# Practical session I: Building a text corpus

In this practical exercise we will import some data, process it to extract the plain text and reformat it into our corpus format.

There are thousands of modules to download and process text data. We will only discuss a few examples here.

## Requirements for our corpus data:
- language: (original) English only - no translations; no historical texts (before ~1960)
- text only (no audio/video/image data)
- metadata: none from the data

In [1]:
# import some general modules
import logging
import datetime as dt
from pathlib import Path

# set logging level (suggested: logging.INFO; for bug fixing: logging.DEBUG)
# logging_level = logging.INFO
logging_level = logging.DEBUG

logging.basicConfig(format='%(levelname)s:%(message)s', level=logging_level)

# 1. Process source data

As a first step we will discuss several methods to extract text from various sources (local or online documents).

In [40]:
# simple print function we can use to look at some data results
def print_some(some_data, some=7):
    print('*'*21)
    for i in range(7):
        if i >= len(some_data):
            break
        print(some_data[-(i + 1)])
        print('*'*10 + str(i+1) + '*'*10)

## 1.1 Text documents

Load raw text documents (sometimes preformatted, e.g. csv or XML) from other sources.

### 1.1.1 Pure text files

In this case we process either text files available on your computer or files found online. Depending on the text format you might have to add some basic text cleaning to remove, e.g., header, footer or other kinds of metadata. Also make sure unicode characters are encoded using UTF-8 when you process old data.<br>
For our corpus, we are only interested in the plain text body.

In [3]:
# process raw text files without any special formatting/metadata
def process_text_files(filename, encoding='utf8'):
    logging.info('Reading text from pure text file %s' % filename)
    return open(filename, mode='r', encoding=encoding).read()

For example, we will show the processing of a BBC news article dataset http://mlg.ucd.ie/datasets/bbc.html (direct download link: http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip).

In [5]:
# The function above is for single files only.
# Here, we show how to process a zip folder of multiple files using a loop.

# first step: unzip archive
import zipfile

DATA_ZIP_PATH = Path(r'C:\Users\hp\Desktop\natural language processing') / 'bbc-fulltext.zip'
DATA_OUTPUT_PATH = Path(r'C:\Users\hp\Desktop\natural language processing') / r'C:\Users\hp\Desktop\natural language processing'
FILE_EXTENSION = '.txt'

with zipfile.ZipFile(DATA_ZIP_PATH) as zip_file:
    for filename in zip_file.namelist():
        if filename.endswith(FILE_EXTENSION):
            logging.debug('Extracting file %s' % filename)
            zip_file.extract(filename, DATA_OUTPUT_PATH)

DEBUG:Extracting file bbc/business/001.txt
DEBUG:Extracting file bbc/business/002.txt
DEBUG:Extracting file bbc/business/003.txt
DEBUG:Extracting file bbc/business/004.txt
DEBUG:Extracting file bbc/business/005.txt
DEBUG:Extracting file bbc/business/006.txt
DEBUG:Extracting file bbc/business/007.txt
DEBUG:Extracting file bbc/business/008.txt
DEBUG:Extracting file bbc/business/009.txt
DEBUG:Extracting file bbc/business/010.txt
DEBUG:Extracting file bbc/business/011.txt
DEBUG:Extracting file bbc/business/012.txt
DEBUG:Extracting file bbc/business/013.txt
DEBUG:Extracting file bbc/business/014.txt
DEBUG:Extracting file bbc/business/015.txt
DEBUG:Extracting file bbc/business/016.txt
DEBUG:Extracting file bbc/business/017.txt
DEBUG:Extracting file bbc/business/018.txt
DEBUG:Extracting file bbc/business/019.txt
DEBUG:Extracting file bbc/business/020.txt
DEBUG:Extracting file bbc/business/021.txt
DEBUG:Extracting file bbc/business/022.txt
DEBUG:Extracting file bbc/business/023.txt
DEBUG:Extra

DEBUG:Extracting file bbc/business/192.txt
DEBUG:Extracting file bbc/business/193.txt
DEBUG:Extracting file bbc/business/194.txt
DEBUG:Extracting file bbc/business/195.txt
DEBUG:Extracting file bbc/business/196.txt
DEBUG:Extracting file bbc/business/197.txt
DEBUG:Extracting file bbc/business/198.txt
DEBUG:Extracting file bbc/business/199.txt
DEBUG:Extracting file bbc/business/200.txt
DEBUG:Extracting file bbc/business/201.txt
DEBUG:Extracting file bbc/business/202.txt
DEBUG:Extracting file bbc/business/203.txt
DEBUG:Extracting file bbc/business/204.txt
DEBUG:Extracting file bbc/business/205.txt
DEBUG:Extracting file bbc/business/206.txt
DEBUG:Extracting file bbc/business/207.txt
DEBUG:Extracting file bbc/business/208.txt
DEBUG:Extracting file bbc/business/209.txt
DEBUG:Extracting file bbc/business/210.txt
DEBUG:Extracting file bbc/business/211.txt
DEBUG:Extracting file bbc/business/212.txt
DEBUG:Extracting file bbc/business/213.txt
DEBUG:Extracting file bbc/business/214.txt
DEBUG:Extra

DEBUG:Extracting file bbc/entertainment/056.txt
DEBUG:Extracting file bbc/entertainment/057.txt
DEBUG:Extracting file bbc/entertainment/058.txt
DEBUG:Extracting file bbc/entertainment/059.txt
DEBUG:Extracting file bbc/entertainment/060.txt
DEBUG:Extracting file bbc/entertainment/061.txt
DEBUG:Extracting file bbc/entertainment/062.txt
DEBUG:Extracting file bbc/entertainment/063.txt
DEBUG:Extracting file bbc/entertainment/064.txt
DEBUG:Extracting file bbc/entertainment/065.txt
DEBUG:Extracting file bbc/entertainment/066.txt
DEBUG:Extracting file bbc/entertainment/067.txt
DEBUG:Extracting file bbc/entertainment/068.txt
DEBUG:Extracting file bbc/entertainment/069.txt
DEBUG:Extracting file bbc/entertainment/070.txt
DEBUG:Extracting file bbc/entertainment/071.txt
DEBUG:Extracting file bbc/entertainment/072.txt
DEBUG:Extracting file bbc/entertainment/073.txt
DEBUG:Extracting file bbc/entertainment/074.txt
DEBUG:Extracting file bbc/entertainment/075.txt
DEBUG:Extracting file bbc/entertainment/

DEBUG:Extracting file bbc/entertainment/227.txt
DEBUG:Extracting file bbc/entertainment/228.txt
DEBUG:Extracting file bbc/entertainment/229.txt
DEBUG:Extracting file bbc/entertainment/230.txt
DEBUG:Extracting file bbc/entertainment/231.txt
DEBUG:Extracting file bbc/entertainment/232.txt
DEBUG:Extracting file bbc/entertainment/233.txt
DEBUG:Extracting file bbc/entertainment/234.txt
DEBUG:Extracting file bbc/entertainment/235.txt
DEBUG:Extracting file bbc/entertainment/236.txt
DEBUG:Extracting file bbc/entertainment/237.txt
DEBUG:Extracting file bbc/entertainment/238.txt
DEBUG:Extracting file bbc/entertainment/239.txt
DEBUG:Extracting file bbc/entertainment/240.txt
DEBUG:Extracting file bbc/entertainment/241.txt
DEBUG:Extracting file bbc/entertainment/242.txt
DEBUG:Extracting file bbc/entertainment/243.txt
DEBUG:Extracting file bbc/entertainment/244.txt
DEBUG:Extracting file bbc/entertainment/245.txt
DEBUG:Extracting file bbc/entertainment/246.txt
DEBUG:Extracting file bbc/entertainment/

DEBUG:Extracting file bbc/politics/013.txt
DEBUG:Extracting file bbc/politics/014.txt
DEBUG:Extracting file bbc/politics/015.txt
DEBUG:Extracting file bbc/politics/016.txt
DEBUG:Extracting file bbc/politics/017.txt
DEBUG:Extracting file bbc/politics/018.txt
DEBUG:Extracting file bbc/politics/019.txt
DEBUG:Extracting file bbc/politics/020.txt
DEBUG:Extracting file bbc/politics/021.txt
DEBUG:Extracting file bbc/politics/022.txt
DEBUG:Extracting file bbc/politics/023.txt
DEBUG:Extracting file bbc/politics/024.txt
DEBUG:Extracting file bbc/politics/025.txt
DEBUG:Extracting file bbc/politics/026.txt
DEBUG:Extracting file bbc/politics/027.txt
DEBUG:Extracting file bbc/politics/028.txt
DEBUG:Extracting file bbc/politics/029.txt
DEBUG:Extracting file bbc/politics/030.txt
DEBUG:Extracting file bbc/politics/031.txt
DEBUG:Extracting file bbc/politics/032.txt
DEBUG:Extracting file bbc/politics/033.txt
DEBUG:Extracting file bbc/politics/034.txt
DEBUG:Extracting file bbc/politics/035.txt
DEBUG:Extra

DEBUG:Extracting file bbc/politics/393.txt
DEBUG:Extracting file bbc/politics/394.txt
DEBUG:Extracting file bbc/politics/395.txt
DEBUG:Extracting file bbc/politics/396.txt
DEBUG:Extracting file bbc/politics/397.txt
DEBUG:Extracting file bbc/politics/398.txt
DEBUG:Extracting file bbc/politics/399.txt
DEBUG:Extracting file bbc/politics/400.txt
DEBUG:Extracting file bbc/politics/401.txt
DEBUG:Extracting file bbc/politics/402.txt
DEBUG:Extracting file bbc/politics/403.txt
DEBUG:Extracting file bbc/politics/404.txt
DEBUG:Extracting file bbc/politics/405.txt
DEBUG:Extracting file bbc/politics/406.txt
DEBUG:Extracting file bbc/politics/407.txt
DEBUG:Extracting file bbc/politics/408.txt
DEBUG:Extracting file bbc/politics/409.txt
DEBUG:Extracting file bbc/politics/410.txt
DEBUG:Extracting file bbc/politics/411.txt
DEBUG:Extracting file bbc/politics/412.txt
DEBUG:Extracting file bbc/politics/413.txt
DEBUG:Extracting file bbc/politics/414.txt
DEBUG:Extracting file bbc/politics/415.txt
DEBUG:Extra

DEBUG:Extracting file bbc/sport/179.txt
DEBUG:Extracting file bbc/sport/180.txt
DEBUG:Extracting file bbc/sport/181.txt
DEBUG:Extracting file bbc/sport/182.txt
DEBUG:Extracting file bbc/sport/183.txt
DEBUG:Extracting file bbc/sport/184.txt
DEBUG:Extracting file bbc/sport/185.txt
DEBUG:Extracting file bbc/sport/186.txt
DEBUG:Extracting file bbc/sport/187.txt
DEBUG:Extracting file bbc/sport/188.txt
DEBUG:Extracting file bbc/sport/189.txt
DEBUG:Extracting file bbc/sport/190.txt
DEBUG:Extracting file bbc/sport/191.txt
DEBUG:Extracting file bbc/sport/192.txt
DEBUG:Extracting file bbc/sport/193.txt
DEBUG:Extracting file bbc/sport/194.txt
DEBUG:Extracting file bbc/sport/195.txt
DEBUG:Extracting file bbc/sport/196.txt
DEBUG:Extracting file bbc/sport/197.txt
DEBUG:Extracting file bbc/sport/198.txt
DEBUG:Extracting file bbc/sport/199.txt
DEBUG:Extracting file bbc/sport/200.txt
DEBUG:Extracting file bbc/sport/201.txt
DEBUG:Extracting file bbc/sport/202.txt
DEBUG:Extracting file bbc/sport/203.txt


DEBUG:Extracting file bbc/sport/384.txt
DEBUG:Extracting file bbc/sport/385.txt
DEBUG:Extracting file bbc/sport/386.txt
DEBUG:Extracting file bbc/sport/387.txt
DEBUG:Extracting file bbc/sport/388.txt
DEBUG:Extracting file bbc/sport/389.txt
DEBUG:Extracting file bbc/sport/390.txt
DEBUG:Extracting file bbc/sport/391.txt
DEBUG:Extracting file bbc/sport/392.txt
DEBUG:Extracting file bbc/sport/393.txt
DEBUG:Extracting file bbc/sport/394.txt
DEBUG:Extracting file bbc/sport/395.txt
DEBUG:Extracting file bbc/sport/396.txt
DEBUG:Extracting file bbc/sport/397.txt
DEBUG:Extracting file bbc/sport/398.txt
DEBUG:Extracting file bbc/sport/399.txt
DEBUG:Extracting file bbc/sport/400.txt
DEBUG:Extracting file bbc/sport/401.txt
DEBUG:Extracting file bbc/sport/402.txt
DEBUG:Extracting file bbc/sport/403.txt
DEBUG:Extracting file bbc/sport/404.txt
DEBUG:Extracting file bbc/sport/405.txt
DEBUG:Extracting file bbc/sport/406.txt
DEBUG:Extracting file bbc/sport/407.txt
DEBUG:Extracting file bbc/sport/408.txt


DEBUG:Extracting file bbc/tech/080.txt
DEBUG:Extracting file bbc/tech/081.txt
DEBUG:Extracting file bbc/tech/082.txt
DEBUG:Extracting file bbc/tech/083.txt
DEBUG:Extracting file bbc/tech/084.txt
DEBUG:Extracting file bbc/tech/085.txt
DEBUG:Extracting file bbc/tech/086.txt
DEBUG:Extracting file bbc/tech/087.txt
DEBUG:Extracting file bbc/tech/088.txt
DEBUG:Extracting file bbc/tech/089.txt
DEBUG:Extracting file bbc/tech/090.txt
DEBUG:Extracting file bbc/tech/091.txt
DEBUG:Extracting file bbc/tech/092.txt
DEBUG:Extracting file bbc/tech/093.txt
DEBUG:Extracting file bbc/tech/094.txt
DEBUG:Extracting file bbc/tech/095.txt
DEBUG:Extracting file bbc/tech/096.txt
DEBUG:Extracting file bbc/tech/097.txt
DEBUG:Extracting file bbc/tech/098.txt
DEBUG:Extracting file bbc/tech/099.txt
DEBUG:Extracting file bbc/tech/100.txt
DEBUG:Extracting file bbc/tech/101.txt
DEBUG:Extracting file bbc/tech/102.txt
DEBUG:Extracting file bbc/tech/103.txt
DEBUG:Extracting file bbc/tech/104.txt
DEBUG:Extracting file bbc

DEBUG:Extracting file bbc/tech/291.txt
DEBUG:Extracting file bbc/tech/292.txt
DEBUG:Extracting file bbc/tech/293.txt
DEBUG:Extracting file bbc/tech/294.txt
DEBUG:Extracting file bbc/tech/295.txt
DEBUG:Extracting file bbc/tech/296.txt
DEBUG:Extracting file bbc/tech/297.txt
DEBUG:Extracting file bbc/tech/298.txt
DEBUG:Extracting file bbc/tech/299.txt
DEBUG:Extracting file bbc/tech/300.txt
DEBUG:Extracting file bbc/tech/301.txt
DEBUG:Extracting file bbc/tech/302.txt
DEBUG:Extracting file bbc/tech/303.txt
DEBUG:Extracting file bbc/tech/304.txt
DEBUG:Extracting file bbc/tech/305.txt
DEBUG:Extracting file bbc/tech/306.txt
DEBUG:Extracting file bbc/tech/307.txt
DEBUG:Extracting file bbc/tech/308.txt
DEBUG:Extracting file bbc/tech/309.txt
DEBUG:Extracting file bbc/tech/310.txt
DEBUG:Extracting file bbc/tech/311.txt
DEBUG:Extracting file bbc/tech/312.txt
DEBUG:Extracting file bbc/tech/313.txt
DEBUG:Extracting file bbc/tech/314.txt
DEBUG:Extracting file bbc/tech/315.txt
DEBUG:Extracting file bbc

In [6]:
# loop over multiple text files (opening files might take some time)
data = []

DATA_PATH = Path(r'C:\Users\hp\Desktop\natural language processing') / 'bbc'
for filename in Path(DATA_PATH).glob('**/*.txt'):
    data.append(process_text_files(filename))
    
    # we stop here after the first few files, NOTE: you should remove this break when you process your data
    if len(data) > 10:
        break

print_some(data)

INFO:Reading text from pure text file C:\Users\hp\Desktop\natural language processing\bbc\business\001.txt
INFO:Reading text from pure text file C:\Users\hp\Desktop\natural language processing\bbc\business\002.txt
INFO:Reading text from pure text file C:\Users\hp\Desktop\natural language processing\bbc\business\003.txt
INFO:Reading text from pure text file C:\Users\hp\Desktop\natural language processing\bbc\business\004.txt
INFO:Reading text from pure text file C:\Users\hp\Desktop\natural language processing\bbc\business\005.txt
INFO:Reading text from pure text file C:\Users\hp\Desktop\natural language processing\bbc\business\006.txt
INFO:Reading text from pure text file C:\Users\hp\Desktop\natural language processing\bbc\business\007.txt
INFO:Reading text from pure text file C:\Users\hp\Desktop\natural language processing\bbc\business\008.txt
INFO:Reading text from pure text file C:\Users\hp\Desktop\natural language processing\bbc\business\009.txt
INFO:Reading text from pure text file

*********************
Ask Jeeves tips online ad revival

Ask Jeeves has become the third leading online search firm this week to thank a revival in internet advertising for improving fortunes.

The firm's revenue nearly tripled in the fourth quarter of 2004, exceeding $86m (£46m). Ask Jeeves, once among the best-known names on the web, is now a relatively modest player. Its $17m profit for the quarter was dwarfed by the $204m announced by rival Google earlier in the week. During the same quarter, Yahoo earned $187m, again tipping a resurgence in online advertising.

The trend has taken hold relatively quickly. Late last year, marketing company Doubleclick, one of the leading providers of online advertising, warned that some or all of its business would have to be put up for sale. But on Thursday, it announced that a sharp turnaround had brought about an unexpected increase in profits. Neither Ask Jeeves nor Doubleclick thrilled investors with their profit news, however. In both cases, 

### 1.1.2 csv text files

Several texts are often saved in single files in a format with comma-separated values (csv). <br> Here, we are only interested in the text and ignore all other values (usually metadata (e.g. author or date) or some kind of annotation (e.g. sentiment)).

In [7]:
# csv files
import csv

def process_csv_to_text(filename, tag_of_text_column):
    with open(filename, encoding="mbcs") as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            yield row[tag_of_text_column]

For example, here we process some publicly available emails saved in csv format which can be downloaded from https://www.kaggle.com/kaggle/hillary-clinton-emails.

In [8]:
data = []

csv_filename = Path(r'C:\Users\hp\Desktop\natural language processing') / 'Emails.csv'
TAG_OF_TEXT_COLUMN = 'ExtractedBodyText'
 
for text in process_csv_to_text(csv_filename, TAG_OF_TEXT_COLUMN):
    data.append(text)
    
print_some(data)

*********************
See below.
**********1**********
PVerveer B6
Friday, December 17, 2010 12:12 AM
From B6
Please
let me know if I can be of any help to your department and will happy to do and please thank
Mrs. Hillary Clinton on behalf of me and
. supporting Afghan women.
â€¢Thank you,
B6
B6
B6
B6
B6
B6
B6
B6
**********2**********

**********3**********
Big change of plans in the Senate. Senator Reid just announced that he was no longer going to move forward with the
omnibus appropriations bill. Instead, he filed cloture motions on the repeal of Don't Ask, Don't Tell and the DREAM
Act.
Those petitions will ripen on Saturday. So it looks like the Senate will be again considering the new START Treaty
tomorrow. We should know the starting time shortly.
**********4**********

**********5**********
B6
I assume you saw this by now -- if not, it's worth a read.
Forwarded message
**********6**********
Hi. Sorry I haven't had a chance to see you, but I did want you to hear directly from me

### 1.1.3 XML text files

Text with more metadata information and various structural markes is often saved in XML format. This format is often used by companies for various structured documents, e.g., for handbooks. <br>
Here for this course, we will ignore all structural markers and metadata and only extract the plain text entries.

In [9]:
# XML documents
import xml.etree.ElementTree as ET

# Set a minimum number of words for an entry to be considered as text to exclude non-running text.
MIN_WORDS_IN_TEXT = 30

# You can also specify specific tags which surround the text in your xml-document; all other tags will be ignored.
# If this list is empty, text from all tags will be selected.
TEXT_TAGS = []


def get_node_text_recursively(node, min_words, text_tags=None):
    if not text_tags or node.tag in text_tags:
        if node.text and len(node.text.strip().split(' ')) > min_words:
            yield node.text
    for child in node:
        yield from get_node_text_recursively(child, min_words=min_words, text_tags=text_tags)


def process_xml_to_text(xml_document):
    tree = ET.parse(xml_document)
    root = tree.getroot()
    
    texts = list(get_node_text_recursively(root, min_words=MIN_WORDS_IN_TEXT, text_tags=TEXT_TAGS))
    
    logging.info('Found %d texts in the xml document "%s".' % (len(texts), xml_document))
    return texts

To test the xml reader we use as an example here pubmed (biomedical publication citations - abstracts) xml files available online at ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/.

In [10]:
data = []

xml_filename = Path(r'C:\Users\hp\Desktop\natural language processing\pubmed20n1020.xml') / 'pubmed20n1020.xml'

for text in process_xml_to_text(xml_filename):
    data.append(text)

print_some(data)

INFO:Found 12689 texts in the xml document "C:\Users\hp\Desktop\natural language processing\pubmed20n1020.xml\pubmed20n1020.xml".


*********************
Tardive dyskinesia (TD), a condition characterized by involuntary movements, is found in patients taking antipsychotics or other agents that block dopamine receptors. Symptoms of TD are associated with reduced quality of life, psychosocial problems, and medication nonadherence. Few agents tested in the treatment of TD had sufficient data to support or refute their use, until recently. A review of new evidence was combined with the existing guideline to provide new treatment recommendations. This activity provides an overview of treatments for patients with TD, including valbenazine and deutetrabenazine, which both received FDA approval for the treatment of TD.
**********1**********
Among 526 patients at risk, those taking ramelteon and/or suvorexant developed delirium significantly less frequently than those who did not, after control for the effects of risk factors on the estimate of an independent association between the effects of ramelteon and/or suvorexant an

## 1.2 Documents with encoded formats

Sometimes documents are only available in specialized formats such as pdf, doc, ppt, etc. <br>
Here the text is encoded and first has to be extracted using preprocessing techniques.

We will use the package textract https://textract.readthedocs.io/en/latest/ to get the plain text out of pdf documents using pdftotext.
Some other formats might require additional tools to be installed separately onto your system (which we cannot do on ms azure notebooks), for example, doc files require antiword.

In [11]:
# textract requires quite a few other modules, installing might take a minute
#!pip install textract

In [17]:
import textract
# we can now simply use: textract.process(filename)
data = []

# let us process a pdf file
pdf_filename = Path('data_sources') / 'OJ_L_2020_065_FULL_EN_TXT.pdf'
pdf_text = textract.process(pdf_filename).decode("utf-8")

print(pdf_text)

data.append(pdf_text)

TypeError: argument of type 'WindowsPath' is not iterable

## 1.3 Websites

We can download websites using their url with the requests module (https://requests.readthedocs.io/en/master/ ). <br>
This will usually return a html document from which we can extract the plain text in the same way as with the abovementioned xml documents.

In [12]:
import requests

def download_website(url):
    r = requests.get(url)
    r.encoding = 'utf-8'  # sometimes gets guessed wrongly, but we strictly assume the website uses utf-8
    logging.info('Downloaded url %s ; text encoding = "%s".' % (url, r.encoding))
    html_text = r.text
    return html_text

In [19]:
# process html text file to plain text
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self)
        self.all_data = []

    def handle_data(self, data):
        if data.strip():
            self.all_data.append(data.strip())


def process_html_to_text(html_text):
    parser = MyHTMLParser()
    parser.feed(html_text)
    return parser.all_data

Let us try it out:

In [20]:
test_url = 'https://www.ismll.uni-hildesheim.de/'
text = download_website(test_url)
print(text)

DEBUG:Starting new HTTPS connection (1): www.ismll.uni-hildesheim.de:443
DEBUG:https://www.ismll.uni-hildesheim.de:443 "GET / HTTP/1.1" 200 44927
INFO:Downloaded url https://www.ismll.uni-hildesheim.de/ ; text encoding = "utf-8".


<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <meta name="Keywords" content="Wirtschaftsinformatik und Maschinelles Lernen, Universit&#xE4;t Hildesheim, University of Hildesheim, Information Systems and Machine Learning Lab, ISMLL, data mining, machine learning, maschinelles Lernen, E-Commerce, E-Business, E-Learning, Recommender-Systeme, recommender systems, web mining, Lars Schmidt-Thieme" />
    <meta name="Description" content="Wirtschaftsinformatik und Maschinelles Lernen, Universit&#xE4;t Hildesheim" />
    <meta name="Author" content="Lars Schmidt-Thieme" />
    <meta name="robots" content="index,follow" />
    <meta name="revisit-after" content="5 days" />
    <meta http-equiv="Content-Script-Type" content="text/javascript" />
    <meta http

In [21]:
text = process_html_to_text(text)

for entry in text:
    print(entry)

Wirtschaftsinformatik und Maschinelles Lernen, Universität Hildesheim
wir bieten...
Studieninteressierte
Der Weg ins Studium:
Studienangebot
Vorlesungsverzeichnis
Bewerbung & Zulassung
Schnupperstudium
Gasthörerstudium
Beratung und Information
Rund ums Studium:
Profil Universität Hildesheim
Campusleben
Studienfinanzierung
Studienbeiträge
Service:
Zentrale Studienberatung
International Office
Studieren mit Familie
Studieren mit Behinderung
Termine & Fristen
Anfahrts- und Lageplan
FAQ
Studierende
Studium:
Fachbereiche
Vorlesungsverzeichnis LSF
Studien- & Prüf.-Ordnungen
Learnweb
Termine & Fristen
Studienbeiträge
Wichtige Einrichtungen:
Einschreibung, Rückmeldung, Prüfungen...
Universitätsbibliothek
Rechenzentrum
Studierenden-Vertretung
Rund ums Studium:
Studienfinanzierung
Jobs und Praktika
Campusleben
Service:
Information & Beratung
Studieren mit Familie
Lagepläne
Personal
Studium & Lehre:
Vorlesungsverzeichnis LSF
Termine & Fristen
Learnweb
Einrichtungen:
Gleichstellungsbüro
Konferenz 

In [22]:
# some further processing might be needed, 
# however: this is specific to the format of the particular website!

# let us for this example simply add the entire text, combined using whitespace characters
data = []
data.append(' '.join(text))    
print(data)

['Wirtschaftsinformatik und Maschinelles Lernen, Universität Hildesheim wir bieten... Studieninteressierte Der Weg ins Studium: Studienangebot Vorlesungsverzeichnis Bewerbung & Zulassung Schnupperstudium Gasthörerstudium Beratung und Information Rund ums Studium: Profil Universität Hildesheim Campusleben Studienfinanzierung Studienbeiträge Service: Zentrale Studienberatung International Office Studieren mit Familie Studieren mit Behinderung Termine & Fristen Anfahrts- und Lageplan FAQ Studierende Studium: Fachbereiche Vorlesungsverzeichnis LSF Studien- & Prüf.-Ordnungen Learnweb Termine & Fristen Studienbeiträge Wichtige Einrichtungen: Einschreibung, Rückmeldung, Prüfungen... Universitätsbibliothek Rechenzentrum Studierenden-Vertretung Rund ums Studium: Studienfinanzierung Jobs und Praktika Campusleben Service: Information & Beratung Studieren mit Familie Lagepläne Personal Studium & Lehre: Vorlesungsverzeichnis LSF Termine & Fristen Learnweb Einrichtungen: Gleichstellungsbüro Konferen

You can also process multiple urls using a loop:

In [23]:
urls = ['https://www.ismll.uni-hildesheim.de/',
        'https://www.ismll.uni-hildesheim.de/da/index_en.html',
       ]
data = []
for some_url in urls:
    text = download_website(some_url)
    text = process_html_to_text(text)
    data.append(' '.join(text))    
print(data)

DEBUG:Starting new HTTPS connection (1): www.ismll.uni-hildesheim.de:443
DEBUG:https://www.ismll.uni-hildesheim.de:443 "GET / HTTP/1.1" 200 44927
INFO:Downloaded url https://www.ismll.uni-hildesheim.de/ ; text encoding = "utf-8".
DEBUG:Starting new HTTPS connection (1): www.ismll.uni-hildesheim.de:443
DEBUG:https://www.ismll.uni-hildesheim.de:443 "GET /da/index_en.html HTTP/1.1" 200 41159
INFO:Downloaded url https://www.ismll.uni-hildesheim.de/da/index_en.html ; text encoding = "utf-8".


['Wirtschaftsinformatik und Maschinelles Lernen, Universität Hildesheim wir bieten... Studieninteressierte Der Weg ins Studium: Studienangebot Vorlesungsverzeichnis Bewerbung & Zulassung Schnupperstudium Gasthörerstudium Beratung und Information Rund ums Studium: Profil Universität Hildesheim Campusleben Studienfinanzierung Studienbeiträge Service: Zentrale Studienberatung International Office Studieren mit Familie Studieren mit Behinderung Termine & Fristen Anfahrts- und Lageplan FAQ Studierende Studium: Fachbereiche Vorlesungsverzeichnis LSF Studien- & Prüf.-Ordnungen Learnweb Termine & Fristen Studienbeiträge Wichtige Einrichtungen: Einschreibung, Rückmeldung, Prüfungen... Universitätsbibliothek Rechenzentrum Studierenden-Vertretung Rund ums Studium: Studienfinanzierung Jobs und Praktika Campusleben Service: Information & Beratung Studieren mit Familie Lagepläne Personal Studium & Lehre: Vorlesungsverzeichnis LSF Termine & Fristen Learnweb Einrichtungen: Gleichstellungsbüro Konferen

## 1.4 Social media data


There are several social media sites with different APIs allowing to access user-generated data.
These APIs differ in availability of the data, e.g. Twitter data can only be accessed with a Twitter account and only returns posts of the last 7 days in the free version.
Facebook is even more strict with providing their data for analysis.

In this exercise we will download comments from the social news aggregation, web content rating, and discussion website https://www.reddit.com/.
Here we show how to use the https://pushshift.io/ API implemented in the Python module psaw https://pypi.org/project/psaw/.

In [5]:
# install and import API module (this might take a minute)
#!pip install psaw

from psaw import PushshiftAPI

In [6]:
# function to download the comments and submissions

POST_LIMIT = 1000000  # might have to be lowered for weaker systems

def download_subreddit(subreddit_name, start_epoch, end_epoch):
    """Function to crawl all submissions and comments of a subreddit.

    :param subreddit_name: name of the subreddit to be crawled
    :param start_epoch: only return comments posted after this timestamp
    :param end_epoch: only return comments posted before this timestamp
    """
    api = PushshiftAPI()

    limit_exceeded = []

    # query comments from the subreddit for the given epoch
    gen = api.search_comments(after=start_epoch, before=end_epoch,
                              subreddit=subreddit_name, limit=POST_LIMIT)
    # select the body containing the text of the post
    posts = [comment[-1]['body'] for comment in list(gen)]

    # check if limit reached/exceeded
    if len(posts) >= POST_LIMIT:
            limit_exceeded.append((start_epoch, end_epoch))

    if limit_exceeded:
        logging.warning('\nErrors in %d epochs: post limit reached/exceeded (see file %s).\n'
                        % (len(limit_exceeded), os.path.join(out_folder, 'ERRORS')))
        logging.warning('\n'.join([str(epoch[0]) + '\t' + str(epoch[1]) for epoch in limit_exceeded]) + '\n')
    else:
        logging.info('No errors reported.\n')
    
    return posts

In [10]:
# now let us call the function with parameters
data = []

subreddit = 'Donald Trump'

start_year = 2018
start_month = 4
start_day = 1

end_year = 2019
end_month = 4
end_day = 2

start_epoch = int(dt.datetime(start_year, start_month, start_day).timestamp())
end_epoch = int(dt.datetime(end_year, end_month, end_day).timestamp())

data = download_subreddit(subreddit, start_epoch, end_epoch)
logging.info('Downloaded %d reddit posts from the subreddit /r/%s.' % (len(data), subreddit))

print_some(data)

DEBUG:Connecting to /meta endpoint to learn rate limit.
DEBUG:URL: https://api.pushshift.io/meta
DEBUG:Payload: {}
DEBUG:Starting new HTTPS connection (1): api.pushshift.io:443
DEBUG:https://api.pushshift.io:443 "GET /meta HTTP/1.1" 200 None
INFO:https://api.pushshift.io/meta
DEBUG:Response status code: 200
DEBUG:server_ratelimit_per_minute: 120
DEBUG:URL: https://api.pushshift.io/reddit/comment/search
DEBUG:Payload: {'after': 1522533600, 'before': 1554156000, 'subreddit': 'Donald Trump', 'limit': 1000, 'metadata': 'true', 'sort': 'desc'}
DEBUG:Starting new HTTPS connection (1): api.pushshift.io:443
DEBUG:https://api.pushshift.io:443 "GET /reddit/comment/search?after=1522533600&before=1554156000&subreddit=Donald+Trump&limit=1000&metadata=true&sort=desc HTTP/1.1" 200 None
INFO:https://api.pushshift.io/reddit/comment/search?after=1522533600&before=1554156000&subreddit=Donald+Trump&limit=1000&metadata=true&sort=desc
DEBUG:Response status code: 200
DEBUG:Metadata: {'after': 1522533600, 'ag

*********************


# 2. Basic text processing

To create a uniform plain text corpus, certain data processing methods might have to be applied (this depends on how **clean** your data already is).

### First, gather all your data in a single list:


In [23]:
import re
import os
import gzip
import json
from pathlib import Path
import pickle

In [54]:
def load_data(filename):
    return json.loads(gzip.GzipFile(filename).read().decode('utf-8'))
data = load_data('reviews_Musical_Instruments_5.json.gz')

JSONDecodeError: Extra data: line 2 column 1 (char 523)

In [56]:
data = []
for line in open('Musical_Instruments_5.json', 'r'):
    data.append(json.loads(line))
#data = load_data(r'train-v2.0.ra')


In [84]:
cont = []
for i in range(len(data)):
    cont.append(data[i]["reviewText"])

In [92]:
data = cont

In [93]:
# only run this cell once
all_data = []
logging.info('Your data is now empty.')

INFO:Your data is now empty.


In [94]:
# here you can add data from the above methods to the all_data list
logging.info('Your data contained %d text entries.' % len(all_data))
all_data += data
logging.info('Your data now contains %d text entries.' % len(all_data))

INFO:Your data contained 0 text entries.
INFO:Your data now contains 10261 text entries.


### Apply some text cleaning:

In [95]:
# some very basic text cleaning functions
import re
from html import unescape


def remove_short_texts(texts, minimum_char_count=250):
    new_data = []
    for text in texts:
        if len(text) >= minimum_char_count:
            new_data.append(text)
    return new_data


def remove_html_escaping(texts):
    """Function to unescape html-escaped symbols, such as &gt; or #x200B;.
    """
    for index, text in enumerate(texts):
        texts[index] = unescape(text)
    return texts


def remove_urls(texts):
    """Function to remove URLs from a text.
    """
    url_regex = r"(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"
    for index, text in enumerate(texts):
        texts[index] = re.sub(url_regex, '', text)
    return texts

In [96]:
print_some(all_data)
print('\n---------------------------------------------\n')

MINIMUM_CHAR_COUNT = 10

data_clean = all_data
data_clean = remove_short_texts(data_clean, minimum_char_count=MINIMUM_CHAR_COUNT)
data_clean = remove_html_escaping(data_clean)
data_clean = remove_urls(data_clean)

print_some(data_clean)

*********************
These strings are really quite good, but I wouldn't call them perfect.  The unwound strings are not quite as bright as I am accustomed to, but they still ring nicely.  This is the only complaint I have about these strings.  If the unwound strings were a tiny bit brighter, these would be 5-star strings.  As it stands, I give them 4.5 stars... not a big knock, actually.The low-end on the wound strings is very nice and quite warm.  I put these on a jumbo and it definitely accentuates the &#34;jumbo&#34; aspect of my acoustic.  The sound is very big, full, and nice.Definitely a recommended product!4.5/5 stars
**********1**********
Well, MADE by Elixir and DEVELOPED with Taylor Guitars ... these strings were designed for the new 800 (Rosewood) series guitars that came out this year (2014) ... the promise is a &#34;bolder high end, fuller low end&#34; ... I am a long-time Taylor owner and favor their 800 series (Rosewood/Spruce is my favorite combo in tone woods) ... I 

*********************
These strings are really quite good, but I wouldn't call them perfect.  The unwound strings are not quite as bright as I am accustomed to, but they still ring nicely.  This is the only complaint I have about these strings.  If the unwound strings were a tiny bit brighter, these would be 5-star strings.  As it stands, I give them 4.5 stars... not a big knock, actually.The low-end on the wound strings is very nice and quite warm.  I put these on a jumbo and it definitely accentuates the "jumbo" aspect of my acoustic.  The sound is very big, full, and nice.Definitely a recommended product!4.5/5 stars
**********1**********
Well, MADE by Elixir and DEVELOPED with Taylor Guitars ... these strings were designed for the new 800 (Rosewood) series guitars that came out this year (2014) ... the promise is a "bolder high end, fuller low end" ... I am a long-time Taylor owner and favor their 800 series (Rosewood/Spruce is my favorite combo in tone woods) ... I have almost alwa

In [97]:
len(data_clean)

10253

# 3. Check your data

__Do not modify this section! We will run it with the unmodified version before including your data into our corpus; if your data does not pass the check, you will fail the exercise.__

Run these functions before uploading your data to see if your data format/size is valid to be included into our corpus.

In [98]:
# !!! do not modify this part of the code !!!
import string
import nltk
from nltk.corpus import brown

nltk.download('brown')
BROWN_TYPES = set(brown.words())

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [99]:
MIN_CHARS = 1000000  # 1 M characters ~ 100k words
CHAR_LIMIT = 5000000  # 5 M characters ~ 500k words


def check_language(some_data):
    matched_words = 0
    total_words = 0
    for text in some_data:
        text = text.translate(str.maketrans('', '', string.punctuation)).split()
        for word in text:
            total_words += 1
            if word in BROWN_TYPES:
                matched_words += 1
    threshold = 2./3
    if matched_words/max(float(total_words), 1) > threshold:
        logging.info(' Language ok: %2.2f%% of the words in your data seem to be common English.' % (100*matched_words/float(total_words)))
    else:
        logging.error(' Language error: only %2.2f%% of the words in your data seem to be common English; some additional preprocessing might be necessary.'  % (100*matched_words/max(float(total_words), 1)))


def limit_size(some_data):
    char_count = 0
    limited_data = []
    for text in some_data:
        if char_count + len(text) > CHAR_LIMIT:
            logging.warning(' data truncated ...')
            continue  # note: only this segment gets skipped, there might be a shorter one following which still fits
        limited_data.append(text)
        char_count += len(text)
    if char_count < MIN_CHARS:
        logging.error(' Data size error: your text has too few characters (%d); you should aim for at least %d.' % (char_count, MIN_CHARS))
    else:
        logging.info(' Data size ok: your text consists of %d characters.' % char_count)
    return limited_data


all_data = remove_short_texts(all_data, minimum_char_count=250)
all_data = limit_size(all_data)
check_language(all_data)

# !!! do not modify this part of the code !!!

INFO: Data size ok: your text consists of 4266978 characters.
INFO: Language ok: 91.23% of the words in your data seem to be common English.


# 4. Save your data

Save the plain text in our specific format to be included into the NLP2020 course text corpus.

In [100]:
import os
import gzip
import json

def save_data(texts, filename):
    id = 0
    while os.path.isfile(filename.with_name(filename.name + '_' + str(id)).with_suffix('.json.gz')):
        id += 1
    filename = filename.with_name(filename.name + '_' + str(id)).with_suffix('.json.gz')
    with gzip.GzipFile(filename, 'w') as data_file:
        data_file.write(json.dumps(texts).encode('utf-8'))
    return filename


# this function you probably won't need
def load_data(filename):
    return json.loads(gzip.GzipFile(filename).read().decode('utf-8'))

In [101]:
output_data_file_path = Path('.') / 'all_data2'

save_data(all_data, output_data_file_path)

WindowsPath('all_data2_0.json.gz')

In [None]:
#credits to university of Hildesheim