# Preprocessing - Extra tools

## Convenience functions for reading in data

Here, we define a bunch of functions that simplify the process of reading in data that we'll use throughout today.

In [None]:
import os
import re
import glob
import pandas as pd

DATA_DIR = '../data'

def read_pride():
    fname = os.path.join(DATA_DIR, 'pride-and-prejudice.txt')
    with open(fname) as f:
        return f.read()

def read_trump():
    fname = os.path.join(DATA_DIR, 'trump-tweets.csv')
    df = pd.read_csv(fname)
    return list(df['Tweet_Text'].values)

def read_austen():
    fnames = os.path.join(DATA_DIR, 'austen', '*.txt')
    fnames = glob.glob(fnames)
    austen = ''
    for fname in fnames:
        with open(fname) as f:
            text = f.read()
            austen += text
    return austen

def read_amazon(n=2):   
    fnames = os.path.join(DATA_DIR, 'amazon', '*.csv')
    fnames = glob.glob(fnames)
    reviews = []
    column_names = ['id', 'product_id', 'user_id', 'profile_name', 'helpfulness_num', 'helpfulness_denom',
                   'score', 'time', 'summary', 'text']
    for fname in fnames[:n]:
        df = pd.read_csv(fname, names=column_names)
        text = list(df['text'].iloc[1:])
        reviews.extend(text)
    return reviews

def read_dante():
    fname = os.path.join(DATA_DIR, 'dante.txt')
    with open(fname) as f:
        return f.read()

def read_example(n=1):
    fname = os.path.join(DATA_DIR, 'example{}.txt'.format(n))
    with open(fname) as f:
        return f.read()
    
def read_music():
    fname = os.path.join(DATA_DIR, 'music_reviews.csv')
    return list(pd.read_csv(fname, sep='\t')['body'])

## Reading in files

The first step is to read in the files containing the data. As we discussed last week, the most common file types for text data are: `.txt`, `.csv`, `.json`, `.html` and `.xml`.

#### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

- What type of object is `raw`?
- How many characters are in `raw`?
- Get the first 1000 characters of `raw`?

In [None]:
import os
DATA_DIR = '../day-1/data'
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()

#### Reading in `.csv`

Python has a built-in module called `csv` for reading in csv files.

- What type is `tweets`?
- How many entries are in `raw`?
- Which entry is the header row?
- How can we get the text of the first question?
- How can we get a list of the texts of all questions?

In [None]:
import csv
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = []
with open(fname) as f:
    reader = csv.reader(f)
    tweets = list(reader)

#### Reading in `.csv` with `pandas`

`pandas` is a third-party library that makes working with tabular data much easier. This is the recommended way to read in a `.csv` file.

- How many tweets are there?
- What happened to the header row?

In [None]:
import pandas as pd
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = pd.read_csv(fname)

In [None]:
tweets.head(3)

In [None]:
tweet_text = list(tweets['Tweet_Text'])
tweet_text[:4]

#### Reading in `.json` files

Python has built-in support for reading in `.json` files.

- How many questions are there in the dataset?
- What data type is each question?
- How can we access the question text of the first question?
- How can we get a list of the texts of all questions?

In [None]:
# import json
# fname = 'jeopardy.json'
# fname = os.path.join(DATA_DIR, fname)
# with open(fname) as f:
#     data = json.load(f)

In [None]:
# data[:3]

#### Reading in `.html` files

The best way to read in `.html` files in Python is with the `BeautifulSoup` package.

In [None]:
from bs4 import BeautifulSoup
fname = 'time.html'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    html = f.read()
    soup = BeautifulSoup(html)

In [None]:
texts = soup.findAll(text=True)
#texts = soup.getText()
texts[:5]

#### Reading in `.xml` files

We read in `.xml` files using the `ElementTree` module of Python's standard library. We can think of `.xml` files as trees where each branch has a tag name. We can find all the branches with a certain name as follows:

In [None]:
from xml.etree import ElementTree as ET
fname = 'books.xml'
fname = os.path.join(DATA_DIR, fname)
e = ET.parse(fname)
root = e.getroot()

In [None]:
descriptions = root.findall('*/description')
text = [d.text for d in descriptions]
text[:3]

#### Reading in multiple files

Often, our text data is split across multiple files in a folder. We want to be able to read them all into a single variable.

- What type is `austen`?
- What type is `fnames` after it is first assigned a value?
- What type is `fnames` after it is assigned a second value?
- How 

In [None]:
import glob
fnames = os.path.join(DATA_DIR, 'austen', '*.txt')
fnames = glob.glob(fnames)
austen = ''
for fname in fnames:
    with open(fname) as f:
        text = f.read()
        austen += text

## Character encoding

Character encoding was more of a problem in Python 2 and early years in general. With Python 3 and most text files being encoded in `UTF-8`, we don't often need to think about it. If you're getting nonsense when reading in a file, try adding `encoding='utf-8'` to the `open` function.

In [None]:
fname = 'dante.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [None]:
text[5000:6000]

In [None]:
fname = 'akutagawa.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [None]:
text[5000:6000]

## Revision

I've read in the text of Jane Austen's _Pride and Prejudice_ into a variable called `pride`. Your tasks are to:
- Figure out what type of Python object `pride` is.
- Tokenize the text and store it in a variable called `tokenized_pride`.
- Figure out what type `tokenized_pride` is.
- Remove all punctuation from `pride`.
- Remove all punctuation from `tokenized_pride`.
- Break `pride` up into sentences and store the result as `sents_pride`.

In [None]:
pride = read_pride()

## OOV words

Sometimes it's best for us to remove infrequent words (sometimes not!). When we do remove infrequent words, it's often for a downstream method (like classification) that is sensitive to rare words.

In [None]:
from nltk import word_tokenize
from string import punctuation
all_tweets = ' '.join(tweets)

# define regex patterns for cleaning
url_pattern = r'https?:\/\/.*[\r\n]*'
URL_SIGN = ' URL '
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
HASHTAG_SIGN = ' HASHTAG '
digit_pattern = '\d+'
DIGIT_SIGN = ' DIGIT '

In [None]:
clean = re.sub(url_pattern, URL_SIGN, all_tweets)
clean = re.sub(hashtag_pattern, HASHTAG_SIGN, clean)
clean = re.sub(digit_pattern, DIGIT_SIGN, clean)
tokens = word_tokenize(clean)
tokens = [token for token in tokens if token not in punctuation]
tokens[:20]

We can count the frequency of each word type with the built-in `Counter` in Python. This basically just takes the set of word types (we calculated this above as `vocabulary`) and makes a special Python dictionary with each value being the number of times it appears in the list. We can ask that dictionary for the most common words, or for the frequency of individual word types.

In [None]:
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

In [None]:
freq['unleashed']

In [None]:
OOV = 'OOV'
new_tokens = []
for token in tokens:
    if freq[token] == 1:
        new_tokens.append(OOV)
    else:
        new_tokens.append(token)

In [None]:
new_tokens[:20]

## TF-IDF scores

How to find distinctive words in a corpus is a long-standing question in text analysis? Today, we'll learn one simple approach to this: TF-IDF. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is `tf-idf score`. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

Traditionally, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. This function also uses log frequencies, so the numbers will not correspond excactly to the calculations above. We'll use the [scikit-learn calculation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), but a challenge for you: use Pandas to calculate this manually.

In [None]:
whitespace_pattern = r'\s+'
clean = [re.sub(url_pattern, URL_SIGN, t) for t in tweets]
clean = [re.sub(hashtag_pattern, HASHTAG_SIGN, t) for t in clean]
clean = [re.sub(digit_pattern, DIGIT_SIGN, t) for t in clean]
clean = [re.sub(whitespace_pattern, ' ', t) for t in clean]
clean[:4]

In [None]:
music = read_music()
music[:5]

In [None]:
def remove_digit(comment):
    return ''.join([ch for ch in comment if not ch.isdigit()])

no_digits = [remove_digit(comment) for comment in music]

### TF-IDFVectorizer Function

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()
sparse_tfidf = tfidfvec.fit_transform(no_digits)
sparse_tfidf

In [None]:
tfidf = pd.DataFrame(sparse_tfidf.toarray(), columns=tfidfvec.get_feature_names())
tfidf.head()

### Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we add in a column of genre.

In [None]:
fname = os.path.join(DATA_DIR, 'music_reviews.csv')
reviews = pd.read_csv(fname, sep='\t')

tfidf['genre_'] = reviews['genre']
tfidf.head()

In [None]:
rap = tfidf[tfidf['genre_']=='Rap']
indie = tfidf[tfidf['genre_']=='Indie']
jazz = tfidf[tfidf['genre_']=='Jazz']

rap.max(numeric_only=True).sort_values(ascending=False).head()

In [None]:
indie.max(numeric_only=True).sort_values(ascending=False).head()

In [None]:
jazz.max(numeric_only=True).sort_values(ascending=False).head()

There we go! A method of identifying distinctive words. You notice there are some proper nouns in there. How might we remove those if we're not interested in them?

### Difference of proportions

A simple way to calculate distinctive words in two texts (see also TF-IDF in day 2 notebook) is to calculate the words with the highest and lowest difference or proportions. In theory frequent words like 'the' and 'of' will have a small difference. In practice this doesn't happen.

To demonstrate this we will run a difference of proportion calculation on *Pride and Prejudice* and *A Garland for Girls*.

To get the text in shape for scikit-learn we need to creat a list object with each novel as an element in a list. We'll use the append function to do this.

In [None]:
import nltk

text_list = []
#open and read the novels, save them as variables
austen_string = open('../day-2/data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../day-2/data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()

#append each novel to the list
text_list.append(austen_string)
text_list.append(alcott_string)
print(text_list[0][:100])

Creat a DTM from these two novels, force it into a pandas DF, and inspect the output:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()

novels_df = pd.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
novels_df

Notice the number of rows and columns.

Question: What does this mean?

Next, we need to get a word frequency count for each novel, which we can do by summing across the entire row. Note how the syntax is different here compared to when we summed one column across all rows.

In [None]:
novels_df['word_count'] = novels_df.sum(axis=1)
novels_df

Next we divide each frequency cell by the word count. This syntax gets a bit tricky, so let's walk through it.

In [None]:
novels_df = novels_df.iloc[:,:].div(novels_df.word_count, axis=0)
novels_df

Finally, we subtract one row from another, and add the output as a third row.

In [None]:
novels_df.loc[2] = novels_df.loc[0] - novels_df.loc[1]
novels_df

We can sort based of the values of this row:

In [None]:
novels_df.loc[2].sort_values(ascending=False)

Stop words are still in there. Why?

We can, of course, manually remove stop words. This does successfully identify distinctive content words. 

We can do this in the CountVectorizer step, by setting the correct option.

In [None]:
#change stop_words option to 'english
countvec_sw = CountVectorizer(stop_words="english")

#same as code above
novels_df_sw = pd.DataFrame(countvec_sw.fit_transform(text_list).toarray(), columns=countvec_sw.get_feature_names())
novels_df_sw['word_count'] = novels_df_sw.sum(axis=1)
novels_df_sw = novels_df_sw.iloc[:,0:].div(novels_df_sw.word_count, axis=0)
novels_df_sw.loc[2] = novels_df_sw.loc[0] - novels_df_sw.loc[1]
novels_df_sw.loc[2].sort_values(axis=0, ascending=False)

We can also do this by setting the max_df option (maximum document frequency) to either an absolute value, or a decimal between 0 and 1. An absolute value indicate that if the word occurs in more documents than the stated value, that word **will not** be included in the DTM. A decimal value will do the same, but proportion of documents.

Question: In the case of this corpus, what does setting the max_df value to 1 do? What output do you expect?

In [None]:
#Change max_df option to 1
countvec_freq = CountVectorizer(max_df=1)

#same as the code above
novels_df_freq = pd.DataFrame(countvec_freq.fit_transform(text_list).toarray(), columns=countvec_freq.get_feature_names())
novels_df_freq['word_count'] = novels_df_freq.sum(axis=1)
novels_df_freq = novels_df_freq.iloc[:,0:].div(novels_df_freq.word_count, axis=0)
novels_df_freq.loc[2] = novels_df_freq.loc[0] - novels_df_freq.loc[1]
novels_df_freq.loc[2].sort_values(axis=0, ascending=False)

Question: What would happen if we set the max_df to 2, in this case?

Question: What might we do for the music reviews dataset?

### Challenge

Use the difference of proportions calculation to compare two genres, or two artists, in the music reviews dataset. There are many ways you can do this. Think through the problem in steps. 