# Preprocessing

Like other data types, text data never comes clean. Moreover, most of our downstream methods only accept data structured in a particular way. Because of this, before we do any computational text analysis techniques, we will always need to perform some level of preprocessing. Text data has its own unique kind of preprocessing. In this notebook, we will cover the core preprocessing methods to get your feet wet:

- Reading in .txt and .csv files
- Tokenization
- Sentence segmentation
- Removing punctuation
- Stripping whitespace
- Text normalization
- Stop words
- Stemming/Lemmatizing
- POS tagging

This notebook assumes you have basic familiarity with Python. If you need a beginner's introduction to Python, see the notebook at `solutions/intro-to-python.ipynb`. 

## Reading in files

The first step is to read in the files containing the data. The most common file types for text data are: `.txt`, `.csv`, `.json`, `.html` and `.xml`.

### Reading in `.txt` files

Python has built-in support for reading in `.txt` files.

In [5]:
import os
DATA_DIR = 'data'
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()

#### Review of Python string methods

- What type of object is `raw`?
- How many characters are in `raw`?
- Get the first 1000 characters of `raw`?
- Join together the first 200 and the last 200 characters of `raw`.

In [6]:
# your code here

### Reading in `.csv`

Python has a built-in module called `csv` for reading in csv files.

In [8]:
import csv
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = []
#with open(fname) as f:
import codecs
with codecs.open(fname, "r", encoding='utf-8', errors='ignore') as f: ##for special encoding issues  
    reader = csv.reader(f)
    tweets = list(reader)

#### Review of Python list methods

- What data type is `tweets`?
- How many entries are in `tweets`?
- Which entry is the header row?
- Get the first 10 entries.
- Join together the 5th and 10th elements of `tweets`.

In [9]:
# your code here

### Reading in `.csv` with `pandas`

`pandas` is a third-party library that makes working with tabular data much easier. This is the recommended way to read in a `.csv` file.

In [18]:
import os
import pandas as pd
fname = 'trump-tweets.csv'
fname = os.path.join(DATA_DIR, fname)
tweets = pd.read_csv(fname) 

#### Review of `pandas`

- What data type is `tweets`?
- How many tweets are there?
- What happened to the header row?
- Get the first row of `tweets`.
- Get the first 5 entries in the `Tweet_Text` column.

In [None]:
# your code here

### Reading in multiple files

Often, our text data is split across multiple files in a folder. We want to read them all into a single variable. <br>`glob` is a handy package for this: it lists all files matching a pattern. We can use this to get all files in a folder. 

In [22]:
import glob
fnames = os.path.join(DATA_DIR, 'austen', '*.txt')
fnames = glob.glob(fnames)
austen = ''
for fname in fnames:
    with codecs.open(fname, "r", encoding='utf-8-sig', errors='ignore') as f:
        text = f.read()
        austen += text

#### Review of working with files

- What does `os.path.join()` do in this case?
- What type is `fnames` after it is first assigned a value?
- What type is `fnames` after it is assigned a second value?
- How many files are in `fnames`?
- What type is `austen`?

In [None]:
# your code here

### Challenge 

Read in all the `.csv` files in the folder `amazon`. Extract out only the `text` column from THE FIRST TWO files and store them all in a list called `reviews`. 

**Hint 1:** Not all of these files heave a header row to indicate column names. But for your reference, the columns are in this order: <br>
```Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text```

**Hint 2:** You can deal with `.csv` files without header rows by calling the argument `header=None` when loading into a pandas DataFrame. This lets pandas know not to mistake the first row of data for column names. 

In [35]:
# your solution here

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,20000,B002C50X1M,A3LWC833HQIG7J,austin_Larry,0,0,5,1295568000,"Excellent chips, full of flavor and just the r...",I purchased the Salt and Vinegar chips and hav...
1,20001,B002C50X1M,AJKQOYD9ILG9Y,T. Ferek,0,0,5,1295049600,good chips,This is a heavy kettle style chip that is not ...
2,20002,B002C50X1M,A2WV4N0X29CIFN,leecash2fly,0,0,5,1294099200,Great Chips,I recently bought these chips. Came home and ...
3,20003,B002C50X1M,A32SQ3PVLO0UGQ,J. Gilbert,0,0,4,1293753600,Very good taste-twist chips,"These are not the ""same old"" chips by far and ..."
4,20004,B002C50X1M,A1WRBCQDIH209I,angiechildress,0,0,5,1286064000,chips,"I love Rosemary chips, but can't find any at m..."


In [39]:
amz_df1[9] # This is the `text` column

0       I purchased the Salt and Vinegar chips and hav...
1       This is a heavy kettle style chip that is not ...
2       I recently bought these chips.  Came home and ...
3       These are not the "same old" chips by far and ...
4       I love Rosemary chips, but can't find any at m...
                              ...                        
9995    I really like Pamela's baking mix so I tried t...
9996    This is the best gf bread mix I have found by ...
9997    THIS BREAD MIX IS THE CLOSEST THING TO REGULAR...
9998    Delicious and easy to make.  An excellent brea...
9999    I bought this mix for my daughter's boyfriend,...
Name: 9, Length: 10000, dtype: object

In [43]:
reviews = list(amz_df1[9]) + list(amz_df2[9])
reviews[:10]

['I purchased the Salt and Vinegar chips and have been very pleased. There is the right amount of vinegar, virtually every single chip I have tasted is done just right, no burned chips, and they have an excellent thickness to impart just the correct amount of potato taste.<br /><br />They go great with lunches or as a snack. They are very economical. Beats the heck in terms of quality, taste, and price to buying these at work or out and about (I bought the 2 ounce bags). I will be trying some of the other intriguing flavors. Recommended.',
 'This is a heavy kettle style chip that is not as "heavy" as kettle brand chips, but bas good flavor.  I will reorder.',
 "I recently bought these chips.  Came home and only three bags were left in the box.  My whole family loves them.  They have great flavor and better and healthier than more chips. I'm ordering a few more boxes today",
 'These are not the "same old" chips by far and the Asian Sweet & Spicy sparked my imagination too, makes you app

## Tokenization

Once we've read in the data, our next step is often to split it into words. This step is referred to as "tokenization". That's because each occurrence of a word is called a "token". Each distinct word used is called a word "type". So the word type "the" may correspond to multiple tokens of "the" in a text.

#### Tokenizing by whitespace

- What problems do you notice with tokenizing by whitespace?
- What type is `text`?
- What type is `tokens`?
- What type is each element of `tokens`?

In [None]:
import os
fname = 'example1.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()

In [None]:
text

In [None]:
text.split()[:10]

#### Tokenizing with regular expressions

In [None]:
import re
word_pattern = r'\w+'
tokens = re.findall(word_pattern, text)
tokens[:10]

#### Tokenizing with `nltk`

[Just a bunch of regular expressions under the hood](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py)

In [None]:
from nltk.tokenize import word_tokenize
import nltk; nltk.download('punkt')
tokens = word_tokenize(text)
tokens[:10]

### Challenge

A while ago you read in a bunch of Jane Austen books into a variable called `austen`. Tokenize that using a method of your choice. Find all the unique words types (you might want the `set` function). Sort the resulting set object to create a vocabulary (you might want to use the `sorted` function).

In [None]:
# your solution here

## Sentence segmentation

Sentence segmentation involves identifying the boundaries of sentences.

#### Sentence segmentation by splitting on punctuation

In [None]:
text.split('.')

We could improve on this by using regular expressions. They'll allow us to split strings based on a number of characters.

In [None]:
sent_boundary_pattern = r'[.?!]'
re.split(sent_boundary_pattern, text)

### Challenge

The file `example2.txt` has more punctuation problems. Read it in and see what the problems are. Try your best to modify the code from above to work for as many cases as you can.

In [None]:
# your solution here

#### Sentence segmentation by `nltk`

In [None]:
from nltk.tokenize import sent_tokenize
fname = 'example2.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
sent_tokenize(text)

## Removing punctuation

Sometimes (although admittedly less frequently than tokenizing and sentence segmentation), you might want to keep only the alphanumeric characters (i.e. the letters and numbers) and ditch the punctuation. Here's how we can do that.

- What type is `punctuation`?

In [None]:
from string import punctuation
punctuation

In [None]:
no_punct = ''.join([ch for ch in text if ch not in punctuation])
no_punct

## Strip whitespace

This is an extremely common step. It's simple to perform and nicely pre-packaged in Python. It's particularly common for user-generated text (think survey forms).

In [None]:
string = ' Hello! '
string.strip()

In [None]:
fname = 'example3.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
print(text)

In [None]:
stripped_text = text.strip()
print(stripped_text)

In [None]:
whitespace_pattern = r'\s+'
clean_text = re.sub(whitespace_pattern, ' ', text)
clean_text.strip()

## Text normalization

Text normalization means making our text fit some standard patterns. Lots of steps come under this wide umbrella, but the most common are:

- case folding
- removing URLs, digits, hashtags
- OOV (removing infequent words) (not done here)

#### Case folding

Case folding means dealing with upper and lower cases characters. This is usually done by making all characters lower cased.

In [None]:
fname = 'example4.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname) as f:
    text = f.read()
text

In [None]:
text.lower()

### Challenge
The `lower` method we used above is a string method, that is, it works on strings. But what if you want to lowercase every word in a list (say you've already tokenized the text). Take the list of tokens below and make each one lower case.

In [None]:
# your solution here

### Removing URLs, digits and hashtags

We rarely care about the exact URL used in a tweet, or the exact number. We could remove them completely (think about how we'd do that), but it's often informative to know that there is a URL or a digit in the text. So we want to replace individual URLs asnd digits with a symbol that preserves the fact that a URL was there. It's standard to just use the strings "URL" and "DIGIT".

How do we do this? Once again, regular expressions save the day.

In [None]:
url_pattern = r'https?:\/\/.*[\r\n]*'
single_tweet = tweet_text[0]
single_tweet

In [None]:
URL_SIGN = ' URL '
re.sub(url_pattern, URL_SIGN, single_tweet)

Above we replaced the URL in a single tweet. Now we will replace all the URLs in all tweets in `tweet_text`.

In [None]:
url_pattern = r'https?:\/\/.*[\r\n]*'
URL_SIGN = ' URL '
list_of_url_less_tweets = []
## Using a for loop
for tweet in tweet_text:
    url_less_tweet = re.sub(url_pattern, URL_SIGN, tweet)
    list_of_url_less_tweets.append(url_less_tweet)
list_of_url_less_tweets

In [None]:
## Alternative using list comprehension
list_of_url_less_tweets = [re.sub(url_pattern, URL_SIGN, tweet) for tweet in tweet_text]
list_of_url_less_tweets

Now let's remove hashtags and digits.

In [None]:
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
HASHTAG_SIGN = ' HASHTAG '
digit_pattern = '\d+'
DIGIT_SIGN = ' DIGIT '

In [None]:
no_hashtags = [re.sub(hashtag_pattern, HASHTAG_SIGN, tweet) for tweet in tweet_text]
no_hashtags

In [None]:
no_digit = [re.sub(digit_pattern, DIGIT_SIGN, tweet) for tweet in tweet_text]
no_digit

## Counting word frequencies (after text normalization)

We can count the frequency of each word type with the built-in `Counter` in Python. This basically just takes the set of word types (we calculated this above as `vocabulary`) and makes a special Python dictionary with each value being the number of times it appears in the list. We can ask that dictionary for the most common words, or for the frequency of individual word types. 

First, clean and normalize the text:

In [None]:
all_tweets = ' '.join(tweets)
clean = re.sub(url_pattern, URL_SIGN, all_tweets)
clean = re.sub(hashtag_pattern, HASHTAG_SIGN, clean)
clean = re.sub(digit_pattern, DIGIT_SIGN, clean)
tokens = word_tokenize(clean)
tokens = [token for token in tokens if token not in punctuation]
tokens[:20]

In [None]:
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

### Challenge 

I've read in some Amazon reviews from earlier into a list called `reviews`. Each element of the list is a string, representing the text of a single review. Try to:
- Tokenize each review
- Strip all whitespace
- Make all characters lower case
- Replace any URLs and digits

Then find the most common 50 words.

In [None]:
# your solution here

## Removing stop words

You might have noticed that the most common words above aren't terribly exciting. They're words like "am", "i", "the" and "a": stop words. These are rarely useful to us in computational text analysis, so it's very common to remove them completely.

- What other stop words do you think there are?

In [None]:
import nltk; nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

### Challenge 

Use the list `stop` of English stopwords to remove stopwords from our tokenized review above.

In [None]:
# your solution here

## Stemming/lemmatization

Stemming and lemmatization both refer to remove morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter](https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py) algorithm, which is in spirit isn't that far from a bunch of regular expressions.

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
stemmer.stem('grows')

In [None]:
stemmer.stem('running')

In [None]:
stemmer.stem('leaves')

In [None]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
import nltk; nltk.download('wordnet') # Download resource for working with WordNet via NLTK
snowballer_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [None]:
print(snowballer_stemmer.stem('running'))
print(snowballer_stemmer.stem('leaves'))

In [None]:
print(lemmatizer.lemmatize('leaves'))

### Challenge 

Use the Porter stemmer to stem each word in the tweet dataset after having removed stop words.

In [None]:
# your solution here

## POS tagging

POS tagging means assigning each token a part-of-speech (e.g. noun, verb, adjective, etc.). Again, there are many different [alternatives](https://github.com/nltk/nltk/tree/develop/nltk/tag), but NLTK keeps its recommended POS tagger available through the function `pos_tag`. The tagger expects a list of tokens as input.When doing POS tagging, it is advisable **not** to remove stop words beforehand (although you are free to do it afterwards).

In [None]:
from nltk import pos_tag
single_review = reviews[3]
single_review

In [None]:
tokens = word_tokenize(single_review)
import nltk; nltk.download('averaged_perceptron_tagger')
tagged_review = pos_tag(tokens)
tagged_review

### Challenge 

Below I've read in the text of Austen's _Pride and Prejudice_ into a variable called `pride`. Preprocess using the following steps:

- Strip whitespace
- Replace all numbers with '0'
- Tokenize
- Tag each token with a POS tag

Make sure you know:
- What type is the result?
- What type is each element of the result?
- What type are the elements of the elements of the result?

In [None]:
fname = 'pride-and-prejudice.txt'
fname = os.path.join(DATA_DIR, fname)
with open(fname, encoding='utf-8') as f:
    raw = f.read()
pride = raw[679:684814]
pride

In [None]:
# your solution here

## Things we didn't cover
(see `solutions/preprocessing_extra.ipynb` and [this repo](https://github.com/geoffbacon/nlp-with-nltk-spacy/blob/master/03-NLTK.ipynb) for more on these)

- Reading in JSON, HTML, and XML files 
- Removing infrequent words
- Named entity recognition
- Syntactic parsing
- Information extraction
- Removing markup from HTML
- Extracting numerical features
- DTM/TF-IDF
- SpaCy