# Natural Language Processing (NLP)

## What is NLP?

- Using computers to process (analyze, understand, generate) natural human languages
- Most knowledge created by humans is unstructured text, and we need a way to make sense of it
- Build probabilistic model using data about a language
- Requires an understanding of language and the world

## Higher level "task areas"

- **Information retrieval**: Find relevant results and similar results
    - [Google](https://www.google.com/)
- **Information extraction**: Structured information from unstructured documents
    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)
- **Machine translation**: One language to another
    - [Google Translate](https://translate.google.com/)
- **Text simplification**: Preserve the meaning of text, but simplify the grammar and vocabulary
    - [Rewordify](https://rewordify.com/)
    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)
- **Predictive text input**: Faster or easier typing
    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)
    - [Smart Compose](https://www.blog.google/products/gmail/subject-write-emails-faster-smart-compose-gmail/?utm_source=tw&utm_medium=feed&utm_campaign=io18)
- **Sentiment analysis**: Attitude of speaker
    - [Hater News](http://haternews.herokuapp.com/)
- **Automatic summarization**: Extractive or abstractive summarization
    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)
- **Natural Language Generation**: Generate text from data
    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)
    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)
- **Speech recognition and generation**: Speech-to-text, text-to-speech
    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)
    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)
- **Question answering**: Determine the intent of the question, match query with knowledge base, evaluate hypotheses
    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)
    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)

## Lower level "components"

- **Tokenization**: breaking text into tokens (words, sentences, n-grams)
- **Stop word removal**: removing common words
- **TF-IDF**: computing word importance
- **Stemming and lemmatization**: reducing words to their base form
- **Part-of-speech tagging**
- **Named entity recognition**: person/organization/location
- **Segmentation**: "New York City subway"
- **Word sense disambiguation**: "buy a mouse"
- **Spelling correction**
- **Language detection**
- **Machine learning**

## Agenda

1. Reading in the Yelp reviews corpus
2. Tokenizing the text
3. Comparing the accuracy of different approaches
4. Removing frequent terms (stop words)
5. Removing infrequent terms
6. Handling Unicode errors

## Part 1: Reading in the Yelp reviews corpus

- "corpus" = collection of documents
- "corpora" = plural form of corpus

In [12]:
# read yelp.csv into a DataFrame using a relative path
import pandas as pd

# mount google drive
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:
path = '/content/drive/My Drive/BT4222/data/yelp.csv'
yelp = pd.read_csv(path)

# examine the first three rows
yelp.head(3)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0


In [14]:
yelp.shape

(10000, 10)

In [15]:
# examine the text for the first row
yelp.loc[0, 'text']

'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

**Goal:** Distinguish between 5-star and 1-star reviews using **only** the review text. (We will not be using the other columns.)

In [16]:
# examine the class distribution
yelp.stars.value_counts().sort_index()

1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64

In [0]:
# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

In [18]:
# examine the shape
yelp_best_worst.shape

(4086, 10)

In [0]:
# define X and y
X = yelp_best_worst.text # yelp_best_worst['text']
y = yelp_best_worst.stars

In [0]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [21]:
# examine the object shapes
print(X_train.shape)
print(X_test.shape)

(3064,)
(1022,)


In [22]:
X_train[0]

'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

## Part 2: Tokenizing the text

- **What:** Separate text into units such as words, n-grams, or sentences
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

In [0]:
# use CountVectorizer to create document-term matrices from X_train and X_test
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [0]:
# fit and transform X_train
X_train_dtm = vect.fit_transform(X_train)

In [0]:
# only transform X_test
X_test_dtm = vect.transform(X_test)

In [26]:
# examine the shapes: rows are documents, columns are terms (aka "tokens" or "features")
print(X_train_dtm.shape)
print(X_test_dtm.shape)

(3064, 16825)
(1022, 16825)


In [27]:
# examine the last 50 features
print(vect.get_feature_names()[-50:])

['yyyyy', 'z11', 'za', 'zabba', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zero', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zihuatenejo', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zippers', 'zipps', 'ziti', 'zoe', 'zombi', 'zombies', 'zone', 'zones', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel', 'zzed', 'éclairs', 'école', 'ém']


In [28]:
# show default parameters for CountVectorizer
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

- **lowercase:** boolean, True by default
    - Convert all characters to lowercase before tokenizing.

In [29]:
# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(3064, 20838)

- **ngram_range:** tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

In [30]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(3064, 169847)

In [31]:
# examine the last 50 features
print(vect.get_feature_names()[-50:])

['zone out', 'zone when', 'zones', 'zones dolls', 'zoning', 'zoning issues', 'zoo', 'zoo and', 'zoo is', 'zoo not', 'zoo the', 'zoo ve', 'zoyo', 'zoyo for', 'zucca', 'zucca appetizer', 'zucchini', 'zucchini and', 'zucchini bread', 'zucchini broccoli', 'zucchini carrots', 'zucchini fries', 'zucchini pieces', 'zucchini strips', 'zucchini veal', 'zucchini very', 'zucchini with', 'zuchinni', 'zuchinni again', 'zuchinni the', 'zumba', 'zumba class', 'zumba or', 'zumba yogalates', 'zupa', 'zupa flavors', 'zuzu', 'zuzu in', 'zuzu is', 'zuzu the', 'zwiebel', 'zwiebel kräuter', 'zzed', 'zzed in', 'éclairs', 'éclairs napoleons', 'école', 'école lenôtre', 'ém', 'ém all']


## Part 3: Comparing the accuracy of different approaches

**Approach 1:** Always predict the most frequent class

In [32]:
y_test.value_counts().head()

5    838
1    184
Name: stars, dtype: int64

In [33]:
y_test.shape

(1022,)

In [34]:
# calculate null accuracy
y_test.value_counts().head(1) / y_test.shape

5    0.819961
Name: stars, dtype: float64

In [35]:
from sklearn.dummy import DummyClassifier
from sklearn import metrics

dummy = DummyClassifier(strategy='most_frequent', random_state=0)

dummy.fit(X_train_dtm, y_train)

y_pred_class = dummy.predict(X_test_dtm)

metrics.accuracy_score(y_test, y_pred_class)

0.8199608610567515

**Approach 2:** Use the default parameters for CountVectorizer

In [0]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(X_train, y_train, X_test, y_test, vect):
    # create document-term matrices using the vectorizer
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    
    # print the number of features that were generated
    print('Features: ', X_train_dtm.shape[1])
    
    # use Multinomial Naive Bayes to predict the star rating
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    
    # Get the training accuracy
    print('Training Accuracy: ', metrics.accuracy_score(y_train, nb.predict(X_train_dtm)))
    # print the accuracy of its predictions
    print('Test Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [37]:
# use the default parameters
vect = CountVectorizer()
tokenize_test(X_train, y_train, X_test, y_test, vect)

Features:  16825
Training Accuracy:  0.972911227154047
Test Accuracy:  0.9187866927592955


**Approach 3:** Don't convert to lowercase

In [38]:
# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
tokenize_test(X_train, y_train, X_test, y_test, vect)

Features:  20838
Training Accuracy:  0.9768276762402088
Test Accuracy:  0.9099804305283757


**Approach 4:** Include 1-grams and 2-grams

In [39]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(X_train, y_train, X_test, y_test, vect)

Features:  169847
Training Accuracy:  0.993798955613577
Test Accuracy:  0.8542074363992173


**Summary:** Tuning CountVectorizer is a form of **feature engineering**, the process through which you create features that don't natively exist in the dataset. Your goal is to create features that contain the **signal** from the data (with respect to the response value), rather than the **noise**.

## Part 4: Removing frequent terms (stop words)

- **What:** Remove common words that appear in most documents
- **Why:** They probably don't tell you much about your text

In [40]:
# show vectorizer parameters
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used.

In [41]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(X_train, y_train, X_test, y_test, vect)

Features:  16528
Training Accuracy:  0.9758485639686684
Test Accuracy:  0.9158512720156555


Features:  16825
Training Accuracy:  0.972911227154047
Test Accuracy:  0.9187866927592955

In [42]:
# examine the stop words
print(sorted(vect.get_stop_words()))

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give

- **max_df:** float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [43]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)
tokenize_test(X_train, y_train, X_test, y_test, vect)

Features:  16815
Training Accuracy:  0.9751958224543081
Test Accuracy:  0.9207436399217221


- **stop\_words\_:** Terms that were ignored because they either:
    - occurred in too many documents (max_df)
    - occurred in too few documents (min_df)
    - were cut off by feature selection (max_features)

In [44]:
# examine the terms that were removed due to max_df ("corpus-specific stop words")
print(vect.stop_words_)

{'of', 'to', 'it', 'in', 'and', 'for', 'is', 'this', 'my', 'the'}


In [45]:
# vect.stop_words_ is completely distinct from vect.get_stop_words()
print(vect.get_stop_words())

None


## Part 5: Removing infrequent terms

- **max_features:** int or None, default=None
    - If not None, build a vocabulary that only considers the top max_features ordered by term frequency across the corpus.

In [46]:
# only keep the top 1000 most frequent terms
vect = CountVectorizer(max_features=1000)
tokenize_test(X_train, y_train, X_test, y_test, vect)

Features:  1000
Training Accuracy:  0.9275456919060052
Test Accuracy:  0.8923679060665362


- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [47]:
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)
tokenize_test(X_train, y_train, X_test, y_test, vect)

Features:  8783
Training Accuracy:  0.966710182767624
Test Accuracy:  0.9246575342465754


In [48]:
# include 1-grams and 2-grams, and only keep terms that appear in at least 2 documents
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(X_train, y_train, X_test, y_test, vect)

Features:  43957
Training Accuracy:  0.9895561357702349
Test Accuracy:  0.9324853228962818


**Guidelines for tuning CountVectorizer:**

- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.
- **Experiment**, and let the data tell you the best approach!

## Part 6: Handling Unicode errors

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#decoding-text-files):

> Text is made of **characters**, but files are made of **bytes**. These bytes represent characters according to some **encoding**. To work with text files in Python, their bytes must be decoded to a character set called **Unicode**. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many others exist.

**Why should you care?**

When working with text in Python, you are likely to encounter errors related to encoding, and understanding Unicode will help you to troubleshoot these errors.

**Unicode basics:**

- Unicode is a system that assigns a unique number for every character in every language. These numbers are called **code points**. For example, the [code point](http://www.unicode.org/charts/index.html) for "A" is U+0041, and the official name is "LATIN CAPITAL LETTER A".
- An **encoding** specifies how to store the code points in memory:
    - **UTF-8** is the most popular Unicode encoding. It uses 8 to 32 bits to store each character.
    - **UTF-16** is the second most popular Unicode encoding. It uses 16 or 32 bits to store each character.
    - **UTF-32** is the least popular Unicode encoding. It uses 32 bits to store each character.

**ASCII basics:**
- ASCII is an encoding from the 1960's that uses 8 bits to store each character, and only supports **English characters**.
- ASCII-encoded files are sometimes called **plain text**.
- UTF-8 is **backward-compatible** with ASCII, because the first 8 bits of a UTF-8 encoding are identical to the ASCII encoding.

The default encoding in **Python 2** is ASCII. The default encoding in **Python 3** is UTF-8.

In [51]:
# Python 3: examine two types of strings
print(type(b'hello'))
print(type('hello'))

<class 'bytes'>
<class 'str'>


In [52]:
# Python 3: 'decode' converts 'bytes' to 'str'
b'hello'.decode(encoding='utf-8')

'hello'

In [53]:
# Python 3: 'encode' converts 'str' to 'bytes'
'hello'.encode(encoding='utf-8')

b'hello'

In [54]:
# Python 3: 'encode' converts 'str' to 'bytes'
a = "αά".encode('utf-8')
print (a)
print (type(a))
print (len(a))

b'\xce\xb1\xce\xac'
<class 'bytes'>
4


In [55]:
a.decode('utf-8')

'αά'

In [56]:
b'\xce\xb1'.decode('utf-8')

'α'

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#decoding-text-files):

> The text feature extractors in scikit-learn know how to **decode text files**, but only if you tell them what encoding the files are in. The CountVectorizer takes an **encoding parameter** for this purpose. For modern text files, the correct encoding is probably **UTF-8**, which is therefore the default (encoding="utf-8").

> If the text you are loading is not actually encoded with UTF-8, however, you will get a **UnicodeDecodeError**. The vectorizers can be told to be silent about decoding errors by setting the **decode_error parameter** to either "ignore" or "replace".