<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# Introduction to Natural Language Processing

## Icebreaker

Today we're talking about text data. With your neighbour, discuss some use cases for analysing/making predictions using text data.

- what problems could you solve with text analysis algorithms?
- what are some text-based **targets** you could predict with machine learning?

## Housekeeping

- Unit 3 project (two parts) due end of the week

- Final project due in **two weeks**!

- Project presentations on Tuesday 10th and Thursday 12th July

### Project Submission

#### Presentation

- ~10 minutes

- your target audience is **non-technical stakeholders**

- your goal is to set up your problem, summarise your findings, and **make recommendations** about both what decisions to make and what further work to undertake

- graphs & a description of your approach are encouraged, but think about how much technical detail you can/want to go into

#### Technical Report

- can be a clean version of your Jupyter notebook (with commentary)

- target audience is **other Data Scientists**

- your goal is to **explain/justify your approach** and detail any important **assumptions/decisions**

- aim for **reproducibility** i.e. could another Data Scientist pick up/reproduce/continue your work?

# Presentations in Jupyter

### Three-stage workflow

#### 1. Write your content in Markdown cells

#### 2. Designate the way each cell behaves in your presentation

#### 3. Export as slides and test-drive

### Demo

# Why Natural Language Processing?

### Uses of NLP

- **Chatbots:** Understand natural language from the user and return intelligent responses

- **Information retrieval:** Search!

- **Information extraction:** Structured information from unstructured documents, e.g. Google extracting events from emails

- **Machine translation**

- **Predictive text input**

- **Sentiment analysis**

- **Automatic summarisation:** Extractive or abstractive summarisation.

- **Speech recognition and generation:** Speech-to-text, text-to-speech

## Why is NLP hard?

### Ambiguity

- Hospitals Are Sued by 7 Foot Doctors

- Juvenile Court to Try Shooting Defendant

- Local High School Dropouts Cut in Half

### Non-standard English

- slang

- txt msg speak (LOL)

- newly coined words like "retweet"

### Idioms

- "throw in the towel"

### Tricky entity names

"Where is A Bug's Life playing?"

### Sarcasm

The Data Science/machine learning approach is **not** to model human language, but to **find patterns**

Today: what category does a "document" belong to based on the words in it?

Category can be anything - it's just a classification problem!

For our example we will use text from Yelp reviews to predict the star rating given by the user

# Text Classification

In [1]:
import pandas as pd

df = pd.read_csv("assets/data/yelp.csv.gz")
df.shape

(10000, 10)

In [2]:
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [3]:
reviews = df["text"].values
star_ratings = df["stars"].values
print(reviews[0])
print(star_ratings[0])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!
5


### How do we turn this into a machine learning task?

Our target is obviously "number of stars"

But what are our features?

## Bag of Words

Our first approach is simple. Each review (or generically "document") is characterised by the presence of the words in it

We use all our documents to construct a "vocabulary" and create binary features for each word

Imagine two documents: "The cat sat on the mat" and "the dog sat on the log"

Our full vocabulary is `["The", "cat", "sat", "on", "the", "mat", "dog", "log"]`

How are our documents represented?

Our two documents become `[1, 1, 1, 1, 1, 1, 0, 0]` and `[1, 0, 1, 1, 1, 0, 1, 1]`

Now these are just binary features for a machine learning model!

# Preprocessing

What do we need to do with documents like this?

In [4]:
reviews[0]

'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

### Lowercasing

We usually want "The" and "the" to be the same word!

In [5]:
words = reviews[0].split()
words[:10]

['My',
 'wife',
 'took',
 'me',
 'here',
 'on',
 'my',
 'birthday',
 'for',
 'breakfast']

In [6]:
lower_words = [w.lower() for w in words]
lower_words[0:10]

['my',
 'wife',
 'took',
 'me',
 'here',
 'on',
 'my',
 'birthday',
 'for',
 'breakfast']

### Remove stopwords

We'll be using NLTK ([http://nltk.org](http://nltk.org))

In [7]:
from nltk.corpus import stopwords as nltk_stopwords

stopwords = nltk_stopwords.words('english')
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [8]:
print(len(set(lower_words)))
useful_words = [word for word in lower_words if word not in stopwords]
print(len(set(useful_words)))

106
76


### Stemming

- We want words like "having" and "have" to be the same

In [9]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')
stemmer.stem("having")

'have'

In [10]:
stemmed_words = [stemmer.stem(word) for word in useful_words]
print(useful_words[:10])
print(stemmed_words[:10])

['wife', 'took', 'birthday', 'breakfast', 'excellent.', 'weather', 'perfect', 'made', 'sitting', 'outside']
['wife', 'took', 'birthday', 'breakfast', 'excellent.', 'weather', 'perfect', 'made', 'sit', 'outsid']


### Lemmatisation

Like stemming, but knows more about language

In [11]:
from nltk import WordNetLemmatizer

lem = WordNetLemmatizer()
lemmatised_words = [lem.lemmatize(word, 'v') for word in useful_words]
print(useful_words[:10])
print(stemmed_words[:10])
print(lemmatised_words[:10])

['wife', 'took', 'birthday', 'breakfast', 'excellent.', 'weather', 'perfect', 'made', 'sitting', 'outside']
['wife', 'took', 'birthday', 'breakfast', 'excellent.', 'weather', 'perfect', 'made', 'sit', 'outsid']
['wife', 'take', 'birthday', 'breakfast', 'excellent.', 'weather', 'perfect', 'make', 'sit', 'outside']


Downside: you need to tell it what type of word it is (verb, noun etc.)

You can use a "part-of-speech tagging" approach. More info here: [https://www.nltk.org/book/ch05.html](https://www.nltk.org/book/ch05.html)

### Machine Learning with Bag of Words

In [12]:
# apply stemming
df["text_stemmed"] = df["text"].apply(lambda x: " ".join([stemmer.stem(w) for w in x.split()]))
print(df["text"].values[0])
print(df["text_stemmed"].values[0])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!
my wife took me here on my birthday for breakfast and it was excellent. the weather was perfect which made sit

Let's start with "1 star" vs. "5 star" reviews as a binary classification

In [13]:
from sklearn.model_selection import train_test_split

X = df.loc[df["stars"].isin([1, 5]), "text_stemmed"]
y = df.loc[df["stars"].isin([1, 5]), "stars"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
y.value_counts()

5    3337
1     749
Name: stars, dtype: int64

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(binary=True,
                      stop_words='english',
                      lowercase=True # default
                     )

X_train_text = vec.fit_transform(X_train)
X_test_text = vec.transform(X_test)

# vocabulary_ is a dict obtained from the data
print(len(vec.vocabulary_))
# look at some random features
print(vec.get_feature_names()[1000:1010])

14379
['astound', 'astounded', 'astounding', 'astrological', 'astronom', 'asu', 'atari', 'ate', 'atf', 'athelet']


We get lowercasing and "tokenisation" (splitting into tokens) for free with `CountVectorizer`.

In [15]:
print(X_train_text.shape)
type(X_train_text)

(2860, 14379)


scipy.sparse.csr.csr_matrix

A "sparse" matrix means there are lots of zeros and the `scipy.sparse.csr.csr_matrix` is a more optimal way of storing it

Look at random forest "accuracy"

In [16]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier()

scores = cross_val_score(rf, X_train_text, y_train, scoring="f1", cv=7)
print(scores, np.mean(scores))

[0.57391304 0.57627119 0.46956522 0.55462185 0.55462185 0.49090909
 0.43137255] 0.5216106835311333


Try some more things!

`min_df` is "minimum number of documents word has appeared in"

In [17]:
def try_new_vectoriser(vec, X, y):
    X_train_text = vec.fit_transform(X)
    print(len(vec.vocabulary_))
    rf = RandomForestClassifier()
    scores = cross_val_score(rf, X_train_text, y_train, scoring="f1", cv=7)
    print(scores, np.mean(scores))

try_new_vectoriser(CountVectorizer(binary=True,
                                   stop_words='english',
                                   min_df=2
                                  ),
                   X_train,
                   y_train)

7251
[0.64566929 0.56       0.58461538 0.66666667 0.59375    0.55462185
 0.608     ] 0.6019033130514471


Instead of binary let's use actual counts

In [18]:
try_new_vectoriser(CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2
                                  ),
                   X_train,
                   y_train)

7251
[0.4957265  0.65116279 0.59130435 0.62015504 0.58823529 0.63565891
 0.608     ] 0.5986061259794679


Limit to top 1000 most frequent words

In [19]:
try_new_vectoriser(CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2,
                                   max_features=1000
                                  ),
                   X_train,
                   y_train)

1000
[0.57777778 0.62666667 0.63888889 0.63636364 0.49612403 0.69014085
 0.5       ] 0.5951374065393062


Fit another Random Forest on the latest model

In [20]:
vectorizer_1000 = CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2,
                                   max_features=1000)

X_train_text = vectorizer_1000.fit_transform(X_train)

rf = RandomForestClassifier()
rf.fit(X_train_text, y_train);

Function to get most important words (according to Random Forest)

In [21]:
def get_feature_importances(vocabulary, rf_importances, top_n):
    vocab_features = sorted(vocabulary.items(), key=lambda x: x[1])
    importances = zip(vocab_features, rf_importances)
    
    for z in sorted(importances, key=lambda x: abs(x[1]), reverse=True)[:top_n]:
        print(z)

get_feature_importances(vectorizer_1000.vocabulary_, rf.feature_importances_, 10)

(('rude', 740), 0.026705862649869384)
(('great', 394), 0.022213958081078882)
(('worst', 983), 0.021575755583571932)
(('horrible', 438), 0.015605916794288011)
(('minut', 558), 0.014030570462479802)
(('love', 522), 0.013534719826647574)
(('told', 900), 0.012457455352125698)
(('horribl', 437), 0.010841151079898496)
(('poor', 664), 0.00875188552468973)
(('bare', 77), 0.00865837382446233)


## Moving beyond bag of words

What's wrong with the bag of words approach?

- doesn't preserve the order of words!

One solution is to use n-grams

#### N-grams

Split into groups of `n` consecutive words, e.g.

`"the cat sat on the mat"`

where `n`=2 becomes

`[('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]`

In [22]:
cvec = CountVectorizer(ngram_range=(2, 2))
cvec.fit_transform(["the cat sat on the mat"])
cvec.vocabulary_

{'the cat': 3, 'cat sat': 0, 'sat on': 2, 'on the': 1, 'the mat': 4}

In [23]:
# 2-grams
try_new_vectoriser(CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2,
                                   max_features=1000,
                                   ngram_range=(2, 2)
                                  ),
                   X_train,
                   y_train)

1000
[0.46428571 0.40718563 0.425      0.51162791 0.37037037 0.36144578
 0.30487805] 0.4063990646126231


In [24]:
# 3-grams
try_new_vectoriser(CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2,
                                   max_features=1000,
                                   ngram_range=(3, 3)
                                  ),
                   X_train,
                   y_train)

1000
[0.29166667 0.20560748 0.2        0.2970297  0.23300971 0.26415094
 0.16842105] 0.2371265072911639


In these cases using **only** n-grams made the model worse - why?

Let's try mixing **both** words and n-grams

In [25]:
# 3-grams
try_new_vectoriser(CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2,
                                   ngram_range=(1, 2)
                                  ),
                   X_train,
                   y_train)

20946
[0.55045872 0.59677419 0.61666667 0.49541284 0.50819672 0.60504202
 0.65517241] 0.5753890816799119


This is now a similar model, but we're not really gaining anything from adding n-grams.

With lots more data, or specifying a different "minimum frequency" they might be more useful.

### Moving beyond counting

So far we've counted the presence, then the number of occurrences of words/n-grams

What are some problems with this approach?

- counts of common words will be high in all documents = **BAD**

- instead, we want to capture something about the **relative** occurrence of words in documents

## TF-IDF

We're really interested in two things:

- term frequency (TF) = count of a word in a document

- better: the **proportion** of that word in a document

$$ TF_{w, D} = \frac{f_{w, D}}{|{D}|} $$

$TF_{w, D}$: Term Frequency of word $w$ in document $D$

$f_{w, D}$: frequency (count) of occurrences of word $w$ in document $D$

$|D|$: is the size of document $D$ (number of words)

"The cat sat on the mat": what is the TF of each word?

- 1/6 for all words except "the" which is 1/3

#### "Inverse Document Frequency"

- tells us how rare a word is

- even if we remove stopwords, some words will appear in more documents

- rare words are often more informative

- words like "ball" might appear in articles about different sports, but "offside" would be more useful to find football-related articles

$$ IDF_{w} = \frac{log(N)}{DF_{w}} $$

$IDF_{w}$: the inverse document frequency of word $w$

$N$: the total number of documents

$DF_{w}$: number of documents that word $w$ appears in

#### Combining the two

- TF-IDF is just $TF_{w,D}\times IDF_{w}$

- TF is measured **per word per document** whereas IDF is **a single value per word**

- It's best to think of TF-IDF as a value that measure the importance of a word in a document

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer(stop_words="english",
                            min_df=2,
                            max_features=1000)

try_new_vectoriser(tfidf_vec,
                   X_train,
                   y_train)

1000
[0.5785124  0.62992126 0.69767442 0.546875   0.50877193 0.62937063
 0.5840708 ] 0.5964566329709646


In [27]:
rf = RandomForestClassifier()
rf.fit(tfidf_vec.fit_transform(X_train), y_train)

get_feature_importances(tfidf_vec.vocabulary_, rf.feature_importances_, 10)

(('rude', 740), 0.029649745257964565)
(('great', 394), 0.022120506675726216)
(('poor', 664), 0.019361399331322183)
(('worst', 983), 0.01890557830023056)
(('manag', 531), 0.01438489495058681)
(('bad', 71), 0.01308535748992586)
(('money', 565), 0.012676370301556952)
(('told', 900), 0.01247111328174572)
(('minut', 558), 0.011509296374896016)
(('avoid', 64), 0.010764451524851759)


### Sentiment Analysis

Based on what we've seen, how would we approach **sentiment analysis**?

# Other NLP Approaches & Resources

What if we didn't know what we were looking for in our text?

*What is this kind of learning vs. the document classification approach?*

There are **unsupervised** NLP approaches available.

#### Topic modelling

- finds "topics" which are collections of words that appear together

- one example is **Latent Dirichlet Allocation (LDA)** (resources linked in README.md)

- available in `scikit-learn` but also look at the [`gensim`](https://radimrehurek.com/gensim/) library

#### Word Embeddings

- finds "embeddings" which are smaller numerical vector representations of words/phrases

- words close in the "embedding space" are also semantically similar

- look at an approach called word2vec (also [available in gensim](http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code))

## Exercise

- Start with notebook 01 to install `nltk`!