# Machine Learning with Python

In [None]:
import numpy as np
import pandas as pd

In [None]:
from sklearn.datasets import load_files

reviews_train = load_files("imdb/train/")
text_train, y_train = reviews_train.data, reviews_train.target
text_train = [doc.replace(b"<br />", b" ") for doc in text_train]

reviews_test = load_files("imdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target
text_test = [doc.replace(b"<br />", b" ") for doc in text_test]

# 3.2 Sentiment Analysis

### The task

Our first machine learning task will be a *binary classification*, trying to make a predictor for the class label (i.e. positive or negative review) based on the text. This is a form of [*sentiment analysis*](https://towardsdatascience.com/sentiment-analysis-concept-analysis-and-applications-6c94d6f58c17).

Unlike our previous examples using structured data, we do not yet have any features that can be used to a learning algorithm. Finding a useful data representation is therefore a key component of NLP.

### Bag of Words

A very simple but usually quite effective approach is the so-called *bag of words*.

Here, we discard the information contained in the document structure and the order of the words in each sentence, and just represent each document as a frequency table showing how often each word occurs therein.

The three stages are

* *Tokenization* - break each document into a list of words.
* *Vocabulary building* - collect all the words found in the corpus and sort them in alphabetical order.
* *Encoding* - for each document, create a frequency table over the vocabulary.

Here is a simple example on two short documents:

In [None]:
bards_words =["The fool doth think he is wise,",
              "but the wise man knows himself to be a fool"]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_words)

In [None]:
print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("Vocabulary content:\n {}".format(vect.vocabulary_))

The vectorizer is case-aware, so it knows that "The" and "the" are the same word.

In [None]:
bag_of_words = vect.transform(bards_words)
print("bag_of_words: {}".format(repr(bag_of_words)))

The sparse matrix format is much more memory-efficient for data where we expect many zeros.

For inspection, we can use `toarray()` to change this back into a "dense" numpy array:

In [None]:
print("Dense representation of bag_of_words:\n{}".format(
    bag_of_words.toarray()))

Notice that the words with index 3 ("fool") ,9 ("the") and 12 ("wise") appear in both documents.

Let's try the same approach with the IMDB data:

In [None]:
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

In [None]:
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Notice that the definition of "word" is currently very basic, just any text that is separable by white space or punctuation, so numbers, alternative word forms and typos all appear as different features. 

We can now try a classifier. Using logistic regression on the defined features:

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Not bad for such a simple model!

One simple improvement is to restrict the allowed features to words that appear at least a minimum number of times in the corpus. This will help to weed out typos and other uninformative text, and also reduce the size of the feature space, which is helpful in itself. Here we will set `min_df=5`

In [None]:
vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train with min_df: {}".format(repr(X_train)))

In [None]:
feature_names = vect.get_feature_names()

print("First 50 features:\n{}".format(feature_names[:50]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 700th feature:\n{}".format(feature_names[::700]))

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

The cross-validation score is essentially unchanged, but the number of features has reduced to 1/3 of the original number.

On the test dataset:

In [None]:
lr = LogisticRegression(max_iter=100000)
lr.fit(X_train, y_train)

X_test = vect.transform(text_test)

print("Test score: {:.2f}".format(lr.score(X_test, y_test)))

### Stopwords

Sometimes we have a list of words that we do not want to use as features, for example because they do not add any information. We can eliminate these using the `stop_words` argument.

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))

In [None]:
# Specifying stop_words="english" uses the built-in list.
# We could also augment it and pass our own.
vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train)
X_train = vect.transform(text_train)
print("X_train with stop words:\n{}".format(repr(X_train)))

In [None]:
scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Using this fixed list has not improved our model. However, if the dataset were much smaller then we may find that the use of stopwords would help to focus the model on more informative words.

A simple corpus-specific option is to use the `max_df` argument to eliminate words that appear very frequently.

### Rescaling with *tf-idf*

We can try to use the corpus itself to determine feature importance. One common approach is *term frequency-inverse document frequency* (tf-idf).

tf-idf for a word *w* in a document *d* is given by


\begin{equation*}
\text{tfidf}(w, d) = \text{tf} \log\big(\frac{N + 1}{N_w + 1}\big) + 1
\end{equation*}

where 

**tf** is the number of times the word appears in document *d*.

*N* is the total number of documents in the training set.

*Nw* is the number of documents in the training set that contain *w*.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(min_df=5, norm=None).fit(text_train)
X_train = vect.transform(text_train)

scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

No improvement in performance... But organising the features in this way can make the model more interpretable:

In [None]:
max_value = X_train.max(axis=0).toarray().ravel()
sorted_by_tfidf = max_value.argsort()
# get feature names
feature_names = np.array(vect.get_feature_names())

print("Features with lowest tfidf:\n{}".format(
      feature_names[sorted_by_tfidf[:20]]))
print()
print("Features with highest tfidf: \n{}".format(
      feature_names[sorted_by_tfidf[-20:]]))

Features with low inverse document frequency are the most commonly occuring words, which presumably have low information content:

In [None]:
sorted_by_idf = np.argsort(vect.idf_)
print("Features with lowest idf:\n{}".format(
       feature_names[sorted_by_idf[:100]]))

### Investigating the model

Because the features in *bag of words* are just natural language terms, the resulting models can be highly interpretable.


In [None]:
lr.coef_[0]

In [None]:
coefs = pd.Series(lr.coef_[0], vect.get_feature_names() )
coefs = coefs.sort_values(ascending=False)

In [None]:
coefs.head(10)

In [None]:
coefs.tail(10)

### n-Grams

One major problem with *bag of words* is that the meaning contained in word order is completely lost. Compare the sentiment of

* She was happy not to be going to the party.
* She was not happy to be going to the party.

If we extend our features to encompass two or more consecutive tokens, then we will have a chance to extract at least some meaning from word order. This may be very important for languages whose word boundaries are not so easily recognised as in English. 

This approach is called *n-grams*. We use a "sliding window" of a specified number of tokens.


`n=1` is just the same as *bag of words*:

In [None]:
print("bards_words:\n{}".format(bards_words))

In [None]:
cv = CountVectorizer(ngram_range=(1, 1)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))

With *n* fixed at 2, we get different features:

In [None]:
cv = CountVectorizer(ngram_range=(2, 2)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))

In [None]:
print("Transformed data (dense):\n{}".format(cv.transform(bards_words).toarray()))

Or we can specify a range for *n*:

In [None]:
cv = CountVectorizer(ngram_range=(1, 3)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))

Notice how increasing the value of *n* will cause the number of features to explode rapidly.

Applying this to the IMDB data:

In [None]:
vect = CountVectorizer(min_df=5, ngram_range=(2, 2)).fit(text_train)
print("Vocabulary size: {}".format(len(vect.vocabulary_)))

In [None]:
X_train = vect.transform(text_train)

scores = cross_val_score(LogisticRegression(max_iter=100000), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

In [None]:
lr = LogisticRegression(max_iter=100000)
lr.fit(X_train, y_train)

coefs = pd.Series(lr.coef_[0], vect.get_feature_names() )
coefs = coefs.sort_values(ascending=False)

In [None]:
coefs

### Exercise

Try applying *bag of words* to the Spanish language paper reviews. 

Can you fit a linear regression to predict the `evaluation` score?

Apply some of the variations discussed above to see if you can improve cross-validation performance.