# Sentiment Analysis P1

In this notebook, you will learn to 
- use different packages in Python to build a complete pipeline for solving sentiment analysis problem. 

- use the simplest model, Mutinomial NB.

## Pipeline

<img src="resources/pipeline.png">

## Get familiar with dataset

In [None]:
import pandas as pd
from nlp_proj_utils import get_imdb_dataset

pd.set_option('max_colwidth', 500)  # Set display column width to show more content

In [None]:
# Load dataset, download if necessary
train, test = get_imdb_dataset()

In [None]:
# Get a sample (head) of the data frame
train.sample(5)

In [None]:
print('train shape:', train.shape)
print('test  shape:', test.shape)

In [None]:
# Statics on tags
train['sentiment'].value_counts()

See [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/10min.html?highlight=data%20frame) for more details.

## Preprocessing

### Tokenization and Normalization

For preprocessing, we will apply the following steps:

1. Remove HTML tag (`<br />` in this case) from the review text
2. Remove punctuations (replace with whitespace)
3. Split review text into tokens
4. Remove tokens that are considered as "**stopwords**"
5. For the rest, do lemmatization

In [None]:
import nltk
import string

In [None]:
# If this is the first time you use nltk, 
# make sure to download necessary resources and pre-trained models
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [None]:
transtbl = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
stopwords = nltk.corpus.stopwords.words('english')
lemmatizer = nltk.WordNetLemmatizer()

In [None]:
stopwords[:10]

In [None]:
string.punctuation

In [None]:
'ababc'.translate(str.maketrans('abc','def'))

In [None]:
def preprocessing(line: str) -> str:
    """
    Take a text input and return the preprocessed string.
    i.e.: preprocessed tokens concatenated by whitespace
    """
    line = line.replace('<br />', '').translate(transtbl)
    
    # list
    tokens = [lemmatizer.lemmatize(t.lower(),'v')
              for t in nltk.word_tokenize(line)
              if t.lower() not in stopwords]
    
    return ' '.join(tokens)

In [None]:
test_str = "I bought several books yesterday<br /> and I really love them!"
preprocessing(test_str)

In [None]:
from tqdm.auto import tqdm
tqdm.pandas()

In [None]:
# If you're using macOS and Linux, you may run un-comment the following code to speed up the preprocessing

# !pip install pandarallel
# from pandarallel import pandarallel
# pandarallel.initialize(progress_bar=True)

In [None]:
# If you're using Windows, run the following, otherwise, comment this out, and run the second statement instead
for df in train, test:
    df['text_prep'] = df['text'].progress_apply(preprocessing)
    
# If you're using macOS or Linux, un-comment and run the following code
# for df in train, test:
#     df['text_prep'] = df['text'].parallel_apply(preprocessing)

In [None]:
assert train.shape == (25000, 3)
assert test.shape == (25000, 3)

In [None]:
train.sample(2)

### Build Vocabulary

Instead of using `CountVectorizer` (N-gram) provided by sklearn directly, we will build the vocabulary on our own, so that we have more control over it.

<span style="color:red">**Tips:**</span>

We can only use words in training data for building vocabulary

In [None]:
all_words = [w for text in tqdm(train['text_prep']) 
             for w in text.split()]

In [None]:
# Use FreqDist to get count for each word
voca = nltk.FreqDist(all_words)
print(voca)

In [None]:
voca.most_common(10)

In [None]:
topwords = [word for word, _ in voca.most_common(10000)]

### Vectorizer

For this section, we will try two ways to do vectorization: **BoW** (1-gram) and **BoW with Tfidf Transformer**.

In [None]:
from sklearn.feature_extraction.text import (
    CountVectorizer, 
    TfidfTransformer, 
    TfidfVectorizer,)

In [None]:
CountVectorizer()

### Tf–idf Transformer

- Tf: Term-Frequency
- idf: Inverse Document-Frequency
- Tf-idf = $tf(t,d) \times idf(t)$

$$
idf(t) = log{\frac{1 + n_d}{1 + df(d, t)}} + 1
$$

![](http://www.onemathematicalcat.org/Math/Algebra_II_obj/Graphics/log_base_gt1.gif)

> Sentence 1: The boy **love** the toy <br>
> Sentence 2: The boy **hate** the toy

In [None]:
transformer = TfidfTransformer(smooth_idf=False)
transformer

In [None]:
counts = [[3, 0, 1],
          [2, 0, 0],
          [3, 0, 0],
          [4, 0, 0],
          [3, 2, 0],
          [3, 0, 2]]
tfidf = transformer.fit_transform(counts)
tfidf

In [None]:
tfidf.toarray()

<span style="color:red">**Tips:**</span>

tf-idfs are computed slightly different in sklearn, where:

$$
idf(t) = log{\frac{n_d}{1 + df(d, t)}}
$$

With `smooth_idf=True` set to `True`, the formula is:

$$
idf(t) = log{\frac{n_d}{df(d, t)}} + 1
$$


It's always worth trying tfidf transformer for text classification problem. Since `CountVectorizer` and `TfidTransformer` are often chained together, sklearn also provide a class that combines the two steps together: `TfidfVectorizer`.

In [None]:
TfidfVectorizer()

Let's take the sentences from the slide as an example:

In [None]:
t_corpus = ['the boy love the toy', 'the boy hate the toy']

In [None]:
# Bag of words
# Voc = ['boy', 'hate', 'love', 'the', 'toy']

t_cnt_vec = CountVectorizer()
t_cnt_vec.fit(' '.join(t_corpus).split())
t_cnt_vec.transform(t_corpus).toarray()

In [None]:
# Tfidf
t_tfidf_vec = TfidfVectorizer()
t_tfidf_vec.fit(' '.join(t_corpus).split())
t_tfidf_vec.transform(t_corpus).toarray()

### Vectorization / Featurization

In [None]:
train_x, train_y = train['text_prep'], train['sentiment']
test_x, test_y = test['text_prep'], test['sentiment']

In [None]:
# Use topwords as vocabulary
tf_vec = TfidfVectorizer(vocabulary=topwords)

In [None]:
train_features = tf_vec.fit_transform(train_x)
test_features = tf_vec.transform(test_x)

In [None]:
assert train_features.shape == (25000, 10000)
assert test_features.shape == (25000, 10000)

In [None]:
train_features[0][:50].toarray()

## Training

### [Multinomial NB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

The multinomial Naive Bayes classifier is suitable for **classification with discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
mnb_model = MultinomialNB()
mnb_model

In [None]:
%%time

# Train Model
mnb_model.fit(train_features, train_y)

## Evaluation

In [None]:
from sklearn import metrics

In [None]:
# Predict on test set
pred = mnb_model.predict(test_features)
print(pred)

In [None]:
print('Accuracy: %f' % metrics.accuracy_score(pred,test_y))

<span style="color:red">**Tips:**</span>

It doesn't matter if you change the order of `pred` and `test_y` passed into `accuracy_score` since the metrics is symmetric. **However**, it is extremely important that you pass them in the correct order when you need to calculate per-class metrics like f-score.

In [None]:
# Pass in as keyword arguments to make sure the order is correct
print(
    metrics.classification_report(y_true=test_y, y_pred=pred))

In [None]:
# Example from sklearn documentation

y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(metrics.classification_report(y_true, y_pred, target_names=target_names))

## Predict new text

In [None]:
def predict_new(prep_func,  # func for preprocessing
                vec,        # vectorizer
                model,      # model
                text):      # text
    
    prep_text = prep_func(text)
    features = vec.transform([prep_text])
    pred = model.predict(features)
    return pred[0]

In [None]:
from functools import partial

predict_new_p1 = partial(predict_new, preprocessing, tf_vec, mnb_model)

In [None]:
predict_new_p1('It looks nice')

## Tunning hyper parameters

In [None]:
def train_with_n_topwords(n: int, tfidf=False) -> tuple:
    """
    Train and get the accuracy with different model settings
    Args:
        n: number of features (top frequent words in the vocabulary)
        tfidf: whether do tf-idf re-weighting or not
    Outputs:
        tuple: (accuracy score, classifier, vectorizer)
    """
    topwords = [word for word, _ in voca.most_common(n)]
    
    if tfidf:
        vec = TfidfVectorizer(vocabulary=topwords)
    else:
        vec = CountVectorizer(vocabulary=topwords)
    
    # Generate feature vectors
    train_features = vec.fit_transform(train_x)
    test_features  = vec.transform(test_x)
    
    # NB
    mnb_model = MultinomialNB()
    mnb_model.fit(train_features, train_y)
    
    # Test predict
    pred = mnb_model.predict(test_features)
    
    return metrics.accuracy_score(pred, test_y), mnb_model, vec

In [None]:
train_with_n_topwords(500, tfidf=True)

In [None]:
possible_n = [500 * i for i in range(1, 20)]

cnt_accuracies = []
tfidf_accuracies = []

for n in tqdm(possible_n):
    cnt_accuracies.append(train_with_n_topwords(n)[0])
    tfidf_accuracies.append(train_with_n_topwords(n, tfidf=True)[0])

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

plt.plot(possible_n, cnt_accuracies, label='Word Count')
plt.plot(possible_n, tfidf_accuracies, label='Tf-idf')

**Expected**:

<img src="resources/plot.png" width="400">

## Save model

In [None]:
_, model, vec = train_with_n_topwords(3000, tfidf=True)

In [None]:
import pickle

with open('tf_vec.pkl', 'wb') as fp:
    pickle.dump(vec, fp)
    
with open('mnb_model.pkl', 'wb') as fp:
    pickle.dump(model, fp)