In this notebook, we will build a classification model to predict whether a movie review from IMDb is positive or negative. We will use the dataset named [IMDb Dataset of 50K Movie Reviews](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) from [Kaggle](https://en.wikipedia.org/wiki/Kaggle). 

In [None]:
import warnings 
warnings.filterwarnings("ignore")

import pandas as pd
df = pd.read_csv("IMDB Dataset.csv")
df.head()

In [None]:
df.iloc[0, 0]

Each input example in the logistic classification algorithm was a vector(point) in n-dimensional space. Our input consists of text, which is a sequence of words, spaces, punctuations, emojis, etc. So, we need to convert this input into a feature vector, which consists of numerical values.  When we have categories, such as the three ports of entry in the Titanic dataset, we use one-hot encoding to create columns for each categories and thus, get a numerical feature vector. What should we do in this case? 

We will use the techniques from **natural language processing (NLP) for text classification**. This particular model would be an example of **sentiment analysis**, which as the name suggests identifies the sentiment of the text. 

#### Bag Of Words (BOW)

A simple way to vectorize a text would be to convert it into a sequence of words. For example,
```
"It is sunny in Los Angeles." ->  ["It", "is", "sunny", "in", "Los", "Angeles", "."]
```

So, now we have a vector but the values are not numerical. So, we create a vocabulary
```
Training text: ["I like to read in cafes.", "The walk in the park is nice."]
Vocabulary: ["I", "like", "to", "read", "in", "cafes", "the", "walk", "park", "is", "nice"]
New text: "I like the walk in the park."
```

|I| like| to| read| in| cafes|the|walk| park|is|nice|
|-|-----|---|-----|---|------|---|----|-----|--|----|
|1|  1  | 0 |  0  | 1 |  0   | 2 |  1 |  1  |0 | 0  |

```
Vectorization: "I like the walk in the park." -> [1, 1, 0, 0, 1, 0, 2, 1, 1, 0, 0]
``` 
If I know the vocabulary set `["I", "like", "to", "read", "in", "cafes", "the", "walk", "park", "is", "nice"]` and I am given the vector `[1, 1, 0, 0, 1, 0, 2, 1, 1, 0, 0]` corresponding to this vocabulary. Can I retrieve the original sentence? If not, what is missing?

This technique is called Bag of words (BOW) as it disregards the order of the words. You can think of it as putting all the words from a sentence in a bag and thereby breaking the sequence of words completely.

With the above steps, you will get a pair (input, output) corresponding to each training example where input is the numerical vector and the output is label 0 or 1 depending on whether the review is negative or positive respectively.

In practical examples, your vocabulary needs to be very large which means you will have many columns. The number of columns adds to the complexity of the model. To keep overfitting in check, you will need a much higher number of rows (training examples) to train the model. 

The above process of vectorization can be performed using [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from [`scikit-learn`](https://scikit-learn.org/stable/) as follows. 

First we import and define the vectorizer.
```
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() 
```

Then, we use only the training set to train/fit the vectorizer. Once it is trained, we transformed the training set.
```
X_train_vectorized = vectorizer.fit_transform(X_train)
```
Lastly, we transform the validation set. Note that we do not use the validation set to fit/train the vectorizer.
```
X_valid_vectorized = vectorizer.transform(X_valid)
```

The variables `X_train_vectorized` and `X_valid_vectorized` thus obtained are numerical vectors that can be fed into logistic classifier.

Since, the vocabulary is coming solely from the training set, the performance of our model depends on making sure that the training set is large and diverse enough to contain most of the needful vocabulary.

There are in fact two basic steps to follow before building the model:
- Preprocessing: Clean the text and make it easier to process
- Vectorization: Create numerical feature vectors from the text

To decide on how to clean the text, let us have a closer look at the first review:

In [None]:
df.iloc[16, 0]

Some of the text preprocessing we performed earlier can be useful here:
* Remove HTML tags such as <br />
* Remove the characters such as \\, ', "
* Replace punctuation with spaces
* Convert all the text to lowercase

It can be summed up nicely in a function.
```
import re
def clean_text(text):

    return text
    
```

You can use pass this function on to the [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) so that it cleans up the reviews before vectorizing.
```
vectorizer = CountVectorizer(preprocessor=clean_text) 
```

What else can we do? Are there words in the reviews that are not adding any value to the model for predicting the sentiment?

In [None]:
df.iloc[1, 0]

The common words such as "the", "a", "is", "it", etc. can be conveniently removed. They called **stopwords**. 

```
vectorizer = CountVectorizer(stop_words="english", preprocessor=clean_text)                         
```

So go ahead and try build the model! The solution will be shared after the session. We will reconvene to learn more.

Guideline: 
* Divide the dataset into training and validation set
* Define the function for cleaning text to be used in the next step
* Vectorize both training and validation set using [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Make sure to pass arguments for `stop_words` and `preprocessor` keywords.
* Train a logistic classifier using [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) on the vectorized training set 
* Predict the labels for the validation set and test their accuracy
* Write a few reviews and test them to see if the model correctly predicts the sentiment labels (Optional)

In [None]:
from sklearn.model_selection import train_test_split
# default is 75% / 25% train-test split
X = df['review'] 
y = df['sentiment'].replace({'positive': 1, 'negative': 0})
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

In [None]:
import re
def clean_text(text):
    """
    Applies some pre-processing on the given text.

    Steps :
    - Removing HTML tags
    - Removing punctuations and other characters
    """
    
    # remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # remove punctuation and other characters
    text = re.sub("[,.:;?!@#$%^&*()-+_=/{}]+", '', text)
    
    # remove the characters [\], ['] and ["]
    text = re.sub("[\'\"\[\]]", '', text) 
    
    # remove digits
    text = re.sub(r'<.*?>', '', text)

    return text

### TF-IDF Vectorizer

If we were to look only at words such as in Bag-Of-Words (BOW), some words such as "wonderful", "disgusting", etc. would be stronger indicators for the sentiment of the reviews than words such as "watching", "become", "every", "after", etc. In the above method, the words were weighted solely based on their frequency in a review. Wouldn't it be useful to weigh rarer words higher than commonly occuring ones?

Term Frequency Inverse Document Frequency (TF-IDF)

$$ \text{TF-IDF} = \text{TF (Term Frequency)} * \text{IDF (Inverse Document Frequency)} $$

Term Frequency (TF) is the same as above viz the number of times a word occur in a review. It is multiplied by Inverse Document Frequency (IDF) which is a measure of the originality of the word. The words that are rarer have higher IDF values and hence, they are weighted more in TF-IDF than their true frequency as compared to commonly occuring words.

$$ \text{Inverse Document Frequency (IDF) for a word} = \log \Bigg( \frac{\text{Total number of reviews}}{\text{Number of reviews that contain this word}}\Bigg)$$

Term Frequency Inverse Document Frequency (TF-IDF) vectorization is implemented in [`scikit-learn`](https://scikit-learn.org/stable/) as [`TfidfVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and has the same syntax as [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) above.

### Using n-grams

The above methods using Bag-Of-Words (BOW) technique are not good at detecting negation. Let's predict the sentiment for some of the reviews. Recall that $0$ corresponds to negative and $1$ corresponds to positive sentiment.

In [None]:
review1 = ["In and of itself it is not a bad film."]
vectorized_review1 = vectorizer2.transform(review1)
model2.predict(vectorized_review1)

In [None]:
review2 = ["""It plays on our knowledge and our senses, particularly with the scenes concerning
          Orton and Halliwell and the sets are terribly well done."""]
vectorized_review2 = vectorizer2.transform(review2)
model2.predict(vectorized_review2)

In [None]:
review3 = ["""This show was not really funny anymore."""]
vectorized_review3 = vectorizer2.transform(review3)
model2.predict(vectorized_review3)

An improvement would be to include phrases in the model instead of simply breaking the sentence into words. This is achieved using $n$-grams for words. The bigrams take two words together at a time, the trigrams take three words and so on. It is implemented using the keyword `ngram_range` as follows in the vectorizer:
```
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 3))
```

where
```
ngram_range: tuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
```


In [None]:
review1 = ["In and of itself it is not a bad film."]
vectorized_review1 = vectorizer3.transform(review1)
model3.predict(vectorized_review1)

In [None]:
review2 = ["""It plays on our knowledge and our senses, particularly with the scenes concerning
          Orton and Halliwell and the sets are terribly well done."""]
vectorized_review2 = vectorizer3.transform(review2)
model3.predict(vectorized_review2)

In [None]:
review3 = ["""This show was not really funny anymore."""]
vectorized_review3 = vectorizer3.transform(review3)
model3.predict(vectorized_review3)

As you can see, the model is correctly predicting the sentiment only for the second review. It still does not get the sentiment for the other two reviews! There are limitations with using Logistic Regression than can only draw linear decision boundaries, so we will come back to this dataset when using more advanced neural network algorithm to see if they improve the results. We will also study some neural network architectures that are especially designed to have memory of previous words in a sentence.