## Loading the input dataset
This is a dataset of movie reviews scraped from Imdb. You can find it for download on **Kaggle** here:
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews  
I have already downloaded this dataset and put in the local directory the same as this notebook.  
Kaggle is a treasure chest of publicly available datasets with community contributed code for cleaning and prediction! If you're interested in more tutorials like this, pay a visit there!

In [89]:
import pandas as pd
df = pd.read_csv('IMDB Dataset.csv')

In [90]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [91]:
df['review'].iloc[49999]

"No one expects the Star Trek movies to be high art, but the fans do expect a movie that is as good as some of the best episodes. Unfortunately, this movie had a muddled, implausible plot that just left me cringing - this is by far the worst of the nine (so far) movies. Even the chance to watch the well known characters interact in another movie can't save this movie - including the goofy scenes with Kirk, Spock and McCoy at Yosemite.<br /><br />I would say this movie is not worth a rental, and hardly worth watching, however for the True Fan who needs to see all the movies, renting this movie is about the only way you'll see it - even the cable channels avoid this movie."

### Power in Numbers!
There is great power in big numbers! The larger the dataset, the greater the power given to the model to learn detailed patterns. Which in turn will lead to more accurate predictions!  
In this tutorial we take a random smaller sample to make things faster, but the more you use the better the accuracy will be (most of the time, **except when "Garbage in Garbage out"**)

In [92]:
df = df.sample(5000)

In [93]:
df['sentiment'].value_counts()

negative    2565
positive    2435
Name: sentiment, dtype: int64

The dataset seems to be roughly balanced, but if it wasn't we could have taken measures to accommodate for better learning on the class with lower representation.

In [94]:
class_mapping = {'positive': 1, "negative": 0}
reverse_class_mapping = {v: k for k, v in class_mapping.items()}

In [95]:
# Convert positive and negative in 0 and 1 format
df['sentiment'].replace({'positive':1,"negative":0},inplace =True)

In [96]:
df['sentiment']

189      0
4536     0
36924    1
19830    1
40655    0
        ..
7912     0
23148    0
26603    0
41193    0
42349    1
Name: sentiment, Length: 5000, dtype: int64

### Cleaning and preprocessing
This part usually requires the most work in Data Science and is a very essential step in the process. We need to understand the data by looking at it, understand if it is dirty and how it can be cleaned.

In [97]:
import re
def text_normalization(txt:str, to_lower :bool=True, no_punct :bool=True) -> str:
    # remove all non-alphabet items including punctuation except the dash i.e. for well-thought
    if no_punct:
        txt = re.sub('(^\(?[^()]*\))|([^a-zA-Z0-9\s]+)', '', txt)
    
      # remove multiple empty spaces and replace with one
    txt = re.sub('[^\S]+',' ', txt)
    
    # apply lowercase
    if to_lower:
         txt = txt.lower()

    cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    txt = re.sub(cleanr, '', txt) 
    
    return txt

In [98]:
# function to remove the stopwords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
STOPWORDS = stopwords.words('english')

def remove_stopwords(text):
    tokens = word_tokenize(text)
    return [t for t in tokens if t not in STOPWORDS]

In [99]:
# perform stemming
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

def stem_words(text):
    y =[]
    for i in text:
        y.append(ps.stem(i))
    z = y[:]
    return z

In [100]:
# convert list into string
def join_back(list_input):
    return " ".join(list_input)

Pulling all the preprocessing functions into one functio to ease the application to all text

In [101]:
def preprocess(text):
    text = text_normalization(text)
    text = remove_stopwords(text)
    tokens = stem_words(text)
    tokens = join_back(tokens)
   
    return tokens

**progress_apply** is a capability added by tqdm to keep track of progress as a method is being applied to data

In [102]:
from tqdm.notebook import tqdm
tqdm.pandas()

df['reviews_preprocessed'] = df['review'].progress_apply(preprocess)

  0%|          | 0/5000 [00:00<?, ?it/s]

## Vectorizing the input - transformation into a vector space
For the model to understand anything about the input, the input needs to be transformed into numbers as opposed to text! human language on its own does not mean anything to an algorithm. So how do we do this?  
There are many options for transforming text input into a vector space. One of the most simplest methods is:  
**bag-of-words**: simply count the occurrence of each word and record it under its own column. For example given a corpus of documents below:
```
["My cats are very polite",
"I play videogames on the weekends",
"This machine does not have enough capacity"]
```
and for input = `"I have two cats"` the corresponding transformation would look something like this:

| WORDS | my | cats | are | very | polite | I | play | videogames | on | weekends | machine | does | not | have | two | cpus |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| VECTOR | 0 | **1** | 0 | 0 | 0 | **1** | 0 | 0 | 0 | 0 | 0 | 0 | 0 | **1** | **1** | 0 |


This method first creates a dictionary of all words present in the corpus, then counts their frequency in each document and creates the vector

**Note**: There are way more sophisticated transformation techniques such as **word embeddings trained using deep learning**, **tf-idf vectorization**, and many more! The word embeddings themselves come in a huge variety of flavours and all you need to do is give it a search. You'll find options from **GloVe**, **SBERT**, Google's **Word2Vec**, to **fasttext** and more. These are pre-trained transformation dictionaries that you can just download and use.
  
In this tutorial we will use the simple **bag-of-words** method. For all its simplicity, it is highly explainable and very useful for understanding the basics of natural language processing


In [103]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [104]:
X = cv.fit_transform(df['reviews_preprocessed']).toarray()

### What does the data look like at this point?
Just a series of binary values! **Computer goes YUMMM!**

In [105]:
X[0]

array([0, 0, 0, ..., 0, 0, 0])

In [106]:
# extract y
y = df['sentiment'].values

### Train set and Test (hold-out) set
In order to verify if the model is actually learning anything, we need to split the full dataset at hand into a training set and a hold-out set (test set). A typical split ratio can be **80% for train** and **20% for test set**, but depending on data availability this can be changed.
- The **training set** will be used to teach the model patterns existing in the data that would lead to a Positive sentiment or a Negative sentiment.  
- The **test set** will be kept away from the model during training and only used to get predictions and to verify those predictions. This will allow for an important step in any machine learning process: **Model Evaluation**

In [107]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2)

## Training a simple classifier with Logistic Regression
Logistic Regression is one of the simple Machine Learning algorithms that operates on the assumption that there is a linear pattern in the data between the input and the target variable. It is quick to train and get predictions from, and it is easy to work with and understand.  
In this tutorial we will use Logistic Regression with its default parameters, but there are hyperparameters that can be tuned in order to get better accuracy. You can view all the possible hyperparameters here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [108]:
from sklearn.linear_model import LogisticRegression

In [109]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [110]:
y_train

array([0, 0, 1, ..., 0, 1, 0])

In [111]:
clf = LogisticRegression(random_state=0).fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [112]:
y_pred_logreg = clf.predict(X_test)

In [113]:
from sklearn.metrics import accuracy_score
print("logreg",accuracy_score(y_test,y_pred_logreg))

logreg 0.85


### Calculating False Negative Rate and False Positive Rate
- False Negative rate: How many reviews were **actually positive** but we predicted them as **negative incorrectly**
- False Positive rate: How many reviews were **actually negative** but we predicted them as **positive incorrectly**

In [114]:
from sklearn.metrics import confusion_matrix

In [84]:
# want to what a function or class in python does? just try putting '?' after its name and run it like below
confusion_matrix?

Confusion matrix whose i-th row and j-th column entry indicates the number of samples with true label being i-th class and predicted label being j-th class.

In [115]:
confusion_matrix(y_test, y_pred_logreg)

array([[429,  85],
       [ 65, 421]])

### Getting predictions for new/custom input

In [116]:
input_list = [
    "we enjoyed this movie a lot",
    "we hated this movie",
    "do not take your children to this movie. too scary!",
    "I could watch this movie over and over again",
    "I can watch this forever",
    "It started off great but went downhill too fast. Struggled to stay awake.",
]

In [117]:
def vectorize(input_tokens, vectorizer):
    return vectorizer.transform([input_tokens]).toarray()

In [118]:
for review in input_list:
    numeric_predictions = clf.predict(vectorize(preprocess(review), cv))
    readable_predictions = [reverse_class_mapping[p] for p in numeric_predictions]
    print(f"Prediction: {str(readable_predictions)}, Review: {review}")

Prediction: ['positive'], Review: we enjoyed this movie a lot
Prediction: ['negative'], Review: we hated this movie
Prediction: ['negative'], Review: do not take your children to this movie. too scary!
Prediction: ['negative'], Review: I could watch this movie over and over again
Prediction: ['positive'], Review: I can watch this forever
Prediction: ['negative'], Review: It started off great but went downhill too fast. Struggled to stay awake.


### How do you think the model did on these custom reviews?