
- Ask yourself why would they have selected this problem for the challenge? What are some gotchas in this domain I should know about?
- What types of visualizations will help me grasp the nature of the problem / data?
- What feature engineering might help improve the signal?
- Which modeling techniques are good at capturing the types of relationships I see in this data?
- Now that I have a model, how can I be sure that I didn't introduce a bug in the code? If results are too good to be true, they probably are!
- What are some of the weaknesses of the model and and how can the model be improved with additional work
- Choose a CV or NLP problem. Do a thorough Exploratory Data Analysis of the dataset and report the final performance metrics for your approach.Suggest ways in which you can improve the model.

## Frame the Problem

- This is an IMDB dataset having 50K movie reviews for natural language processing.- It is a  binary sentiment classificatios.It consists ofaa  set of 25,000positive and 25,000 negative reviews..

## Objective

- The objective is to predict the number of positive and negative reviews using either classification or deep learning algorithms.

## Top Solution

The highest level of accuracy achieved with this dataset is 96.21 Accuracy in the [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://paperswithcode.com/sota/sentiment-analysis-on-imdb) paper

## Performance Measure

## Data Source
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data

## Constants

In [1]:
DATA_PATH = '../Data/Raw/IMDB Dataset.csv'
PREPROCESSED_PATH = "../Data/Processed/preprocessed_df.pkl"

MLFLOW_TRACKING_URI = '../Models/mlruns'
MLFLOW_EXPERIMENT_NAME = "imdb_review_sentiment_analysis"
LOG_PATH = "../Models/temp/"
LOG_DATA_PKL    =  "data.pkl"
LOG_MODEL_PKL   =  "model.pkl"
LOG_METRICS_PKL =  "metrics.pkl"

## Packages

In [None]:
import os
import numpy as np
import pandas as pd

import logging
import pickle
from pathlib import Path

import re
import string
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

import mlflow
from mlflow.tracking import MlflowClient


# pd.options.display.max_rows = 10000
# pd.options.display.max_columns = 10000

## Functions

In [3]:
# Function to log Data, Model, Metrics and Track models.
def log_data(x_train,y_train,x_test,y_test):
    # Save the model's dataset trained on
    data_details = {
                    "x_train": x_train,
                    "x_test": x_test,
                    "y_train": y_train,
                    "y_test": y_test
    }

    with open(os.path.join(LOG_PATH, LOG_DATA_PKL), "wb") as output_file:
        pickle.dump(data_details, output_file)
        
        
def log_model(clf,model_description=''):
    # save the model, model details and model's description
    model = {"model_description": model_description,
             "model_details": str(clf),
             "model_object": clf} 

    with open(os.path.join(LOG_PATH, LOG_MODEL_PKL), "wb") as output_file:
        pickle.dump(model, output_file)
        
    return model
        
def log_metrics(train_scores, test_scores):
    # save the model metrics
    classes_metrics = {"train_scores": train_scores,
                        "test_scores" : test_scores} 


    with open(os.path.join(LOG_PATH, LOG_METRICS_PKL), "wb") as output_file:
        pickle.dump(classes_metrics, output_file)

def track_model(model, scores):
    # Start a run in the experiment and track current model
    with mlflow.start_run(experiment_id=exp.experiment_id, run_name=model["model_description"]):
        # Track pickle files
        mlflow.log_artifacts(LOG_PATH)

        # Track metrics 
        for metric, score in scores.items():
            mlflow.log_metric(metric, score)

In [None]:
# count number of characters 
def chars_count(text):
    return len(text)

In [None]:
# count number of words 
def words_count(text):
    return len(text.split())

In [None]:
# count number of capital words
def capital_words_count(text):
    return sum(map(str.isupper,text.split()))

In [None]:
# count number of punctuations
def punctuations_count(text):
    punctuations='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    d=dict()
    for i in punctuations:
        d[str(i)+' count']=text.count(i)
    return d

In [None]:
# count number of words in quotes
def words_in_quotes_count(text):
    x = re.findall("\'.\'|\".\"", text)
    count=0
    if x is None:
        return 0
    else:
        for i in x:
            t=i[1:-1]
            count+=count_words(t)
        return count

In [None]:
# count number of sentences
def sent_count(text):
    return len(nltk.sent_tokenize(text))

In [None]:
# calculate average word length
def avg_word_len(char_count,word_count):
    return char_count/word_count

In [None]:
# calculate average sentence length
def avg_sent_len(word_count,sent_count):
    return word_count/sent_count

In [None]:
# count number of unique words 
def unique_words_count(text):
    return len(set(text.split()))

In [None]:
# Calculate the percentage of unique words
def unique_words_percent(unique_count,words_count):
    return unique_count/words_count

In [None]:
# count of stopwords
def stopwords_count(text):
    stop_words = set(stopwords.words('english'))  
    word_tokens = word_tokenize(text)
    stopwords_x = [w for w in word_tokens if w in stop_words]
    return len(stopwords_x)

In [None]:
# stopwords vs words
def stopwords_percent(stopwords_count,text):
    return stopwords_count/len(word_tokenize(text))

## Load Dataset

In [16]:
# Read Dataset and print shape
raw_df = pd.read_csv(DATA_PATH)
raw_df.shape

(50000, 2)

## Data Preprocessing

In [17]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


- The Dataset contains only two columns
- Review column is the training data and Sentiment is the labels column
- The dataset doesn't have null values

In [18]:
# Check for Duplicates
raw_df.duplicated().value_counts()

False    49582
True       418
dtype: int64

- The Dataset Contains 418 Duplicated samples

In [19]:
# Remove the Duplicates
raw_df = raw_df.drop_duplicates()

In [20]:
# Check whether the dataset is balanced or imbalanced?
raw_df['sentiment'].value_counts()

positive    24884
negative    24698
Name: sentiment, dtype: int64

- The Dataset is Balanced

In [21]:
# Check whether any empty reviews exist
raw_df['length'] = raw_df['review'].apply(len)
print(len(raw_df[raw_df['length'] == 0]))
raw_df = raw_df.drop(columns='length')

0


- The Dataset doesn't have empty reviews

### Text preprocessing

In [22]:
df = raw_df.copy()

In [23]:
# Convert text to lowercase
raw_df['review_cleaned'] = raw_df['review'].str.lower()

In [24]:
# Remove HTML tags
raw_df['review_cleaned'] = raw_df['review_cleaned'].apply(lambda x: re.sub('<[^<]+?>', ' ', x))

In [25]:
# Remove Punctuations
# raw_df['review'] = raw_df['review'].str.translate(str.maketrans('', '', string.punctuation))
raw_df['review_cleaned'] = raw_df['review_cleaned'].apply(lambda x: re.sub(f"[{re.escape(string.punctuation)}]",' ', x))

In [26]:
# Remove Digits
raw_df['review_cleaned'] = raw_df['review_cleaned'].apply(lambda x: re.sub(r'\d+', '', x))

In [27]:
# Remove URLs
raw_df['review_cleaned'] = raw_df['review_cleaned'].apply(lambda x: re.sub(r'https?://\S+|www\.\S+', '', x))

In [28]:
# Create new column with No StopWords
stop_words = stopwords.words('english')
raw_df['review_nostopwords'] = raw_df['review_cleaned'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

In [29]:
# Create new column for stemmed words
stemmer = PorterStemmer()
raw_df["review_stemmed"] = raw_df["review_nostopwords"].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))

In [None]:
# Create new column for lemmatized words 
#Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")
# Process the text using spaCy and Extract lemmatized tokens
raw_df['review_lemma'] = raw_df["review_nostopwords"].apply(lambda x: " ".join([word.lemma_ for word in nlp(x)]))

- Applying Text preprocessing in that specific order resulted in the most clean Dataset

In [None]:
# Verify your Results
i = df.sample(1).index[0]
# i = 100
print(raw_df['review_lemma'].iloc[i])
print('###########################################################')
print(df['review'].iloc[i])

In [None]:
# Export the preprocessed dataset with pickle
raw_df.to_pickle(PREPROCESSED_PATH)

In [None]:
prep_df = pd.read_pickle(PREPROCESSED_PATH)

## Feature Engineering

### Text Representation

In [6]:
# Intilize 
tf_idf = TfidfVectorizer(ngram_range=(1,2))

# Fitting
tf = tf_idf.fit_transform(raw_df['review_lemma'])

In [7]:
# Len of Vocabulary
print(f"The Lenght of Tf-idf Vocabulary is {len(tf_idf.vocabulary_)}")

The Lenght of Tf-idf Vocabulary is 2750963


In [8]:
x = tf
y = raw_df['sentiment']

In [71]:
# IDF scores of words
idf_scores = tf_idf.idf_

# Print the IDF scores of words and the vocabulary
print("IDF Scores of Words:", idf_scores)

IDF Scores of Words: [ 9.17234598 11.11825613 11.11825613 ... 11.11825613 11.11825613
 11.11825613]


In [34]:
# BAG of words
cv = CountVectorizer(ngram_range=(1,2))
traindata = cv.fit_transform(raw_df['review'])
x = traindata
y = raw_df['sentiment']

In [35]:
x.shape

(49582, 2451230)

### Intialize MLflow

In [9]:
# Create Directories
Path(MLFLOW_TRACKING_URI).mkdir(parents=True, exist_ok=True)
Path(LOG_PATH).mkdir(parents=True, exist_ok=True)

In [10]:
# Initialize client and experiment
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
client = MlflowClient()
mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
exp = client.get_experiment_by_name(MLFLOW_EXPERIMENT_NAME)

In [70]:
x = raw_df.iloc[0:,0].values
y = raw_df.iloc[0:,1].values

In [23]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 42)

In [72]:
tf = TfidfVectorizer()
from sklearn.pipeline import Pipeline

In [73]:
from sklearn.linear_model import LogisticRegression
log_clf =LogisticRegression()
model=Pipeline([('vectorizer',tf),('Log_clf',log_clf)])

model.fit(x_train,y_train)

In [36]:
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(max_iter=10000)
log_clf.fit(x_train,y_train)

In [37]:
y_pred=log_clf.predict(x_test)

In [38]:
# model score
accuracy=accuracy_score(y_pred,y_test)
print(accuracy)

0.8950467892868668


In [39]:
# confusion matrix
cm=confusion_matrix(y_test,y_pred)
print(cm)

[[5450  718]
 [ 583 5645]]


In [40]:
recall= recall_score(y_test, y_pred, average="binary", pos_label="negative")
precision = precision_score(y_test, y_pred, average="binary", pos_label="negative")
f1 = f1_score(y_test, y_pred, average="binary", pos_label="negative")
print(f"precision: {precision}, Recall: {recall}, F1: {f1}")

precision: 0.9033648267860103, Recall: 0.8835927367055771, F1: 0.8933693959511515


In [41]:
scores={'accuracy' : accuracy,
        'precision' : precision,
        'recall' : recall,
        'f1': f1}

In [42]:
scores

{'accuracy': 0.8950467892868668,
 'precision': 0.9033648267860103,
 'recall': 0.8835927367055771,
 'f1': 0.8933693959511515}

In [43]:
# Log the model's dataset train and test indices
log_data(x_train,y_train,x_test,y_test)
# Log the model, model description
model = log_model(log_clf,'Log ReG, BOW, 1,2 bigram')
# Log the model's train and test scores
log_metrics(scores, scores)
# track the model artifacts, validation scores with mlflow
track_model(model,scores)

## Retrieve Runs

In [44]:
runs = mlflow.search_runs([exp.experiment_id])
runs[['run_id','tags.mlflow.runName','metrics.precision','metrics.recall','metrics.f1','metrics.accuracy']]

Unnamed: 0,run_id,tags.mlflow.runName,metrics.precision,metrics.recall,metrics.f1,metrics.accuracy
0,ffdb5813bf0a47318f9c67012a3cf6a6,"Log ReG, BOW, 1,2 bigram",0.903365,0.883593,0.893369,0.895047
1,6b4b59c8d3fb429c934e7968d7bf7901,"Log ReG, BOW, 1,2 bigram,lemma",0.903365,0.883593,0.893369,0.895047
2,dacb0747953943b6ac752d36f2bcf578,"Log ReG, tf-idf, 1,2 bigram,lemma",0.902224,0.861706,0.881499,0.884721
3,7b592fe69d134112a6c84fbdb5c8be4d,"Log ReG, tf-idf, 1,2 bigram,no stopwords",0.905906,0.88035,0.892945,0.894966
4,5ab3454d99494e939b27394a47c9fbe8,"Logistic Regression, tf-idf, 1,2 bigram, with ...",0.904408,0.868191,0.885929,0.888754
5,d2feaee50f68455781aec918adc780f2,"Logistic Regression, BOW, 1,2 bigram, with sto...",0.913295,0.896563,0.904852,0.906179
6,a8c1d3ee67f64b37bf587ea7dd4a2a30,"Logistic Regression, BOW, bigram",0.879706,0.833495,0.855977,0.860439
7,36ca8b6129ad46e48c10794db3844bb0,"Logistic Regression, BOW, unigram",0.889072,0.874514,0.881733,0.883269
8,50741e3566b344c0a60535724167a91e,"Logistic Regression, BOW,",0.903157,0.886025,0.894509,0.896015
9,1f4e81e7082c414dbf4acaafad7b6e4e,"Logistic Regression, Tf-idf, lemma",0.900868,0.875162,0.887829,0.889965


In [45]:
runs['tags.mlflow.runName'].iloc[5]

'Logistic Regression, BOW, 1,2 bigram, with stopwords '

### Exploratory Data Analysis
#### Questions wee need to answer

In [None]:
raw_df

In [None]:
raw_df.describe()

In [None]:
raw_df.loc[raw_df['length'] == 32].iloc[0,0]

### Model Building

We can observed that both logistic regression and multinomial naive bayes model performing well compared to linear support vector machines.
Still we can improve the accuracy of the models by preprocessing data and by using lexicon models like Textblob.

Word clouds

In [None]:
#word cloud for positive review words
plt.figure(figsize=(10,10))
positive_text=norm_train_reviews[1]
WC=WordCloud(width=1000,height=500,max_words=500,min_font_size=5)
positive_words=WC.generate(positive_text)
plt.imshow(positive_words,interpolation='bilinear')
plt.show

In [None]:
MultinomialNB()