In this notebook I take you through the steps of developing an IMDB Movie Review sentiment analyzer using the famous IMDB Movie Review  dataset freely available on "https://ai.stanford.edu/~amaas/data/sentiment/". After downlowading the file "aclImdb_v1.tar.gz" which is a compressed file of the dataset, I extracted the files into a new folder I named "aclImdb" which includes the "train" directory that contains 25,000 movie reviews for training our sentiment analyzer, and a "test" directory contains 25,000 movie reviews for testing the performance of our model. After which I followed the following steps:

1: First we Load the Dataset:

In [1]:
import os
import pandas as pd

def load_imdb_dataset(dataset_dir):
    train_dir = os.path.join(dataset_dir, 'train')
    test_dir = os.path.join(dataset_dir, 'test')
    
    # Load the training data
    train_texts = []
    train_labels = []
    for label_type in ['neg', 'pos']:
        dir_name = os.path.join(train_dir, label_type)
        for fname in os.listdir(dir_name):
            if fname.endswith('.txt'):
                with open(os.path.join(dir_name, fname), encoding='utf-8') as f:
                    train_texts.append(f.read())
                train_labels.append(0 if label_type == 'neg' else 1)

    # Load the test data
    test_texts = []
    test_labels = []
    for label_type in ['neg', 'pos']:
        dir_name = os.path.join(test_dir, label_type)
        for fname in os.listdir(dir_name):
            if fname.endswith('.txt'):
                with open(os.path.join(dir_name, fname), encoding='utf-8') as f:
                    test_texts.append(f.read())
                test_labels.append(0 if label_type == 'neg' else 1)

    # Convert the data to Pandas DataFrames
    train_df = pd.DataFrame({'text': train_texts, 'label': train_labels})
    test_df = pd.DataFrame({'text': test_texts, 'label': test_labels})

    return train_df, test_df

# Load the IMDb dataset
train_df, test_df = load_imdb_dataset('aclImdb')


Step 2: After loading the dataset we follow it up with Preprocessing the data by removing stop words, stemming or lemmatizing, and performing other text cleaning steps:

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
import re

# Initialize NLTK's stopwords and stemmer
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Define a function for preprocessing the text data
def preprocess_text(text):
    # Remove HTML tags
    text = re.sub('<[^>]*>', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation and numbers
    text = re.sub('[^a-z]+', ' ', text)
    
    # Tokenize the text into words
    words = text.split()
    
    # Remove stop words
    words = [word for word in words if word not in stop_words]
    
    # Stem the words
    words = [stemmer.stem(word) for word in words]
    
    # Lemmatize the words
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Join the words back into a string
    text = ' '.join(words)
    
    return text

# Preprocess the training data
train_df['text'] = train_df['text'].apply(preprocess_text)

# Preprocess the test data
test_df['text'] = test_df['text'].apply(preprocess_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Step 3: After loading and cleaning the data, we Vectorize the text data using TF-IDF or CountVectorizer:

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)

# Fit and transform the training data
train_vectors = vectorizer.fit_transform(train_df['text'])

# Transform the test data
test_vectors = vectorizer.transform(test_df['text'])


Step 4: After vectorizing the data, we train and evaluate the sentiment analyzer using different machine learning algorithms, which are; Naive Bayes, Logistic Regression and Support Vector Machine:

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Define the machine learning models
nb_model = MultinomialNB()
lr_model = LogisticRegression(max_iter=1000)
svm_model = LinearSVC(max_iter=1000)

# Train and evaluate the models
models = [nb_model, lr_model, svm_model]
model_names = ['Naive Bayes', 'Logistic Regression', 'Support Vector Machine']

for i, model in enumerate(models):
    # Train the model
    model.fit(train_vectors, train_df['label'])

    # Make predictions on the test data
    y_pred = model.predict(test_vectors)

    # Evaluate the model performance
    accuracy = accuracy_score(test_df['label'], y_pred)
    precision = precision_score(test_df['label'], y_pred)
    recall = recall_score(test_df['label'], y_pred)

    print(f'{model_names[i]}: Accuracy = {accuracy:.3f}, Precision = {precision:.3f}, Recall = {recall:.3f}')


Naive Bayes: Accuracy = 0.824, Precision = 0.851, Recall = 0.785
Logistic Regression: Accuracy = 0.875, Precision = 0.874, Recall = 0.875
Support Vector Machine: Accuracy = 0.855, Precision = 0.865, Recall = 0.842


Step 5: We then finish up by deploying the sentiment analyzer using a gradio interface where it can receive input for new reviews which are positive or negative:

In [7]:
import gradio as gr

def predict_sentiment(review):
    # Preprocess the review
    review = preprocess_text(review)
    # Vectorize the preprocessed review
    review_vector = vectorizer.transform([review])
    # Predict the sentiment of the review
    if model.predict(review_vector)[0] == 0:
        sentiment = "Negative"
    else:
        sentiment = "Positive"
    return sentiment

# Create a Gradio interface
iface = gr.Interface(fn=predict_sentiment, 
                     inputs=gr.inputs.Textbox(lines=10, label="Enter review here"), 
                     outputs="text",
                     title="IMDB Movie Review Sentiment Analyzer",
                     description="Analyze the sentiment of a movie review as positive or negative.")

# Launch the interface
iface.launch()




Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.


