## 1. Import the packages

In [1]:
import os
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import joblib

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /Users/geuse/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/geuse/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/geuse/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/geuse/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## 2. Load the data

In [3]:
def load_reviews(path, sentiment):
    reviews = []
    labels = []
    # Load all the reviews from the directory.
    for filename in os.listdir(path):
        if filename.endswith(".txt"):
            with open(os.path.join(path, filename), 'r', encoding='utf-8') as file:
                reviews.append(file.read())
                labels.append(sentiment)
    return reviews, labels


def load_imdb_data(base_path):
    # Define the paths for positive and negative reviews for both training and testing datasets.
    train_pos_path = os.path.join(base_path, 'train', 'pos')
    train_neg_path = os.path.join(base_path, 'train', 'neg')
    test_pos_path = os.path.join(base_path, 'test', 'pos')
    test_neg_path = os.path.join(base_path, 'test', 'neg')

    # Load the reviews and labels for each category.
    train_pos_reviews, train_pos_labels = load_reviews(train_pos_path, 1)
    train_neg_reviews, train_neg_labels = load_reviews(train_neg_path, 0)
    test_pos_reviews, test_pos_labels = load_reviews(test_pos_path, 1)
    test_neg_reviews, test_neg_labels = load_reviews(test_neg_path, 0)

    # Combine the reviews and labels for training and testing datasets.
    train_reviews = train_pos_reviews + train_neg_reviews
    train_labels = train_pos_labels + train_neg_labels
    test_reviews = test_pos_reviews + test_neg_reviews
    test_labels = test_pos_labels + test_neg_labels

    # Create pandas dataframes from the reviews and labels.
    train_data = pd.DataFrame({
        'review': train_reviews,
        'sentiment': train_labels
    })

    test_data = pd.DataFrame({
        'review': test_reviews,
        'sentiment': test_labels
    })
    return train_data, test_data

In [5]:
base_path = "./aclImdb"
train_data, test_data = load_imdb_data(base_path)

In [6]:
print(train_data.head())
print(test_data.head())

                                              review  sentiment
0  For a movie that gets no respect there sure ar...          1
1  Bizarre horror movie filled with famous faces ...          1
2  A solid, if unremarkable film. Matthau, as Ein...          1
3  It's a strange feeling to sit alone in a theat...          1
4  You probably all already know this by now, but...          1
                                              review  sentiment
0  Based on an actual story, John Boorman shows t...          1
1  This is a gem. As a Film Four production - the...          1
2  I really like this show. It has drama, romance...          1
3  This is the best 3-D experience Disney has at ...          1
4  Of the Korean movies I've seen, only three had...          1


## 3.  Preprocess the data

In [7]:
def preprocess_review(review):
    # Convert to lowercase.
    review = review.lower()
    # Remove non-alphabet characters.
    review = re.sub('[^a-z]', ' ', review)
    # Tokenize the review.
    words = nltk.word_tokenize(review)
    # Remove stop words.
    words = [word for word in words if word not in stopwords.words('english')]
    # Lemmatize the words.
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    # Join the preprocessed words back into a string.
    review = ' '.join(words)
    return review

In [None]:
# Preprocess the reviews in the training and testing datasets.
train_data['review'] = train_data['review'].apply(preprocess_review)
test_data['review'] = test_data['review'].apply(preprocess_review)

In [5]:
print(train_data.head())
print(test_data.head())

                                              review  sentiment
0  movie get respect sure lot memorable quote lis...          1
1  bizarre horror movie filled famous face stolen...          1
2  solid unremarkable film matthau einstein wonde...          1
3  strange feeling sit alone theater occupied par...          1
4  probably already know additional episode never...          1
                                              review  sentiment
0  based actual story john boorman show struggle ...          1
1  gem film four production anticipated quality i...          1
2  really like show drama romance comedy rolled o...          1
3  best experience disney themeparks certainly be...          1
4  korean movie seen three really stuck first exc...          1


## 4. Save the preprocessed data

In [14]:
import pickle

# Save train_data.
with open('./train_data.pkl', 'wb') as f:
    pickle.dump(train_data, f)

# Save test_data.
with open('./train_data.pkl', 'wb') as f:
    pickle.dump(test_data, f)

In [4]:
import pickle

# Load train_data.
with open('./train_data.pkl', 'rb') as f:
    train_data = pickle.load(f)

# Load test_data.
with open('./test_data.pkl', 'rb') as f:
    test_data = pickle.load(f)

## 5. Model1: TF-IDF + Logistic Regression
**TF-IDF Representation:** TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It considers both the frequency of a word in a document (Term Frequency) and the inverse of the frequency of the document that contains the word in the whole corpus (Inverse Document Frequency). The combination of these two creates a balance where words that are too common across documents get lower weights while unique and meaningful words get higher weights. This helps in eliminating the noise created by common words and stopwords, and focusing on words that really matter to the sentiment.

**Logistic Regression:** Logistic Regression is a simple yet powerful linear model for binary classification problems. It works well with high dimensional data, making it a good choice for text data which often results in high-dimensional feature vectors (each word or n-gram can be considered a feature). It's also efficient in terms of computation and memory requirements, which can be crucial when dealing with large datasets. Logistic Regression is quite interpretable as the coefficient of each feature in the logistic regression output can be used to infer the 'importance' of each feature in predicting the target variable. In the case of sentiment analysis, it can help us understand which words (or n-grams) are more influential in driving sentiment.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF vectorizer.
vectorizer = TfidfVectorizer(min_df=5, ngram_range=(1, 2))

# Fit the vectorizer to the training set reviews and transform them to vectors.
train_features = vectorizer.fit_transform(train_data['review'])

# Transform the test set reviews to vectors using the same vectorizer.
test_features = vectorizer.transform(test_data['review'])

In [15]:
from sklearn.linear_model import LogisticRegression

# Create a Logistic Regression model.
model = LogisticRegression(solver='liblinear')

# Train the model with the training set features and labels.¥
model.fit(train_features, train_data['sentiment'])

LogisticRegression(solver='liblinear')

In [18]:
from sklearn.metrics import log_loss

# Calculate the training accuracy using the model's score() method.
train_accuracy = model.score(train_features, train_data['sentiment'])

# Print the training accuracy.
print('The train accuracy of the Logistic Regression model is:', train_accuracy)

# Calculate the training loss.
train_prob = model.predict_proba(train_features)
train_loss = log_loss(train_data['sentiment'], train_prob)

# Print the training loss.
print('The train loss of the Logistic Regression model is:', train_loss)

The train accuracy of the Logistic Regression model is: 0.94528
The train loss of the Logistic Regression model is: 0.26181187270627615


In [19]:
from sklearn.metrics import accuracy_score

# Use the trained model to predict the sentiment of the test set reviews.
predictions = model.predict(test_features)

# Calculate the prediction accuracy.
accuracy = accuracy_score(test_data['sentiment'], predictions)

print('The test accuracy of the Logistic Regression model is:', accuracy)

The test accuracy of the Logistic Regression model is: 0.88604


In [8]:
# Save the model.
joblib.dump(model, 'logistic_regression_model.pkl')

# Save the vectorizer.
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')

['tfidf_vectorizer.pkl']

**Comment:** The training accuracy is quite high at 0.94528 indicating the model is performing well on the training set. The test accuracy, however, is slightly lower at 0.88604. This could suggest a degree of overfitting, where the model has learned the training data very well but doesn't generalize as well to unseen data. The train loss is relatively low at 0.26181, also suggesting good performance on the training set.

## 6. Model 2: TF-IDF + SVM
**Support Vector Machine (SVM):** SVM is a powerful machine learning model for classification tasks. It is designed to find the best hyperplane or decision boundary that can separate classes in a higher-dimensional space, which is ideal for high-dimensional data like text data. SVMs are effective in high-dimensional spaces, even in cases where the number of dimensions exceeds the number of samples. This makes them a good choice for text classification problems where each word or n-gram can be considered as a separate dimension.

**Handling Non-linearities:** Unlike Logistic Regression, SVM can easily handle non-linear decision boundaries thanks to the kernel trick. This means it can model more complex relationships between your data points, which could lead to better performance in some cases.

**Robustness:** SVMs are also robust against overfitting, especially in high-dimensional space. This is because SVMs aim to maximize the margin, i.e., the distance between the decision boundary and the closest points of each class.

In [8]:
from sklearn import svm

# Create a SVM model.
model = svm.SVC(kernel='linear')

# Train the model with the training set features and labels.
model.fit(train_features, train_data['sentiment'])

SVC(kernel='linear')

In [20]:
from sklearn.metrics import accuracy_score

model = joblib.load('svm_model.pkl')

train_accuracy = model.score(train_features, train_data['sentiment'])

# Print the training accuracy.
print('The train accuracy of the SVM model is:', train_accuracy)

The train accuracy of the SVM model is: 0.9818


In [10]:
from sklearn.metrics import accuracy_score

# Predict the sentiment of the test set reviews using the SVM model.
predictions = model.predict(test_features)

# Calculate the prediction accuracy.
accuracy = accuracy_score(test_data['sentiment'], predictions)

print('The test accuracy of the SVM model is:', accuracy)

The test accuracy of the SVM model is: 0.88652


In [13]:
joblib.dump(model, 'svm_model.pkl')

['svm_model.pkl']

**Comment:** The SVM model has even higher training accuracy at 0.9818, suggesting it's performing very well on the training data. The test accuracy is a little lower than the training accuracy but slightly higher than the Logistic Regression model's test accuracy, at 0.88652. This also indicates a bit of overfitting but suggests that the SVM model may generalize slightly better than the Logistic Regression model.

## 7. Model 3: LSTM
**Sequence Understanding:** Text data is essentially a sequence of words. LSTMs are designed to handle such sequence data, as they can understand the context by remembering or forgetting information with the help of their gating mechanisms (input gate, forget gate and output gate). This ability to "remember" previous data in the sequence helps to capture the context and eliminate the problem of long-term dependencies that exist in text data.

**Context Capturing:** Unlike traditional machine learning algorithms like Logistic Regression or SVM, LSTM models take into account the entire context of a sentence, not just individual words or n-grams. This allows them to understand nuances in language that can greatly affect sentiment, such as sarcasm or negations.

**Handling Variable-Length Sequences:** LSTMs can handle variable-length sequences, meaning they can process reviews of different lengths without needing to predefine the sequence length, unlike traditional machine learning models where a fixed number of features must be set.

**Model Complexity:** LSTM networks are capable of modeling complex patterns and interactions between words, which can be particularly useful in sentiment analysis where the meaning of a particular word can depend heavily on its surrounding words.

**End-to-End Learning:** LSTM allows for end-to-end learning, where the model learns the best representation for the task at hand by itself. You do not have to manually engineer features (like TF-IDF).

In [23]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Set the maximum size of the vocabulary.
max_words = 10000
# Set the maximum length for each review.
max_len = 100

# Create a tokenizer, set the maximum size of the vocabulary.
tokenizer = Tokenizer(num_words=max_words)
# Fit the tokenizer using the training set reviews.
tokenizer.fit_on_texts(train_data['review'])

# Transform the reviews to sequences of integers using the tokenizer.
train_sequences = tokenizer.texts_to_sequences(train_data['review'])
test_sequences = tokenizer.texts_to_sequences(test_data['review'])

# Pad or truncate the sequences to the same length.
train_sequences = pad_sequences(train_sequences, maxlen=max_len)
test_sequences = pad_sequences(test_sequences, maxlen=max_len)

In [26]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, BatchNormalization, Dropout
from keras import regularizers

# Define a LSTM model.
model = Sequential()
model.add(Embedding(max_words, 50, input_length=max_len))
model.add(Dropout(0.5))
model.add(LSTM(32))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# Compile the model.
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

# Train the model.
history = model.fit(train_sequences, train_data['sentiment'], epochs=10, batch_size=128, validation_split=0.2)

# Retrieve the accuracy history from the training process.
accuracy_history = history.history['acc']

# Get the training accuracy of the last epoch.
final_training_accuracy = accuracy_history[-1]

# Print the final training accuracy.
print('The final training accuracy of the LSTM model is:', final_training_accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
The final training accuracy of the LSTM model is: 0.9559500217437744


In [27]:
test_loss, test_acc = model.evaluate(test_sequences, test_data['sentiment'])
print('The test accuracy of the LSTM model is:', test_acc)

The test accuracy of the LSTM model is: 0.8497200012207031


**Comment:** The LSTM model shows a different pattern. During the 10 epochs of training, we can observe a steady decrease in the training loss and a steady increase in the training accuracy, suggesting that the model is continuously learning and improving its performance. The final training accuracy is 0.9560, which is similar to the Logistic Regression and SVM models. However, the test accuracy is slightly lower than the other models, at 0.8497. The validation loss and accuracy fluctuate across epochs, suggesting that the model might be overfitting to the training data and may not generalize as well to new, unseen data.

## 8. Conclusion:
In conclusion, all three models performed well on the training data with high accuracy scores. They also all exhibited signs of overfitting to varying degrees, as indicated by lower test accuracy scores compared to training accuracy. It would be beneficial to apply strategies to reduce overfitting, such as regularization, early stopping, or more extensive data augmentation.

If interpretability is important, Logistic Regression might be the best choice as it allows us to understand the importance of each feature in the prediction. SVMs and LSTMs are less interpretable but might provide better performance, especially on more complex tasks or larger datasets. Finally, LSTMs are particularly well-suited to sequence data and might outperform the other models on tasks where the order of words is important. However, LSTMs are more computationally expensive to train and use for prediction.