# Hack The Box Skills Assessment: Sentiment Analysis on IMDB Movie Reviews

## Project Overview

This project serves as the skills assessment for the **Applications of AI in Infosec** module, part of the **Hack The Box AI Red Teamer Path**. The module is designed to equip learners with essential data science and machine learning skills, focusing on applications within cybersecurity.

For this assessment, I am working with the **IMDB dataset** introduced by Maas et al. (2011). This dataset consists of 50,000 movie reviews extracted from the Internet Movie Database (IMDB), annotated for sentiment analysis. The reviews are balanced with an equal number of positive and negative examples and are divided into training and test sets. This curated mixture of reviews makes the dataset a valuable resource for evaluating various natural language processing (NLP) techniques and machine learning models for sentiment classification tasks.

The objective of this project is to develop a model capable of predicting whether a given movie review is **positive (1)** or **negative (0)**. The IMDB dataset has become a cornerstone for research in NLP, particularly for developing word representations like word embeddings. By leveraging this dataset, I aim to benchmark and optimize machine learning models in the domain of sentiment classification.

## Model Evaluation and Flag Submission

After training and evaluating the model, I will upload it to the **Hack The Box Playground VM** evaluation portal for testing. If the model meets the specified performance criteria, a **flag value** will be generated. This flag serves as confirmation of the model's success and completion of the project.

---


In [2]:
#!pip install tensorflow keras numpy pandas matplotlib seaborn scikit-learn

# Download the Dataset and Unzip

---

In [4]:
# Downloading the Dataset
import requests
import zipfile
import io

# URL of the dataset
url = "https://academy.hackthebox.com/storage/modules/292/skills_assessment_data.zip"

# Download the dataset
response = requests.get(url)
if response.status_code == 200:
    print("Download Successful.")
else:
    print("Failed to download the dataset.")

Download Successful.


In [5]:
# Unzipping the dataset
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    z.extractall("skills_assessment_data")
    print("Extraction Complete")

# Verify complete extraction
import os
extracted_files = os.listdir("skills_assessment_data")
print("Extracted files: ", extracted_files)

Extraction Complete
Extracted files:  ['test.json', 'train.json']


# Load and Inspect the Dataset
We are loading the training and testing data in UTF-8 format, for the ability to process the JSON source files. Then we can convert these to data frames so we can do initial exploration on the dataset. Where we will have 'text' being the movie review and 'label' pointing to '1' for positive and '0' for negative sentiment.

---

In [7]:
import json
import pandas as pd

# Load training data
with open("skills_assessment_data/train.json", "r", encoding="utf-8") as f:
    train_data = json.load(f)

# Load testing data
with open("skills_assessment_data/test.json", "r", encoding="utf-8") as f:
    test_data = json.load(f)

# Convert to TWO DataFrame's for the train and test sets
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

# Display some examples
print(train_df.head())
print(test_df.head())

                                                text  label
0  Bromwell High is a cartoon comedy. It ran at t...      1
1  Homelessness (or Houselessness as George Carli...      1
2  Brilliant over-acting by Lesley Ann Warren. Be...      1
3  This is easily the most underrated film inn th...      1
4  This is not the typical Mel Brooks film. It wa...      1
                                                text  label
0  I went and saw this movie last night after bei...      1
1  Actor turned director Bill Paxton follows up h...      1
2  As a recreational golfer with some knowledge o...      1
3  I saw this film in a sneak preview, and it is ...      1
4  Bill Paxton has taken the true story of the 19...      1


#### Checking for missing values that skew and reduce the quality of data
We find no missing values.

---

In [9]:
print("Missingvalues:\n", train_df.isnull().sum())
print("Missingvalues:\n", test_df.isnull().sum())

Missingvalues:
 text     0
label    0
dtype: int64
Missingvalues:
 text     0
label    0
dtype: int64


#### Checking for duplicate values for the same purpose
We do find a lot of duplicate values-  we can drop these,

---

In [11]:
print("Duplicate entries:\n", train_df.duplicated().sum())
print("Duplicate entries:\n", test_df.duplicated().sum())

Duplicate entries:
 96
Duplicate entries:
 199


In [12]:
# Removign duplicate entries from both training and test datasets
train_df = train_df.drop_duplicates()
test_df = test_df.drop_duplicates()

In [13]:
# Confirmation of deletion
print("Duplicate entries:\n", train_df.duplicated().sum())
print("Duplicate entries:\n", test_df.duplicated().sum())

Duplicate entries:
 0
Duplicate entries:
 0


# Preprocess the Text Data
This is going to standardize the text, reduce noise, extract meaningful features (all improve performance), relying on the NLTK library for tokenization, stop word removal, and stemming to better implement the Bayes Spam Classification

---

In [15]:
import nltk

# Download the necessary NLTK data files
#nltk.download("punkt") # tokenization
#nltk.download("punkt_tab")
#nltk.download("stopwords") # Stop words

print("=== BEFORE ANY PREPROCESSINGM ===")
print(train_df.head(5))
print(test_df.head(5))

=== BEFORE ANY PREPROCESSINGM ===
                                                text  label
0  Bromwell High is a cartoon comedy. It ran at t...      1
1  Homelessness (or Houselessness as George Carli...      1
2  Brilliant over-acting by Lesley Ann Warren. Be...      1
3  This is easily the most underrated film inn th...      1
4  This is not the typical Mel Brooks film. It wa...      1
                                                text  label
0  I went and saw this movie last night after bei...      1
1  Actor turned director Bill Paxton follows up h...      1
2  As a recreational golfer with some knowledge o...      1
3  I saw this film in a sneak preview, and it is ...      1
4  Bill Paxton has taken the true story of the 19...      1


#### Text Cleaning
We want to remove any uncessary elements from the reviews like HTML Tags, special characters, extra whitespace. Then convert all text to lowercase to ensure uniformity.

---

In [69]:
import re

def clean_text(text):
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters and numbers
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

# Apply preprocessing only to non-null entries
train_df['text'] = train_df['text'].dropna().apply(clean_text)
test_df['text'] = test_df['text'].dropna().apply(clean_text)

print("\n=== AFTER REMOVING NON-NECESSARY ELEMENTS AND LOWERCASING TEXT ===")
print(train_df["text"].head(5))
print(test_df["text"].head(5))


=== AFTER REMOVING NON-NECESSARY ELEMENTS AND LOWERCASING TEXT ===
0    bromwel high cartoon comedi ran time program s...
1    homeless houseless georg carlin state issu yea...
2    brilliant overact lesley ann warren best drama...
3    easili underr film inn brook cannon sure flaw ...
4    typic mel brook film much less slapstick movi ...
Name: text, dtype: object
0    went saw movi last night coax friend mine ill ...
1    actor turn director bill paxton follow promis ...
2    recreat golfer knowledg sport histori pleas di...
3    saw film sneak preview delight cinematographi ...
4    bill paxton taken true stori us golf open made...
Name: text, dtype: object


#### Tokenization
Break down each review into individual words or tokens.

---

In [21]:
from nltk.tokenize import word_tokenize

# Split each message into individual tokens
train_df["text"] = train_df["text"].apply(word_tokenize)
test_df["text"] = test_df["text"].apply(word_tokenize)

print("\n=== AFTER TOKENIZATION ===")
print(train_df["text"].head(5))
print(test_df["text"].head(5))


=== AFTER TOKENIZATION ===
0    [bromwell, high, is, a, cartoon, comedy, it, r...
1    [homelessness, or, houselessness, as, george, ...
2    [brilliant, overacting, by, lesley, ann, warre...
3    [this, is, easily, the, most, underrated, film...
4    [this, is, not, the, typical, mel, brooks, fil...
Name: text, dtype: object
0    [i, went, and, saw, this, movie, last, night, ...
1    [actor, turned, director, bill, paxton, follow...
2    [as, a, recreational, golfer, with, some, know...
3    [i, saw, this, film, in, a, sneak, preview, an...
4    [bill, paxton, has, taken, the, true, story, o...
Name: text, dtype: object


#### Stop Word Removal
Eliminate common words like "is," "the," "and" that don’t contribute much meaning.

----

In [23]:
from nltk.corpus import stopwords

# Define a set of English stop words and remove them from the tokens
stop_words = set(stopwords.words("english"))
train_df["text"] = train_df["text"].apply(lambda x: [word for word in x if word not in stop_words])
test_df["text"] = test_df["text"].apply(lambda x: [word for word in x if word not in stop_words])

print("\n=== AFTER REMOVING STOP WORDS ===")
print(train_df["text"].head(5))
print(test_df["text"].head(5))


=== AFTER REMOVING STOP WORDS ===
0    [bromwell, high, cartoon, comedy, ran, time, p...
1    [homelessness, houselessness, george, carlin, ...
2    [brilliant, overacting, lesley, ann, warren, b...
3    [easily, underrated, film, inn, brooks, cannon...
4    [typical, mel, brooks, film, much, less, slaps...
Name: text, dtype: object
0    [went, saw, movie, last, night, coaxed, friend...
1    [actor, turned, director, bill, paxton, follow...
2    [recreational, golfer, knowledge, sports, hist...
3    [saw, film, sneak, preview, delightful, cinema...
4    [bill, paxton, taken, true, story, us, golf, o...
Name: text, dtype: object


#### Stemming
Reduce words to their base or root form for consistency. For example, "running" becomes "run."

---

In [25]:
from nltk.stem import PorterStemmer

# Stem each token to reduce words to their base form
stemmer = PorterStemmer()
train_df["text"] = train_df["text"].apply(lambda x: [stemmer.stem(word) for word in x])
test_df["text"] = test_df["text"].apply(lambda x: [stemmer.stem(word) for word in x])

print("\n=== AFTER STEMMING ===")
print(train_df["text"].head(5))
print(test_df["text"].head(5))


=== AFTER STEMMING ===
0    [bromwel, high, cartoon, comedi, ran, time, pr...
1    [homeless, houseless, georg, carlin, state, is...
2    [brilliant, overact, lesley, ann, warren, best...
3    [easili, underr, film, inn, brook, cannon, sur...
4    [typic, mel, brook, film, much, less, slapstic...
Name: text, dtype: object
0    [went, saw, movi, last, night, coax, friend, m...
1    [actor, turn, director, bill, paxton, follow, ...
2    [recreat, golfer, knowledg, sport, histori, pl...
3    [saw, film, sneak, preview, delight, cinematog...
4    [bill, paxton, taken, true, stori, us, golf, o...
Name: text, dtype: object


#### Joining Tokens Back into a Single String
Machine learning algorithms and vectorization techniques (e.g. TF_IDF) work better with raw strings. Rejoining tokens into a space-separated string restores a format compatible with these methods, preparing the dataset for the feature extraction phase.
                                                                                                                                                                                                                          
----

In [27]:
# Rejoin tokens into a single string for feature extraction
train_df["text"] = train_df["text"].apply(lambda x: " ".join(x))
test_df["text"] = test_df["text"].apply(lambda x: " ".join(x))

print("\n=== AFTER JOINING TOKENS BACK INTO STRINGS ===")
print(train_df["text"].head(5))
print(test_df["text"].head(5))


=== AFTER JOINING TOKENS BACK INTO STRINGS ===
0    bromwel high cartoon comedi ran time program s...
1    homeless houseless georg carlin state issu yea...
2    brilliant overact lesley ann warren best drama...
3    easili underr film inn brook cannon sure flaw ...
4    typic mel brook film much less slapstick movi ...
Name: text, dtype: object
0    went saw movi last night coax friend mine ill ...
1    actor turn director bill paxton follow promis ...
2    recreat golfer knowledg sport histori pleas di...
3    saw film sneak preview delight cinematographi ...
4    bill paxton taken true stori us golf open made...
Name: text, dtype: object


#### Bag-of-Words (BoW) Feature Extraction (Convert Text to Numerical Data)
The purpose of this is to transform preprocessed reviews into numerical vectors that work with machine learning algorithms. Model cannot process raw text data, so we can transform the data into numerically expressed information the model can work with. We can take CounterVectorizer from scikit-learn to implement a Bag-of-Words (BoW) approach. It converts a collection of documents into a matrix of term counts, each row represents a message and each column corresponds to a term (unigram or bigram). Before transformation, tokenization, vocabulary building, and the mapping of each document to a numeric vector occurs with CountVectorizer.

This step prepares 'X' below to become a numerical feature matrix ready to be fed into a classifier like Naive Bayes.

---

In [51]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize optimized CountVectorizer with bigrams and additional feature adjustments
vectorizer = CountVectorizer(min_df=2, max_df=0.8, ngram_range=(1, 2), max_features=10000)

# Fit and transform the 'text' column for training data
X_train = vectorizer.fit_transform(train_df["text"].dropna())

# Transform the 'text' column for testing data
X_test = vectorizer.transform(test_df["text"].dropna())

# Map labels to binary values for training and testing datasets
y_train = (train_df["label"].dropna() == 1).astype(int)
y_test = (test_df["label"].dropna() == 1).astype(int)

# Output the shapes of the transformed data
print("Shape of training data:", X_train.shape)
print("Shape of testing data:", X_test.shape)

Shape of training data: (24904, 10000)
Shape of testing data: (24801, 10000)


# Build the Sentiment Classification Model using MultinomialNB

---

#### Building the Pipeline
This will chain together vectorization using CountVectorizer and classification using MultinomialNB.

---

In [53]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

# Create the pipeline for movie reviews with an optimized Naive Bayes classifier
pipeline = Pipeline([
    ("vectorizer", vectorizer),  # Ensure vectorizer is appropriately configured (e.g., ngram_range, max_features)
    ("classifier", MultinomialNB(alpha=0.75))  # Include a tuned alpha parameter for smoothing
])

#### Hyperparameter Tuning
This will be used to find the optimal value for 'alpha' parameter of MultinomialNB, with GridSearchCV in order to better generalize the model for classifying (1) positive and (0) negative for reviews.

---

In [55]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# Define the parameter grid for alpha tuning
param_grid = {
    "classifier__alpha": [0.01, 0.1, 0.15, 0.2, 0.25, 0.5, 0.75, 1.0]
}

# Use StratifiedKFold for balanced cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform the grid search with F1-score as metric and optimizations
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=cv,
    scoring="f1",
    n_jobs=-1,  # Utilize all available CPU cores
    verbose=3   # Enable detailed progress logging
)

# Train the grid search on the training data
grid_search.fit(train_df["text"], y_train)

# Retrieve the best model and parameters
best_model = grid_search.best_estimator_
print("Best hyperparameters:", grid_search.best_params_)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best hyperparameters: {'classifier__alpha': 0.75}


# Evaluate the Model
Now we can evaluate the models performance on the test set of movie reviews

----

In [56]:
from sklearn.metrics import classification_report

# Predict on the test data using the best model
y_pred = best_model.predict(test_df["text"])

# Generate and print a classification report
report = classification_report(y_test, y_pred, digits=4)  # Higher precision for metrics
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0     0.8271    0.8590    0.8428     12361
           1     0.8543    0.8216    0.8376     12440

    accuracy                         0.8402     24801
   macro avg     0.8407    0.8403    0.8402     24801
weighted avg     0.8408    0.8402    0.8402     24801



#### Evaluating new movie reviews

---

In [None]:
new_reviews = [
    "This movie was absolutely fantastic! A must-watch for everyone.",
    "Terrible plot and awful acting. I would not recommend it to anyone.",
    "A beautifully written storyline with stellar performances. I loved it!",
    "It was just okay. Nothing special to write home about.",
    "Horrible. Worst movie I have ever seen in my life."
]

#### Preprocess and Predict

---

In [59]:
# Predict sentiments of the new reviews
predictions = best_model.predict(new_reviews)
prediction_probabilities = best_model.predict_proba(new_reviews)

# Display results
for i, review in enumerate(new_reviews):
    prediction = "Positive" if predictions[i] == 1 else "Negative"
    positive_prob = prediction_probabilities[i][1]
    negative_prob = prediction_probabilities[i][0]
    print(f"Review: {review}")
    print(f"Prediction: {prediction}")
    print(f"Positive Probability: {positive_prob:.2f}")
    print(f"Negative Probability: {negative_prob:.2f}")
    print("-" * 50)

Review: This movie was absolutely fantastic! A must-watch for everyone.
Prediction: Positive
Positive Probability: 0.71
Negative Probability: 0.29
--------------------------------------------------
Review: Terrible plot and awful acting. I would not recommend it to anyone.
Prediction: Negative
Positive Probability: 0.43
Negative Probability: 0.57
--------------------------------------------------
Review: A beautifully written storyline with stellar performances. I loved it!
Prediction: Positive
Positive Probability: 0.75
Negative Probability: 0.25
--------------------------------------------------
Review: It was just okay. Nothing special to write home about.
Prediction: Negative
Positive Probability: 0.24
Negative Probability: 0.76
--------------------------------------------------
Review: Horrible. Worst movie I have ever seen in my life.
Prediction: Negative
Positive Probability: 0.08
Negative Probability: 0.92
--------------------------------------------------


# Save the Model for Submission

---

In [61]:
import joblib

# Save the trained model to a file for future use
model_filename = 'skills_assessment.joblib'
joblib.dump(best_model, model_filename)  # Save the model

print(f"Model saved to {model_filename}")

Model saved to skills_assessment.joblib


# Uploading the Model to Hack The Box's Endpoint for flag
I cleared the output of the flag in order to encourage those who may land here to do the work themselves.

---

In [None]:
import requests
import json

# Define the URL of the API endpoint
url = "http://10.129.151.56:5000/api/upload"

# Path to the model file you want to upload
model_file_path = "skills_assessment.joblib"

# Open the file in binary mode and send the POST request
with open(model_file_path, "rb") as model_file:
    files = {"model": model_file}
    response = requests.post(url, files=files)

# Pretty print the response from the server
print(json.dumps(response.json(), indent=4))