
---

The scope of this exercise is to perform sentiment analysis on the IMDb dataset by transforming the movie review texts into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. You will learn how to preprocess and vectorize text data, train various machine learning models (such as logistic regression, Random Forest, or KNN), and evaluate their performance in predicting whether a movie review is positive or negative. This hands-on project will help you gain practical experience in text processing, feature extraction, model training, and performance assessment in the context of Natural Language Processing (NLP).

---

#### Library Imports

This cell imports the essential Python libraries and modules required for data processing and machine learning tasks in this notebook:

- `random`, `numpy`, and `pandas`: Standard libraries for data manipulation, numerical computations, and data handling.
- Various modules from `sklearn`: For model training, evaluation, and text vectorization, including:
  - `train_test_split`: Splits data into training and test sets.
  - `TfidfVectorizer`: Converts text data into numerical features using TF-IDF.
  - `LogisticRegression`, `RandomForestClassifier`, `KNeighborsClassifier`: Implements different classification algorithms.
  - `accuracy_score`, `classification_report`: Evaluates and summarizes model performance.
- `XGBClassifier` from `xgboost`: Provides gradient boosting classification.

These imports set up the environment for building, training, and evaluating text classification models.

In [41]:
import random
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

In [2]:
!pip install datasets --upgrade

Collecting datasets
  Downloading datasets-4.1.1-py3-none-any.whl.metadata (18 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Downloading datasets-4.1.1-py3-none-any.whl (503 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.6/503.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (42.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyarrow, datasets
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 18.1.0
    Uninstalling pyarrow-18.1.0:
      Successfully uninstalled pyarrow-18.1.0
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
[31mERROR: pip's dependency resolver does

In [None]:
#!pip install xgboost --upgrade


#### Load and Combine IMDB Dataset

This cell performs the following actions:

- Loads the IMDB movie reviews dataset using the `datasets` library.
- Converts both the training and test splits of the IMDB dataset to pandas DataFrames.
- Concatenates the training and test DataFrames into a single DataFrame (`df`) containing all samples.
- Prints the shape of the combined DataFrame to display the total number of rows and columns.

This step prepares the full IMDB dataset for subsequent data processing and machine learning tasks.

In [42]:
from datasets import load_dataset
dataset = load_dataset("imdb")
df = pd.concat([pd.DataFrame(dataset["train"]),pd.DataFrame(dataset["test"])],axis=0)
print (df.shape)

(50000, 2)


#### Display Data Overview

This cell displays the first five rows of the combined IMDB DataFrame (`df`) using the `head()` function.  
This provides a quick overview of the dataset's structure, including sample movie reviews and their associated labels.

In [43]:
df.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


#### Check for Missing Values

This cell checks for missing (null) values in each column of the DataFrame (`df`).  
It does so by:
- Calling `isnull().sum()` on the DataFrame to count the number of null entries per column.
- Printing the results to identify if any columns contain missing data.

This helps ensure data quality before proceeding with further analysis or modeling.

In [44]:
print (df.isnull().sum())

text     0
label    0
dtype: int64



#### Display Class Distribution

This cell displays the distribution of classes in the `label` column of the DataFrame (`df`).  
It uses the `value_counts()` function to count and show the number of instances for each class label, allowing you to assess class balance in the dataset.

In [45]:
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,25000
1,25000


#### Extract review texts and sentiment labels

This cell extracts the review texts and their corresponding sentiment labels from the dataframe.  
- `reviews` will be a list containing all the review texts from the "text" column.
- `sentiments` will be a list containing all the labels (sentiments) from the "label" column.


In [46]:
balanced_reviews = list(df["text"])
balanced_sentiments = list(df["label"])

Check how many elements each list has.

In [47]:
print (len(balanced_reviews))
print (len(balanced_sentiments))

50000
50000


Let's see a positive and a negative review from the dataset.

In [48]:
# Afiseaza un review pozitiv
review_pozitiv = df[df['label'] == 1].iloc[10]['text']
print("Review pozitiv:\n", review_pozitiv)

# Afiseaza un review negativ
review_negativ = df[df['label'] == 0].iloc[10]['text']
print("\nReview negativ:\n", review_negativ)

Review pozitiv:
 Lars von Trier's Europa is a worthy echo of The Third Man, about an American coming to post-World War II Europe and finds himself entangled in a dangerous mystery.<br /><br />Jean-Marc Barr plays Leopold Kessler, a German-American who refused to join the US Army during the war, arrives in Frankfurt as soon as the war is over to work with his uncle as a sleeping car conductor on the Zentropa Railway. What he doesn't know is the war is still secretly going on with an underground terrorist group called the Werewolves who target American allies. Leopold is strongly against taking any sides, but is drawn in and seduced by Katharina Hartmann (Barbara Sukowa), the femme fatale daughter of the owner of the railway company. Her father was a Nazi sympathizer, but is pardoned by the American Colonel Harris (Eddie Considine) because he can help get the German transportation system up and running again. The colonel soon enlists, or forces, Leopold to be a spy (without giving him a 

#### Split the data into training, validation, and test sets

This cell splits the balanced dataset into training, validation, and test sets:
- First, it splits `balanced_reviews` and `balanced_sentiments` into:
  - Training set (60%)
  - Temporary set (40%)
- Then, it splits the temporary set evenly into:
  - Validation set (20%)
  - Test set (20%)
Finally, it prints the number of samples in each subset for both input data and labels.


In [49]:
X_train, X_temp, y_train, y_temp = train_test_split(balanced_reviews, balanced_sentiments, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
print (len(X_train), len(y_train))
print (len(X_val), len(y_val))
print (len(X_test), len(y_test))

30000 30000
10000 10000
10000 10000


In [9]:
X_train[0]

"The story of peace-loving farmers and townspeople fighting for land, water, law and order, and the respect and ultimate subjugation of the long entrenched cattle interests and their hired guns had been worked over better in earlier (Shane) and probably later films as well. There's some good action scenes and the general layout of the story, excluding a disappointing ending, is well executed. Law and order and religion have established roots in the town, but the old order of cattle drives, cowboys, and gunslingers is still around as well. The clash of the two occurs in a nicely staged ambush scene where the townsmen ride right into a trap. Granger, an ex-gunfighter, plays the guy who is shunned by the very townspeople who need his expertise with a gun."

### Vectorize training data using TF-IDF

This cell converts the training text data (`X_train`) into numerical features using the TF-IDF (Term Frequency–Inverse Document Frequency) method:
- Initializes a `TfidfVectorizer` that removes English stop words.
- Fits the vectorizer on the training data and transforms `X_train` into a TF-IDF feature matrix (`X_train_tfidf`).
- Prints the size of the vocabulary (i.e., the number of unique words/features extracted from the training data).


In [50]:
vectorizer = TfidfVectorizer(stop_words="english")
X_train_tfidf = vectorizer.fit_transform(X_train)
print (len(vectorizer.vocabulary_))

82357


This cell processes the training text data (`X_train`) as follows:
- Initializes a `TfidfVectorizer` that removes English stop words and limits the vocabulary to the 10,000 most frequent terms (`max_features=10000`).

In [11]:
vectorizer = TfidfVectorizer(stop_words="english", max_features=10000)
X_train_tfidf = vectorizer.fit_transform(X_train)
print (len(vectorizer.vocabulary_))

10000


This cell transforms the training text data (`X_train`) into TF-IDF features with additional vocabulary selection criteria:
- Initializes a `TfidfVectorizer` configured to:
  - Remove English stop words.
  - Exclude words that appear in more than 95% of documents (`max_df=0.95`).
  - Exclude words that appear in fewer than 5% of documents (`min_df=0.05`).

We eliminate words that appear in many documents because they do not help to differentiate them. Also, words that appear in too few documents may be noise, or may be too insignificant.

In [12]:
vectorizer = TfidfVectorizer(stop_words="english", max_df=0.95, min_df=0.05)
X_train_tfidf = vectorizer.fit_transform(X_train)
print (len(vectorizer.vocabulary_))

258


This cell performs the following tasks:

- Initializes a `TfidfVectorizer` configured to:
  - Remove English stop words,
  - Consider both unigrams and bigrams (`ngram_range=(1,2)`),
  - Ignore terms that appear in more than 95% (`max_df=0.95`) or less than 2.5% (`min_df=0.025`) of documents.
- Fits the vectorizer on the `X_train` data and transforms the text data into a TF-IDF weighted document-term matrix (`X_train_tfidf`).
- Prints the size of the vocabulary (`len(vectorizer.vocabulary_)`) generated by the vectorizer, indicating how many unique terms were retained after preprocessing.



In [13]:
vectorizer = TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_df=0.95, min_df=0.025)
X_train_tfidf = vectorizer.fit_transform(X_train, )
print (len(vectorizer.vocabulary_))

631


In [51]:
print(X_train_tfidf.shape)

(30000, 82357)


In [52]:
feature_names = vectorizer.get_feature_names_out()
first_vector = X_train_tfidf[0].toarray().flatten()
first_doc_features = dict(zip(feature_names, first_vector))
# vezi doar valorile nenule
print({k: v for k, v in first_doc_features.items() if v > 0})

{'action': np.float64(0.06357570747429808), 'ambush': np.float64(0.15680079336667058), 'better': np.float64(0.05044596686622952), 'cattle': np.float64(0.27443975406121096), 'clash': np.float64(0.13555986568333372), 'cowboys': np.float64(0.13942782366592), 'disappointing': np.float64(0.09638136957014569), 'drives': np.float64(0.1149654117132686), 'earlier': np.float64(0.08835318364760894), 'ending': np.float64(0.0668831792253816), 'entrenched': np.float64(0.1639146602211857), 'established': np.float64(0.11465034749260912), 'ex': np.float64(0.09638136957014569), 'excluding': np.float64(0.1578143310006706), 'executed': np.float64(0.107119105366374), 'expertise': np.float64(0.14430705935413138), 'farmers': np.float64(0.14887396964554003), 'fighting': np.float64(0.09155706990595328), 'films': np.float64(0.050753563409767895), 'general': np.float64(0.08697128282668588), 'good': np.float64(0.03696216401337171), 'granger': np.float64(0.15242147743419188), 'gun': np.float64(0.09367787030313793)

If we want to see the vocabulary:

In [53]:
for word, idx in list(vectorizer.vocabulary_.items())[:20]:
    print(word, ":", idx)

story : 69773
peace : 53845
loving : 43539
farmers : 26116
townspeople : 74326
fighting : 26853
land : 41393
water : 79484
law : 41787
order : 51998
respect : 60775
ultimate : 75847
subjugation : 70332
long : 43252
entrenched : 24242
cattle : 12228
interests : 37375
hired : 34103
guns : 32018
worked : 80976


Let's see how first example from the training set looks after all this processing:

In [54]:
# 1. Afiseaza textul initial
print("Textul inițial (X_train[0]):")
print(X_train[0])
print("="*50)

# # 2. Afiseaza toate feature-urile (termenii pastrati de vectorizator)
# feature_names = vectorizer.get_feature_names_out()
# print("Lista completă a feature-urilor păstrate de vectorizator:")
# print(feature_names)
# print(f'Numar total de feature-uri: {len(feature_names)}')
# print("="*50)

# 3. Afiseaza vectorul TF-IDF (doar valorile nenule) pt X_train[0]
first_vector = X_train_tfidf[0].toarray().flatten()
nonzero_idx = first_vector.nonzero()[0]  # indexii unde valoarea difera de zero

print("Feature-uri nenule pentru X_train[0] si valorile lor TF-IDF:")
for idx in nonzero_idx:
    print(f"{feature_names[idx]} : {first_vector[idx]:.4f}")

Textul inițial (X_train[0]):
The story of peace-loving farmers and townspeople fighting for land, water, law and order, and the respect and ultimate subjugation of the long entrenched cattle interests and their hired guns had been worked over better in earlier (Shane) and probably later films as well. There's some good action scenes and the general layout of the story, excluding a disappointing ending, is well executed. Law and order and religion have established roots in the town, but the old order of cattle drives, cowboys, and gunslingers is still around as well. The clash of the two occurs in a nicely staged ambush scene where the townsmen ride right into a trap. Granger, an ex-gunfighter, plays the guy who is shunned by the very townspeople who need his expertise with a gun.
Feature-uri nenule pentru X_train[0] si valorile lor TF-IDF:
action : 0.0636
ambush : 0.1568
better : 0.0504
cattle : 0.2744
clash : 0.1356
cowboys : 0.1394
disappointing : 0.0964
drives : 0.1150
earlier : 0.0

Now, we need to transform also the test and validation set.

In [55]:
X_val_tfidf = vectorizer.transform(X_val)
X_test_tfidf = vectorizer.transform(X_test)

In [56]:
X_val_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 876106 stored elements and shape (10000, 82357)>

In [57]:
print(X_val_tfidf.shape)
print(X_test_tfidf.shape)

(10000, 82357)
(10000, 82357)


#### Training a Logistic Regression Model

This cell initializes a `LogisticRegression` model and trains (fits) it on the TF-IDF-transformed training data (`X_train_tfidf`) with the corresponding target labels (`y_train`). This trained model will be used for predictive tasks in subsequent steps.

In [58]:
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

In [24]:
y_train_pred = model.predict(X_train_tfidf)
val_accuracy = accuracy_score(y_train, y_train_pred)
print("Train Accuracy:", val_accuracy)

Train Accuracy: 0.8567666666666667


In [59]:
y_val_pred = model.predict(X_val_tfidf)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)

Validation Accuracy: 0.8946


In [60]:
y_test_pred = model.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Test Accuracy: 0.8931


If we want to see some predictions:

In [61]:
import numpy as np

# Predictii pe setul de test
y_test_pred = model.predict(X_test_tfidf)

# Indici pentru predicții corecte și greșite
correct_indices = np.where(y_test == y_test_pred)[0]
incorrect_indices = np.where(y_test != y_test_pred)[0]

print("\nEXEMPLE CORECTE:")
for idx in correct_indices[:5]: # primele 5 exemple corecte
    print(f"Text: {X_test[idx]}\nLabel real: {y_test[idx]}\nPredictie: {y_test_pred[idx]}\n")

print("\nEXEMPLE GREȘITE:")
for idx in incorrect_indices[:5]: # primele 5 exemple greșite
    print(f"Text: {X_test[idx]}\nLabel real: {y_test[idx]}\nPredictie: {y_test_pred[idx]}\n")



EXEMPLE CORECTE:
Text: too bad this movie isn't. While "Nemesis Game" is mildly entertaining, I found it hard to suspend my disbelief the whole length of the movie, especially the situations that Sara was putting herself into. Are we supposed to believe that:<br /><br />1) this hot chick is going to go slumming unarmed around abandoned buildings and dark subway tunnels in the middle of the night just to solve some riddles?<br /><br />2) the protagonists are supposedly such experts that they play riddle games for fun, but don't put the whole "I Never Sinned" riddle together until the very end...and then...and then...get this...she has to do the whole mirror thing to finally put the pieces together?? I know it was the filmmaker's device to show the audience what was going on, but do they really think we're that stupid?<br /><br />3) when Vern and Sara go to the Chez M to question the blonde, there is not ONE topless chick in the whole building. Nada. C'mon. I know it's Canada, but I wou

Now let's try with a Random forest classifier.

In [27]:
model = RandomForestClassifier(n_estimators=30)
model.fit(X_train_tfidf, y_train)

In [28]:
y_train_pred = model.predict(X_train_tfidf)
val_accuracy = accuracy_score(y_train, y_train_pred)
print("Train Accuracy:", val_accuracy)
y_val_pred = model.predict(X_val_tfidf)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)
y_test_pred = model.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Train Accuracy: 0.9997
Validation Accuracy: 0.804
Test Accuracy: 0.8014


In [29]:
model = RandomForestClassifier(n_estimators=50, max_depth=20)
model.fit(X_train_tfidf, y_train)

In [30]:
y_train_pred = model.predict(X_train_tfidf)
val_accuracy = accuracy_score(y_train, y_train_pred)
print("Train Accuracy:", val_accuracy)
y_val_pred = model.predict(X_val_tfidf)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)
y_test_pred = model.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Train Accuracy: 0.9018
Validation Accuracy: 0.7989
Test Accuracy: 0.7994


Now with a KNeighborsClassifier.

In [31]:
model = KNeighborsClassifier(n_neighbors=7)
model.fit(X_train_tfidf, y_train)

In [32]:
y_train_pred = model.predict(X_train_tfidf)
val_accuracy = accuracy_score(y_train, y_train_pred)
print("Train Accuracy:", val_accuracy)
y_val_pred = model.predict(X_val_tfidf)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)
y_test_pred = model.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Train Accuracy: 0.7962
Validation Accuracy: 0.702
Test Accuracy: 0.7004


In [33]:
model = XGBClassifier(n_estimators=50, max_depth=5)
model.fit(X_train_tfidf, y_train)

In [34]:
y_train_pred = model.predict(X_train_tfidf)
val_accuracy = accuracy_score(y_train, y_train_pred)
print("Train Accuracy:", val_accuracy)
y_val_pred = model.predict(X_val_tfidf)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)
y_test_pred = model.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Train Accuracy: 0.8686666666666667
Validation Accuracy: 0.8126
Test Accuracy: 0.8135


In [36]:
len(y_test_pred)

10000

This function **trains an ensemble of three classifiers** (one Logistic Regression and two Random Forests), all using the same training data and features. It makes predictions with each model on the test data, then combines their predictions using **majority voting** (at least 2 out of 3 must agree). The function returns the **accuracy score** of the combined ensemble predictions on the test set.

In [37]:
def ensamble():
    model = LogisticRegression()
    model.fit(X_train_tfidf, y_train)
    y_pred1 = model.predict(X_test_tfidf)

    model = RandomForestClassifier(n_estimators=100, max_depth=20)
    model.fit(X_train_tfidf, y_train)
    y_pred2 = model.predict(X_test_tfidf)

    model = RandomForestClassifier(n_estimators=100, max_depth=20)
    model.fit(X_train_tfidf, y_train)
    y_pred3 = model.predict(X_test_tfidf)

    y_pred = []
    for i in range(len(y_pred1)):
        if y_pred1[i] + y_pred2[i] + y_pred3[i] >= 2:
            y_pred.append(1)
        else:
            y_pred.append(0)

    acc = accuracy_score(y_test, y_pred)
    return acc

In [38]:
acc = ensamble()
print ("Enemble acc:", acc)

Enemble acc: 0.8145
