### 1. Installation and Imports


In [1]:
! pip install datasets



- **Purpose:** Installs the `datasets` library, which is useful for easily loading various datasets.

In [2]:
# Load the libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


- **Purpose:** Imports necessary libraries for data manipulation, machine learning, and evaluation.

- **pd, np:** Standard imports for data manipulation (`pandas`) and numerical operations (`numpy`).
- **train_test_split:** Utility to split datasets into training and testing subsets.
- **TfidfVectorizer:** Converts text data into numerical features using TF-IDF.
- **MultinomialNB, LogisticRegression, SVC, RandomForestClassifier:** Machine learning algorithms for classification tasks.
- **accuracy_score, classification_report:** Metrics to evaluate model performance.
- **load_dataset:** Loads datasets from the `datasets` library.


### 2. Loading the Dataset

In [3]:
# Load the dataset IMDb
dataset = load_dataset("imdb")

- **Purpose:** Loads the IMDb dataset from the `datasets` library. This dataset is often used for sentiment analysis tasks.


In [4]:
# Convert the dataset to a dataframe
df = pd.DataFrame(dataset["train"])
# df = df.sample(frac=0.1, random_state=42) # Use a subset for faster preprocessing
df.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


- **"train":** Refers to the training split of the IMDb dataset.
- **Purpose:** Converts the training portion of the dataset to a DataFrame for easier manipulation and inspection.

### 3. Data Preparation

In [5]:
# Define the X and y
X = df["text"]
y = df["label"]


- **X:** Feature set containing the text data.
- **y:** Target variable containing sentiment labels (e.g., positive or negative).

- **Purpose:** Defines `X` as the feature set (text data) and `y` as the target variable (labels indicating sentiment).


In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- **test_size=0.2:** Specifies that 20% of the data should be used for testing, and the remaining 80% for training.
- **random_state=42:** Ensures reproducibility by setting a seed for the random number generator.

- **Purpose:** Splits the dataset into training and testing sets. `test_size=0.2` indicates that 20% of the data will be used for testing, and `random_state=42` ensures reproducibility.

### 4. Text Vectorization


In [7]:
# Vectorization
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

- **max_features=5000:** Limits the number of features to the top 5000 most important terms.
- **stop_words='english':** Removes common English stop words (e.g., "the", "and").
- **fit_transform:** Fits the vectorizer to the training data and transforms it into a TF-IDF matrix.
- **transform:** Applies the fitted vectorizer to the test data.

- **Purpose:** Converts the text data into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). `max_features=5000` limits the feature set to the top 5000 most important terms, and `stop_words='english'` removes common English stop words.

### 5. Model Training and Evaluation

#### Naive Bayes

In [8]:
# Train the Naive_Bayes Model
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
 #Make a prediction and evaluate the model
y_pred_nb = nb_model.predict(X_test_tfidf)

- **MultinomialNB:** Naive Bayes algorithm suitable for discrete features like TF-IDF values.
- **fit:** Trains the model using the training data.
- **predict:** Makes predictions on the test data.

In [9]:
print("Naive Bayes Classifier report: ")
print(classification_report(y_test, y_pred_nb))

Naive Bayes Classifier report: 
              precision    recall  f1-score   support

           0       0.85      0.85      0.85      2515
           1       0.85      0.85      0.85      2485

    accuracy                           0.85      5000
   macro avg       0.85      0.85      0.85      5000
weighted avg       0.85      0.85      0.85      5000



- **Purpose:** Trains a Naive Bayes model and evaluates its performance using the `classification_report`, which includes precision, recall, and F1-score metrics.

#### Logistic Regression

In [10]:
# Train the logistic Regression model
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)

- **max_iter=1000:** Sets the maximum number of iterations for the solver to converge.
- **LogisticRegression:** A linear model for binary classification.

In [11]:
# Make predictions on the model
y_pred_lr = lr_model.predict(X_test_tfidf)
print("Logistic Regression Classification report: ")
print(classification_report(y_test, y_pred_lr))

Logistic Regression Classification report: 
              precision    recall  f1-score   support

           0       0.90      0.87      0.88      2515
           1       0.87      0.90      0.89      2485

    accuracy                           0.88      5000
   macro avg       0.88      0.88      0.88      5000
weighted avg       0.88      0.88      0.88      5000



- **Purpose:** Trains a Logistic Regression model with a maximum of 1000 iterations (to ensure convergence) and evaluates its performance.

#### Support Vector Machine (SVM)

In [12]:
# Train the Support Vector Machine
svm_model = SVC()
svm_model.fit(X_train_tfidf, y_train)



- **SVC:** Support Vector Classification model.
- **kernel='rbf':** (Default) Specifies the Radial Basis Function kernel, which maps input data into a higher-dimensional space.


In [13]:
# Make the prediction and evaluate the model
y_pred_svm = svm_model.predict(X_test_tfidf)
print("Support Vector Machine Classification report: ")
print(classification_report(y_test, y_pred_svm))


Support Vector Machine Classification report: 
              precision    recall  f1-score   support

           0       0.90      0.87      0.89      2515
           1       0.88      0.90      0.89      2485

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000



- **Purpose:** Trains an SVM model and evaluates its performance. The default kernel is `rbf` (Radial Basis Function).

#### Random Forest

In [14]:
# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train_tfidf, y_train)

- **n_estimators=100:** Specifies the number of trees in the forest.
- **RandomForestClassifier:** An ensemble method that combines multiple decision trees.

In [15]:
# Predict and evaluate the model
y_pred_rf = rf_model.predict(X_test_tfidf)

In [16]:
# Print the report
print("Random Forest Classification report: ")
print(classification_report(y_test, y_pred_rf))

Random Forest Classification report: 
              precision    recall  f1-score   support

           0       0.84      0.86      0.85      2515
           1       0.85      0.84      0.85      2485

    accuracy                           0.85      5000
   macro avg       0.85      0.85      0.85      5000
weighted avg       0.85      0.85      0.85      5000



- **Purpose:** Trains a Random Forest model with 100 trees (estimators) and evaluates its performance.

### 6. Accuracy Comparison

In [17]:
# Accuracy Comparison
accuracy_nb = accuracy_score(y_test, y_pred_nb)
accuracy_lr = accuracy_score(y_test, y_pred_lr)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

In [18]:
print(f"Accuracy Naive Bayes: {accuracy_nb:.4f}")
print(f"Accuracy Logistic Regression: {accuracy_lr:.4f}")
print(f"Accuracy Support Vector Machine: {accuracy_svm:.4f}")
print(f"Accuracy Random Forest: {accuracy_rf:.4f}")

Accuracy Naive Bayes: 0.8496
Accuracy Logistic Regression: 0.8840
Accuracy Support Vector Machine: 0.8862
Accuracy Random Forest: 0.8482


- **accuracy_score:** Computes the accuracy of the model by comparing predicted labels with actual labels.
- **Purpose:** Computes and prints the accuracy for each model, which helps compare their performance.

### 7. Hyperparameter Tuning for SVM

In [19]:
# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# define my parameter Grid
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}

# grid_search
grid_search = GridSearchCV(SVC(), param_grid, cv=3)



- **param_grid:** Dictionary specifying the hyperparameters to be tuned. 
    - **C:** Regularization parameter controlling the trade-off between margin size and misclassification.
    - **kernel:** Specifies the kernel type to be used in the algorithm (e.g., `linear`, `rbf`).
- **cv=3:** Specifies 3-fold cross-validation to evaluate the performance of each parameter combination.


In [20]:
grid_search.fit(X_train_tfidf, y_train)

# Best Parameter and model evaluation
best_svm = grid_search.best_estimator_
y_pred_best_svm = best_svm.predict(X_test_tfidf)

print("Best SVM Classification report: ")
print(classification_report(y_test, y_pred_best_svm))

- **best_estimator_:** Retrieves the model with the best parameter combination found during grid search.

In [None]:
accuracy_svm_best = accuracy_score(y_test, y_pred_best_svm)
print(f"Accuracy Support Vector Machine with Best Parameters: {accuracy_svm_best:.4f}")

Accuracy Support Vector Machine with Best Parameters: 0.8874


- **Purpose:** Performs hyperparameter tuning for the SVM model using `GridSearchCV` to find the best combination of parameters (`C` and `kernel`). The results are then evaluated.

### 8. Confusion Matrix for SVM

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

In [None]:
y_pred_svm_cm = svm_model.predict(X_test_tfidf)


In [None]:
# Generate the CM
cm = confusion_matrix(y_test, y_pred_svm_cm)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=svm_model.classes_)

- **confusion_matrix:** Computes the confusion matrix, which shows the counts of true positives, false positives, true negatives, and false negatives.
- **display_labels=svm_model.classes_:** Labels for the classes in the confusion matrix.
- **ConfusionMatrixDisplay:** Utility to visualize the confusion matrix using `matplotlib`.

In [None]:
# Display the confusion matrix

disp.plot()
plt.show()

- **Purpose:** Generates and displays a confusion matrix for the SVM model to visualize the performance in terms of true positives, false positives, true negatives, and false negatives.
