# Notebook: IMDb Sentiment Analysis with ClearML

## Objective
The goal of this project is to perform sentiment analysis on the IMDb Movie Reviews dataset, classifying reviews as positive or negative. We'll leverage ClearML for experiment tracking, visualization, and comparison.

## Workflow
1. Load and explore the dataset.
2. Preprocess the text data.
3. Train and evaluate machine learning models.
4. Track experiments and log results using ClearML.

## Tools and Libraries
- Python
- Pandas, Scikit-learn for data handling and modeling
- ClearML for experiment tracking and logging

## Step 1: Load and Explore the Dataset


In [6]:
import pandas as pd

df = pd.read_csv("../data/IMDB-Dataset.csv")

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [7]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [8]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

## Step 2: Data Preprocessing
We'll clean the text and convert it into numerical format using TF-IDF.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Split the data into training and testing sets
X = df['review']  
y = df['sentiment']  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text to numerical representation using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

X_train_vec

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2806532 stored elements and shape (40000, 5000)>

## Step 3: Model Training and ClearML Integration
We'll train a Logistic Regression model and track the experiment using ClearML.


In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from clearml import Task

# Initialize ClearML Task
task = Task.init(project_name="IMDb Sentiment Analysis", task_name="Logistic Regression")

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_vec, y_train)

# Evaluate the model
y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

ClearML Task: created new task id=921c179a4dff4db186856b3836a81a26
2024-12-01 11:29:35,326 - clearml.Task - INFO - Storing jupyter notebook directly as code
ClearML results page: https://app.clear.ml/projects/8d7422d16fe44ab780913011deb6f3f8/experiments/921c179a4dff4db186856b3836a81a26/output/log
Test Accuracy: 0.8889

Classification Report:
              precision    recall  f1-score   support

    negative       0.90      0.87      0.89      4961
    positive       0.88      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring


In [None]:
# Log the accuracy to ClearML
task.get_logger().report_scalar("Accuracy", "Test", iteration=1, value=accuracy)