## Sentiment Analysis
We use the `IMDB movie_review` dataset from NLTK to build:
- A baseline model using TF-IDF + SVM
- Transformers

### Importing required libraries

#### I trained it on kaggle so this is code for accesing files on kaggle

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv


In [2]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.preprocessing import LabelEncoder
import transformers
import tensorflow as tf
from transformers import AutoTokenizer,TFDistilBertForSequenceClassification

2025-04-19 15:55:35.622644: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745078135.804346      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745078135.859881      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Functions for cleaning dataset

In [4]:
CLEANR = re.compile('<.*?>')
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
vectorizer = TfidfVectorizer()
clf = LinearSVC()

def data_preprocess(review):
    review = re.sub(CLEANR, '', review)
    review = re.sub('[^a-zA-Z ]', '', review)
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(i) for i in review]
    return ' '.join(review)

### Loading the dataset and creating dummy variables

In [5]:
data = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
data['review'] = data['review'].apply(data_preprocess)
data['sentiment'] = data['sentiment'].map({'positive': 1, 'negative': 0})
data.head()

Unnamed: 0,review,sentiment
0,one of the other reviewer ha mentioned that af...,1
1,a wonderful little production the filming tech...,1
2,i thought this wa a wonderful way to spend tim...,1
3,basically there a family where a little boy ja...,0
4,petter matteis love in the time of money is a ...,1


### Splitting Data into train and test

In [6]:
x_train, x_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=0)

### Baseline: TF-IDF + SVM
We vectorize the text using TF-IDF and train a linear SVM classifier.

In [7]:
X_train_vec = vectorizer.fit_transform(x_train)
X_test_vec = vectorizer.transform(x_test)
clf.fit(X_train_vec, y_train)
y_pred = clf.predict(X_test_vec)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.89      0.89      5035
           1       0.89      0.90      0.89      4965

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



In [8]:
transformers.__version__

'4.51.1'

## Data Processing for Transformers

In [9]:
data_transformers = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
data_transformers['review'] = data_transformers['review'].apply(data_preprocess)

In [10]:
sentiments = list(data_transformers['sentiment'])
labels = pd.get_dummies(sentiments)['positive']

In [11]:
x_train, x_test, y_train, y_test = train_test_split(list(data_transformers['review']), labels, test_size=0.2, random_state=0)

In [12]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Splitting and tokenizing data

In [13]:
train_encodings = tokenizer(x_train,
                            truncation=True,
                            padding=True)

test_encodings = tokenizer(x_test,
                            truncation=True,
                            padding=True)

train_dataset = tf.data.Dataset.from_tensor_slices((
                                dict(train_encodings),
                                y_train
))

test_dataset = tf.data.Dataset.from_tensor_slices((
                                dict(test_encodings),
                                y_test
))

I0000 00:00:1745078296.016767      31 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15513 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0


In [14]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [15]:
for data in train_dataset.take(1):
    inputs, labels = data
    predictions = model(inputs)
    print(predictions)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[0.04265959, 0.07845549]], dtype=float32)>, hidden_states=None, attentions=None)


### Training for 2 epochs

In [16]:

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])


model.fit(train_dataset.shuffle(100).batch(16),
          epochs=2,
          validation_data=test_dataset.shuffle(100).batch(16))

Epoch 1/2


I0000 00:00:1745078354.481066     103 service.cc:148] XLA service 0x7f44b818d980 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1745078354.481548     103 service.cc:156]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
I0000 00:00:1745078354.552973     103 cuda_dnn.cc:529] Loaded cuDNN version 90300
I0000 00:00:1745078354.688759     103 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/2


<tf_keras.src.callbacks.History at 0x7f460fdb5d90>

### Saving Model check points

In [17]:
model.save_pretrained("./sentiment_transformer_model")

In [18]:
prediction_model = TFDistilBertForSequenceClassification.from_pretrained("./sentiment_transformer_model")

Some layers from the model checkpoint at ./sentiment_transformer_model were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at ./sentiment_transformer_model and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
test_sentence = x_test[4]
test_sentence

'this movie is simply awesome it is so hilarious although the skating and other montage are played out the comedy is awesome raab himself and brandon dicamillo are hilarious there will be moment when you cant breath youre laughing so hard plus there are scene that you can watch hundred of time and still laugh this is one of the funniest comedy ive ever seen'

### Prediction

In [21]:
predict_input = tokenizer.encode(test_sentence,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")
tf_output = prediction_model.predict(predict_input)[0]



In [22]:
tf_prediction = tf.nn.softmax(tf_output, axis=1).numpy()[0]
tf_prediction

array([0.06466941, 0.9353306 ], dtype=float32)