In [15]:
## Sentiment Analysis using Support Vector Machine (SVM)
'''
This project aims to classify text data into positive and negative sentiments
using a Support Vector Machine (SVM) classifier. Sentiment analysis helps
understand public opinion from textual data such as reviews and comments.

SVM is chosen because it performs well on high-dimensional data and is effective
for text classification tasks.
'''


'\nThis project aims to classify text data into positive and negative sentiments\nusing a Support Vector Machine (SVM) classifier. Sentiment analysis helps\nunderstand public opinion from textual data such as reviews and comments.\n\nSVM is chosen because it performs well on high-dimensional data and is effective\nfor text classification tasks.\n'

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
#the dataset is loaded from kaggle
df=pd.read_csv(r"C:\Users\jayas\Downloads\sentimentanalysis\IMDB Dataset.csv")

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df.columns

Index(['review', 'sentiment'], dtype='object')

In [5]:
df.shape

(50000, 2)

In [6]:
#this is used to determine what are the things insde the sentiment column
df['sentiment'].value_counts() 

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [7]:
#since the ml model are unable to understand text we are encoding them
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
df.head()


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [8]:
X = df['review']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [9]:
#Text Preprocessing:
#Bag of words is used to convert text into number for machine learning by counting word frequencies by counting words ignoring grammer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score

# Bag of Words
bow = CountVectorizer(stop_words='english', max_df=0.95, min_df=5)

X_train_bow = bow.fit_transform(X_train)
X_test_bow = bow.transform(X_test)

# Linear SVM
svm_bow = LinearSVC()
svm_bow.fit(X_train_bow, y_train)

y_pred_bow = svm_bow.predict(X_test_bow)

print("Bag of Words + Linear SVM")
print("Accuracy:", accuracy_score(y_test, y_pred_bow))
print(classification_report(y_test, y_pred_bow))


Bag of Words + Linear SVM
Accuracy: 0.8591
              precision    recall  f1-score   support

           0       0.86      0.86      0.86      4961
           1       0.86      0.86      0.86      5039

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



In [10]:
'''
TF-IDF (Term Frequency–Inverse Document Frequency) is used to convert text into
numerical features. It assigns higher importance to words that are frequent in
a document but rare across all documents, improving classification performance.
from sklearn.feature_extraction.text import TfidfVectorizer
'''
tfidf = TfidfVectorizer(stop_words='english', max_df=0.95, min_df=5)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

svm_tfidf_linear = LinearSVC()
svm_tfidf_linear.fit(X_train_tfidf, y_train)

y_pred_tfidf = svm_tfidf_linear.predict(X_test_tfidf)

print("TF-IDF + Linear SVM")
print("Accuracy:", accuracy_score(y_test, y_pred_tfidf))
print(classification_report(y_test, y_pred_tfidf))


TF-IDF + Linear SVM
Accuracy: 0.8912
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      4961
           1       0.89      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



In [None]:
## Hyperparameter Tuning using GridSearchCV
'''
Support Vector Machine performance depends on the choice of hyperparameters.
To find the optimal value of the regularization parameter `C`, GridSearchCV is
used.

GridSearchCV performs an exhaustive search over a specified parameter grid and
evaluates each combination using cross-validation. This helps select the model
that generalizes best to unseen data.

The F1-macro score is used as the evaluation metric because it balances precision
and recall across all classes, making it suitable for sentiment classification.
'''


In [11]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10]
}

grid = GridSearchCV(
    LinearSVC(),
    param_grid,
    cv=5,
    scoring='f1_macro'
)

grid.fit(X_train_tfidf, y_train)

print("Best Parameters:", grid.best_params_)
print("Best CV Score:", grid.best_score_)


Best Parameters: {'C': 0.1}
Best CV Score: 0.8936856627356857


In [None]:
## Best Parameter Selection
'''
The GridSearchCV results indicate that the value of `C` shown above provides the
best performance based on 5-fold cross-validation.

A smaller value of `C` encourages a wider margin and better generalization,
while a larger value of `C` focuses on minimizing classification errors.
The selected parameter achieves an optimal balance between bias and variance.
'''

In [12]:
from sklearn.decomposition import TruncatedSVD
from sklearn.svm import SVC

svd = TruncatedSVD(n_components=300, random_state=42)
X_train_reduced = svd.fit_transform(X_train_tfidf)
X_test_reduced = svd.transform(X_test_tfidf)

svm_rbf = SVC(kernel='rbf', C=1, gamma='scale')
svm_rbf.fit(X_train_reduced, y_train)

y_pred_rbf = svm_rbf.predict(X_test_reduced)

print("TF-IDF + RBF SVM (Reduced Features)")
print("Accuracy:", accuracy_score(y_test, y_pred_rbf))
print(classification_report(y_test, y_pred_rbf))


TF-IDF + RBF SVM (Reduced Features)
Accuracy: 0.8811
              precision    recall  f1-score   support

           0       0.89      0.87      0.88      4961
           1       0.87      0.89      0.88      5039

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000



In [13]:
svm_poly = SVC(kernel='poly', degree=3, C=1)
svm_poly.fit(X_train_reduced, y_train)

y_pred_poly = svm_poly.predict(X_test_reduced)

print("Polynomial Kernel SVM")
print(accuracy_score(y_test, y_pred_poly))


Polynomial Kernel SVM
0.8621


In [14]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_tfidf, y_train)

y_pred_rf = rf.predict(X_test_tfidf)

print("Random Forest")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))


Random Forest
Accuracy: 0.8634
              precision    recall  f1-score   support

           0       0.86      0.87      0.86      4961
           1       0.87      0.85      0.86      5039

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



In [None]:
# In comparison with Random Forest, SVM demonstrated superior performance, highlighting its effectiveness for text classification tasks. 

In [None]:
#Comparision table 
"""
| Model              | Feature Representation | Accuracy        | Precision (Avg) | Recall (Avg) | F1-Score (Avg) |
| ------------------ | ---------------------- | --------------- | --------------- | ------------ | -------------- |
| Linear SVM         | Bag of Words           | **0.8591**      | 0.86            | 0.86         | 0.86           |
| Linear SVM         | TF-IDF                 | **0.8912**      | 0.89            | 0.89         | 0.89           |
| Linear SVM (Tuned) | TF-IDF (C = 0.1)       | **0.8937 (CV)** | –               | –            | **0.89**       |
| RBF SVM            | TF-IDF + SVD           | **0.8811**      | 0.88            | 0.88         | 0.88           |
| Polynomial SVM     | TF-IDF + SVD           | **0.8621**      | –               | –            | –              |
| Random Forest      | TF-IDF                 | **0.8634**      | 0.86            | 0.86         | 0.86           |
"""


In [None]:
## Results and Interpretation
'''
The model achieves good accuracy, indicating that SVM is effective for sentiment
classification. The confusion matrix shows that most samples are correctly
classified, with a small number of misclassifications.

This suggests that TF-IDF combined with SVM is suitable for text-based sentiment
analysis.
'''