<a href="https://colab.research.google.com/github/Prags16/Sentiment_Analysis/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install kaggle



Importing the Dependencies

In [None]:
import os
import json
from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV



Data Collection-Kaggle API


In [None]:
kaggle_dictionary = json.load(open("kaggle.json"))

In [None]:
kaggle_dictionary.keys()

dict_keys(['username', 'key'])

In [None]:
#setup kaggle credentials as environment variables
os.environ["KAGGLE_USERNAME"] = kaggle_dictionary["username"]
os.environ["KAGGLE_KEY"] = kaggle_dictionary["key"]

In [None]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 97% 25.0M/25.7M [00:02<00:00, 22.1MB/s]
100% 25.7M/25.7M [00:02<00:00, 12.3MB/s]


In [None]:
!ls

imdb-dataset-of-50k-movie-reviews.zip  kaggle.json  sample_data


In [None]:
# unzip the dataset file
with ZipFile("imdb-dataset-of-50k-movie-reviews.zip", "r") as zip_ref:
  zip_ref.extractall()

In [None]:
!ls

'IMDB Dataset.csv'   imdb-dataset-of-50k-movie-reviews.zip   kaggle.json   sample_data


Loading the Dataset

In [None]:
data = pd.read_csv("/content/IMDB Dataset.csv")

In [None]:
data.shape

(50000, 2)

In [None]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
data.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


In [None]:
data["sentiment"].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


In [None]:
data = data.replace({"sentiment": {"positive": 1, "negative": 0}})

  data = data.replace({"sentiment": {"positive": 1, "negative": 0}})


In [None]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [None]:
data["sentiment"].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
1,25000
0,25000


In [None]:
# split data into training data and test data
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
train_x, train_y = train_data["review"], train_data["sentiment"]
test_x, test_y = test_data["review"], test_data["sentiment"]

In [None]:
print(train_data.shape)
print(test_data.shape)

(40000, 2)
(10000, 2)


**Vectorize the data**

In [None]:
tfidf = TfidfVectorizer(max_features=5000)
train_x_vector = tfidf.fit_transform(train_x)
test_x_vector = tfidf.transform(test_x)

In [None]:
print(test_x)

33553    I really liked this Summerslam due to the look...
9427     Not many television shows appeal to quite as m...
199      The film quickly gets to a major chase scene w...
12447    Jane Austen would definitely approve of this o...
39489    Expectations were somewhat high for me when I ...
                               ...                        
28567    Although Casper van Dien and Michael Rooker ar...
25079    I liked this movie. I wasn't really sure what ...
18707    Yes non-Singaporean's can't see what's the big...
15200    As far as films go, this is likable enough. En...
5857     I saw Anatomy years ago -- dubbed at a friends...
Name: review, Length: 10000, dtype: object


In [None]:
train_y= train_data["sentiment"]
test_y= test_data["sentiment"]

In [None]:
print(test_y)

33553    1
9427     1
199      0
12447    1
39489    0
        ..
28567    0
25079    1
18707    1
15200    0
5857     1
Name: sentiment, Length: 10000, dtype: int64


**Model Selection and Training**

Support Vector Machine

In [None]:
svc = SVC(kernel='linear')
svc.fit(train_x_vector, train_y)

Decision Trees

In [None]:
dec_tree = DecisionTreeClassifier()
dec_tree.fit(train_x_vector, train_y)

Naive Bayes

In [None]:
gnb = GaussianNB()
gnb.fit(train_x_vector.toarray(), train_y)

Logistic Regression

In [None]:
log_reg = LogisticRegression()
log_reg.fit(train_x_vector, train_y)


**Model Evaluation**

In [None]:
print("Mean Accuracy:")
print("SVM Accuracy:", svc.score(test_x_vector, test_y))
print("Decision Tree Accuracy:", dec_tree.score(test_x_vector, test_y))
print("Naive Bayes Accuracy:", gnb.score(test_x_vector.toarray(), test_y))
print("Logistic Regression Accuracy:", log_reg.score(test_x_vector, test_y))

Graph

In [None]:
import matplotlib.pyplot as plt

# Define the algorithms and their respective accuracies
algorithms = ['SVM', 'Decision Tree', 'Naive Bayes', 'Logistic Regression']
accuracies = [0.8949, 0.7082, 0.7997, 0.8955]

# Create a bar chart for the accuracies
plt.figure(figsize=(8, 6))
plt.bar(algorithms, accuracies, color=['blue', 'orange', 'green', 'red'])
plt.ylim(0, 1)  # Set y-axis limits to 0-1 for easier interpretation
plt.xlabel('Algorithms')
plt.ylabel('Accuracy')
plt.title('Mean Accuracy of Different Algorithms')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the accuracy values on top of each bar
for i, v in enumerate(accuracies):
    plt.text(i, v + 0.02, f"{v:.4f}", ha='center', fontweight='bold')

plt.show()

F1 Score

In [None]:
print("\nF1 Score:")
print(f"SVM F1 Score: {f1_score(test_y, svc.predict(test_x_vector), average='weighted')}")
print(f"Decision Tree F1 Score: {f1_score(test_y, dec_tree.predict(test_x_vector), average='weighted')}")
print(f"Naive Bayes F1 Score: {f1_score(test_y, gnb.predict(test_x_vector.toarray()), average='weighted')}")
print(f"Logistic Regression F1 Score: {f1_score(test_y, log_reg.predict(test_x_vector), average='weighted')}")

Graph

In [None]:
import matplotlib.pyplot as plt

# Define the algorithms and their respective F1 scores
algorithms = ['SVM', 'Decision Tree', 'Naive Bayes', 'Logistic Regression']
f1_scores = [0.8948821538096041, 0.7082042953868904, 0.7996908601719831, 0.8954741540229668]

# Create a bar chart for the F1 scores
plt.figure(figsize=(8, 6))
plt.bar(algorithms, f1_scores, color=['blue', 'orange', 'green', 'red'])
plt.ylim(0, 1)  # Set y-axis limits to 0-1 for easier interpretation
plt.xlabel('Algorithms')
plt.ylabel('F1 Score')
plt.title('F1 Score of Different Algorithms')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the F1 score values on top of each bar
for i, v in enumerate(f1_scores):
    plt.text(i, v + 0.02, f"{v:.4f}", ha='center', fontweight='bold')

plt.show()

Classification Report for Logistic Regression

In [None]:
print("\nClassification Report (Logistic Regression):")
print(classification_report(test_y, log_reg.predict(test_x_vector)))


Classification Report (Logistic Regression):
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      4961
           1       0.89      0.91      0.90      5039

    accuracy                           0.90     10000
   macro avg       0.90      0.89      0.89     10000
weighted avg       0.90      0.90      0.89     10000



Confusion Matrix for Logistic Regression

In [None]:
print("\nConfusion Matrix (Logistic Regression):")
print(confusion_matrix(test_y, log_reg.predict(test_x_vector)))


Confusion Matrix (Logistic Regression):
[[4376  585]
 [ 465 4574]]


Hyperparameter Tuning for Logistic Regression using GridSearchCV

In [None]:
params = {'C': [0.1, 1, 10, 100], 'solver': ['liblinear', 'lbfgs']}
log_reg_grid = GridSearchCV(LogisticRegression(), params, cv=5)
log_reg_grid.fit(train_x_vector, train_y)

In [None]:
print("\nBest parameters found for Logistic Regression:", log_reg_grid.best_params_)
print("Best cross-validated score for Logistic Regression:", log_reg_grid.best_score_)


Best parameters found for Logistic Regression: {'C': 1, 'solver': 'lbfgs'}
Best cross-validated score for Logistic Regression: 0.8878


Example usage of the best model from GridSearch for prediction

In [None]:
best_log_reg = log_reg_grid.best_estimator_


Function to predict sentiment using the best Logistic Regression model

In [None]:
def predict_sentiment_logreg(review):
    review_vector = tfidf.transform([review])  # Transform the review into the same vector format
    prediction = best_log_reg.predict(review_vector)
    return "positive" if prediction[0] == 1 else "negative"

Example prediction

In [None]:
new_review = "This movie was fantastic!"
print(f"\nThe sentiment of the review '{new_review}' is: {predict_sentiment_logreg(new_review)}")


The sentiment of the review 'This movie was fantastic!' is: positive


Classification Report for SVM

In [None]:
print("\nClassification Report (SVM):")
print(classification_report(test_y, svc.predict(test_x_vector)))


Classification Report (SVM):
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      4961
           1       0.89      0.90      0.90      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



Confusion Matrix for SVM

In [None]:
print("\nConfusion Matrix (SVM):")
conf_mat = confusion_matrix(test_y, svc.predict(test_x_vector))
print(conf_mat)


Confusion Matrix (SVM):
[[4387  574]
 [ 485 4554]]


Hyperparameter Tuning for SVM using GridSearchCV

In [None]:
params = {'C': [1, 4, 8, 16, 32], 'kernel': ['linear', 'rbf']}
svc_grid = GridSearchCV(SVC(), params, cv=5)
svc_grid.fit(train_x_vector, train_y)

In [None]:
print("\nBest parameters found for SVM:", svc_grid.best_params_)
print("Best cross-validated score for SVM:", svc_grid.best_score_)

Example usage of the best model from GridSearch for prediction

In [None]:
best_svc = svc_grid.best_estimator_
def predict_sentiment(review):
    review_vector = tfidf.transform([review])
    prediction = best_svc.predict(review_vector)
    return "positive" if prediction[0] == 1 else "negative"

Example prediction

In [None]:
new_review = "This movie was fantastic!"
print(f"\nThe sentiment of the review '{new_review}' is: {predict_sentiment(new_review)}")