<a href="https://colab.research.google.com/github/Shaffizy/Sentimental-Analysist/blob/main/Sentimental_Analysis_of_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SENTIMENTAL ANALYSIS**
Sentiment analysis of movie reviews is a natural language processing (NLP) technique that involves analyzing and interpreting the sentiments expressed in written movie reviews. The primary goal is to determine whether a review conveys a positive, negative, or neutral sentiment about a movie. This process typically involves several steps, including data preprocessing, feature extraction, and model training.

How It Works:
Data Collection: The process begins by gathering a large dataset of movie reviews, often from online platforms like IMDb, Rotten Tomatoes, or specific datasets provided for sentiment analysis tasks. Each review is typically labeled with its corresponding sentiment (e.g., positive or negative).

Data Preprocessing: Before analysis, the text data is cleaned and prepared. This step may involve removing stopwords (common words like "the" or "and"), stemming or lemmatizing words to their root forms, and converting the text into a format that a machine learning model can understand.

Feature Extraction: The next step is to convert the raw text into numerical features. This can be done using techniques like Bag of Words, Term Frequency-Inverse Document Frequency (TF-IDF), or word embeddings (e.g., Word2Vec, GloVe). These features represent the text in a way that a machine learning model can process.

Model Training: A machine learning model, such as a logistic regression classifier, support vector machine (SVM), or a neural network, is trained on the labeled data. The model learns to associate certain words, phrases, or patterns in the text with positive or negative sentiments.

Prediction and Evaluation: Once the model is trained, it can be used to predict the sentiment of new, unseen movie reviews. The performance of the model is typically evaluated using metrics such as accuracy, precision, recall, and F1-score. More detailed evaluations might include analyzing the model’s performance using a confusion matrix, which provides insight into how well the model distinguishes between positive and negative reviews.

**Importing the Dependencies**
In this code, I imported the following dependencies: Pandas, Scikit-learn, Tensorflow.
Pandas for to visualize the data in this this programming environment
Scikit-Learn to divide the data to test and train data.
Tensorflow for the training of the date when it is split to train and test.We would use the Dense, Embedding and LSTM.

In [None]:
import os
import json

from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

**Data Collection-kaggle API**
This function is used to load the file "kaggle.json" which contains my kaggle information (name, password) to enable the download of a particular kaggle file called "imdb-dataset-of-50k-movie-reviews".

In [None]:
def Data_Collection():
    kaggle_dictionary = json.load(open("kaggle.json"))
    os.environ["KAGGLE_USERNAME"] = kaggle_dictionary["username"]
    os.environ["KAGGLE_KEY"] = kaggle_dictionary["key"]
Data_Collection()

In [None]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


**Data Extraction**
This function is used to unzip the "imdb-dataset-of-50k-movie-reviews" so it can be visualized by pandas.


In [None]:
def Data_Extraction():
  with ZipFile("imdb-dataset-of-50k-movie-reviews.zip", "r") as zip_ref:
        zip_ref.extractall()
Data_Extraction()


**Data Processing**
In this function I used pandas to read through the data.Then I replaced the "positive" and "negative" value in the "sentiment" column of the *data* that is the variable of the result of the it's visualization by pandas to integer so the machine can read and understand. After that I defined the *train* and *test* data by spliting them with scikit-learn. Tokenization is now carried out to replace the word with corresponding numbers so the machine can easily work with it. While X(train, test) and Y(train, test) are then initialized with the former being the data of values of "review" and the latter beign the data of values of "sentiment".

In [None]:
def Data_Processing():
    data = pd.read_csv("/content/IMDB Dataset.csv")
    data.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)
    train_data, test_data = train_test_split(data, test_size=0.2, random_state=70)
    print(train_data.shape)
    print(test_data.shape)

    return train_data , test_data

In [None]:
# Initialize the values of train and test data
train_data , test_data = Data_Processing()

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data["review"])

X_train = pad_sequences(tokenizer.texts_to_sequences(train_data["review"]), maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(test_data["review"]), maxlen=200)
print(X_train)
print(X_test)

Y_train = train_data["sentiment"]
Y_test = test_data["sentiment"]
print(Y_train)
print(Y_test)

(40000, 2)
(10000, 2)
[[   0    0    0 ...   78  174  204]
 [   0    0    0 ...  377   11   19]
 [   0    0    0 ...   47   56  527]
 ...
 [   0    0    0 ...  112  509  315]
 [   0    0    0 ...    2  548  346]
 [   0    0    0 ... 1469   38 2260]]
[[ 154    4  259 ... 2451   71   12]
 [   0    0    0 ...  148    5   68]
 [   0    0    0 ...   21  276  146]
 ...
 [ 292    2   16 ...   76 1879  644]
 [   0    0    0 ...    2  353 1076]
 [   7 1829  266 ...   49  683    5]]
22375    0
43545    1
41827    1
41018    1
46867    1
        ..
21563    1
25916    1
44824    0
21618    0
23886    0
Name: sentiment, Length: 40000, dtype: int64
14397    1
2775     1
28441    0
7544     0
13908    0
        ..
40441    0
13069    0
46889    1
34084    1
47718    0
Name: sentiment, Length: 10000, dtype: int64


**LSTM - Long Short-Term Memory**
I used this function to build the model with Embedding, LSTM, Dense provided by tensorflow and then I compile the model together so everything works simultaneously in concurrent order and I train the model the finally carry out model evaluation which shows the accuracy of the machine  

In [None]:
def Trainning():
    # build the model

    model = Sequential()
    model.add(Embedding(input_dim=5000, output_dim=128, input_length=200))
    model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(1, activation="sigmoid"))
    model.summary()

    return model


In [None]:
# compile the model
model = Trainning()
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# training the Model
model.fit(X_train, Y_train, epochs=8, batch_size=64, validation_split=0.2)

# model Evaluation
loss, accuracy = model.evaluate(X_test, Y_test)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")




Epoch 1/8
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m308s[0m 610ms/step - accuracy: 0.7224 - loss: 0.5352 - val_accuracy: 0.5726 - val_loss: 0.6530
Epoch 2/8
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m297s[0m 560ms/step - accuracy: 0.7496 - loss: 0.4991 - val_accuracy: 0.8331 - val_loss: 0.3822
Epoch 3/8
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m312s[0m 540ms/step - accuracy: 0.8516 - loss: 0.3506 - val_accuracy: 0.8559 - val_loss: 0.3568
Epoch 4/8
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m341s[0m 579ms/step - accuracy: 0.8844 - loss: 0.2886 - val_accuracy: 0.8764 - val_loss: 0.3116
Epoch 5/8
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m298s[0m 531ms/step - accuracy: 0.9012 - loss: 0.2505 - val_accuracy: 0.8700 - val_loss: 0.3254
Epoch 6/8
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 530ms/step - accuracy: 0.9166 - loss: 0.2169 - val_accuracy: 0.8759 - val_loss: 0.3079
Epoch 7/8


**Building a Predictive System** Testing and verification of what the machine has learnt would happen here. I would test various test case with original reviews.


In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

# Predict on the test data
Y_pred = model.predict(X_test)

# For binary classification, convert predictions to class labels
Y_pred_classes = (Y_pred > 0.5).astype("int32")

# Compute the confusion matrix
conf_matrix = confusion_matrix(Y_test, Y_pred_classes)
print("Confusion Matrix:\n", conf_matrix)

# Optionally, get a full classification report
print("Classification Report:\n", classification_report(Y_test, Y_pred_classes))

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 126ms/step
Confusion Matrix:
 [[4345  638]
 [ 577 4440]]
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.87      0.88      4983
           1       0.87      0.88      0.88      5017

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000



In [None]:
def predict_sentiment(review):
  # tokenize and pad the review
  sequence = tokenizer.texts_to_sequences([review])
  padded_sequence = pad_sequences(sequence, maxlen=200)
  prediction = model.predict(padded_sequence)
  sentiment = "positive" if prediction[0][0] > 0.5 else "negative"
  return sentiment

In [None]:
# example usage
new_review = "what exactly is this movie about, I cant make sense of it"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step
The sentiment of the review is: negative


In [None]:
# example usage
new_review = "This movie was not that good"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 174ms/step
The sentiment of the review is: negative


In [None]:
# example usage
new_review = "This movie was bad"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step
The sentiment of the review is: negative
