<a href="https://colab.research.google.com/github/OsirisEscaL/Machine_learning/blob/main/Performing_Sentiment_Analysis_on_Movie_Reviews_Using_Scikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Performing Sentiment Analysis on Movie Reviews Using Scikit-Learn

Sentiment analysis, or opinion mining, is a natural language processing (NLP) technique that determines a text's sentiment or emotive tone. There are numerous applications for sentiment analysis, ranging from analyzing movie reviews and social media posts to comprehending customer feedback. This article will use Python's Scikit-Learn library to conduct sentiment analysis on text data. We will use an actual dataset, experiment with different vectorization techniques, and train classifiers to determine whether a given text expresses a positive or negative sentiment.




**Dataset**

We will utilize the IMDB movie reviews dataset for our sentiment analysis task. This data set contains movie evaluations and their respective positive or negative sentiment labels. It is available for distribution from the [Kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).

**Step 1: Importing Essential Libraries**

Importing the essential Python libraries for the project will be our initial step:

In [50]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score

**Step 2: Loading and Preprocessing the Dataset**

Once the dataset has been downloaded and extracted, it will be loaded and preprocessed.

In [51]:
# Load the dataset
data = pd.read_csv('IMDB Dataset.csv')
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [52]:
# Convert to numerical values where 'negative' is 0 and 'positive' is 1
label_mapping = {'negative': 0, 'positive': 1}
data['sentiment'] = data['sentiment'].map(label_mapping)
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


We reduce the dataset with only 5000 reviews

In [53]:
# Reduce the dataset
class_proportions = data['sentiment'].value_counts(normalize=True)
small_dataset_size = 5000
smaller_dataset = data.groupby('sentiment').apply(lambda x: x.sample(int(small_dataset_size * class_proportions[x.name]))).reset_index(drop=True)
smaller_dataset.groupby('sentiment').count()

Unnamed: 0_level_0,review
sentiment,Unnamed: 1_level_1
0,2500
1,2500


In [54]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(smaller_dataset['review'], smaller_dataset['sentiment'], test_size=0.2, random_state=42)

**Step 3: Feature Extraction with TF-IDF Vectorization and Count Vectorizer**

We need to preprocess the text data before building and training our sentiment analysis model. We will use the TF-IDF vectorization and Count Vectorization techniques to convert text data into numerical characteristics. TF-IDF (Term Frequency-Inverse Document Frequency) vectorization and Count Vectorization are natural language processing (NLP) and text analysis techniques used to convert text data into numerical representations for machine learning tasks. Yet, they differ in how they represent and weight words in a document.

In [55]:
# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# Transform the text data using TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

**Building and Training the Classifiers**

Now that our data has been prepared and preprocessed, let's build and train our sentiment analysis classifier. We will conduct experiments with various classifiers.

In [56]:
# Create a dictionary to store models
models = {
    'Naive Bayes Classifier': MultinomialNB(),
    'Support Vector Classifier': SVC(),
    'Decision Tree Classifier': DecisionTreeClassifier(),
    'Random Forest Classifier': RandomForestClassifier(),
    'Gradient Boosting Classifier': GradientBoostingClassifier(),
}

In [57]:
# Train and evaluate each model using TfidfTransformer to preprocess the text data
results_tfidf = {}
for model_name, model in models.items():
    model.fit(X_train_tfidf, y_train)
    y_pred = model.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    results_tfidf[model_name] = {'Accuracy': accuracy, 'F1': f1}

In [58]:
# Train and evaluate each model using CountVectorizer to preprocess the text data
results_counts = {}
for model_name, model in models.items():
    model.fit(X_train_counts, y_train)
    y_pred = model.predict(X_test_counts)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    results_counts[model_name] = {'Accuracy': accuracy, 'F1': f1}

We evaluate the efficacy of our trained classifiers using various metrics. We will assess accuracy, and f1-score. We first show the results with the text data converted with TfidfTransformer.

In [59]:
results_tfidf = pd.DataFrame(results_tfidf)
results_tfidf

Unnamed: 0,Naive Bayes Classifier,Support Vector Classifier,Decision Tree Classifier,Random Forest Classifier,Gradient Boosting Classifier
Accuracy,0.824,0.85,0.684,0.815,0.806
F1,0.809524,0.850895,0.685885,0.811801,0.81165


Now, with the text data converted with CountVectorizer.

In [60]:
results_counts = pd.DataFrame(results_counts)
results_counts

Unnamed: 0,Naive Bayes Classifier,Support Vector Classifier,Decision Tree Classifier,Random Forest Classifier,Gradient Boosting Classifier
Accuracy,0.819,0.8,0.689,0.84,0.81
F1,0.810073,0.80198,0.688064,0.84127,0.819048


**Conclusion**

Sentiment analysis is a valuable NLP technique that can shed light on customer feedback, social media sentiment, etc. With the appropriate preprocessing and classifier, you can extract helpful information from textual data and make decisions based on sentiment analysis results.