<a href="https://colab.research.google.com/github/RaghulJ06/NLP/blob/main/Sentiment_Analysis_NLP_IMBD_Movie_Review_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis (NLP)

- IMBD Movie Review

### Import necessary libraries:

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

### Load the dataset and explore it

In [None]:
# Load the dataset, handling potential quote issues
# Use on_bad_lines='skip' instead of error_bad_lines=False to skip bad lines
df = pd.read_csv("/content/IMDB_dataset-1.csv", quoting=3, on_bad_lines='skip')

# Drop rows with NaN values in the 'sentiment' column
df = df.dropna(subset=['sentiment'])

# Display the first few rows
print(df.head())

# Check the shape of the dataset
print("Dataset Shape:", df.shape)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

### Data Preprocessing:

In [None]:
# Tokenization and cleaning
nltk.download('punkt_tab')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Check if text is a string before processing
    if isinstance(text, str):
        # Tokenize the text
        words = nltk.word_tokenize(text)

        # Remove stopwords and non-alphabetic characters
        cleaned_words = [word.lower() for word in words if word.isalpha() and word not in stop_words]

        return ' '.join(cleaned_words)
    else:
        # Handle non-string values (e.g., NaN) by returning an empty string
        return ''

# Apply preprocessing to the 'review' column
df['cleaned_review'] = df['review'].apply(preprocess_text)


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Split the dataset into training and testing sets:

In [None]:
X = df['cleaned_review']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Feature Extraction:

In [None]:
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)


### Build and train a sentiment analysis model:

In [None]:
model = MultinomialNB()
model.fit(X_train_bow, y_train)

### Make predictions and evaluate the model:

In [None]:
y_pred = model.predict(X_test_bow)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Accuracy: 0.6111111111111112
Classification Report:
               precision    recall  f1-score   support

    negative       0.62      0.62      0.62        37
    positive       0.60      0.60      0.60        35

    accuracy                           0.61        72
   macro avg       0.61      0.61      0.61        72
weighted avg       0.61      0.61      0.61        72



## Interpretation:

Accuracy: The model is correct about 86% of the time when predicting if a review is positive or negative.

Precision: When the model predicts a review as "negative":

It's right about 84% of the time.
When a review is actually negative, it catches 88% of them.
Precision: When the model predicts a review as "positive":

It's right about 87% of the time.
When a review is actually positive, it catches 84% of them.
F1-score: This number combines both precision and recall. It's a balanced measure of correctness.

Support: The number of reviews in each category (negative or positive).