<a href="https://colab.research.google.com/github/Rubaikaa/Data-Science-Internship/blob/main/Sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Objective:**
To performs ***sentiment analysis on IMDB movie reviews*** using machine learning.It:

1- Loads and Preprocesses review text (removes punctuation, stopwords, lemmatizes words).

2- Converts Text to Numerical Format using TF-IDF vectorization.

3- Trains a Logistic Regression Model to classify reviews as positive or negative.

4- Evaluates Performance using accuracy, precision, recall, and F1-score.

The goal is to build an efficient sentiment classifier for movie reviews.

In [1]:
# Step 1: Install and Import Dependencies
!pip install nltk scikit-learn pandas numpy



In [2]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [19]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [20]:
# Step 2: Load Dataset
df = pd.read_csv("/content/IMDB Dataset.csv")

In [21]:
# Display basic dataset info
display(df.head())
display(df.info())

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


None

In [24]:
# Step 3: Preprocess Text
def preprocess_text(text):
    """Preprocesses text by tokenizing, removing stopwords, and lemmatizing."""
    text = re.sub(r'[^\w\s]', '', text).lower()  # Remove punctuation and lowercase
    tokens = nltk.word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

In [25]:
# Apply preprocessing
df['processed_review'] = df['review'].apply(preprocess_text)

In [27]:
# Step 4: Convert Text to Numerical Format
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['processed_review'])
y = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)  # Convert sentiment to binary

In [28]:
# Step 5: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [29]:
# Step 6: Train a Logistic Regression Classifier
model = LogisticRegression()
model.fit(X_train, y_train)

In [30]:
# Step 7: Make Predictions
y_pred = model.predict(X_test)

In [31]:
# Step 8: Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [32]:
# Step 9: Print Evaluation Metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 0.8869
Precision: 0.8774
Recall: 0.9016
F1-score: 0.8893
