Task 2: Sentiment Analysis
--------------------------
Description:
 Build a simple text sentiment analysis model using a small dataset like 1000 IMDB reviews.
Steps:

 1. Preprocessing:

       -Lowercase the text.

       -Remove stopwords and special characters.

 2. Feature Engineering: Convert text data into numerical format using CountVectorizer.

 3. Model Training: Train a simple Logistic Regression model to classify reviews as positive or negative.

 4. Model Evaluation: Check accuracy only (optional: precision and recall).

Outcome:
A Python script to predict whether a review is positive or negative.


Steps 1: Preprocessing:
-----------------------

   1. Lowercase the text.

   2. Remove stopwords and special characters.

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

# Load dataset
df = pd.read_csv("IMDB Dataset.csv")

# Download stopwords once
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    # 1. Lowercase
    text = text.lower()
    
    # 2. Remove HTML tags
    text = re.sub(r'<.*?>', ' ', text)
    
    # 3. Remove special characters and numbers
    text = re.sub(r'[^a-z\s]', ' ', text)
    
    # 4. Remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words]
    
    # Join back into string
    return " ".join(words)

# Apply preprocessing to dataset
df["clean_review"] = df["review"].apply(preprocess_text)

# Show sample(First 5 rows)
df[["review", "clean_review", "sentiment"]].head()


[nltk_data] Downloading package stopwords to C:\Users\Muhammad
[nltk_data]     Mamoon\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,review,clean_review,sentiment
0,One of the other reviewers has mentioned that ...,one reviewers mentioned watching oz episode ho...,positive
1,A wonderful little production. <br /><br />The...,wonderful little production filming technique ...,positive
2,I thought this was a wonderful way to spend ti...,thought wonderful way spend time hot summer we...,positive
3,Basically there's a family where a little boy ...,basically family little boy jake thinks zombie...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",petter mattei love time money visually stunnin...,positive


Step 2. Feature Engineering: Convert text data into numerical format using CountVectorizer.
------------------------------------------------------------------------------------------

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize vectorizer
vectorizer = CountVectorizer(max_features=5000)  # top 5000 words only

# Alternatively, we can use TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)  # top 5000 words only
# Fit and transform clean reviews
X = vectorizer.fit_transform(df["clean_review"])

# Target variable
y = df["sentiment"]

print("Shape of X (features):", X.shape)
print("Unique labels:", y.unique())


Shape of X (features): (1003, 5000)
Unique labels: ['positive' 'negative']


Step 3. Model Training: Train a simple Logistic Regression model to classify reviews as positive or negative.
---------------------------------------------------------------------------------------

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Train the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Sentiment Analysis Model:", accuracy)


Accuracy of Sentiment Analysis Model: 0.8258706467661692


Step 4. Model Evaluation: Check accuracy only (optional: precision and recall).
------------------------------------------------------------------------------

In [None]:
def predict_sentiment(review_text):
    # Step 1: Preprocess the text
    clean_text = preprocess_text(review_text)
    
    # Step 2: Convert text to vector using the same vectorizer
    vector = vectorizer.transform([clean_text])
    
    # Step 3: Predict sentiment using trained model
    prediction = model.predict(vector)[0]
    
    return prediction

# 🔹 Example tests
print(predict_sentiment("This movie was absolutely fantastic!"))  # Expected: positive
print(predict_sentiment("The film was boring and a waste of time."))  # Expected: negative
print(predict_sentiment("This movie was not good."))  # Expected: negative

positive
negative
negative
