### Customer Review Sentiment Classification

#### 1.Data Collection

In [34]:
# Import necessary libraries
import numpy as np  
import nltk
import glob
import re
import string
import pandas as pd  
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV  
from nltk.tokenize import word_tokenize
from sklearn.linear_model import LogisticRegression 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from sklearn.metrics import accuracy_score  


In [12]:
# Download required NLTK resources
nltk.download("all")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_

True

In [13]:
 # Define dataset path
dataset_path = "aclImdb"

def load_imdb_data(directory, label):
    """Function to load IMDB dataset."""
    data = []
    for file_path in glob.glob(f"{directory}/*.txt"):
        with open(file_path, "r", encoding="utf-8") as file:
            data.append((file.read(), label))
    return data

In [14]:
# Load dataset
train_pos = load_imdb_data(f"{dataset_path}/train/pos", "positive")
train_neg = load_imdb_data(f"{dataset_path}/train/neg", "negative")
test_pos = load_imdb_data(f"{dataset_path}/test/pos", "positive")
test_neg = load_imdb_data(f"{dataset_path}/test/neg", "negative")


In [17]:
# Convert to DataFrame
train_df = pd.DataFrame(train_pos + train_neg, columns=["review", "sentiment"])
test_df = pd.DataFrame(test_pos + test_neg, columns=["review", "sentiment"])

#### 2. Data Preprocessing

In [45]:
# Initialize text preprocessing tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def clean_text(text):
    """Function for text cleaning."""
    text = text.lower()
    text = re.sub(r"<.*?>", "", text)  # Remove HTML tags
    text = re.sub(r"\d+", "", text)  # Remove numbers
    text = text.translate(str.maketrans("", "", string.punctuation))  # Remove punctuation
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)

In [46]:
# Apply text cleaning
train_df["cleaned_review"] = train_df["review"].apply(clean_text)
test_df["cleaned_review"] = test_df["review"].apply(clean_text)

#### 3.Feature Extraction

TF-IDF is chosen over Bag of Words because it gives more importance to meaningful words while reducing the impact of common words.

1- Ignores Common Words – Words like "the", "is", and "and" appear in almost every review. TF-IDF lowers their importance, unlike Bag of Words, which treats all words equally.

2-Highlights Important Words – If a word appears often in one review but not in many others, TF-IDF assigns it a higher weight, making it more useful for classification.

3-Better Accuracy – Since TF-IDF focuses on unique words that actually matter, the model can make better predictions.

4-Efficient Processing – It reduces the number of unnecessary features, making the model faster and less complex.

In short, TF-IDF helps the model understand which words are actually important for classifying reviews as positive or negative

In [36]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df["cleaned_review"])
X_test = vectorizer.transform(test_df["cleaned_review"])


In [37]:
# Encode labels    # covert pos-->1 and negative-->0
encoder = LabelEncoder()
y_train = encoder.fit_transform(train_df["sentiment"])
y_test = encoder.transform(test_df["sentiment"])


#### 4.Model Selection And Training 

In [38]:
# Split dataset
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

In [39]:
# Train Logistic Regression Model
model = LogisticRegression(C=1.0, max_iter=200)
model.fit(X_train, y_train)

In [40]:
# Evaluate Model
y_pred = model.predict(X_val)
print("Accuracy:", accuracy_score(y_val, y_pred))
print("\nClassification Report:\n", classification_report(y_val, y_pred))

Accuracy: 0.868

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.86      0.87      2500
           1       0.86      0.88      0.87      2500

    accuracy                           0.87      5000
   macro avg       0.87      0.87      0.87      5000
weighted avg       0.87      0.87      0.87      5000



In [42]:
# Hyperparameter Tuning using GridSearchCV
param_grid = {
    "C": [0.1, 1, 10],
    "max_iter": [100, 200, 300]
}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)

#### 5. Evalution 

In [44]:
# Evaluate Best Model
y_pred_best = grid_search.predict(X_test)
print(" Model Accuracy:", accuracy_score(y_test, y_pred_best))
print("\n Model Classification Report:\n", classification_report(y_test, y_pred_best))


 Model Accuracy: 0.87356

 Model Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.87      0.87     12500
           1       0.87      0.88      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000



## Model Improvement Strategies

### 1️ Use Different Algorithms  
- Currently, we used **Logistic Regression**.  
- We can try **Naive Bayes** (good for text data) or **Support Vector Machine (SVM)** (better for complex patterns).  
- Testing multiple models helps find the one that performs best.  

### 2️ Tune Hyperparameters  
- Instead of using default settings, we can adjust parameters like **C (regularization strength)** or **max_iter (iterations for learning)** to improve accuracy.  
- **Grid Search** or **Random Search** can help find the best values automatically.  

### 3️ Add More Data  
- More training data can help the model learn better and reduce errors.  
- We can collect more reviews or use **data augmentation** (e.g., translating text, adding synonyms) to create variations.  


### 4 Try Word Embeddings  
- Instead of TF-IDF, we can use **Word2Vec** or **BERT embeddings**, which capture meaning better than just counting words.  

By testing these improvements, we can make the model more accurate, reliable, and useful for real-world applications!