# Sentiment Analysis of Product Reviews  
**An End-to-End Machine Learning & NLP Project**
This project classifies Amazon product reviews into Positive, Negative, and Neutral categories using NLP and Machine Learning techniques.

## 1. Load Dataset

In [1]:
import pandas as pd

file_path = r"C:\Users\harsh\OneDrive\Desktop\Data Science\Amazon Product Review Project\Amazon Dataset\1429_1.csv"

df = pd.read_csv(file_path, low_memory=False)

df = df[['reviews.text', 'reviews.rating', 'name']]

print("TOP 5 ROWS:\n")
print(df.head())

print("\nDATASET SHAPE:")
print(df.shape)

print("\nMISSING VALUES:")
print(df.isnull().sum())


TOP 5 ROWS:

                                        reviews.text  reviews.rating  \
0  This product so far has not disappointed. My c...             5.0   
1  great for beginner or experienced person. Boug...             5.0   
2  Inexpensive tablet for him to use and learn on...             5.0   
3  I've had my Fire HD 8 two weeks now and I love...             4.0   
4  I bought this for my grand daughter when she c...             5.0   

                                                name  
0  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...  
1  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...  
2  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...  
3  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...  
4  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...  

DATASET SHAPE:
(34660, 3)

MISSING VALUES:
reviews.text         1
reviews.rating      33
name              6760
dtype: int64


## 2. Handle Missing Values
Missing values are handled using the following strategies:
- Drop rows with missing review text  
- Fill missing ratings with the median value  
- Replace missing product names with 'Unknown Product'


In [2]:
df = df.dropna(subset=['reviews.text'])
df['reviews.rating'] = df['reviews.rating'].fillna(df['reviews.rating'].median())
df['name'] = df['name'].fillna("Unknown Product")

print("MISSING VALUES AFTER CLEANING:\n")
print(df.isnull().sum())

print("\nDATASET SHAPE AFTER CLEANING:")
print(df.shape)


MISSING VALUES AFTER CLEANING:

reviews.text      0
reviews.rating    0
name              0
dtype: int64

DATASET SHAPE AFTER CLEANING:
(34659, 3)


## 3. Sentiment Label Creation

Sentiment labels are created from review ratings:
- Rating ≥ 4 → Positive  
- Rating = 3 → Neutral  
- Rating ≤ 2 → Negative


In [3]:
def label_sentiment(rating):
    if rating >= 4:
        return "Positive"
    elif rating == 3:
        return "Neutral"
    else:
        return "Negative"
df['sentiment']= df['reviews.rating'].apply(label_sentiment)
print(df[['reviews.rating','sentiment']].head(6))

   reviews.rating sentiment
0             5.0  Positive
1             5.0  Positive
2             5.0  Positive
3             4.0  Positive
4             5.0  Positive
5             5.0  Positive


## 4. Text Preprocessing (NLP)

In this step, review text is cleaned using NLP techniques:
- Lowercasing  
- Removing special characters using regex  
- Removing stopwords  
- Removing extra spaces


In [4]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return " ".join(words)

df['clean_review'] = df['reviews.text'].apply(clean_text)

print(df[['reviews.text', 'clean_review']].head())


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                        reviews.text  \
0  This product so far has not disappointed. My c...   
1  great for beginner or experienced person. Boug...   
2  Inexpensive tablet for him to use and learn on...   
3  I've had my Fire HD 8 two weeks now and I love...   
4  I bought this for my grand daughter when she c...   

                                        clean_review  
0  product far disappointed children love use lik...  
1  great beginner experienced person bought gift ...  
2  inexpensive tablet use learn step nabi thrille...  
3  ive fire hd two weeks love tablet great valuew...  
4  bought grand daughter comes visit set user ent...  


## 5. Feature Extraction using TF-IDF
Text data is converted into numerical features using the TF-IDF vectorization technique so that machine learning models can process the text.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)

X = tfidf.fit_transform(df['clean_review'])
y = df['sentiment']

print("TF-IDF matrix shape:", X.shape)
print("Sample feature names:", tfidf.get_feature_names_out()[:10])


TF-IDF matrix shape: (34659, 5000)
Sample feature names: ['abc' 'abilities' 'ability' 'able' 'absolute' 'absolutely' 'abundance'
 'abuse' 'ac' 'accent']


## 6. Train-Test Split

The dataset is split into training and testing sets to evaluate model performance.


In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)


Training data shape: (27727, 5000)
Testing data shape: (6932, 5000)


## 7. Model Training

Three machine learning models are trained and evaluated:
- Naive Bayes  
- Logistic Regression  
- Support Vector Machine (SVM)


In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

model_nb = MultinomialNB()
model_nb.fit(X_train, y_train)

y_pred_nb = model_nb.predict(X_test)

print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))


Naive Bayes Accuracy: 0.9329197922677438


In [8]:
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X_train, y_train)

y_pred_lr = model_lr.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))


Logistic Regression Accuracy: 0.9371032890940566


In [9]:
from sklearn.svm import LinearSVC

model_svm = LinearSVC()
model_svm.fit(X_train, y_train)

y_pred_svm = model_svm.predict(X_test)

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))


SVM Accuracy: 0.9350836699365263


## 8. Model Evaluation (Final Model - SVM)

The Support Vector Machine (SVM) model is selected as the final model based on performance comparison.  
In this section, we evaluate its performance using accuracy, classification report, and confusion matrix.


In [10]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred_svm))
print("\nConfusion Matrix:\n")
print(confusion_matrix(y_test, y_pred_svm))

SVM Accuracy: 0.9350836699365263

Classification Report:

              precision    recall  f1-score   support

    Negative       0.55      0.19      0.28       162
     Neutral       0.34      0.08      0.13       300
    Positive       0.94      0.99      0.97      6470

    accuracy                           0.94      6932
   macro avg       0.61      0.42      0.46      6932
weighted avg       0.91      0.94      0.92      6932


Confusion Matrix:

[[  31   15  116]
 [  14   24  262]
 [  11   32 6427]]


## 9. Save the Best Model

The trained SVM model and TF-IDF vectorizer are saved for future predictions and deployment.

In [11]:
import pickle

with open("svm_sentiment_model.pkl", "wb") as f:
    pickle.dump(model_svm, f)

with open("tfidf_vectorizer.pkl", "wb") as f:
    pickle.dump(tfidf, f)

print("Model and vectorizer saved successfully!")

Model and vectorizer saved successfully!


## 10. Real-Time Sentiment Prediction

A real-time sentiment prediction system is implemented where users can input new reviews and get sentiment predictions using the trained SVM model.


In [12]:
with open("svm_sentiment_model.pkl", "rb") as f:
    loaded_model = pickle.load(f)

with open("tfidf_vectorizer.pkl", "rb") as f:
    loaded_tfidf = pickle.load(f)

def predict_sentiment(review_text):
    clean_review = clean_text(review_text)
    vector = loaded_tfidf.transform([clean_review])
    prediction = loaded_model.predict(vector)[0]
    return prediction

# Test with sample reviews
test_reviews = [
    "This product is absolutely amazing! Worth every penny.",
    "Worst product ever. Totally waste of money.",
    "It is okay, not too good and not too bad."
]

for review in test_reviews:
    print("Review:", review)
    print("Predicted Sentiment:", predict_sentiment(review))
    print("-" * 50)

Review: This product is absolutely amazing! Worth every penny.
Predicted Sentiment: Positive
--------------------------------------------------
Review: Worst product ever. Totally waste of money.
Predicted Sentiment: Negative
--------------------------------------------------
Review: It is okay, not too good and not too bad.
Predicted Sentiment: Neutral
--------------------------------------------------


In [13]:
while True:
    user_review = input("\nEnter a product review (type 'exit' to stop): ")

    if user_review.lower() == "exit":
        print("Exiting sentiment checker...")
        break

    sentiment = predict_sentiment(user_review)
    print("Predicted Sentiment:", sentiment)



Enter a product review (type 'exit' to stop):  Good quality, but overpriced for what it offers.


Predicted Sentiment: Positive



Enter a product review (type 'exit' to stop):  Camera is great but performance is slow sometimes.


Predicted Sentiment: Positive



Enter a product review (type 'exit' to stop):  बहुत बढ़िया प्रोडक्ट है. पैसे वसूल!


Predicted Sentiment: Positive



Enter a product review (type 'exit' to stop):  Bad experience. पैसे पूरी तरह बर्बाद हो गए.


Predicted Sentiment: Positive



Enter a product review (type 'exit' to stop):  I am extremely disappointed. Cheap material and horrible performance.


Predicted Sentiment: Neutral



Enter a product review (type 'exit' to stop):  This is the worst product I have ever bought. Totally waste of money.


Predicted Sentiment: Negative



Enter a product review (type 'exit' to stop):  The product arrived broken and customer service was useless. Never buying again.


Predicted Sentiment: Negative



Enter a product review (type 'exit' to stop):  exit


Exiting sentiment checker...


In [14]:
import os

print("Current working directory:")
print(os.getcwd())

print("\nFiles in this folder:")
print(os.listdir())


Current working directory:
C:\Users\harsh\Amazon Product Review Project

Files in this folder:
['.ipynb_checkpoints', '1429_1.csv', 'sentiment_analysis.ipynb', 'svm_sentiment_model.pkl', 'tfidf_vectorizer.pkl']
