### Project Title:
## Sentiment Analysis of Product Reviews Using NLP

### Objective:
### To analyze customer product reviews and classify them as positive, negative, or neutral using Natural Language Processing (NLP) techniques.



###  Tools & Libraries:
Python

Pandas, NumPy

NLTK / spaCy (for NLP preprocessing)

Scikit-learn (for ML models)

Matplotlib/Seaborn (for visualization)

Jupyter Notebook / Google Colab

### Dataset:
Use any publicly available dataset like:

Amazon Product Reviews

# ## Project Workflow:
1. Import Libraries & Load Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### 2. Data Preprocessing (NLP Concepts)
Lowercasing

Removing punctuations and numbers

Tokenization

Stopwords removal

Lemmatization (or Stemming)

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

## lOAD DATASET

In [7]:
import pandas as pd

# Load dataset
df = pd.read_csv(r'C:\Users\Sonams\Desktop\amazon_reviews.csv')

# Show the first few rows
print(df.head())


   Unnamed: 0        asin   helpful  overall  \
0           0  0528881469    [0, 0]      5.0   
1           1  0528881469  [12, 15]      1.0   
2           2  0528881469  [43, 45]      3.0   
3           3  0528881469   [9, 10]      2.0   
4           4  0528881469    [0, 0]      1.0   

                                          reviewText   reviewTime  \
0  We got this GPS for my husband who is an (OTR)...   06 2, 2013   
1  I'm a professional OTR truck driver, and I bou...  11 25, 2010   
2  Well, what can I say.  I've had this unit in m...   09 9, 2010   
3  Not going to write a long review, even thought...  11 24, 2010   
4  I've had mine for a year and here's what we go...  09 29, 2011   

       reviewerID              reviewerName  \
0   AO94DHGC771SJ                   amazdnu   
1   AMO214LNFCEI4           Amazon Customer   
2  A3N7T0DY83Y4IG             C. A. Freeman   
3  A1H8PY3QHMQQA0  Dave M. Shaw "mack dave"   
4  A24EV6RXELQZ63               Wayne Smith   

             

In [9]:
print(df.columns)


Index(['Unnamed: 0', 'asin', 'helpful', 'overall', 'reviewText', 'reviewTime',
       'reviewerID', 'reviewerName', 'summary', 'unixReviewTime'],
      dtype='object')


In [11]:
# Text preprocessing
nltk.download("stopwords")
nltk.download("punkt")
stop_words = set(stopwords.words("english"))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sonams\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sonams\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    return " ".join(filtered_tokens)

In [15]:
def preprocess_text(text):
    if isinstance(text, str): 
        tokens = word_tokenize(text.lower())
        filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
        return " ".join(filtered_tokens)
    return "" 


In [17]:
df["sentiment"] = df["overall"].apply(lambda x: "positive" if x >= 4 else "negative")


In [19]:
df["cleaned_review"] = df["reviewText"].apply(preprocess_text)



In [20]:
X_train, X_test, y_train, y_test = train_test_split(df["cleaned_review"], df["sentiment"], test_size=0.2, random_state=42)


In [23]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)

In [25]:
# Model evaluation
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.815


In [27]:
product_id = "B00813GRG4"  # Replace with actual product ASIN
print(df[df["asin"] == product_id]["reviewText"])


Series([], Name: reviewText, dtype: object)


In [29]:
for index, row in df.head(10).iterrows():
    print(f"🔹 **Review by {row['reviewerName']}** (Rating: {row['overall']}/5) 📅 {row['reviewTime']}")
    print(f"📝 {row['reviewText']}")
    print("-" * 60)


🔹 **Review by amazdnu** (Rating: 5.0/5) 📅 06 2, 2013
📝 We got this GPS for my husband who is an (OTR) over the road trucker.  Very Impressed with the shipping time, it arrived a few days earlier than expected...  within a week of use however it started freezing up... could of just been a glitch in that unit.  Worked great when it worked!  Will work great for the normal person as well but does have the "trucker" option. (the big truck routes - tells you when a scale is coming up ect...)  Love the bigger screen, the ease of use, the ease of putting addresses into memory.  Nothing really bad to say about the unit with the exception of it freezing which is probably one in a million and that's just my luck.  I contacted the seller and within minutes of my email I received a email back with instructions for an exchange! VERY impressed all the way around!
------------------------------------------------------------
🔹 **Review by Amazon Customer** (Rating: 1.0/5) 📅 11 25, 2010
📝 I'm a professi

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)  # Limit vocabulary size for efficiency
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


In [33]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

y_pred = model.predict(X_test_tfidf)
print("✅ Model Accuracy:", accuracy_score(y_test, y_pred))


✅ Model Accuracy: 0.815


In [35]:
new_review = ["This product is amazing! Totally worth the price."]
new_review_tfidf = vectorizer.transform(new_review)
print("Predicted Sentiment:", model.predict(new_review_tfidf)[0])


Predicted Sentiment: positive


In [37]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=5000, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)

X_train_seq = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen=100, padding="post")
X_test_seq = pad_sequences(tokenizer.texts_to_sequences(X_test), maxlen=100, padding="post")


In [38]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

model = Sequential([
    Embedding(input_dim=5000, output_dim=64, input_length=100),
    LSTM(64, return_sequences=True),
    LSTM(32),
    Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])




In [39]:
model.fit(X_train_seq, y_train.map({"positive": 1, "negative": 0}), epochs=5, batch_size=32, validation_data=(X_test_seq, y_test.map({"positive": 1, "negative": 0})))


Epoch 1/5
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 72ms/step - accuracy: 0.6878 - loss: 0.6117 - val_accuracy: 0.8150 - val_loss: 0.4800
Epoch 2/5
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 53ms/step - accuracy: 0.7733 - loss: 0.5112 - val_accuracy: 0.8150 - val_loss: 0.4835
Epoch 3/5
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 54ms/step - accuracy: 0.7582 - loss: 0.5423 - val_accuracy: 0.7750 - val_loss: 0.4852
Epoch 4/5
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 55ms/step - accuracy: 0.8085 - loss: 0.5011 - val_accuracy: 0.8250 - val_loss: 0.4478
Epoch 5/5
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 55ms/step - accuracy: 0.8138 - loss: 0.4791 - val_accuracy: 0.8250 - val_loss: 0.4500


<keras.src.callbacks.history.History at 0x233eb875af0>

In [40]:
pip install keras-tuner


Note: you may need to restart the kernel to use updated packages.


In [43]:
from tensorflow.keras.optimizers import Adam
from keras_tuner import RandomSearch

def build_model(hp):
    model = Sequential([
        Embedding(input_dim=5000, output_dim=hp.Int("embedding_dim", 32, 128, step=32), input_length=100),
        LSTM(hp.Int("lstm_units", 32, 128, step=32), return_sequences=True),
        LSTM(hp.Int("lstm_units", 32, 128, step=32)),
        Dense(1, activation="sigmoid")
    ])
    model.compile(loss="binary_crossentropy", optimizer=Adam(hp.Choice("learning_rate", [0.001, 0.0005, 0.0001])), metrics=["accuracy"])
    return model

tuner = RandomSearch(build_model, objective="val_accuracy", max_trials=5, directory="hyperparam_tuning")
tuner.search(X_train_seq, y_train.map({"positive": 1, "negative": 0}), epochs=5, validation_data=(X_test_seq, y_test.map({"positive": 1, "negative": 0})))

best_hyperparameters = tuner.get_best_hyperparameters()[0]
print("Best Parameters:", best_hyperparameters.values)


Reloading Tuner from hyperparam_tuning\untitled_project\tuner0.json
Best Parameters: {'embedding_dim': 128, 'lstm_units': 96, 'learning_rate': 0.001}
