---

## 🔄 Project Flow: Sentiment Analysis of Product Reviews

---

### 🔹 1. **Dataset Creation**
- A synthetic dataset is created containing 20 product reviews labeled as either `positive` or `negative`.
- Saved as `product_reviews.csv` for reproducibility.

---

### 🔹 2. **Text Preprocessing (NLP Pipeline)**  
Performed using **NLTK**:
- **Tokenization** – Split each review into individual words (tokens).
- **Lowercasing** – Normalize all text to lowercase.
- **Stopword Removal** – Eliminate common non-informative words (e.g., “is”, “and”, “the”).
- **Lemmatization** – Convert words to their base/root form (e.g., “working” → “work”).
- **Punctuation Removal** – Remove special characters and symbols.

✔️ Result: Cleaned and meaningful text ready for vectorization.

---

### 🔹 3. **Feature Extraction (TF-IDF Vectorization)**  
- Convert cleaned reviews into numerical format using **TfidfVectorizer**.
- Each review is now represented as a vector of weighted word frequencies.

✔️ This step transforms raw text into a format suitable for machine learning.

---

### 🔹 4. **Train-Test Split**
- Dataset is split into training (75%) and testing (25%) sets using `train_test_split`.
- Ensures models are trained and evaluated on separate data for unbiased results.

---

### 🔹 5. **Model Training**
Two classification models are trained:
- **Multinomial Naive Bayes** – Suitable for discrete data like word counts.
- **Logistic Regression** – A strong baseline for binary classification problems.

✔️ Both models are fitted on the training data.

---

### 🔹 6. **Model Evaluation**
Each model is evaluated on the test data using:
- **Accuracy Score** – Overall correctness.
- **Classification Report** – Includes precision, recall, and F1-score for both classes.

✔️ This helps compare performance and choose the better model.

---

### 🔹 7. **Predictions on New Reviews**
- Example reviews (unseen during training) are preprocessed and vectorized.
- Predictions are made using both trained models.
  
✔️ Shows how the models perform on real-time inputs.

---

In [12]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [3]:
# Step 1: Create a larger synthetic dataset
data = {
    'review': [
        'This product is amazing and works perfectly',
        'Terrible experience, it broke after one use',
        'I love this item, highly recommend it',
        'Not worth the money, very disappointing',
        'Fantastic quality, exceeded my expectations',
        'Poor design, stopped working quickly',
        'Really happy with this purchase',
        'Awful product, waste of time',
        'Superb performance, very satisfied',
        'Horrible, it failed immediately',
        'Great value for the price',
        'Bad quality, would not buy again',
        'Excellent product, works as advertised',
        'Disappointing, broke within days',
        'Love the design and functionality',
        'Worst purchase ever, totally useless',
        'Impressive features, highly recommend',
        'Cheaply made, fell apart fast',
        'Amazing experience, will buy again',
        'Terrible product, complete waste'
    ],
    'sentiment': ['positive', 'negative', 'positive', 'negative', 'positive', 'negative', 
                  'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 
                  'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 
                  'positive', 'negative']
}
df = pd.DataFrame(data)
df.to_csv('product_reviews.csv', index=False)

In [4]:
# Step 2: Text Preprocessing
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(token.lower()) for token in tokens 
              if token.lower() not in stop_words and token not in string.punctuation]
    return ' '.join(tokens)

In [5]:
# Apply preprocessing and debug
df['cleaned_review'] = df['review'].apply(preprocess_text)
print("Cleaned Reviews Sample:")
print(df[['review', 'cleaned_review']].head())

Cleaned Reviews Sample:
                                        review  \
0  This product is amazing and works perfectly   
1  Terrible experience, it broke after one use   
2        I love this item, highly recommend it   
3      Not worth the money, very disappointing   
4  Fantastic quality, exceeded my expectations   

                           cleaned_review  
0          product amazing work perfectly  
1       terrible experience broke one use  
2              love item highly recommend  
3               worth money disappointing  
4  fantastic quality exceeded expectation  


In [6]:
# Step 3: Vectorization
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_review'])
print("\nTF-IDF Vocabulary:")
print(list(tfidf_vectorizer.vocabulary_.keys())[:10])  # Show first 10 words


TF-IDF Vocabulary:
['product', 'amazing', 'work', 'perfectly', 'terrible', 'experience', 'broke', 'one', 'use', 'love']


In [7]:
# Step 4: Train-Test Split
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25, random_state=42)

In [8]:
# Step 5: Train Models
# Naive Bayes
nb_clf = MultinomialNB()
nb_clf.fit(X_train, y_train)

# Logistic Regression
lr_clf = LogisticRegression(max_iter=1000)
lr_clf.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [9]:
# Step 6: Evaluate Models
print("\nNaive Bayes Results:")
y_pred_nb = nb_clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Classification Report:\n", classification_report(y_test, y_pred_nb))

print("\nLogistic Regression Results:")
y_pred_lr = lr_clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr))


Naive Bayes Results:
Accuracy: 0.6
Classification Report:
               precision    recall  f1-score   support

    negative       1.00      0.33      0.50         3
    positive       0.50      1.00      0.67         2

    accuracy                           0.60         5
   macro avg       0.75      0.67      0.58         5
weighted avg       0.80      0.60      0.57         5


Logistic Regression Results:
Accuracy: 0.4
Classification Report:
               precision    recall  f1-score   support

    negative       0.00      0.00      0.00         3
    positive       0.40      1.00      0.57         2

    accuracy                           0.40         5
   macro avg       0.20      0.50      0.29         5
weighted avg       0.16      0.40      0.23         5



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [10]:
# Step 7: Predict on New Reviews
new_reviews = ["This product is great!", "Really bad quality, broke instantly"]
new_reviews_cleaned = [preprocess_text(review) for review in new_reviews]
new_reviews_tfidf = tfidf_vectorizer.transform(new_reviews_cleaned)
print("\nNew Reviews Cleaned:")
print(new_reviews_cleaned)

print("\nNaive Bayes Predictions:")
nb_predictions = nb_clf.predict(new_reviews_tfidf)
for review, pred in zip(new_reviews, nb_predictions):
    print(f"Review: {review} --> Sentiment: {pred}")

print("\nLogistic Regression Predictions:")
lr_predictions = lr_clf.predict(new_reviews_tfidf)
for review, pred in zip(new_reviews, lr_predictions):
    print(f"Review: {review} --> Sentiment: {pred}")


New Reviews Cleaned:
['product great', 'really bad quality broke instantly']

Naive Bayes Predictions:
Review: This product is great! --> Sentiment: positive
Review: Really bad quality, broke instantly --> Sentiment: negative

Logistic Regression Predictions:
Review: This product is great! --> Sentiment: positive
Review: Really bad quality, broke instantly --> Sentiment: positive


### Learning: 

> Text preprocessing techniques such as tokenization, stopword removal, and lemmatization.
>
> Representing text data numerically using TF-IDF vectorization for machine learning.
>
> comparing machine learning models like Naive Bayes and Logistic Regression for classification tasks.
>
> Use of evaluation metrics like accuracy, precision, recall, and F1-score to assess model performance.




###  Summary
This project demonstrates the complete lifecycle of an NLP task using classical machine learning:
> **Raw Text ➡️ Cleaned Text ➡️ Vectorized ➡️ Trained ➡️ Evaluated ➡️ Predicted**

---
