# Sentiment Analysis on Amazon Product Reviews

## 1. Dataset Overview
- **Dataset Description**:
  - Analyze an Amazon product review dataset containing textual reviews (`reviewText`) and corresponding sentiment labels (`Positive`).
  - Sentiment is binary: 1 for positive, 0 for negative.
- **Objective**:
  - Predict the sentiment of a product review based on its textual content.


In [1]:
import pandas as pd

In [2]:
url = 'https://raw.githubusercontent.com/rashakil-ds/Public-Datasets/refs/heads/main/amazon.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,reviewText,Positive
0,This is a one of the best apps acording to a b...,1
1,This is a pretty good version of the game for ...,1
2,this is a really cool game. there are a bunch ...,1
3,"This is a silly game and can be frustrating, b...",1
4,This is a terrific game on any pad. Hrs of fun...,1


## 2. Data Preprocessing
- Handle missing values, if any.
- Perform text preprocessing on the `reviewText` column:
  - Convert text to lowercase.
  - Remove stop words, punctuation, and special characters.
  - Tokenize and lemmatize text data.
- Split the dataset into training and testing sets.


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import string
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

# Preprocessing function
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

df['cleaned_text'] = df['reviewText'].apply(preprocess_text)

# Split dataset
X = df['cleaned_text']
y = df['Positive']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 3. Model Selection
- Choose at least three machine learning models for sentiment classification:
  - Logistic Regression
  - Random Forest
  - Support Vector Machine (SVM)


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Vectorize text using TF-IDF
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train_tfidf, y_train)
lr_predictions = lr_model.predict(X_test_tfidf)

# Random Forest
rf_model = RandomForestClassifier()
rf_model.fit(X_train_tfidf, y_train)
rf_predictions = rf_model.predict(X_test_tfidf)

# SVM
svm_model = SVC()
svm_model.fit(X_train_tfidf, y_train)
svm_predictions = svm_model.predict(X_test_tfidf)

## 4. Model Evaluation
- Evaluate models using Accuracy, Precision, Recall, and F1 Score.


In [5]:
# Evaluate Logistic Regression
print('Logistic Regression Metrics:')
print(classification_report(y_test, lr_predictions))

# Evaluate Random Forest
print('Random Forest Metrics:')
print(classification_report(y_test, rf_predictions))

# Evaluate SVM
print('SVM Metrics:')
print(classification_report(y_test, svm_predictions))

Logistic Regression Metrics:
              precision    recall  f1-score   support

           0       0.84      0.65      0.73       958
           1       0.90      0.96      0.93      3042

    accuracy                           0.89      4000
   macro avg       0.87      0.81      0.83      4000
weighted avg       0.88      0.89      0.88      4000

Random Forest Metrics:
              precision    recall  f1-score   support

           0       0.83      0.59      0.69       958
           1       0.88      0.96      0.92      3042

    accuracy                           0.87      4000
   macro avg       0.85      0.77      0.80      4000
weighted avg       0.87      0.87      0.86      4000

SVM Metrics:
              precision    recall  f1-score   support

           0       0.86      0.66      0.75       958
           1       0.90      0.97      0.93      3042

    accuracy                           0.89      4000
   macro avg       0.88      0.81      0.84      4000
weighted 

## 5. Conclusion
- Summarize findings and discuss the best-performing model.
