<a href="https://colab.research.google.com/github/Enobangaru/Data-science-projects/blob/main/Amazon%20Food%20Reviews%20Sentiment%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 1: Download the Dataset

In [None]:
!kaggle datasets download -d snap/amazon-fine-food-reviews
!unzip amazon-fine-food-reviews.zip


Step 2: Load the Data

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("Reviews.csv")

# Display first few rows
df.head()


Step 3: Data Preprocessing -> Drop Unnecessary Columns

In [None]:
df = df[['Score', 'Text']]
df = df.dropna()  # Remove missing values


Convert Ratings to Sentiment Labels
We'll classify reviews into Positive, Neutral, and Negative:

1-2 stars → Negative (-1)
3 stars → Neutral (0)
4-5 stars → Positive (1)

In [None]:
def sentiment_label(score):
    if score <= 2:
        return -1  # Negative
    elif score == 3:
        return 0   # Neutral
    else:
        return 1   # Positive

df['Sentiment'] = df['Score'].apply(sentiment_label)
df = df[['Text', 'Sentiment']]  # Keep only needed columns


Step 4: Text Preprocessing ---
We clean the text before applying ML models.

In [None]:
import re

# Assuming 'df' is your DataFrame and 'Text' is the column you want to clean
df['Text'] = df['Text'].str.replace(r'\d+', '', regex=True)  # Remove numbers from the 'Text' column



In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary NLP resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab') # Download the punkt_tab data

# Initialize NLP tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    if not isinstance(text, str):  # Check if text is a valid string
        return ""

    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = re.sub(r'\d+', '', text)  # Remove numbers (optional)
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces

    tokens = word_tokenize(text)  # Tokenization
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]  # Remove stopwords & lemmatize

    return ' '.join(tokens) if tokens else "empty"  # Avoid empty strings

# Apply cleaning
df['Cleaned_Text'] = df['Text'].apply(clean_text)




Step 5: Sentiment Analysis

We can use Logistic Regression for sentiment classification.

1. Convert Text into Features (TF-IDF Vectorization)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)  # Convert text to numerical form
X = vectorizer.fit_transform(df['Cleaned_Text'])
y = df['Sentiment']


2. Train a Logistic Regression Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Step 6: Topic Modeling using LDA (Latent Dirichlet Allocation)

To find topics in the reviews, we use LDA.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# Apply LDA
lda = LatentDirichletAllocation(n_components=5, random_state=42)  # 5 topics
lda.fit(X)

# Get top words per topic
words = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic {topic_idx+1}: ", [words[i] for i in topic.argsort()[-10:]])


Step 7: (Optional) Deploy with Streamlit

In [None]:
import streamlit as st

def predict_sentiment(text):
    clean_text = clean_text(text)
    vectorized_text = vectorizer.transform([clean_text])
    sentiment = model.predict(vectorized_text)[0]
    return "Positive" if sentiment == 1 else "Negative" if sentiment == -1 else "Neutral"

st.title("Amazon Review Sentiment Analyzer")
user_input = st.text_area("Enter a review:")
if st.button("Analyze"):
    result = predict_sentiment(user_input)
    st.write("Predicted Sentiment:", result)


In [None]:
streamlit run app.py
