### Customer Insight

# Exploratory Data Analysis & Model Understanding

This notebook explores the dataset used for the Customer Sentiment Analysis project.

The goals are:
- Understand the structure and quality of the data
- Analyze label distribution and text characteristics
- Establish a baseline machine learning model
- Explain why transformer-based models perform better


In [None]:
from datasets import load_dataset
import pandas as pd
from pathlib import Path

In [None]:
RAW_DATA_DIR = Path("data/raw")
RAW_DATA_DIR.mkdir(parents=True, exist_ok=True)

In [None]:
def download_amazon_reviews(sample_size: int = 5000) -> None:
  """
  Download a sample of Amazon Polarity reviews.
  saves raw data to data/raw/.
  """
  print("Dowloading dataset....")
  dataset = load_dataset("amazon_polarity", split=f"train[:{sample_size}]")

  df = pd.DataFrame({
      "text": dataset["content"],
      "label": dataset["label"]
  })

  output_path = RAW_DATA_DIR / "amazon_reviews.csv"
  df.to_csv(output_path, index=False)

  print(f"Saved raw data to {output_path}")


if __name__ == "__main__":
  download_amazon_reviews()

## 1. Dataset Overview

The dataset consists of customer product reviews labeled with sentiment.

Each row contains:
- `content`: preprocessed review text
- `label`: sentiment class (0 = negative, 1 = positive)

The dataset was cleaned prior to modeling to remove noise such as emails, excessive whitespace, and formatting issues.


In [None]:
# load data

df = pd.read_csv('data/raw/amazon_reviews.csv')
df.head()

In [None]:
df.tail()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.isna().sum()

In [None]:
df.describe()

In [None]:
df['label'].value_counts(normalize=True)

In [None]:
import matplotlib.pyplot as plt

df['label'].value_counts().plot(kind="bar")
plt.title("Label Distribution")
plt.show()

In [None]:
# Text length
df["text_length"] = df["text"].apply(lambda x: len(x.split()))

df["text_length"].describe()


In [None]:
df["text_length"].hist(bins=50)
plt.title("Text Length Distribution")
plt.show()

### Baseline TF-IDF experiment

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


X = df["text"]
y = df["label"]


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

vectorizer = TfidfVectorizer(max_features=20000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = LogisticRegression(max_iter=1000)
model.fit(X_train_vec, y_train)

preds = model.predict(X_test_vec)

print(classification_report(y_test, preds))

## 5. Baseline Model

Before applying deep learning, a classical machine learning baseline was implemented.

Pipeline:
- TF-IDF vectorization
- Logistic Regression classifier

This baseline establishes a reference performance level for comparison.


TF-IDF treats text as independent words.
Transformers model relationships between words using attention, enabling better handling of negation, sentiment intensity, and context.

### Conclusion

Dataset is moderately clean and balanced

Baseline ML performs reasonably well

Transformer improves performance by capturing contextual semantics

Deployment requires separating experimentation from production code