# Automated Customer Reviews — Preprocessing (Day 1)

Goal: turn raw Amazon reviews into clean `train.csv` / `test.csv` files we can reuse for:
- Sentiment classification (negative / neutral / positive)
- Category clustering
- GenAI summaries

## 0.1 Dataset check

Confirmation that the files load correctly to prevent future issues.

In [1]:
import pandas as pd

RAW_PATH = "../data/raw/1429_1.csv"

df_raw = pd.read_csv(RAW_PATH, low_memory=False)

print("shape:", df_raw.shape)
print("n_cols:", len(df_raw.columns))

key_cols = ["reviews.text", "reviews.rating", "name", "brand", "categories"]
missing = [c for c in key_cols if c not in df_raw.columns]
print("missing key cols:", missing if missing else "None")

df_raw[key_cols].head(2)

shape: (34660, 21)
n_cols: 21
missing key cols: None


Unnamed: 0,reviews.text,reviews.rating,name,brand,categories
0,This product so far has not disappointed. My c...,5.0,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta..."
1,great for beginner or experienced person. Boug...,5.0,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta..."


## Only the columns we’ll use

To keep things simple (and fast), I’m trimming the dataset to:
- product metadata: `id`, `name`, `brand`, `categories`
- review fields: `reviews.text`, `reviews.rating`

In [5]:
KEEP_COLS = ["id", "name", "brand", "categories", "reviews.text", "reviews.rating"]

df = df_raw[KEEP_COLS].copy()
print("working shape:", df.shape)
df.head(2)

working shape: (34660, 6)


Unnamed: 0,id,name,brand,categories,reviews.text,reviews.rating
0,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",This product so far has not disappointed. My c...,5.0
1,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",great for beginner or experienced person. Boug...,5.0


## Clean review text + normalize ratings

Light cleaning only:
- remove weird spacing/newlines
- keep text readable for models
- convert ratings to numeric
- drop empty reviews / invalid ratings

In [None]:
import re

def clean_text(x) -> str:
    if pd.isna(x):
        return ""
    x = str(x).strip()
    x = re.sub(r"\s+", " ", x)
    return x

df["text"] = df["reviews.text"].map(clean_text)
df["rating"] = pd.to_numeric(df["reviews.rating"], errors="coerce")

before = len(df)
df = df.dropna(subset=["rating"]).copy()
df = df[df["text"].str.len() > 0].copy()
after = len(df)

print(f"kept {after:,} / {before:,} rows after cleaning")
df[["rating", "text"]].head(3)

kept 34,626 / 34,660 rows after cleaning


Unnamed: 0,rating,text
0,5.0,This product so far has not disappointed. My c...
1,5.0,great for beginner or experienced person. Boug...
2,5.0,Inexpensive tablet for him to use and learn on...


## Create sentiment labels (negative / neutral / positive)

Default mapping (project requirement):
- 1–2 → **negative**
- 3 → **neutral**
- 4–5 → **positive**

In [7]:
def rating_to_label(r: float) -> str:
    if r <= 2:
        return "negative"
    if r == 3:
        return "neutral"
    return "positive"

df["label"] = df["rating"].map(rating_to_label)

print("label counts:")
print(df["label"].value_counts())
df[["rating", "label", "text"]].head(5)

label counts:
label
positive    32315
neutral      1499
negative      812
Name: count, dtype: int64


Unnamed: 0,rating,label,text
0,5.0,positive,This product so far has not disappointed. My c...
1,5.0,positive,great for beginner or experienced person. Boug...
2,5.0,positive,Inexpensive tablet for him to use and learn on...
3,4.0,positive,I've had my Fire HD 8 two weeks now and I love...
4,5.0,positive,I bought this for my grand daughter when she c...


## Final modeling dataset

This is the clean table we’ll reuse across the whole project.

In [8]:
FINAL_COLS = ["id", "name", "brand", "categories", "rating", "label", "text"]
df_final = df[FINAL_COLS].copy()

print("final shape:", df_final.shape)
df_final.head(3)

final shape: (34626, 7)


Unnamed: 0,id,name,brand,categories,rating,label,text
0,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",5.0,positive,This product so far has not disappointed. My c...
1,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",5.0,positive,great for beginner or experienced person. Boug...
2,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...",5.0,positive,Inexpensive tablet for him to use and learn on...


## Split + save files

We do a stratified train/test split so class balance stays consistent.

Outputs:
- `data/processed/train.csv`
- `data/processed/test.csv`

In [9]:
from sklearn.model_selection import train_test_split
from pathlib import Path

out_dir = Path("../data/processed")
out_dir.mkdir(parents=True, exist_ok=True)

train_df, test_df = train_test_split(
    df_final,
    test_size=0.2,
    random_state=42,
    stratify=df_final["label"],
)

train_path = out_dir / "train.csv"
test_path  = out_dir / "test.csv"

train_df.to_csv(train_path, index=False)
test_df.to_csv(test_path, index=False)

print("saved:")
print(" -", train_path, train_df.shape)
print(" -", test_path,  test_df.shape)

print("\ntrain label dist:")
print(train_df["label"].value_counts(normalize=True).round(3))

print("\ntest label dist:")
print(test_df["label"].value_counts(normalize=True).round(3))

saved:
 - ../data/processed/train.csv (27700, 7)
 - ../data/processed/test.csv (6926, 7)

train label dist:
label
positive    0.933
neutral     0.043
negative    0.023
Name: proportion, dtype: float64

test label dist:
label
positive    0.933
neutral     0.043
negative    0.023
Name: proportion, dtype: float64


## Spot-check

A small random sample to make sure labels and text look reasonable.

In [10]:
train_df.sample(5, random_state=42)[["rating", "label", "text"]]

Unnamed: 0,rating,label,text
17858,5.0,positive,Purchased for my 75 year old mom so she could ...
13914,5.0,positive,very fast and kid friendly for my young son gi...
3402,5.0,positive,Good charger.
6356,5.0,positive,I need tablets that were cheap but had the pro...
2205,5.0,positive,Got this when they were on sale last year and ...


## Output summary (for reporting)

At this point we have:
- Cleaned review text
- Sentiment labels from star ratings
- Stratified train/test CSVs saved to `data/processed/`

In [11]:
train_path, test_path

(PosixPath('../data/processed/train.csv'),
 PosixPath('../data/processed/test.csv'))