# Day 2 – Text Preprocessing & Lemmatization

**Goal:** Clean and lemmatize the text data from raw Amazon reviews to prepare for model training.


In [1]:
import pandas as pd
import spacy
import nltk
import re
from nltk.corpus import stopwords
from tqdm import tqdm
import json


## Setup NLP Tools

- Download NLTK stopwords
- Load SpaCy English model (`en_core_web_sm`)


In [2]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Load SpaCy
try:
    nlp = spacy.load("en_core_web_sm")
except:
    print("Downloading SpaCy model...")
    import os
    os.system("python -m spacy download en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nitis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Downloading SpaCy model...


## Load Raw Dataset
We’ll load `raw_reviews.csv` and only use the `Text` and `Score` columns for now.


In [3]:
df = pd.read_csv("../data/raw_reviews.csv")
df = df[['Text', 'Score']]
df.dropna(inplace=True)
df.head()


Unnamed: 0,Text,Score
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


## Text Cleaning + Lemmatization

We’ll:
- Lowercase text
- Remove special characters
- Remove stopwords
- Lemmatize using spaCy


In [4]:
def clean_and_lemmatize(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc if token.lemma_ not in stop_words and token.lemma_.strip() != ""]
    
    return " ".join(lemmas)


In [5]:
tqdm.pandas()
df['clean_text'] = df['Text'].progress_apply(clean_and_lemmatize)
df[['clean_text', 'Score']].head()


100%|██████████| 568454/568454 [13:54:40<00:00, 11.35it/s]       


Unnamed: 0,clean_text,Score
0,I buy several vitality dog food product find g...,5
1,product arrived label jumbo salt peanutsthe pe...,1
2,confection around century light pillowy citrus...,4
3,look secret ingredient robitussin I believe I ...,2
4,great taffy great price wide assortment yummy ...,5


## Save Cleaned Corpus

We’ll save it as `cleaned_reviews.json` in `data/` directory.


In [6]:
output = df[['clean_text', 'Score']].to_dict(orient='records')

with open("../data/cleaned_reviews.json", "w") as f:
    json.dump(output, f)

print("✅ Cleaned reviews saved to data/cleaned_reviews.json")


✅ Cleaned reviews saved to data/cleaned_reviews.json


## ✅ Summary

- Loaded raw reviews and selected relevant fields
- Preprocessed and lemmatized text using spaCy
- Removed punctuation and stopwords
- Saved clean corpus for modeling
