### STEP 1: Setup and Load Libraries
##### This step prepares the environment:
- Mounts Google Drive for file access
- Installs and loads required Python libraries (NLTK, TQDM, pandas, etc.)
- Downloads NLTK data for text processing

In [4]:
from google.colab import drive
import pandas as pd
import numpy as np
import re, string
import nltk
import pickle
from tqdm.notebook import tqdm
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Mount Google Drive to access files
drive.mount('/content/drive')

# Install tqdm if not present
!pip install tqdm
from tqdm.notebook import tqdm
tqdm.pandas()  # ✅ Enable .progress_apply for pandas

# NLTK setup
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')

# Initialize NLP tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

Mounted at /content/drive


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


### STEP 2: Load and Clean Sentiment Data
##### This step loads the sentiment-labeled review data:
- Reads preprocessed CSV with sentiment scores
- Drops null values from key columns

In [5]:
sentiment_df = pd.read_csv('/content/drive/MyDrive/Amazon_Recommender/data/processed/03_df_with_sentiment_2.csv')
sentiment_df = sentiment_df[['asin', 'reviewText','overall','sentiment_score']].dropna()
display(sentiment_df.head())
print(sentiment_df.isnull().sum())

Unnamed: 0,asin,reviewText,overall,sentiment_score
0,151004714,This is the best novel I have read in 2 or 3 y...,5.0,0.9601
1,151004714,"Pages and pages of introspection, in the style...",3.0,0.8382
2,151004714,This is the kind of novel to read when you hav...,5.0,0.9642
3,151004714,What gorgeous language! What an incredible wri...,5.0,0.9737
4,151004714,I was taken in by reviews that compared this b...,3.0,0.985


asin               0
reviewText         0
overall            0
sentiment_score    0
dtype: int64


#### Challenge: reviewText Column Unexpectedly Contains Nulls After Reloading Data
######Problem Description:
- We encountered a data consistency issue. A DataFrame that was previously confirmed to have no null values and saved to a file suddenly showed 287 nulls in the reviewText column after being reloaded.

######Root Cause Analysis:

- After debugging, we identified that the root cause was the method used to read the file.

- We had initially used a line-by-line parsing approach to read what appeared to be a JSON Lines (.jsonl) formatted file. This method assumes that each line is a complete and independent JSON object.

- However, if the string content within the reviewText field itself contains newline characters (\n)—for instance, when a user presses "Enter" to start a new paragraph in their review—a single JSON object gets split across multiple physical lines in the file. A line-by-line parser will fail to interpret this correctly, thinking the object has ended prematurely. This leads to parsing errors and results in null values for those records.

### STEP 3: Group Reviews by ASIN
##### This step consolidates all reviews per product:
- Groups multiple reviews by ASIN
- Joins review text into one string per product
- Join multiple reviews into a single string per product

In [6]:
df_grouped = sentiment_df.groupby('asin')['reviewText'].apply(lambda x: ' '.join(x)).reset_index()
print(f"Number of unique products: {len(df_grouped)}")
display(df_grouped.head())

Number of unique products: 160052


Unnamed: 0,asin,reviewText
0,101635370,I figured out how to use it. It's okay for li...
1,151004714,This is the best novel I have read in 2 or 3 y...
2,380709473,I read this probably 50 years ago in my youth ...
3,446697192,"Fresh from Connecticut, Taylor Henning lands a..."
4,511189877,"This remote, for whatever reason, was chosen b..."


### STEP 4: Clean Text Reviews
##### This step cleans review text:
- Lowercases and removes punctuation/numbers
- Tokenizes and lemmatizes the text
- Removes stopwords and short tokens

In [7]:
def clean_text(text):
    text = text.lower()
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)
    text = re.sub(r'\d+', '', text)
    words = nltk.word_tokenize(text)
    return ' '.join([lemmatizer.lemmatize(word) for word in words if word not in stop_words and len(word) > 2])

df_grouped['cleaned_text'] = df_grouped['reviewText'].progress_apply(clean_text)
display(df_grouped.head())

  0%|          | 0/160052 [00:00<?, ?it/s]

Unnamed: 0,asin,reviewText,cleaned_text
0,101635370,I figured out how to use it. It's okay for li...,figured use okay listening radio station would...
1,151004714,This is the best novel I have read in 2 or 3 y...,best novel read year everything fiction beauti...
2,380709473,I read this probably 50 years ago in my youth ...,read probably year ago youth reread first time...
3,446697192,"Fresh from Connecticut, Taylor Henning lands a...",fresh connecticut taylor henning land dream jo...
4,511189877,"This remote, for whatever reason, was chosen b...",remote whatever reason chosen time warner repl...


### STEP 5: Aggregate Sentiment Scores by Product
##### This step computes average sentiment score per ASIN:
- Groups sentiment by ASIN and calculates the mean

In [8]:
sentiment_avg = sentiment_df.groupby('asin')['sentiment_score'].mean().reset_index()
sentiment_avg.columns = ['asin', 'avg_sentiment']
display(sentiment_avg.head())

Unnamed: 0,asin,avg_sentiment
0,101635370,0.218728
1,151004714,0.94424
2,380709473,0.495933
3,446697192,0.972782
4,511189877,0.397433


### STEP 6: Merge Cleaned Text with Sentiment Scores
##### This step merges textual and sentiment data:
- Joins aggregated sentiment to grouped text
- Renames the column for clarity

In [9]:
df_grouped = df_grouped.merge(sentiment_avg, on='asin', how='left')
df_grouped = df_grouped.rename(columns={'avg_sentiment': 'sentiment_score'})
display(df_grouped.head())

Unnamed: 0,asin,reviewText,cleaned_text,sentiment_score
0,101635370,I figured out how to use it. It's okay for li...,figured use okay listening radio station would...,0.218728
1,151004714,This is the best novel I have read in 2 or 3 y...,best novel read year everything fiction beauti...,0.94424
2,380709473,I read this probably 50 years ago in my youth ...,read probably year ago youth reread first time...,0.495933
3,446697192,"Fresh from Connecticut, Taylor Henning lands a...",fresh connecticut taylor henning land dream jo...,0.972782
4,511189877,"This remote, for whatever reason, was chosen b...",remote whatever reason chosen time warner repl...,0.397433


### STEP 7: TF-IDF Vectorization
##### This step vectorizes text:
- Initialize TfidfVectorizer
- Transform cleaned text
- Export features and matrix

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, min_df=5, token_pattern=r'\b[a-zA-Z]{3,}\b', stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df_grouped['cleaned_text'])
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")

feature_names = vectorizer.get_feature_names_out()
df_tfidf = pd.DataFrame(tfidf_matrix[:5].toarray(), columns=feature_names)
display(df_tfidf.head())

# Save to disk
df_tfidf.to_csv('/content/drive/MyDrive/Amazon_Recommender/data/processed/04_df_tfidf.csv', index=False)
with open('/content/drive/MyDrive/Amazon_Recommender/models/04_tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

from scipy import sparse
sparse.save_npz('/content/drive/MyDrive/Amazon_Recommender/models/04_tfidf_matrix.npz', tfidf_matrix)
np.save('/content/drive/MyDrive/Amazon_Recommender/models/04_tfidf_matrix.npy', tfidf_matrix.toarray())

TF-IDF matrix shape: (160052, 5000)


Unnamed: 0,aaa,abc,aberration,ability,able,absolute,absolutely,abuse,abused,accent,...,zen,zero,zip,zipper,zippered,zone,zoom,zoomed,zooming,zune
0,0.083263,0.0,0.0,0.005358,0.017298,0.0,0.0,0.0,0.0,0.0,...,0.011406,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### STEP 8: Build Nearest Neighbors Model
##### This step builds the recommender engine:
- Fit NearestNeighbors using cosine distance
- Save index, distances, and neighbor lookup


In [None]:
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(metric='cosine', algorithm='brute')
nn.fit(tfidf_matrix)

distances, indices = nn.kneighbors(tfidf_matrix, n_neighbors=6)
asin_list = df_grouped['asin'].tolist()

with open('/content/drive/MyDrive/Amazon_Recommender/models/04_nearest_neighbors_and_result.pkl', 'wb') as f:
    pickle.dump({'model': nn, 'distances': distances, 'indices': indices}, f)

### STEP 9: Define Recommendation Function
##### This step calculates a hybrid score:
- Uses cosine similarity (1 - dist)
- Adds weighted sentiment

In [None]:
def recommend_similar_products(asin, top_n=5, sentiment_weight=0.3):
    if asin not in asin_list:
        return f"{asin} not found."

    idx = asin_list.index(asin)
    neighbor_indices = indices[idx]
    neighbor_distances = distances[idx]
    sim_scores = list(zip(neighbor_indices, neighbor_distances))
    sim_scores = sorted(sim_scores, key=lambda x: x[1])

    final_scores = []
    for i, dist in sim_scores[1:]:
        sentiment = df_grouped.iloc[i]['sentiment_score']
        final_score = (1 - dist) + sentiment_weight * sentiment
        final_scores.append((i, final_score))

    final_scores = sorted(final_scores, key=lambda x: x[1], reverse=True)
    top_idxs = [i for i, score in final_scores[:top_n]]
    return df_grouped.iloc[top_idxs][['asin', 'reviewText', 'sentiment_score']]

### STEP 10: Generate Full Recommendation List
##### This step builds the output CSV:
- Iterates through all ASINs
- Applies recommender

In [None]:
results = []
for asin in tqdm(df_grouped['asin']):
    recs = recommend_similar_products(asin)
    results.append({'asin': asin, 'recommended_asins': recs['asin'].tolist()})

df_recommendations_results = pd.DataFrame(results)
display(df_recommendations_results.head())

### STEP 11: Analyze Output
##### This step analyzes the top recommended items:
- Count most frequently recommended products

In [14]:
from collections import Counter
flat = []
for rec in df_recommendations_results['recommended_asins']:
    flat.extend(rec)

print(f"Total number of unique recommended products: {len(set(flat))}")
Counter(flat).most_common(10)

Total number of unique recommended products: 75078


[('B011KFQASE', 512),
 ('B00FJILVDS', 432),
 ('B00ITYXRU4', 420),
 ('B00D6NLDJU', 416),
 ('B00LP6CFEC', 408),
 ('B005G2C42E', 392),
 ('B00J21DFGE', 388),
 ('B01865QFJA', 371),
 ('B001IBFSJ8', 368),
 ('B00STP86CW', 351)]

Total number of unique recommended products: 75078


[('B011KFQASE', 512),
 ('B00FJILVDS', 432),
 ('B00ITYXRU4', 420),
 ('B00D6NLDJU', 416),
 ('B00LP6CFEC', 408),
 ('B005G2C42E', 392),
 ('B00J21DFGE', 388),
 ('B01865QFJA', 371),
 ('B001IBFSJ8', 368),
 ('B00STP86CW', 351)]