# Data Loading
This is the first notebook in my pipeline to build a recommendation system on Amazon products. It serves as the data ingestion and preprocessing pipeline for my model. I am getting my data from an open source repository of datasets that can be used to train a recommendation model (https://cseweb.ucsd.edu/~jmcauley/datasets.html). I need to load:

- **Training Data**: Loading user review data with their relevant data (ratings, timestamps, and purchase history)
- **Product Metadata**: Processing product metadata including titles, prices, ratings, and store details

I will then merge these datasets to create one training dataset that contains user review data and product metadata.

In [None]:
from glob import glob
import gzip
import json
import os
import pandas as pd

I need to first loading the compressed CSV files containing user review data. Each record represents a user's interaction with a product, including:
- **user_id**: The unique user identifier
- **parent_asin**: The unique product identifier (Amazon Standard Identification Number)
- **rating**: A user's rating for the product (1-5 scale)
- **timestamp**: When the review was made
- **history**: A sequential list of previously purchased products
- **category**: The product's category

In [None]:
# read training review data
all_files = glob("train_data/*.csv.gz")

train_dfs = []

for file in all_files:
    base = os.path.basename(file).replace(".csv.gz", "")
    _, split = base.rsplit('.', 1)
    
    with gzip.open(file, 'rt', encoding='utf-8') as f:
        df = pd.read_csv(f)
    train_dfs.append(df)

train_df = pd.concat(train_dfs, ignore_index=True)

Let's examine the structure and content of our training data.

In [None]:
train_df.head()

Unnamed: 0,user_id,parent_asin,rating,timestamp,history,category
0,AFJTRBXMURLHS5EGNXLUHDHIZRFQ,B096WPNG8Q,5.0,1600542207688,,Patio_Lawn_and_Garden
1,AFJTRBXMURLHS5EGNXLUHDHIZRFQ,B000BQT5IG,3.0,1602272552200,B096WPNG8Q,Patio_Lawn_and_Garden
2,AFJTRBXMURLHS5EGNXLUHDHIZRFQ,B002FGU2MI,4.0,1624053736863,B096WPNG8Q B000BQT5IG,Patio_Lawn_and_Garden
3,AEFKF6R2GUSK2AWPSWRR4ZO36JVQ,B073V7N6RQ,5.0,1566941698710,,Patio_Lawn_and_Garden
4,AEFKF6R2GUSK2AWPSWRR4ZO36JVQ,B01J0RIRUS,4.0,1566941843328,B073V7N6RQ,Patio_Lawn_and_Garden


Next, I need to process product metadata from compressed JSONL files. From each JSONL file, we'll extract key product features, organize them into manageable chunks for efficient processing, and combine them into one, simple dataframe for analysis.

The key features I extract are:
- **title**: The product's name/description
- **average_rating**: The product's rating for all users
- **price**: The product's price 
- **store**: The product's brand/store name
- **parent_asin**: The product's unique identifier
- **rating_number**: The number of ratings received
- **main_category**: The product's main category


In [None]:
output_dir = "meta_chunks"
data_dir = "metadata"
file_pattern = os.path.join(data_dir, "meta_*.jsonl.gz")

columns_to_keep = ["title", "average_rating", "price", "store", "parent_asin", "rating_number", "main_category"]

os.makedirs(output_dir, exist_ok=True)

chunk_id = 0

def safe_float(val):
    try:
        return float(val)
    except:
        return -1.00

for file_path in glob(file_pattern):
    chunk = []
    with gzip.open(file_path, 'rt', encoding='utf-8') as f:
        for i, line in enumerate(f):
            try:
                record = json.loads(line)
                filtered = {key: record.get(key, None) for key in columns_to_keep}
                filtered["price"] = safe_float(filtered.get("price", None))
                chunk.append(filtered)
            except:
                continue

            if len(chunk) >= 500000:
                df = pd.DataFrame(chunk)
                df["price"] = df["price"].astype("float64")
                output_file = os.path.join(output_dir, f"chunk_{chunk_id}.parquet")
                df.to_parquet(output_file, engine='pyarrow', compression='snappy', index=False)
                chunk_id += 1
                chunk = []

    if chunk:
        df = pd.DataFrame(chunk)
        df["price"] = df["price"].astype("float64")
        output_file = os.path.join(output_dir, f"chunk_{chunk_id}.parquet")
        df.to_parquet(output_file, engine='pyarrow', compression='snappy', index=False)
        chunk_id += 1

In [4]:
# read metadata
all_chunks = glob(os.path.join(output_dir, "chunk_*.parquet"))
meta_df = pd.concat([pd.read_parquet(f) for f in all_chunks], ignore_index=True)

Let's examine the loaded metadata structure and sample records.

In [6]:
meta_df.head()

Unnamed: 0,title,average_rating,price,store,parent_asin
0,Dark Roast Pure Coffee,4.7,-1.0,Luzianne,B00NE08WM6
1,PICARAS Galletas Peruanas Bañadas en Chocolate...,4.5,15.99,Winters,B084Q13Q5Q
2,Chipped Beef and Gravy By Patterson's - Great ...,3.2,-1.0,Pattersons,B00KBRUYVM
3,Asher's Sugar Free Milk Chocolate Cordial Cher...,5.0,29.99,Generic,B0BN4PW255
4,Messmer Peppermint 25 bags (6er pack),3.5,29.99,Messmer,B06X9DC27H


In [7]:
meta_df.shape

(22930397, 5)

To make future analysis easier, let's conduct some feature engineering. We can create a binary feature that indicates if a price is missing or not, which can help the model understand price availability as a feature that might influence recommendations.

In [6]:
# adding missing value column that model can learn
meta_df["price_missing"] = (meta_df["price"] == -1.0).astype(int)
meta_df.head()

Unnamed: 0,title,average_rating,price,store,parent_asin,price_missing
0,Dark Roast Pure Coffee,4.7,-1.0,Luzianne,B00NE08WM6,1
1,PICARAS Galletas Peruanas Bañadas en Chocolate...,4.5,15.99,Winters,B084Q13Q5Q,0
2,Chipped Beef and Gravy By Patterson's - Great ...,3.2,-1.0,Pattersons,B00KBRUYVM,1
3,Asher's Sugar Free Milk Chocolate Cordial Cher...,5.0,29.99,Generic,B0BN4PW255,0
4,Messmer Peppermint 25 bags (6er pack),3.5,29.99,Messmer,B06X9DC27H,0


In [38]:
meta_df.head(n=10)

Unnamed: 0,title,average_rating,price,store,parent_asin,price_missing
0,Dark Roast Pure Coffee,4.7,-1.0,Luzianne,B00NE08WM6,1
1,PICARAS Galletas Peruanas Bañadas en Chocolate...,4.5,15.99,Winters,B084Q13Q5Q,0
2,Chipped Beef and Gravy By Patterson's - Great ...,3.2,-1.0,Pattersons,B00KBRUYVM,1
3,Asher's Sugar Free Milk Chocolate Cordial Cher...,5.0,29.99,Generic,B0BN4PW255,0
4,Messmer Peppermint 25 bags (6er pack),3.5,29.99,Messmer,B06X9DC27H,0
5,Crystal Light Peach Tea Drink Mix (36 Pitcher ...,4.7,-1.0,Crystal Light,B0BN7CKZYC,1
6,Chincoteague Seafood 90944 Vegetable Red Crab ...,5.0,73.57,Chincoteague Seafood,B002HQF1BI,0
7,Lmtime Double Wall Glass Leak-Proof Water Bott...,1.0,-1.0,Lmtime,B083JDXY4S,1
8,"Vintners Best Fruit Wine Base-Rhubarb,128 oz",4.8,47.86,Home Brew Ohio,B019QP6648,0
9,Yuengling Medium Wing Sauce 13 oz,4.1,-1.0,Yuengling,B01FHPU34A,1


In [None]:
# filter metadata to only include products that appear in our training data
relevant_asins = train_df["parent_asin"].unique()
meta_df = meta_df[meta_df["parent_asin"].isin(relevant_asins)]

Now, we can create the final dataset by joining our user interaction data with product metadata. This unified dataset will be used for model training and contains all the necessary features for building a recommendation system:

**Final Dataset Features:**
- **User Information**: user_id, history of previous purchases
- **Product Information**: parent_asin, title, store, category
- **Interaction Data**: rating, timestamp
- **Product Attributes**: average_rating, price, price_missing indicator


In [None]:
train_df = train_df.merge(meta_df, on="parent_asin", how="inner")

# save merged dataframe
output_path = "merged_data/merged_train_df.parquet"
os.makedirs("merged_data", exist_ok=True)
train_df.to_parquet(output_path, index=False, engine='pyarrow', compression='snappy')