# Recommender Systems: Data Preprocessing

This notebook is a supplementary material for a series of blog posts on Recommender Systems at [Encora's Insights](https://www.encora.com/insights/all).

This notebook performs a series of essential transformations on the dataset used in this repository, making the execution of the other notebooks possible.

# Download the data

For this demonstration, we use the Amazon 2014 Product Review data. You can download it in [here](http://jmcauley.ucsd.edu/data/amazon/links.html).

There is a more recent version of this dataset. However, you have to ask permissions to download the larger files.

As stated by Amazon

>This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

More specifically, we will use the [5-core (9.9gb)](http://snap.stanford.edu/data/amazon/productGraph/kcore_5.json.gz) file and the [metadata (3.1gb)](http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz) file.

The 5-core dataset is interesting because it guarantees that each user and item have at least 5 reviews. The metadata file gives descriptions about the products, which will be specially useful when building a Content Based Recommender System based on product metadata.

In this notebook, we will reduce the dataset size so the code can be easily run on systems with limited memory. Despite this reduction, the dataset will need to be further subsampled in the Memory Based Collaborative Filtering notebook to accommodate the creation of a user-item matrix in memory.


# Import the libraries

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import math
import ast
import os

# Configurable Notebook Variables

The variables below configure the data processing and sampling performed in this Notebook.

In [2]:
CHUNK_SIZE = 200000 # chunk size for processing the original kcore_5 file 
USER_SAMPLES = 10000 # number of users to use in the sample. 

# Preprocessing the data

## Reducing the file size of the reviews dataset

Since most of the information present in the 5-core file is not really needed for our purpose of product recommendation, we will rewrite the file to a more compact version that we can work more easily in memory.

In [3]:
# setting the seed for reproducibility purposes
np.random.seed(47)

In [4]:
if not os.path.exists('dataset/kcore_5_compact.csv'):
    nrows_og_file = 41135700
    nchunks = math.ceil(nrows_og_file / CHUNK_SIZE)

    with pd.read_json('dataset/kcore_5.json', lines=True, chunksize=CHUNK_SIZE) as reviews_chunk:
        for chunk in tqdm(reviews_chunk, total=nchunks):
            chunk.drop(columns=['reviewerName', 'helpful', 'reviewText', 'summary', 'unixReviewTime','reviewTime'], inplace=True)
            chunk.to_csv('dataset/kcore_5_compact.csv', mode='a', header=False, index=False, quotechar='"')

The compact version of the dataset is comprised of the ID of the reviewer, ID of the product and the rating for the product.

In [5]:
kcore_5_compact = pd.read_csv('dataset/kcore_5_compact.csv', names=['reviewerID', 'asin', 'overall'], quotechar='"')
kcore_5_compact.head()

Unnamed: 0,reviewerID,asin,overall
0,ACNGUPJ3A3TM9,13714,4
1,A2SUAM1J3GNN3B,13714,5
2,APOZ15IEYQRRR,13714,5
3,AYEDW3BFK53XK,13714,5
4,A1KLCGLCXYP1U1,13714,3


In [6]:
kcore_5_compact.shape

(41135700, 3)

## Resample

We will resample the data in order to simplify this demonstration and fit all the solution (data/model/etc.) in memory. Note that this will make the reviews dataset not a 5-core anymore, since  some users/items might not have at least 5 reviews after  resampling.

It would be possible to use other smaller 5-core files available in the Amazon review data page. However, the difference here is that this sample contains items from multiple product categories.

In [7]:
if not os.path.exists('dataset/kcore_5_compact_sample.csv'):
    unique_users = kcore_5_compact['reviewerID'].unique()
    user_sample = np.random.choice(unique_users.shape[0], USER_SAMPLES, replace=False)
    kcore_5_compact_sample = kcore_5_compact[kcore_5_compact['reviewerID'].isin(unique_users[user_sample])]
    kcore_5_compact_sample.to_csv('dataset/kcore_5_compact_sample.csv', mode='a', header=False, index=False, quotechar='"')

In [8]:
kcore_5_compact_sample = pd.read_csv('dataset/kcore_5_compact_sample.csv', names=['reviewerID', 'asin', 'overall'], quotechar='"')
kcore_5_compact_sample.shape

(135470, 3)

Now we process the metadata file as well, dropping some columns that will not be used in this demonstration, while also keeping only the products that exists in our 5-core sample.

These are the attributes we keep from the product metadata
* asin: ID of the product
* categories: list of categories the product belongs to
* title: name of the product
* description: description of the product 
* related: related products (also bought, also viewed, bought together, buy after viewing)
    * also bought: other products that customers bought after purchasing a given product

In [9]:
if not os.path.exists('dataset/metadata_sample.csv'):

    total_sampled_rows = 0
    nrows_og_file = 9430088

    data = list()
    with open('dataset/metadata.json', 'r') as f:
        for line in tqdm(f, total=nrows_og_file):
            line = ast.literal_eval(line)
            [line.pop(key, None) for key in ['imUrl', 'price', 'salesRank', 'brand']]
            data.append(line)
            if len(data) == CHUNK_SIZE:
                chunk = pd.DataFrame.from_dict(data, orient='columns')
                chunk = chunk[chunk['asin'].isin(kcore_5_compact_sample['asin'])]
                data = list()
                chunk = chunk[['asin', 'categories', 'title', 'description', 'related']] 
                total_sampled_rows += len(chunk)
                chunk.to_csv('dataset/metadata_sample.csv', mode='a', header=False, index=False, quotechar='"')

    print(f'Size of the full meta dataset: {total_sampled_rows}')

In [10]:
meta = pd.read_csv('dataset/metadata_sample.csv', names=['asin', 'categories', 'title', 'description', 'related'], quotechar='"')
meta.head()

Unnamed: 0,asin,categories,title,description,related
0,0000589012,"[['Movies & TV', 'Movies']]",Why Don't They Just Quit? DVD Roundtable Discu...,,"{'also_bought': ['B000Z3N1HQ', '0578045427', '..."
1,000100039X,[['Books']],The Prophet,"In a distant, timeless place, a mysterious pro...","{'also_bought': ['1851686274', '0785830618', '..."
2,0002051850,[['Books']],For Whom the Bell Tolls,,"{'also_bought': ['0684801469', '0743297334', '..."
3,0002007770,[['Books']],Water For Elephants,,"{'also_bought': ['0399155341', '1573222453', '..."
4,0002247399,[['Books']],A Dance with Dragons,,"{'also_bought': ['0553801503', '0553106635', '..."


We also remove entries that contains no category or title, since these text-based attributes are used in the Content Based Recommender System we will build.

In [11]:
meta = meta.dropna(axis=0, subset=('categories', 'title'))
meta.to_csv('dataset/metadata_sample.csv', header=False, index=False, quotechar='"')
meta.shape

(87849, 5)

In [12]:
kcore_5_compact_sample.shape

(135470, 3)

We now check whether we still have a considerable amount of users that rated at least 5 times and items that got at least 5 ratings.

In [13]:
user_count = kcore_5_compact_sample.groupby('reviewerID').size().to_frame('count').sort_values(by='count') 
user_count[user_count['count'] > 4]

Unnamed: 0_level_0,count
reviewerID,Unnamed: 1_level_1
A01338202O0PRUBIBEPNF,5
A2NW337W0ZCZHT,5
A2NWRZHM0E9O03,5
A2NY95RRTI3Z6W,5
A2NYK433VLN57H,5
...,...
A2Y3ZGVRA3S23L,368
A3P3UOHYBFRGJN,447
A1X2LENOF84LCQ,568
A34CSXOGVYF94S,669


In [14]:
user_count

Unnamed: 0_level_0,count
reviewerID,Unnamed: 1_level_1
A01338202O0PRUBIBEPNF,5
A2NW337W0ZCZHT,5
A2NWRZHM0E9O03,5
A2NY95RRTI3Z6W,5
A2NYK433VLN57H,5
...,...
A2Y3ZGVRA3S23L,368
A3P3UOHYBFRGJN,447
A1X2LENOF84LCQ,568
A34CSXOGVYF94S,669


In [15]:
item_count = kcore_5_compact_sample.groupby('asin').size().to_frame('count').sort_values(by='count') 
item_count[item_count['count'] > 5]

Unnamed: 0_level_0,count
asin,Unnamed: 1_level_1
0007447868,6
0060573775,6
B0058UUR6E,6
B006HJKKCG,6
B00609B3J2,6
...,...
0439023483,36
030758836X,37
B0074BW614,41
B0051VVOB2,43


After completing the essential preprocessing steps on the data in this notebook, we can proceed to execute the other notebooks.