# Fast AI Text: Product Recommendation
Fast AI is a deep learning library built on top of PyTorch, designed to make it easier and faster to develop and train deep learning models.

In this section I used FastAI to implement a product recommendation model built on top of a text dataset of clothing reviews from amazon.com. The 'X' features are a json dataset of reviews, star ratings, etc. of users for certain products, and the 'y' target is the product id number.

## Imports

In [1]:
# loads the libraries used in this notebook
import pandas as pd
import gzip
import json
from fastai.tabular.all import *
from fastai.text.all import *
import ast

## The Data

In [2]:
# explore the data and select features
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

In [3]:
fashion_reviews = getDF('../../data/text_data/Amazon Fashion Review Data.json.gz')

In [4]:
fashion_reviews.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5.0,True,"09 4, 2015",ALJ66O1Y6SLHA,B000K2PJ4K,"{'Size:': ' Big Boys', 'Color:': ' Blue/Orange'}",Tonya B.,Great product and price!,Five Stars,1441324800,,
1,5.0,True,"09 4, 2015",ALJ66O1Y6SLHA,B000K2PJ4K,"{'Size:': ' Big Boys', 'Color:': ' Black (37467610) / Red/White'}",Tonya B.,Great product and price!,Five Stars,1441324800,,
2,5.0,True,"09 4, 2015",ALJ66O1Y6SLHA,B000K2PJ4K,"{'Size:': ' Big Boys', 'Color:': ' Blue/Gray Logo'}",Tonya B.,Great product and price!,Five Stars,1441324800,,
3,5.0,True,"09 4, 2015",ALJ66O1Y6SLHA,B000K2PJ4K,"{'Size:': ' Big Boys', 'Color:': ' Blue (37867638-99) / Yellow'}",Tonya B.,Great product and price!,Five Stars,1441324800,,
4,5.0,True,"09 4, 2015",ALJ66O1Y6SLHA,B000K2PJ4K,"{'Size:': ' Big Boys', 'Color:': ' Blue/Pink'}",Tonya B.,Great product and price!,Five Stars,1441324800,,


Since the purposes of our recommender is to recommend based on product review text to a product, it should be the product ASIN that is being predicted, and the input should be the review text.

## Pre-processing

In [5]:
fashion_reviews.columns

Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote',
       'image'],
      dtype='object')

Looking at the columns, [reviewTime, unixReviewTime, vote, verified] features are numerical & may have signals useful for prediction a products likeability. The [style, reviewerID, reviewText, summary, overall] features are categorical and could also be useful. Finally, the ASIN is our 'y' target variable. The reviewerName and image features could be dropped as they may offer less useful signals for our purporse and given choice of a text recommendation model. 

In [6]:
fashion_reviews.shape

(3176, 12)

In [7]:
fashion_reviews.drop(['image', 'reviewerName'], axis=1, inplace=True)

### Numerical Features

#### Verified
It may be a good idea to drop unverified reviews as they might be spam and there is only a small number of unverified comments.

In [8]:
fashion_reviews.verified.value_counts()

True     3079
False      97
Name: verified, dtype: int64

In [9]:
fashion_reviews = fashion_reviews.where(fashion_reviews.verified == True).dropna(how='all')
fashion_reviews.shape

(3079, 10)

In [10]:
fashion_reviews.drop('verified', axis=1, inplace=True) #can be dropped as column is all true

#### Date

In [11]:
# date has to be converted into seperate date features
date_feature = fashion_reviews.copy()
date_feature = add_datepart(date_feature, 'reviewTime')

In [12]:
datelike_features = list(date_feature.describe().columns)
date_features = date_feature.loc[:, datelike_features]
date_features.drop(['overall'], axis=1,inplace=True)

In [13]:
date_features.head()

Unnamed: 0,unixReviewTime,reviewTimeYear,reviewTimeMonth,reviewTimeWeek,reviewTimeDay,reviewTimeDayofweek,reviewTimeDayofyear,reviewTimeElapsed
0,1441325000.0,2015,9,36,4,4,247,1441325000.0
1,1441325000.0,2015,9,36,4,4,247,1441325000.0
2,1441325000.0,2015,9,36,4,4,247,1441325000.0
3,1441325000.0,2015,9,36,4,4,247,1441325000.0
4,1441325000.0,2015,9,36,4,4,247,1441325000.0


### Categorical Features

#### Style

In [14]:
# converting style into features
product_style = fashion_reviews.loc[:, ['style']]

In [15]:
product_style

Unnamed: 0,style
0,"{'Size:': ' Big Boys', 'Color:': ' Blue/Orange'}"
1,"{'Size:': ' Big Boys', 'Color:': ' Black (37467610) / Red/White'}"
2,"{'Size:': ' Big Boys', 'Color:': ' Blue/Gray Logo'}"
3,"{'Size:': ' Big Boys', 'Color:': ' Blue (37867638-99) / Yellow'}"
4,"{'Size:': ' Big Boys', 'Color:': ' Blue/Pink'}"
...,...
3171,"{'Size:': ' 8.5 B(M) US', 'Color:': ' Green Glow/Seaweed - Hasta - White'}"
3172,"{'Size:': ' 5 B(M) US', 'Color:': ' Wolf Grey/Black-pink Blast/White'}"
3173,"{'Size:': ' 8 B(M) US', 'Color:': ' Blue Tint/Green Glow/Hasta/White'}"
3174,"{'Size:': ' 9 B(M) US', 'Color:': ' Blue Tint/Green Glow/Hasta/White'}"


In [16]:
product_style['style'].isnull().sum()

61

In [17]:
# style has to be converted from json to a feature of columns
# made with help of ai
def unpack_style_features(df, style_column):
    def process_styles(styles):
        if isinstance(styles, dict):
            return styles
        else:
            return {}
    
    # Get unique style keys
    unique_keys = set()
    for styles in df[style_column]:
        styles = process_styles(styles)
        unique_keys.update(styles.keys())
    
    # Create a DataFrame with missing category for each unique style key
    style_df = pd.DataFrame(columns=list(unique_keys))
    
    # Fill the DataFrame with the style values
    for index, styles in enumerate(df[style_column]):
        styles = process_styles(styles)
        for key, value in styles.items():
            if key == 'Color:':
                colors = value.split('/')
                for i, color in enumerate(colors, start=1):
                    style_df.loc[index, f'Color{i}'] = color
            else:
                style_df.loc[index, key] = value
    
    # Fill missing values with 'missing'
    style_df = style_df.fillna('missing')
    
    # Concatenate the style DataFrame with the original DataFrame
    df = pd.concat([df, style_df], axis=1)
    
    return df



In [18]:
style_features = unpack_style_features(product_style, 'style')
style_features = style_features.dropna(how='all')
style_features['Size'] = style_features['Size:'].copy()
style_features = style_features.drop(['Size:','style','Style:', 'Size Name:', 'Color:'], axis=1)

In [19]:
style_features

Unnamed: 0,Color1,Color2,Color3,Color4,Size
0,Blue,Orange,missing,missing,Big Boys
1,Black (37467610),Red,White,missing,Big Boys
2,Blue,Gray Logo,missing,missing,Big Boys
3,Blue (37867638-99),Yellow,missing,missing,Big Boys
4,Blue,Pink,missing,missing,Big Boys
...,...,...,...,...,...
2943,Pink Blast,Stealth,Hyper Pink,White,10 M US
2979,Black,White,Anthracite,Stealth,9.5 M US
3035,Black,White,Anthracite,Stealth,9 B(M) US
3049,Blue Tint,Green Glow,Hasta,White,6 B(M) US


In [20]:
style_features.value_counts()

Color1                Color2           Color3      Color4   Size        
 Black                White            Anthracite  Stealth   9 B(M) US      175
                                                             8.5 B(M) US    152
                                                             9.5 B(M) US    152
                                                             8 B(M) US      140
                                                             7.5 B(M) US     96
                                                                           ... 
 Blue                 Orange           missing     missing   Little Boys      1
                                                             Big Boys         1
 White                Metallic Silver  Black       missing   8.5 M US         1
 Blue                 Gray Logo        missing     missing   Big Boys         1
 Blue (37867638-99)    Yellow          missing     missing   Little Boys      1
Length: 279, dtype: int64

#### reviewerID, reviewText, summary, overall

In [21]:
fashion_reviews = fashion_reviews.drop(['style', 'reviewTime'], axis = 1)

Althought 'vote' is numerical, it doesn't have much data. Only about 10% of the dataset has a 'vote' value. This may be useful though as it may mean that the other votes are lower ranked and are not as informative. I debate dropping this one due to high missing values, but may just fill in using imputation later on.

In [22]:
fashion_reviews.vote.isnull().sum()

2814

In [23]:
fashion_reviews.vote.describe()

count     265
unique     15
top         2
freq       84
Name: vote, dtype: object

In [24]:
fashion_reviews.overall.value_counts() #overall is the ordinal star value, where 5 is the best

5.0    2077
4.0     463
3.0     329
1.0     117
2.0      93
Name: overall, dtype: int64

### Combining Features

In [25]:
combined_preprocessed = pd.concat([fashion_reviews, style_features, date_features], axis=1)

In [26]:
combined_preprocessed = combined_preprocessed.dropna(subset=['reviewTimeYear'])
combined_preprocessed = combined_preprocessed.dropna(subset=['asin'])

In [27]:
combined_preprocessed.reset_index(drop=True, inplace=True)

In [28]:
combined_preprocessed.shape

(3079, 20)

In [29]:
combined_preprocessed.head()

Unnamed: 0,overall,reviewerID,asin,reviewText,summary,unixReviewTime,vote,Color1,Color2,Color3,Color4,Size,unixReviewTime.1,reviewTimeYear,reviewTimeMonth,reviewTimeWeek,reviewTimeDay,reviewTimeDayofweek,reviewTimeDayofyear,reviewTimeElapsed
0,5.0,ALJ66O1Y6SLHA,B000K2PJ4K,Great product and price!,Five Stars,1441325000.0,,Blue,Orange,missing,missing,Big Boys,1441325000.0,2015.0,9.0,36.0,4.0,4.0,247.0,1441325000.0
1,5.0,ALJ66O1Y6SLHA,B000K2PJ4K,Great product and price!,Five Stars,1441325000.0,,Black (37467610),Red,White,missing,Big Boys,1441325000.0,2015.0,9.0,36.0,4.0,4.0,247.0,1441325000.0
2,5.0,ALJ66O1Y6SLHA,B000K2PJ4K,Great product and price!,Five Stars,1441325000.0,,Blue,Gray Logo,missing,missing,Big Boys,1441325000.0,2015.0,9.0,36.0,4.0,4.0,247.0,1441325000.0
3,5.0,ALJ66O1Y6SLHA,B000K2PJ4K,Great product and price!,Five Stars,1441325000.0,,Blue (37867638-99),Yellow,missing,missing,Big Boys,1441325000.0,2015.0,9.0,36.0,4.0,4.0,247.0,1441325000.0
4,5.0,ALJ66O1Y6SLHA,B000K2PJ4K,Great product and price!,Five Stars,1441325000.0,,Blue,Pink,missing,missing,Big Boys,1441325000.0,2015.0,9.0,36.0,4.0,4.0,247.0,1441325000.0


### Processing with FastAI

In [30]:
combined_preprocessed.reviewTimeYear.value_counts(normalize=True) # training on years <= 2017

reviewTimeYear
2017.0    0.525495
2016.0    0.272491
2018.0    0.180578
2015.0    0.013966
2014.0    0.003248
2010.0    0.001624
2012.0    0.001299
2009.0    0.001299
Name: proportion, dtype: float64

In [31]:
year_cond = (combined_preprocessed.reviewTimeYear<2018)
train_idx = np.where( year_cond)[0]
valid_idx = np.where(~year_cond)[0]
splits = (list(train_idx),list(valid_idx))

In [32]:
len(splits[0]) + len(splits[1]) == len(combined_preprocessed)

True

In [33]:
combined_preprocessed.columns

Index(['overall', 'reviewerID', 'asin', 'reviewText', 'summary',
       'unixReviewTime', 'vote', 'Color1', 'Color2', 'Color3', 'Color4',
       'Size', 'unixReviewTime', 'reviewTimeYear', 'reviewTimeMonth',
       'reviewTimeWeek', 'reviewTimeDay', 'reviewTimeDayofweek',
       'reviewTimeDayofyear', 'reviewTimeElapsed'],
      dtype='object')

In [34]:
combined_preprocessed.shape

(3079, 20)

In [35]:
continuous_vars = list(date_features.columns)+['vote','overall']
continuous_vars

['unixReviewTime',
 'reviewTimeYear',
 'reviewTimeMonth',
 'reviewTimeWeek',
 'reviewTimeDay',
 'reviewTimeDayofweek',
 'reviewTimeDayofyear',
 'reviewTimeElapsed',
 'vote',
 'overall']

In [36]:
categorical_vars = list(set(combined_preprocessed.columns)-set(continuous_vars)-set(['reviewText', 'summary']))
categorical_vars

['asin', 'Color1', 'reviewerID', 'Size', 'Color2', 'Color3', 'Color4']

In [37]:
text_vars = ['reviewText', 'summary']

In [38]:
# filling na with 'no color' for color1-4
combined_preprocessed['Color1'] = combined_preprocessed['Color1'].fillna('no color')
combined_preprocessed['Color2'] = combined_preprocessed['Color2'].fillna('no color')
combined_preprocessed['Color3'] = combined_preprocessed['Color3'].fillna('no color')
combined_preprocessed['Color4'] = combined_preprocessed['Color4'].fillna('no color')
# fill na with 'no size' for Size
combined_preprocessed['Size'] = combined_preprocessed['Size'].fillna('no size')

In [39]:
# filling na with 0 for vote 
combined_preprocessed['vote'] = combined_preprocessed['vote'].fillna(0)

In [40]:
combined_preprocessed['reviewText'].fillna('', inplace=True)

In [41]:
combined_preprocessed.isna().sum()

overall                0
reviewerID             0
asin                   0
reviewText             0
summary                0
unixReviewTime         0
vote                   0
Color1                 0
Color2                 0
Color3                 0
Color4                 0
Size                   0
unixReviewTime         0
reviewTimeYear         0
reviewTimeMonth        0
reviewTimeWeek         0
reviewTimeDay          0
reviewTimeDayofweek    0
reviewTimeDayofyear    0
reviewTimeElapsed      0
dtype: int64

## Training a Recommender Model

### Creating the FastAI Dataloaders

In [42]:
combined_preprocessed.to_csv('../../data/text_data/combined_preprocessed.csv', index=False)

In [43]:
textreviewdl = TextDataLoaders.from_df(combined_preprocessed, text_col='reviewText', is_lm=False)

In [44]:
textsummarydl = TextDataLoaders.from_df(combined_preprocessed, text_col='summary', is_lm=False)

In [45]:
tabulardls = TabularDataLoaders.from_csv('./combined_preprocessed.csv', y_names="asin",
    cat_names = categorical_vars,
    cont_names = continuous_vars,
    procs = [Categorify, FillMissing, Normalize], splits=splits)

In [46]:
# Create a learner for the tabular model
tabular_learn = tabular_learner(tabulardls, loss_func=CrossEntropyLossFlat(), metrics=accuracy)

In [47]:
text_learn1 = text_classifier_learner(textreviewdl, arch=AWD_LSTM)

In [48]:
text_learn2 = text_classifier_learner(textsummarydl, arch=AWD_LSTM)

## Evaluating the Models

In [49]:
predicted_user1 = text_learn1.predict('I really like the color of my shoes')[0]
predicted_user2 = text_learn2.predict('Hat')[0]

In [50]:
predicted_product_rows1 = combined_preprocessed[combined_preprocessed['reviewerID'] == predicted_user1]
predicted_product_rows2 = combined_preprocessed[combined_preprocessed['reviewerID'] == predicted_user2]


In [51]:
predicted_product_rows1.head()

Unnamed: 0,overall,reviewerID,asin,reviewText,summary,unixReviewTime,vote,Color1,Color2,Color3,Color4,Size,unixReviewTime.1,reviewTimeYear,reviewTimeMonth,reviewTimeWeek,reviewTimeDay,reviewTimeDayofweek,reviewTimeDayofyear,reviewTimeElapsed
466,5.0,AIM9MWMG87AWG,B001IKJOLW,Size 5. Very comfortable shoes. Love!,Very comfortable shoes. Love,1469578000.0,0,Black,White,Anthracite,Stealth,8 B(M) US,1469578000.0,2016.0,7.0,30.0,27.0,2.0,209.0,1469578000.0
817,5.0,AIM9MWMG87AWG,B0058YEJ5K,Size 5. Very comfortable shoes. Love!,Very comfortable shoes. Love,1469578000.0,0,Ocean Fog,Blue Grey,Mango,missing,8 B(M) US,1469578000.0,2016.0,7.0,30.0,27.0,2.0,209.0,1469578000.0
1160,5.0,AIM9MWMG87AWG,B0014F7B98,Size 5. Very comfortable shoes. Love!,Very comfortable shoes. Love,1469578000.0,0,Black,White,Anthracite,Stealth,9.5 B(M) US,1469578000.0,2016.0,7.0,30.0,27.0,2.0,209.0,1469578000.0
1516,5.0,AIM9MWMG87AWG,B009MA34NY,Size 5. Very comfortable shoes. Love!,Very comfortable shoes. Love,1469578000.0,0,Cool Grey,Team Orange,White,Platinum,13 D(M) US,1469578000.0,2016.0,7.0,30.0,27.0,2.0,209.0,1469578000.0
1873,5.0,AIM9MWMG87AWG,B0092UF54A,Size 5. Very comfortable shoes. Love!,Very comfortable shoes. Love,1469578000.0,0,Black,White,Anthracite,Stealth,7 B(M) US,1469578000.0,2016.0,7.0,30.0,27.0,2.0,209.0,1469578000.0


In [52]:
predicted_product_rows2.head()

Unnamed: 0,overall,reviewerID,asin,reviewText,summary,unixReviewTime,vote,Color1,Color2,Color3,Color4,Size,unixReviewTime.1,reviewTimeYear,reviewTimeMonth,reviewTimeWeek,reviewTimeDay,reviewTimeDayofweek,reviewTimeDayofyear,reviewTimeElapsed
464,5.0,A36XF6818PQ4DJ,B001IKJOLW,These sneakers give me the motivation to workout because they feel so good and are great for any training,... me the motivation to workout because they feel so good and are great for any training,1470614000.0,0,Black,White,Anthracite,Stealth,6.5 B(M) US,1470614000.0,2016.0,8.0,32.0,8.0,0.0,221.0,1470614000.0
815,5.0,A36XF6818PQ4DJ,B0058YEJ5K,These sneakers give me the motivation to workout because they feel so good and are great for any training,... me the motivation to workout because they feel so good and are great for any training,1470614000.0,0,Black,White,Anthracite,Stealth,6.5 B(M) US,1470614000.0,2016.0,8.0,32.0,8.0,0.0,221.0,1470614000.0
1158,5.0,A36XF6818PQ4DJ,B0014F7B98,These sneakers give me the motivation to workout because they feel so good and are great for any training,... me the motivation to workout because they feel so good and are great for any training,1470614000.0,0,Black,White,Anthracite,Stealth,7 B(M) US,1470614000.0,2016.0,8.0,32.0,8.0,0.0,221.0,1470614000.0
1514,5.0,A36XF6818PQ4DJ,B009MA34NY,These sneakers give me the motivation to workout because they feel so good and are great for any training,... me the motivation to workout because they feel so good and are great for any training,1470614000.0,0,Black,White,Anthracite,Stealth,8.5 B(M) US,1470614000.0,2016.0,8.0,32.0,8.0,0.0,221.0,1470614000.0
1871,5.0,A36XF6818PQ4DJ,B0092UF54A,These sneakers give me the motivation to workout because they feel so good and are great for any training,... me the motivation to workout because they feel so good and are great for any training,1470614000.0,0,Pure Platinum,Blue Glow,Wolf Grey,missing,9.5 B(M) US,1470614000.0,2016.0,8.0,32.0,8.0,0.0,221.0,1470614000.0
