# Minimal Recommendation Engine

### Notation <sup>[1]</sup>

- $U$ is the set of users in our domain. Its size is $|U|$.
- $I$ is the set of items in our domain. Its size is $|I|$.
- $I(u)$ is the set of items that user $u$ has rated.
- $-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
- $U(i)$ is the set of users that have rated item $i$.
- $-U(i)$ is the complement of $U(i)$.
- $S(u,i)$ is a function that measures the utility of item $i$ for user $u$.

### Goal of a recommendation system <sup>[1]</sup>

$
i^{*} = argmax_{i \in -I(u)} S(u,i), \forall{u \in U}
$

### Problem statement <sup>[1]</sup>

The recommendation problem in its most basic form is quite simple to define:

```
|-------------------+-----+-----+-----+-----+-----|
|  user_id, asin    | a_1 | a_2 | a_3 | a_4 | a_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|
```

*Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.*

## Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## Load Data

In [2]:
df = pd.read_csv('../Data/eda_data.csv')

In [3]:
df.head(2)

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,price,asin,overall,verified,reviewerID,vote,style
0,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (","['B00CREXSHY', 'B001QTRGAQ', 'B000JSQK70', 'B0...",Grocery,12.46,4639725043,5.0,True,A1J205ZK25TZ6W,8.0,
1,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (","['B00CREXSHY', 'B001QTRGAQ', 'B000JSQK70', 'B0...",Grocery,12.46,4639725043,3.0,True,ACOICLIJQYECU,9.0,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1083170 entries, 0 to 1083169
Data columns (total 14 columns):
 #   Column      Non-Null Count    Dtype  
---  ------      --------------    -----  
 0   category    1083170 non-null  object 
 1   title       1083170 non-null  object 
 2   also_buy    926546 non-null   object 
 3   brand       1075197 non-null  object 
 4   rank        1039163 non-null  object 
 5   also_view   577060 non-null   object 
 6   main_cat    1081896 non-null  object 
 7   price       750231 non-null   float64
 8   asin        1083170 non-null  object 
 9   overall     1083170 non-null  float64
 10  verified    1083170 non-null  bool   
 11  reviewerID  1083170 non-null  object 
 12  vote        149247 non-null   float64
 13  style       559212 non-null   object 
dtypes: bool(1), float64(3), object(10)
memory usage: 108.5+ MB


### Preprocessing

In [5]:
# Full grocery data set
print(df.shape)
print(df.reviewerID.nunique())
print(df.asin.nunique())

(1083170, 14)
127496
41280


In [6]:
# Subset of grocery dataset for testing a speed reasons
df_subset = df.iloc[np.random.choice(df.index, size=10000, replace=False)]
print(df_subset.shape)
print(df_subset.reviewerID.nunique())
print(df_subset.asin.nunique())

(10000, 14)
9382
6686


### Train test split
Using an 80/20 split.

In [7]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=26)
print(df.shape)
print(df_train.shape)
print(df_test.shape)
assert len(df_train.index.intersection(df_test.index)) == 0

(1083170, 14)
(866536, 14)
(216634, 14)


In [8]:
df_train.head(2)

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,price,asin,overall,verified,reviewerID,vote,style
573045,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",Nescafe Gold Blend Instant Coffee Refill Packe...,"['B0046DM09G', 'B0085H9YHK', 'B01B50WLWS', 'B0...",Nescaf,"34,306 in Grocery & Gourmet Food (",,Grocery,19.38,B006R9F6TI,5.0,True,A9LGCFXYIJ4GO,,
66933,"['Grocery & Gourmet Food', 'Cooking & Baking',...","SweetLeaf Sweet Drops Liquid Stevia Sweetener,...","['B000ELQNRE', 'B00H4HKNA4', 'B00282UD0K', 'B0...",SweetLeaf,,"['B00GRY33AC', 'B002LMBIVA', 'B000ELQNRE', 'B0...",Grocery,9.49,B000E8WIAS,5.0,True,AOIXKRX33QSEV,,{'Size:': ' 2 Ounce'}


In [9]:
df_test.head(2)

Unnamed: 0,category,title,also_buy,brand,rank,also_view,main_cat,price,asin,overall,verified,reviewerID,vote,style
1006152,"['Grocery & Gourmet Food', 'Candy & Chocolate']",SKITTLES &amp; STARBURST Full Size Candy Varie...,"['B01KH9SUBY', 'B000WL39JQ', 'B003N0R5BG', 'B0...",Wrigley's,587 in Grocery & Gourmet Food (,,Grocery,15.09,B011B45FZS,1.0,True,A342ULF8S62BG,,{'Size:': ' 30 Full Size Pieces'}
1036443,"['Grocery & Gourmet Food', 'Cooking & Baking',...",FRESH 2018/19 Harvest PJ KABOS 16.9Floz Greek ...,"['B01MTR4NV9', 'B00CMGRNAK', 'B01NCJBLG9', 'B0...",PJ KABOS,"16,057 in Grocery & Gourmet Food (",,Grocery,22.95,B018KRPLH6,3.0,True,A33DGWR9R1LMSN,,{'Size:': ' Delicate-Medium (16.9Floz Tin)'}


In [10]:
# Copying various dataframes to prevent SettingWithCopyWarning.
df_subset = df_subset.copy()
df_subset['for_testing'] = False
df_train_subset, df_test_subset = train_test_split(df_subset, test_size=0.2, random_state=123)
df_test_subset = df_test_subset.copy()
df_test_subset['for_testing'] = True
print(df_subset.shape)
print(df_train_subset.shape)
print(df_test_subset.shape)
assert len(df_train_subset.index.intersection(df_test_subset.index)) == 0

(10000, 15)
(8000, 15)
(2000, 15)


In [11]:
df_train_subset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8000 entries, 611135 to 146834
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   category     8000 non-null   object 
 1   title        8000 non-null   object 
 2   also_buy     6857 non-null   object 
 3   brand        7958 non-null   object 
 4   rank         7680 non-null   object 
 5   also_view    4278 non-null   object 
 6   main_cat     7991 non-null   object 
 7   price        5537 non-null   float64
 8   asin         8000 non-null   object 
 9   overall      8000 non-null   float64
 10  verified     8000 non-null   bool   
 11  reviewerID   8000 non-null   object 
 12  vote         1069 non-null   float64
 13  style        4120 non-null   object 
 14  for_testing  8000 non-null   bool   
dtypes: bool(2), float64(3), object(10)
memory usage: 890.6+ KB


In [12]:
df_test_subset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 278403 to 60643
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   category     2000 non-null   object 
 1   title        2000 non-null   object 
 2   also_buy     1718 non-null   object 
 3   brand        1980 non-null   object 
 4   rank         1932 non-null   object 
 5   also_view    1064 non-null   object 
 6   main_cat     1996 non-null   object 
 7   price        1381 non-null   float64
 8   asin         2000 non-null   object 
 9   overall      2000 non-null   float64
 10  verified     2000 non-null   bool   
 11  reviewerID   2000 non-null   object 
 12  vote         292 non-null    float64
 13  style        1058 non-null   object 
 14  for_testing  2000 non-null   bool   
dtypes: bool(2), float64(3), object(10)
memory usage: 222.7+ KB


## Save Data

In [13]:
df_train.to_csv('../Data/training_data.csv', index=False)
df_test.to_csv('../Data/testing_data.csv', index=False)
df_train_subset.to_csv('../Data/training_data_subset.csv', index=False)
df_test_subset.to_csv('../Data/testing_data_subset.csv', index=False)

## Summary
- Performed an 80/20 train/test split on the dataset.
- Also, created a random subset of the original dataset and performed an 80/20 split on it as well.
- The data subset will be used for this project going forward for speed and proof of concept reasons.
- Would also do this step first on most projects to get feedback as fast as possible. 
Then, would use the most promising models to run against the entire dataset.
- Note: The full dataset they took 2-6 hours to return results on a couple of the models initially tested.

## References
1) Unata 2015 [Hands-on with PyData: How to Build a Minimal Recommendation Engine](https://www.youtube.com/watch?v=F6gWjOc1FUs).  