# Minimal Recommendation Engine

### Notation <sup>[1]</sup>

- $U$ is the set of users in our domain. Its size is $|U|$.
- $I$ is the set of items in our domain. Its size is $|I|$.
- $I(u)$ is the set of items that user $u$ has rated.
- $-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
- $U(i)$ is the set of users that have rated item $i$.
- $-U(i)$ is the complement of $U(i)$.
- $S(u,i)$ is a function that measures the utility of item $i$ for user $u$.

### Goal of a recommendation system <sup>[1]</sup>

$
i^{*} = argmax_{i \in -I(u)} S(u,i), \forall{u \in U}
$

### Problem statement <sup>[1]</sup>

The recommendation problem in its most basic form is quite simple to define:

```
|-------------------+-----+-----+-----+-----+-----|
|  user_id, asin    | a_1 | a_2 | a_3 | a_4 | a_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|
```

*Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.*

## Imports

In [23]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## Load Data

In [24]:
df = pd.read_csv('../Data/eda_data.csv')

In [25]:
df.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,main_cat,price,asin,details,overall,verified,reviewerID,reviewText,summary,vote,style
0,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (","['B00CREXSHY', 'B001QTRGAQ', 'B000JSQK70', 'B0...",Grocery,12.46,4639725043,,5.0,True,A1J205ZK25TZ6W,I make the best brewed iced tea with this yell...,Best for brewed iced tea.,8.0,
1,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (","['B00CREXSHY', 'B001QTRGAQ', 'B000JSQK70', 'B0...",Grocery,12.46,4639725043,,3.0,True,ACOICLIJQYECU,I have recently started drinking hot tea again...,Not Bad for iced Tea,9.0,


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1086548 entries, 0 to 1086547
Data columns (total 18 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   category     1086548 non-null  object 
 1   description  993137 non-null   object 
 2   title        1086548 non-null  object 
 3   also_buy     929733 non-null   object 
 4   brand        1078563 non-null  object 
 5   rank         1042485 non-null  object 
 6   also_view    578911 non-null   object 
 7   main_cat     1085268 non-null  object 
 8   price        752804 non-null   float64
 9   asin         1086548 non-null  object 
 10  details      1086497 non-null  object 
 11  overall      1086548 non-null  float64
 12  verified     1086548 non-null  bool   
 13  reviewerID   1086548 non-null  object 
 14  reviewText   1086175 non-null  object 
 15  summary      1086335 non-null  object 
 16  vote         149486 non-null   float64
 17  style        561516 non-null   object 
dtypes:

### Preprocessing: split ratings into train and test sets

In [27]:
# Full grocery data set
print(df.shape)
print(df.reviewerID.nunique())
print(df.asin.nunique())

(1086548, 18)
127496
41280


In [28]:
# Subset of grocery dataset for testing a speed reasons
df_subset = df.iloc[np.random.choice(df.index, size=10000, replace=False)]
print(df_subset.shape)
print(df_subset.reviewerID.nunique())
print(df_subset.asin.nunique())

(10000, 18)
9367
6641


### Train test split using custom function
Generate train and test subsets by marking 20% of each users's ratings; using groupby and apply. <sup>[1]</sup>

In [29]:
def assign_to_set(df_):
    sampled_ids = np.random.choice(df_.index,
                                   size=np.int64(np.ceil(df_.index.size * 0.2)),
                                   replace=False)
    df_.loc[sampled_ids, 'for_testing'] = True
    return df_

df_subset['for_testing'] = False
grouped = df_subset.groupby('reviewerID', group_keys=False).apply(assign_to_set)
df_train_subset = df_subset[grouped.for_testing == False]
df_test_subset = df_subset[grouped.for_testing == True]
print(df_subset.shape)
print(df_train_subset.shape)
print(df_test_subset.shape)
assert len(df_train_subset.index.intersection(df_test_subset.index)) == 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset['for_testing'] = False


(10000, 19)
(633, 19)
(9367, 19)


In [30]:
df_test_subset.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,main_cat,price,asin,details,overall,verified,reviewerID,reviewText,summary,vote,style,for_testing
370983,"['Grocery & Gourmet Food', 'Cooking & Baking',...","['For thousands of years, sourdough starter ha...",Breadtopia Sourdough Starter (Live),"['B00471MP42', 'B01GM4UZJI', '1607740079', 'B0...",Breadtopia,"1,750 in Grocery & Gourmet Food (",,Grocery,8.95,B002C0E5VG,"{'Shipping Weight:': '4 ounces (', 'ASIN: ': '...",5.0,True,A3TXRT03F81NGO,It started right up and now I have tons of sta...,Great starter.,,,False
631706,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...","[""At high temperatures sugar carmalizes and tr...",Starbucks Natural Fusions Caramel Ground Coffe...,"['B00UW71SQQ', 'B01MDPYFBC', 'B071CL2TXK', 'B0...",Starbucks,"261,295 in Grocery & Gourmet Food (","['B00UW71SQQ', 'B00ILOYD96', 'B01LTI95ZM', 'B0...",Grocery,31.99,B0090X8JUG,"{'Shipping Weight:': '4.8 pounds (', 'Domestic...",1.0,True,A3IILODDI2ZDFG,DISGUSTING!!!!!!!! This is NOT the same item ...,This is NOT the same thing you buy in the stor...,,{'Size:': ' 11 Ounce (Pack of 2)'},False


In [31]:
df_train_subset.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,main_cat,price,asin,details,overall,verified,reviewerID,reviewText,summary,vote,style,for_testing
933722,"['Grocery & Gourmet Food', 'Snack Foods', 'Cra...","['Shortly after the American Revolution, at th...",New England Original Westminster Bakeries Oyst...,"['B000J2DQ46', 'B00EQD7F82', 'B0096POWJ0', 'B0...",Westminster Bakers,"18,324 in Grocery & Gourmet Food (",,Grocery,5.51,B00Q3DQI0E,{'\n Product Dimensions: \n ': '8.5 x 2 ...,5.0,True,AOLRD8F5KRBOB,Oyster crackers were a terrific surprise for r...,Oyster crackers were a terrific surprise for r...,,,False
293456,"['Grocery & Gourmet Food', 'Herbs, Spices & Se...",['Flavorful Tip 1/8 tsp. McCormickGranulated G...,"McCormick Garlic, Granulated, 26-Ounce Units (...","['B008OGCPSM', 'B008OU2XQW', 'B00FOLNUZC', 'B0...",McCormick,"198,987 in Grocery & Gourmet Food (","['B074ZK8MQS', 'B001PQOATU', 'B008OU2XQW', 'B0...",Grocery,,B001EQ56NA,"{'Shipping Weight:': '3.6 pounds', 'Domestic S...",5.0,True,A2696CFF1SBXH0,Good quality and flavor.,Five Stars,,{'Size:': ' 26 oz'},False


## Save Data

In [32]:
# df_train.to_csv('../Data/training_data.csv', index=False)

In [33]:
# df_test.to_csv('../Data/testing_data.csv', index=False)

## References
1) Unata 2015 [Hands-on with PyData: How to Build a Minimal Recommendation Engine](https://www.youtube.com/watch?v=F6gWjOc1FUs).  