# Minimal Recommendation Engine

### Notation <sup>[1]</sup>

- $U$ is the set of users in our domain. Its size is $|U|$.
- $I$ is the set of items in our domain. Its size is $|I|$.
- $I(u)$ is the set of items that user $u$ has rated.
- $-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
- $U(i)$ is the set of users that have rated item $i$.
- $-U(i)$ is the complement of $U(i)$.
- $S(u,i)$ is a function that measures the utility of item $i$ for user $u$.

### Goal of a recommendation system <sup>[1]</sup>

$
i^{*} = argmax_{i \in -I(u)} S(u,i), \forall{u \in U}
$

### Problem statement <sup>[1]</sup>

The recommendation problem in its most basic form is quite simple to define:

```
|-------------------+-----+-----+-----+-----+-----|
|  user_id, asin    | a_1 | a_2 | a_3 | a_4 | a_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|
```

*Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.*

## Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## Load Data

In [2]:
df = pd.read_csv('../Data/eda_data.csv')

In [3]:
df.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,main_cat,price,asin,details,overall,verified,reviewerID,reviewText,summary,vote,style
0,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (","['B00CREXSHY', 'B001QTRGAQ', 'B000JSQK70', 'B0...",Grocery,12.46,4639725043,,5.0,True,A1J205ZK25TZ6W,I make the best brewed iced tea with this yell...,Best for brewed iced tea.,8.0,
1,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (","['B00CREXSHY', 'B001QTRGAQ', 'B000JSQK70', 'B0...",Grocery,12.46,4639725043,,3.0,True,ACOICLIJQYECU,I have recently started drinking hot tea again...,Not Bad for iced Tea,9.0,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1086548 entries, 0 to 1086547
Data columns (total 18 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   category     1086548 non-null  object 
 1   description  993137 non-null   object 
 2   title        1086548 non-null  object 
 3   also_buy     929733 non-null   object 
 4   brand        1078563 non-null  object 
 5   rank         1042485 non-null  object 
 6   also_view    578911 non-null   object 
 7   main_cat     1085268 non-null  object 
 8   price        752804 non-null   float64
 9   asin         1086548 non-null  object 
 10  details      1086497 non-null  object 
 11  overall      1086548 non-null  float64
 12  verified     1086548 non-null  bool   
 13  reviewerID   1086548 non-null  object 
 14  reviewText   1086175 non-null  object 
 15  summary      1086335 non-null  object 
 16  vote         149486 non-null   float64
 17  style        561516 non-null   object 
dtypes:

### Preprocessing: split ratings into train and test sets

In [5]:
# Full grocery data set
print(df.shape)
print(df.reviewerID.nunique())
print(df.asin.nunique())

(1086548, 18)
127496
41280


In [6]:
# Subset of grocery dataset for testing a speed reasons
df_subset = df.iloc[np.random.choice(df.index, size=10000, replace=False)]
print(df_subset.shape)
print(df_subset.reviewerID.nunique())
print(df_subset.asin.nunique())

(10000, 18)
9373
6669


### Train test split using custom function
Generate train and test subsets using an 80/20 split.

In [7]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=26)
print(df.shape)
print(df_train.shape)
print(df_test.shape)
assert len(df_train.index.intersection(df_test.index)) == 0

(1086548, 18)
(869238, 18)
(217310, 18)


In [8]:
df_test.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,main_cat,price,asin,details,overall,verified,reviewerID,reviewText,summary,vote,style
906154,"['Grocery & Gourmet Food', 'Beverages', 'Bottl...",['Bai Bubbles Jamaica Blood Orange is a refres...,"Bai Bubbles, Sparkling Water, Jamaica Blood Or...","['B004JLRYCS', 'B0799978Q3', 'B017W3QUL0', 'B0...",bai,"3,954 in Grocery & Gourmet Food (",,Grocery,20.02,B00N14ZDDQ,"{'\n Item Weight: \n ': '11.5 ounces', '...",4.0,False,APYHSY5CVS48I,This is probably my favorite Bai bubbles. I s...,Pretty good. My favorite Bai Bubbles!,,{'Flavor:': ' Waikiki Coconut Lime'}
663706,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...","['The awesome, centering energy of golden turm...",Rishi Tea Organic Caffeine-Free Turmeric Ginge...,"['B008YDOKL0', 'B00E7VTFHC', 'B00CX33V6C', 'B0...",Rishi Tea,"4,193 in Grocery & Gourmet Food (","['B017TNT4K2', 'B01CIW4ELI', 'B006PTWQJI', 'B0...",Grocery,,B00AR6Q5JC,{'\n Product Dimensions: \n ': '2.8 x 2....,5.0,True,A2E7UZ9BLOTZVO,I love the taste and it helps my chronic tendo...,Great product,,


In [9]:
df_train.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,main_cat,price,asin,details,overall,verified,reviewerID,reviewText,summary,vote,style
93329,"['Grocery & Gourmet Food', 'Snack Foods', 'Coo...",['The Original Digestive: Wheat meal Biscuit. ...,"McVitie's Digestive Biscuits -400g 3 Pack, Ori...","['B000F9X40O', 'B00I8VS45G', 'B009CS3LNE', 'B0...",McVitie's,"7,580 in Grocery & Gourmet Food (","['B000JSOC0C', 'B000F9X40O', 'B00I8VS45G', 'B0...",Grocery,12.59,B000FA8SH2,{'\n Product Dimensions: \n ': '6 x 4 x ...,5.0,True,A6KL17KKN0A5L,"I grew up with Digestives biscuits, and nothin...",and nothing goes better with tea than these,,{'Size:': ' Pack of 3'}
481878,"['Grocery & Gourmet Food', 'Cooking & Baking',...","[""Over time, CARNATION Milk has found its way ...","Nestle Carnation Instant Nonfat Dry Milk, 25.6...","['B005CUM0J2', 'B0054DDBGI', 'B00FRFRZF6', 'B0...",Carnation,"20,889 in Grocery & Gourmet Food (","['B0054DDBGI', 'B00FRFRZF6', 'B004K0862K', 'B0...",Grocery,,B004VITI0K,"{'Shipping Weight:': '1.7 pounds', 'Domestic S...",4.0,True,AAOY6JB236XZW,Expensive but what can you do? I need it for ...,Good,,{'Size:': ' Pack of 1'}


In [10]:
df_train_subset, df_test_subset = train_test_split(df_subset, test_size=0.2, random_state=123)
df_train_subset['for_testing'] = False
df_test_subset['for_testing'] = True
print(df_subset.shape)
print(df_train_subset.shape)
print(df_test_subset.shape)

(10000, 18)
(8000, 19)
(2000, 19)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train_subset['for_testing'] = False
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test_subset['for_testing'] = True


In [11]:
df_test_subset.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,main_cat,price,asin,details,overall,verified,reviewerID,reviewText,summary,vote,style,for_testing
715571,"['Grocery & Gourmet Food', 'Herbs, Spices & Se...",,"Bay Leaves, Ground-4oz-Bay Leaf Powder","['B00PXAWB1I', 'B000N8MTWG', 'B000N4OF6S', 'B0...",Red Bunny Farms,"25,024 in Grocery & Gourmet Food (",,Grocery,6.88,B00CJ27NJS,"{'Shipping Weight:': '4 ounces (', 'ASIN: ': '...",4.0,True,A3BBZVH5MRLC7O,"good bay leaf powder, but only needed a very s...",good bay leaf powder,,,True
434633,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",,Aiya Organic Ceremonial Matcha (30g),"['B006LLWB8G', 'B003VSEG7Q', 'B00L9IEBE8', 'B0...",Aiya,"27,596 in Grocery & Gourmet Food (",,Grocery,24.03,B00447S4YY,{'\n Product Dimensions: \n ': '2.5 x 2....,4.0,True,A11B369RS9ON4L,"Nice, but a little clumpy.",Clumpy,,{'Size:': ' 100 grams'},True


In [12]:
df_train_subset.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,main_cat,price,asin,details,overall,verified,reviewerID,reviewText,summary,vote,style,for_testing
865780,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['This sampler pack includes 50 premium gourme...,"Coffee, Tea, and Hot Chocolate Variety Sampler...","['B003KRHDNC', 'B00N5I46BI', 'B071Z8LD77', 'B0...",Custom Variety Pack,"1,988 in Grocery & Gourmet Food (",,Grocery,29.75,B00K2RYBMO,"{'Shipping Weight:': '2 pounds (', 'Domestic S...",5.0,True,AZBWA0QDFA2XR,This sampler has a really great variety. I fou...,"No duplicates, great variety",2.0,,False
262733,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...","['Organic French Roast, Dark Roast, Caffeine F...",Teeccino Maya Herbal Coffee Caffe Dark Roast F...,"['B072FJCVZQ', 'B00TZ5UV28', 'B07FDPWMJ1', 'B0...",Teeccino,"58,716 in Grocery & Gourmet Food (",,Grocery,10.0,B001BYGJLS,{'\n Product Dimensions: \n ': '6 x 6 x ...,3.0,True,A152DM9G44QW8J,It is OK.,Three Stars,,"{'Size:': ' 11 oz', 'Flavor:': ' French Roast'}",False


## Save Data

In [13]:
df_train.to_csv('../Data/training_data.csv', index=False)
df_test.to_csv('../Data/testing_data.csv', index=False)
df_train_subset.to_csv('../Data/training_data_subset.csv', index=False)
df_test_subset.to_csv('../Data/testing_data_subset.csv', index=False)

## References
1) Unata 2015 [Hands-on with PyData: How to Build a Minimal Recommendation Engine](https://www.youtube.com/watch?v=F6gWjOc1FUs).  