# Minimal Recommendation Engine

### Notation <sup>[1]</sup>

- $U$ is the set of users in our domain. Its size is $|U|$.
- $I$ is the set of items in our domain. Its size is $|I|$.
- $I(u)$ is the set of items that user $u$ has rated.
- $-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
- $U(i)$ is the set of users that have rated item $i$.
- $-U(i)$ is the complement of $U(i)$.
- $S(u,i)$ is a function that measures the utility of item $i$ for user $u$.

### Goal of a recommendation system <sup>[1]</sup>

$
i^{*} = argmax_{i \in -I(u)} S(u,i), \forall{u \in U}
$

### Problem statement <sup>[1]</sup>

The recommendation problem in its most basic form is quite simple to define:

```
|-------------------+-----+-----+-----+-----+-----|
|  user_id, asin    | a_1 | a_2 | a_3 | a_4 | a_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|
```

*Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.*

## Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## Load Data

In [2]:
df = pd.read_csv('../Data/eda_data.csv')

In [3]:
df.head(2)

Unnamed: 0,category,description,title,also_buy,brand,rank,main_cat,asin,details,overall,verified,reviewerID,reviewText,summary
0,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (",Grocery,4639725043,,5.0,True,A1J205ZK25TZ6W,I make the best brewed iced tea with this yell...,Best for brewed iced tea.
1,"['Grocery & Gourmet Food', 'Beverages', 'Coffe...",['Lipton Yellow Label Tea use only the finest ...,Lipton Yellow Label Tea (loose tea) - 450g,"['B00886E4K0', 'B00CREXSHY', 'B001QTRGAQ', 'B0...",Lipton,"30,937 in Grocery & Gourmet Food (",Grocery,4639725043,,3.0,True,ACOICLIJQYECU,I have recently started drinking hot tea again...,Not Bad for iced Tea


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1086548 entries, 0 to 1086547
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   category     1086548 non-null  object 
 1   description  993137 non-null   object 
 2   title        1086548 non-null  object 
 3   also_buy     929733 non-null   object 
 4   brand        1078563 non-null  object 
 5   rank         1042485 non-null  object 
 6   main_cat     1085268 non-null  object 
 7   asin         1086548 non-null  object 
 8   details      1086497 non-null  object 
 9   overall      1086548 non-null  float64
 10  verified     1086548 non-null  bool   
 11  reviewerID   1086548 non-null  object 
 12  reviewText   1086175 non-null  object 
 13  summary      1086335 non-null  object 
dtypes: bool(1), float64(1), object(12)
memory usage: 108.8+ MB


### Preprocessing: split ratings into train and test sets

In [5]:
print(df.shape)
print(df.reviewerID.nunique())
print(df.asin.nunique())

(1086548, 14)
127496
41280


In [6]:
#TODO Discuss the 2 different ways below. 
# Think only need the recommendation function; but want to confirm.

X = df.drop('overall', axis=1)
y = df['overall']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=123)

In [7]:
print(f'X_train: {X_train.shape}')
print(f'X_test: {X_test.shape}')
print(f'y_train: {y_train.shape}')
print(f'y_test: {y_train.shape}')

X_train: (869238, 13)
X_test: (217310, 13)
y_train: (869238,)
y_test: (869238,)


### Scale Data

TODO: The recommendation example didn't do this.
Don't think this needs to scale, standardize, or normalize. If it does, how? Majority of features aren't numeric.

### Train test split using custom function
Generate train and test subsets by marking 20% of each users's ratings; using groupby and apply. <sup>[1]</sup>

In [8]:
def assign_to_set(df):
    sampled_ids = np.random.choice(df.index,
                                   size=np.int64(np.ceil(df.index.size * 0.2)),
                                   replace=False)
    df.loc[sampled_ids, 'for_testing'] = True
    return df

df['for_testing'] = False
grouped = df.groupby('reviewerID', group_keys=False).apply(assign_to_set)
df_train = df[grouped.for_testing == False]
df_test = df[grouped.for_testing == True]
print(df.shape)
print(df_train.shape)
print(df_test.shape)
assert len(df_train.index.intersection(df_test.index)) == 0

(1086548, 15)
(824585, 15)
(261963, 15)


## Save Data

In [9]:
df_train.to_csv('../Data/training_data.csv', index=False)

In [10]:
df_test.to_csv('../Data/testing_data.csv', index=False)

## References
1) Unata 2015 [Hands-on with PyData: How to Build a Minimal Recommendation Engine](https://www.youtube.com/watch?v=F6gWjOc1FUs).  