# Product Recommendation System

## Concept
We want to build a model that will learn from previous transactions to assess the relationship between items. The model will then be available to recommend items to new incomplete transactions.

In [66]:
import pandas as pd
import numpy as np

In [67]:
df = pd.read_csv("groceries.csv")

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9835 entries, 0 to 9834
Data columns (total 33 columns):
Item(s)    9835 non-null int64
Item 1     9835 non-null object
Item 2     7676 non-null object
Item 3     6033 non-null object
Item 4     4734 non-null object
Item 5     3729 non-null object
Item 6     2874 non-null object
Item 7     2229 non-null object
Item 8     1684 non-null object
Item 9     1246 non-null object
Item 10    896 non-null object
Item 11    650 non-null object
Item 12    468 non-null object
Item 13    351 non-null object
Item 14    273 non-null object
Item 15    196 non-null object
Item 16    141 non-null object
Item 17    95 non-null object
Item 18    66 non-null object
Item 19    52 non-null object
Item 20    38 non-null object
Item 21    29 non-null object
Item 22    18 non-null object
Item 23    14 non-null object
Item 24    8 non-null object
Item 25    7 non-null object
Item 26    7 non-null object
Item 27    6 non-null object
Item 28    5 non-null object
It

In [69]:
df.head()

Unnamed: 0,Item(s),Item 1,Item 2,Item 3,Item 4,Item 5,Item 6,Item 7,Item 8,Item 9,...,Item 23,Item 24,Item 25,Item 26,Item 27,Item 28,Item 29,Item 30,Item 31,Item 32
0,4,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,...,,,,,,,,,,
1,3,tropical fruit,yogurt,coffee,,,,,,,...,,,,,,,,,,
2,1,whole milk,,,,,,,,,...,,,,,,,,,,
3,4,pip fruit,yogurt,cream cheese,meat spreads,,,,,,...,,,,,,,,,,
4,4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,...,,,,,,,,,,


### Data Cleaning
For our model to learn properly, we should format our data.
We will convert the "Item n" columns to boolean features: one for each item.

In [70]:
def translate_to_array(row):
    columns = ["Item " + str(i) for i in range(1, 33)]
    return [row[column] for column in columns if type(row[column]) != float]

In [71]:
df["items"] = df.apply(translate_to_array, axis=1)

In [72]:
df.head()

Unnamed: 0,Item(s),Item 1,Item 2,Item 3,Item 4,Item 5,Item 6,Item 7,Item 8,Item 9,...,Item 24,Item 25,Item 26,Item 27,Item 28,Item 29,Item 30,Item 31,Item 32,items
0,4,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,...,,,,,,,,,,"[citrus fruit, semi-finished bread, margarine,..."
1,3,tropical fruit,yogurt,coffee,,,,,,,...,,,,,,,,,,"[tropical fruit, yogurt, coffee]"
2,1,whole milk,,,,,,,,,...,,,,,,,,,,[whole milk]
3,4,pip fruit,yogurt,cream cheese,meat spreads,,,,,,...,,,,,,,,,,"[pip fruit, yogurt, cream cheese, meat spreads]"
4,4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,...,,,,,,,,,,"[other vegetables, whole milk, condensed milk,..."


In [73]:
def get_unique_items(df):
    transactions = list(df["items"])
    items = []
    for t in transactions:
        for item in t:
            items.append(item)
    return list(set(items))

In [74]:
items = get_unique_items(df)

In [75]:
def categorize_items(df, items):
    for item in items:
        df[item] = df["items"].apply(lambda transaction: int(item in transaction))

In [76]:
categorize_items(df, items)

In [77]:
df.head()

Unnamed: 0,Item(s),Item 1,Item 2,Item 3,Item 4,Item 5,Item 6,Item 7,Item 8,Item 9,...,cookware,butter,waffles,abrasive cleaner,cream cheese,chocolate marshmallow,coffee,fruit/vegetable juice,nuts/prunes,margarine
0,4,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,...,0,0,0,0,0,0,0,0,0,1
1,3,tropical fruit,yogurt,coffee,,,,,,,...,0,0,0,0,0,0,1,0,0,0
2,1,whole milk,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
3,4,pip fruit,yogurt,cream cheese,meat spreads,,,,,,...,0,0,0,0,1,0,0,0,0,0
4,4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,...,0,0,0,0,0,0,0,0,0,0


In [78]:
def drop_previous_item_columns(df):
    columns = ["Item " + str(i) for i in range(1, 33)]
    return df.drop(columns=columns)

In [79]:
df = drop_previous_item_columns(df)

In [80]:
df.head(6)

Unnamed: 0,Item(s),items,condensed milk,packaged fruit/vegetables,flower (seeds),salty snack,canned fish,bottled beer,sauces,salt,...,cookware,butter,waffles,abrasive cleaner,cream cheese,chocolate marshmallow,coffee,fruit/vegetable juice,nuts/prunes,margarine
0,4,"[citrus fruit, semi-finished bread, margarine,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,3,"[tropical fruit, yogurt, coffee]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,1,[whole milk],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,"[pip fruit, yogurt, cream cheese, meat spreads]",0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,4,"[other vegetables, whole milk, condensed milk,...",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,5,"[whole milk, butter, yogurt, rice, abrasive cl...",0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0


## Output of our model
Client A
T1.[onions, salt]
T2.[beer]
T3.[salt, waffles] -> next time, the client will probably want x

What is our goal? 1. Make him stay in our shop 2.Promotion 3.Suggesting what he needs the most to add to his physical shopping list=convenience 4.Could use a pricing model to find the best promotion

We promote new items as well as items he is used to buying

How to find which items to promote?
* we can look into what similar client buy (k-neareast-neighbors)
* we can look into what this client tends to buy (probabilistic, Naive bayes?)

### Reducing Bias Towards Common Items
Our model will inherently be biased towards common items. We will therefore prioritize rare items to balance it.

In [81]:
def get_occurences(item, df):
    return df[item].sum()
    
item_occurences = {item: get_occurences(item, df) for item in items}
max_occurences = max(item_occurences.values())
item_lift_factors = {}
for item in item_occurences.keys():
    item_lift_factors[item] = max_occurences/item_occurences[item]
item_lift_factors

{'condensed milk': 24.88118811881188,
 'packaged fruit/vegetables': 19.6328125,
 'flower (seeds)': 24.637254901960784,
 'salty snack': 6.755376344086022,
 'canned fish': 16.97972972972973,
 'bottled beer': 3.172979797979798,
 'sauces': 46.53703703703704,
 'salt': 23.70754716981132,
 'domestic eggs': 4.027243589743589,
 'other vegetables': 1.3205465055176038,
 'cooking chocolate': 100.52,
 'frozen fruits': 209.41666666666666,
 'dental care': 44.08771929824562,
 'kitchen towels': 42.59322033898305,
 'grapes': 11.422727272727272,
 'misc. beverages': 9.007168458781361,
 'hamburger meat': 7.685015290519877,
 'cat food': 10.973799126637555,
 'detergent': 13.296296296296296,
 'frozen vegetables': 5.312896405919662,
 'zwieback': 36.955882352941174,
 'hair spray': 228.45454545454547,
 'white bread': 6.070048309178744,
 'pip fruit': 3.377688172043011,
 'bottled water': 2.311867525298988,
 'tropical fruit': 2.435077519379845,
 'prosecco': 125.65,
 'potato products': 89.75,
 'rubbing alcohol': 251

In [82]:
max_occurences

2513

In [83]:
# Returns the proportion of unique items that are in both transactions
def get_transaction_similarity(transaction_1, transaction_2):
    n_unique_items = len(set(transaction_1).union(set(transaction_2)))
    n_common_items = len(set(transaction_1).intersection(set(transaction_2)))
    return n_common_items / n_unique_items

In [84]:
def row_number_to_transaction(row_number, df):
    return df["items"].iloc[row_number]

# Sorts a dictionary by its values
def sort_dictio(d, descending=True):
    return {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse = descending)}

In [85]:
# To be usable, a transaction must be of similar size, not exactly the same, and not be a subset of the current transaction
def transaction_is_usable(current_tr, tr, transaction_size_factor_threshold):
    is_not_subset = len(set(tr).difference(set(current_transaction))) > 0
    if is_not_subset:
        return transaction_size_factor_threshold < abs(len(current_tr) - len(tr)) / len(current_tr) < 1
    return False

# Returns every transaction that is usable, as defined in the previous method
def get_usable_transactions(df, current_transaction, transaction_size_factor_threshold):
    usable_transactions = {}
    for index, row in df.iterrows():
        tr = row["items"]
        if transaction_is_usable(current_transaction, tr, transaction_size_factor_threshold):
            usable_transactions[index] = tr
    return usable_transactions

# Returns a dictionary associating each transaction to its similarity with the current transaction
def get_similarity_per_transaction(current_transaction, usable_transactions):
    similarity_per_transaction = {}
    for row_number, u_tr in usable_transactions.items():
        similarity_per_transaction[row_number] = get_transaction_similarity(current_transaction, u_tr)
    similarity_per_transaction = sort_dictio(similarity_per_transaction)
    return similarity_per_transaction

def select_k_most_similar(similarity_per_transaction, k):
    k_nearest_neighbors = {}
    for k, v in tuple(similarity_per_transaction.items())[:k]:
        k_nearest_neighbors[k] = v
    return k_nearest_neighbors 

# Returns a dictionary containing the k most similar transactions as
# {transaction_row_number: similarity_factor}
def select_k_nearest_transactions(df, k, current_transaction, transaction_size_factor_threshold):
    usable_transactions = get_usable_transactions(df, current_transaction, transaction_size_factor_threshold)    
    similarity_per_transaction = get_similarity_per_transaction(current_transaction, usable_transactions)
    k_nearest_neighbors = select_k_most_similar(similarity_per_transaction, k)
    return k_nearest_neighbors

In [86]:
current_transaction = list(df["items"].iloc[10])
print(current_transaction)
current_transaction = current_transaction[:-1]
k_nearest_neighbors_dict = select_k_nearest_transactions(df, 10, current_transaction, .3)
k_nearest_neighbors_dict

neighbor_transactions = [row_number_to_transaction(row_number, df) for row_number in k_nearest_neighbors_dict.keys()]
neighbor_transactions

['tropical fruit', 'other vegetables', 'white bread', 'bottled water', 'chocolate']


[['tropical fruit',
  'berries',
  'root vegetables',
  'other vegetables',
  'bottled water',
  'shopping bags'],
 ['tropical fruit',
  'other vegetables',
  'rolls/buns',
  'white bread',
  'bottled beer',
  'potted plants'],
 ['tropical fruit',
  'pip fruit',
  'root vegetables',
  'other vegetables',
  'whole milk',
  'bottled water'],
 ['other vegetables',
  'whole milk',
  'whipped/sour cream',
  'specialty cheese',
  'white bread',
  'cat food',
  'bottled water'],
 ['citrus fruit',
  'tropical fruit',
  'herbs',
  'other vegetables',
  'whole milk',
  'whipped/sour cream',
  'bottled water'],
 ['sausage',
  'tropical fruit',
  'whole milk',
  'yogurt',
  'whipped/sour cream',
  'white bread',
  'bottled water'],
 ['other vegetables',
  'whole milk',
  'rolls/buns',
  'margarine',
  'bottled water',
  'bottled beer'],
 ['onions', 'other vegetables', 'pastry', 'bottled water', 'soda', 'napkins'],
 ['meat',
  'other vegetables',
  'hard cheese',
  'frozen meals',
  'bottled water'

In [98]:
import random

def recommend_from_neighbors(current_transaction, k_nearest_neighbors_dict, df, apply_lift=False, debug=False):
    neighbor_transactions = [row_number_to_transaction(row_number, df) for row_number in k_nearest_neighbors_dict.keys()]
    flatten_items = [item for transaction in neighbor_transactions for item in transaction]
    
    # sorting new items by occurences
    item_counts = {}
    for item in set(flatten_items):
        if item not in current_transaction:
            # count number of occurences
            item_counts[item] = flatten_items.count(item)
    if len(item_counts) == 0:
        return np.nan
            
    if debug:
        print(item_counts)
            
    # implementing "lift"
    if apply_lift:
        for item, occurences in item_counts.items():
            item_counts[item] *= item_lift_factors[item]
        return [item for item in item_counts.keys() if item_counts[item] == max(item_counts.values())]
     
    # without lift, returns one of the items that have the most occurences
    max_occurences = max(item_counts.values())
    return random.choice([item for item in item_counts.keys() if item_counts[item] == max_occurences])

In [88]:
recommend_from_neighbors(current_transaction, k_nearest_neighbors_dict, df)

'whole milk'

In [147]:
def recommend(current_transaction, df, k, transaction_size_factor_threshold, debug=False):
    if len(current_transaction) == 0:
        return np.nan
    k_nearest_neighbors_dict = select_k_nearest_transactions(df, k, current_transaction, transaction_size_factor_threshold)
    if debug:
        print("Current transaction: " + str(current_transaction))
        print(k_nearest_neighbors_dict)
        
    if len(k_nearest_neighbors_dict) == 0:
        return np.nan
    return recommend_from_neighbors(current_transaction, k_nearest_neighbors_dict, df, debug=debug)

In [90]:
current_transaction = list(df["items"].iloc[11])
k = 10
transaction_size_factor_threshold = .3
recommend(current_transaction, df, k, transaction_size_factor_threshold, True)

Current transaction: ['citrus fruit', 'tropical fruit', 'whole milk', 'butter', 'curd', 'yogurt', 'flour', 'bottled water', 'dishes']
{3692: 0.5, 707: 0.4, 740: 0.4, 4638: 0.4, 8506: 0.4, 8585: 0.4, 9081: 0.4, 1658: 0.36363636363636365, 1846: 0.36363636363636365, 2435: 0.36363636363636365}
{'soft cheese': 1, 'frankfurter': 2, 'frozen meals': 1, 'whipped/sour cream': 2, 'rolls/buns': 1, 'pip fruit': 1, 'beverages': 1, 'onions': 1, 'specialty cheese': 1, 'other vegetables': 2}


'other vegetables'

## Testing

In [166]:
def recommend_row(row, df, k, transaction_size_factor_threshold):
#     items = [item for item in row["items"] if item not in [random.choice(row["items"])]]
    items = row["items"][:-1]
    return recommend(items, df, k, transaction_size_factor_threshold)

# test_sample = df.sample(n=30, random_state=1)
test_sample = df.loc[:200, :].copy()
test_sample["recommendation"] = test_sample.apply(lambda row: recommend_row(row, test_sample, k, transaction_size_factor_threshold), axis=1)

In [167]:
test_sample["recommendation_is_accurate"] = test_sample.apply(lambda row: row["recommendation"] in row["items"], axis=1)
test_sample[["items", "recommendation", "recommendation_is_accurate"]]

Unnamed: 0,items,recommendation,recommendation_is_accurate
0,"[citrus fruit, semi-finished bread, margarine,...",ready soups,True
1,"[tropical fruit, yogurt, coffee]",coffee,True
2,[whole milk],,False
3,"[pip fruit, yogurt, cream cheese, meat spreads]",butter milk,False
4,"[other vegetables, whole milk, condensed milk,...",long life bakery product,True
...,...,...,...
196,[canned beer],,False
197,"[pork, beef, pip fruit, herbs, spices]",tropical fruit,False
198,"[frankfurter, citrus fruit, UHT-milk, margarin...",dessert,False
199,"[sausage, bottled beer, liquor (appetizer)]",whole milk,False


In [168]:
accuracy = test_sample["recommendation_is_accurate"].sum() / len(test_sample)
str(accuracy * 100) + "%"

'11.442786069651742%'

In [None]:
def get_accuracy(df, sample_size, k, transaction_size_factor_threshold):
    test_sample = df.loc[:sample_size, :].copy()
    test_sample["recommendation"] = test_sample.apply(lambda row: recommend_row(row, test_sample, k, transaction_size_factor_threshold), axis=1)
    test_sample["recommendation_is_accurate"] = test_sample.apply(lambda row: row["recommendation"] in row["items"], axis=1)
    accuracy = test_sample["recommendation_is_accurate"].sum() / len(test_sample)
    return accuracy

# can probably disable the size condition
def optimize_hyperparameters(df, k_values):
    pass

In [144]:
recommend(["pip fruit", "yogurt", "cream cheese", "meat spreads"], df, k, transaction_size_factor_threshold)

'citrus fruit'

In [131]:
items

['condensed milk',
 'packaged fruit/vegetables',
 'flower (seeds)',
 'salty snack',
 'canned fish',
 'bottled beer',
 'sauces',
 'salt',
 'domestic eggs',
 'other vegetables',
 'cooking chocolate',
 'frozen fruits',
 'dental care',
 'kitchen towels',
 'grapes',
 'misc. beverages',
 'hamburger meat',
 'cat food',
 'detergent',
 'frozen vegetables',
 'zwieback',
 'hair spray',
 'white bread',
 'pip fruit',
 'bottled water',
 'tropical fruit',
 'prosecco',
 'potato products',
 'rubbing alcohol',
 'chicken',
 'dishes',
 'finished products',
 'organic sausage',
 'cereals',
 'house keeping products',
 'brown bread',
 'fish',
 'flower soil/fertilizer',
 'cocoa drinks',
 'beverages',
 'dessert',
 'candy',
 'instant coffee',
 'baby food',
 'ready soups',
 'sliced cheese',
 'rum',
 'root vegetables',
 'pickled vegetables',
 'honey',
 'candles',
 'curd',
 'sound storage medium',
 'liquor (appetizer)',
 'pet care',
 'berries',
 'organic products',
 'specialty vegetables',
 'snack products',
 'nut 

In [361]:
df.head()

Unnamed: 0,Item(s),items,onions,fruit/vegetable juice,misc. beverages,salt,shopping bags,waffles,curd,brandy,...,salty snack,whisky,dental care,beef,cookware,liqueur,liver loaf,dishes,meat spreads,red/blush wine
0,4,"[citrus fruit, semi-finished bread, margarine,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,"[tropical fruit, yogurt, coffee]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,[whole milk],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,"[pip fruit, yogurt, cream cheese, meat spreads]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,4,"[other vegetables, whole milk, condensed milk,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
