# Product Recommendation System

## Concept
We want to build a model that will learn from previous transactions to assess the relationship between items. The model will then be available to recommend items to new incomplete transactions.

In [125]:
import pandas as pd
import numpy as np

In [126]:
df = pd.read_csv("groceries.csv")

In [127]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9835 entries, 0 to 9834
Data columns (total 33 columns):
Item(s)    9835 non-null int64
Item 1     9835 non-null object
Item 2     7676 non-null object
Item 3     6033 non-null object
Item 4     4734 non-null object
Item 5     3729 non-null object
Item 6     2874 non-null object
Item 7     2229 non-null object
Item 8     1684 non-null object
Item 9     1246 non-null object
Item 10    896 non-null object
Item 11    650 non-null object
Item 12    468 non-null object
Item 13    351 non-null object
Item 14    273 non-null object
Item 15    196 non-null object
Item 16    141 non-null object
Item 17    95 non-null object
Item 18    66 non-null object
Item 19    52 non-null object
Item 20    38 non-null object
Item 21    29 non-null object
Item 22    18 non-null object
Item 23    14 non-null object
Item 24    8 non-null object
Item 25    7 non-null object
Item 26    7 non-null object
Item 27    6 non-null object
Item 28    5 non-null object
It

In [128]:
df.head()

Unnamed: 0,Item(s),Item 1,Item 2,Item 3,Item 4,Item 5,Item 6,Item 7,Item 8,Item 9,...,Item 23,Item 24,Item 25,Item 26,Item 27,Item 28,Item 29,Item 30,Item 31,Item 32
0,4,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,...,,,,,,,,,,
1,3,tropical fruit,yogurt,coffee,,,,,,,...,,,,,,,,,,
2,1,whole milk,,,,,,,,,...,,,,,,,,,,
3,4,pip fruit,yogurt,cream cheese,meat spreads,,,,,,...,,,,,,,,,,
4,4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,...,,,,,,,,,,


### Data Cleaning
For our model to learn properly, we should format our data.
We will convert the "Item n" columns to boolean features: one for each item.

In [129]:
def translate_to_array(row):
    columns = ["Item " + str(i) for i in range(1, 33)]
    return [row[column] for column in columns if type(row[column]) != float]

In [130]:
df["items"] = df.apply(translate_to_array, axis=1)

In [131]:
df.head()

Unnamed: 0,Item(s),Item 1,Item 2,Item 3,Item 4,Item 5,Item 6,Item 7,Item 8,Item 9,...,Item 24,Item 25,Item 26,Item 27,Item 28,Item 29,Item 30,Item 31,Item 32,items
0,4,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,...,,,,,,,,,,"[citrus fruit, semi-finished bread, margarine,..."
1,3,tropical fruit,yogurt,coffee,,,,,,,...,,,,,,,,,,"[tropical fruit, yogurt, coffee]"
2,1,whole milk,,,,,,,,,...,,,,,,,,,,[whole milk]
3,4,pip fruit,yogurt,cream cheese,meat spreads,,,,,,...,,,,,,,,,,"[pip fruit, yogurt, cream cheese, meat spreads]"
4,4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,...,,,,,,,,,,"[other vegetables, whole milk, condensed milk,..."


In [132]:
def get_unique_items(df):
    transactions = list(df["items"])
    items = []
    for t in transactions:
        for item in t:
            items.append(item)
    return list(set(items))

In [133]:
items = get_unique_items(df)

In [134]:
def categorize_items(df, items):
    for item in items:
        df[item] = df["items"].apply(lambda transaction: int(item in transaction))

In [135]:
categorize_items(df, items)

In [136]:
df.head()

Unnamed: 0,Item(s),Item 1,Item 2,Item 3,Item 4,Item 5,Item 6,Item 7,Item 8,Item 9,...,salty snack,whisky,dental care,beef,cookware,liqueur,liver loaf,dishes,meat spreads,red/blush wine
0,4,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,...,0,0,0,0,0,0,0,0,0,0
1,3,tropical fruit,yogurt,coffee,,,,,,,...,0,0,0,0,0,0,0,0,0,0
2,1,whole milk,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
3,4,pip fruit,yogurt,cream cheese,meat spreads,,,,,,...,0,0,0,0,0,0,0,0,1,0
4,4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,...,0,0,0,0,0,0,0,0,0,0


In [137]:
def drop_previous_item_columns(df):
    columns = ["Item " + str(i) for i in range(1, 33)]
#     columns.append("items")
    return df.drop(columns=columns)

In [138]:
df = drop_previous_item_columns(df)

In [290]:
df.head(6)

Unnamed: 0,Item(s),items,onions,fruit/vegetable juice,misc. beverages,salt,shopping bags,waffles,curd,brandy,...,salty snack,whisky,dental care,beef,cookware,liqueur,liver loaf,dishes,meat spreads,red/blush wine
0,4,"[citrus fruit, semi-finished bread, margarine,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,"[tropical fruit, yogurt, coffee]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,[whole milk],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,"[pip fruit, yogurt, cream cheese, meat spreads]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,4,"[other vegetables, whole milk, condensed milk,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,5,"[whole milk, butter, yogurt, rice, abrasive cl...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Output of our model
Client A
T1.[onions, salt]
T2.[beer]
T3.[salt, waffles] -> next time, the client will probably want x

What is our goal? 1. Make him stay in our shop 2.Promotion 3.Suggesting what he needs the most to add to his physical shopping list=convenience 4.Could use a pricing model to find the best promotion

We promote new items as well as items he is used to buying

How to find which items to promote?
* we can look into what similar client buy (k-neareast-neighbors)
* we can look into what this client tends to buy (probabilistic, Naive bayes?)

In [317]:
# Implementing "lift"$
def get_occurences(item, df):
    return df[item].sum()
    
item_occurences = {item: get_occurences(item, df) for item in items}
max_occurences = max(item_occurences.values())
item_lift_factors = {}
for item in item_occurences.keys():
    item_lift_factors[item] = max_occurences/item_occurences[item]
item_lift_factors

{'onions': 8.239344262295083,
 'fruit/vegetable juice': 3.5344585091420533,
 'misc. beverages': 9.007168458781361,
 'salt': 23.70754716981132,
 'shopping bags': 2.5933952528379773,
 'waffles': 6.648148148148148,
 'curd': 4.7958015267175576,
 'brandy': 61.292682926829265,
 'nut snack': 81.06451612903226,
 'cooking chocolate': 100.52,
 'specialty fat': 69.80555555555556,
 'house keeping products': 30.646341463414632,
 'baby food': 2513.0,
 'pork': 4.432098765432099,
 'curd cheese': 50.26,
 'cling film/bags': 22.4375,
 'pastry': 2.872,
 'roll products': 24.88118811881188,
 'pip fruit': 3.377688172043011,
 'turkey': 31.4125,
 'fish': 86.65517241379311,
 'bottled beer': 3.172979797979798,
 'brown bread': 3.938871473354232,
 'whipped/sour cream': 3.5645390070921987,
 'skin care': 71.8,
 'cake bar': 19.33076923076923,
 'other vegetables': 1.3205465055176038,
 'frozen chicken': 418.8333333333333,
 'dog food': 29.916666666666668,
 'packaged fruit/vegetables': 19.6328125,
 'honey': 167.533333333

In [309]:
max_occurences

2513

In [143]:
# Returns the proportion of unique items that are in both transactions
def get_transaction_similarity(transaction_1, transaction_2):
    n_unique_items = len(set(transaction_1).union(set(transaction_2)))
    n_common_items = len(set(transaction_1).intersection(set(transaction_2)))
    return n_common_items / n_unique_items

0.6666666666666666

In [206]:
def row_number_to_transaction(row_number, df):
    return df["items"].iloc[row_number]

# Sorts a dictionary by its values
def sort_dictio(d, descending=True):
    return {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse = descending)}

In [279]:
# Returns a dictionary containing the k most similar transactions as
# {transaction_row_number: similarity_factor}
def select_k_nearest_transactions(df, k, current_transaction, transaction_size_factor_threshold):
#     usable_transactions = [tr for tr in past_transactions if transaction_is_usable(current_transaction, tr, transaction_size_factor_threshold)]
    usable_transactions = {}
    for index, row in df.iterrows():
        tr = row["items"]
        if transaction_is_usable(current_transaction, tr, transaction_size_factor_threshold):
            usable_transactions[index] = tr
#             usable_transactions.append(tr)
#     return {u_transaction: get_transaction_similarity(current_transaction, u_transaction) for u_transaction in usable_transactions} 
    similarity_per_transaction = {}
    for row_number, u_tr in usable_transactions.items():
        similarity_per_transaction[row_number] = get_transaction_similarity(current_transaction, u_tr)
    similarity_per_transaction = sort_dictio(similarity_per_transaction)
    
    # extracting the k most similar
    k_nearest_neighbors = {}
    for k, v in tuple(similarity_per_transaction.items())[:k]:
        k_nearest_neighbors[k] = v
    return k_nearest_neighbors 

# To be usable, a transaction must be of similar size, not exactly the same, and not be a subset of the current transaction
def transaction_is_usable(current_tr, tr, transaction_size_factor_threshold):
    is_not_subset = len(set(tr).difference(set(current_transaction))) > 0
    if is_not_subset:
        return transaction_size_factor_threshold < abs(len(current_tr) - len(tr)) / len(current_tr) < 1
    return False

In [295]:
current_transaction = list(df["items"].iloc[10])
print(current_transaction)
current_transaction = current_transaction[:-1]
k_nearest_neighbors_dict = select_k_nearest_transactions(df, 10, current_transaction, .3)
k_nearest_neighbors_dict

['tropical fruit', 'other vegetables', 'white bread', 'bottled water', 'chocolate']


{230: 0.42857142857142855,
 3330: 0.42857142857142855,
 4345: 0.42857142857142855,
 2478: 0.375,
 3474: 0.375,
 4166: 0.375,
 411: 0.25,
 438: 0.25,
 486: 0.25,
 533: 0.25}

In [357]:
import random

def recommend_from_neighbors(current_transaction, k_nearest_neighbors_dict, df):
    print("We found " + str(len(k_nearest_neighbors_dict)) + " neighbors")
    print(k_nearest_neighbors_dict)
    
    neighbor_transactions = [row_number_to_transaction(row_number, df) for row_number in k_nearest_neighbors_dict.keys()]
    flatten_items = [item for transaction in neighbor_transactions for item in transaction]
    # sorting new items by occurences
    item_counts = {}
    for item in set(flatten_items):
        if item not in current_transaction:
            # count number of occurences
            item_counts[item] = flatten_items.count(item)
            
#     print(item_counts)

    # implementing lift
    for item, occurences in item_counts.items():
        item_counts[item] *= item_lift_factors[item]
    
#     print(item_counts)
    
    return [item for item in item_counts.keys() if item_counts[item] == max(item_counts.values())]
    

In [330]:
recommend_from_neighbors(current_transaction, k_nearest_neighbors_dict, df)

{'onions': 1, 'frozen meals': 1, 'pasta': 1, 'shopping bags': 2, 'margarine': 1, 'napkins': 1, 'soda': 2, 'potted plants': 1, 'rolls/buns': 2, 'herbs': 1, 'specialty cheese': 1, 'pastry': 1, 'hard cheese': 1, 'pip fruit': 1, 'citrus fruit': 1, 'bottled beer': 2, 'cat food': 1, 'whipped/sour cream': 3, 'beverages': 1, 'root vegetables': 2, 'sausage': 1, 'berries': 1, 'meat': 1, 'yogurt': 1, 'whole milk': 6}
{'onions': 0.8239344262295083, 'frozen meals': 0.9007168458781362, 'pasta': 1.697972972972973, 'shopping bags': 0.5186790505675954, 'margarine': 0.43628472222222225, 'napkins': 0.4879611650485437, 'soda': 0.29306122448979594, 'potted plants': 1.478235294117647, 'rolls/buns': 0.2778330569375346, 'herbs': 1.5706250000000002, 'specialty cheese': 2.9916666666666667, 'pastry': 0.2872, 'hard cheese': 1.0427385892116183, 'pip fruit': 0.3377688172043011, 'citrus fruit': 0.30872235872235876, 'bottled beer': 0.6345959595959596, 'cat food': 1.0973799126637556, 'whipped/sour cream': 1.0693617021

['specialty cheese']

In [298]:
def recommend(current_transaction, df, k, transaction_size_factor_threshold):
    k_nearest_neighbors_dict = select_k_nearest_transactions(df, k, current_transaction, transaction_size_factor_threshold)
    return recommend_from_neighbors(current_transaction, k_nearest_neighbors_dict, df)

In [300]:
current_transaction = list(df["items"].iloc[10])
k = 10
transaction_size_factor_threshold = .3
recommend(current_transaction, df, k, transaction_size_factor_threshold)

{'misc. beverages': 1, 'pasta': 1, 'chewing gum': 1, 'coffee': 1, 'soda': 1, 'rolls/buns': 1, 'specialty cheese': 1, 'prosecco': 1, 'hygiene articles': 1, 'dessert': 1, 'citrus fruit': 1, 'bottled beer': 1, 'cake bar': 1, 'root vegetables': 1, 'zwieback': 1, 'whole milk': 1, 'cream cheese': 1}


['misc. beverages',
 'pasta',
 'chewing gum',
 'coffee',
 'soda',
 'rolls/buns',
 'specialty cheese',
 'prosecco',
 'hygiene articles',
 'dessert',
 'citrus fruit',
 'bottled beer',
 'cake bar',
 'root vegetables',
 'zwieback',
 'whole milk',
 'cream cheese']

## Testing

In [343]:
def recommend_row(row, df, k, transaction_size_factor_threshold):
    items = [item for item in row["items"] if item not in [row["items"]]]
    return recommend(items, df, k, transaction_size_factor_threshold)

test_sample = df.sample(n=30, random_state=1)
test_sample["recommendation"] = test_sample.apply(lambda row: recommend_row(row, df, k, transaction_size_factor_threshold), axis=1)

In [344]:
test_sample[["items", "recommendation"]]
# test_sample["recommendation_is_accurate"] = test_sample.apply(lambda row: row["recommendation"] in row["items"], axis=1)

Unnamed: 0,items,recommendation
260,"[pork, berries, root vegetables, other vegetab...",[syrup]
8463,"[soda, rum]",[tropical fruit]
6662,"[whipped/sour cream, candy]",[]
473,[bottled water],[]
7294,"[pastry, specialty chocolate, female sanitary ...",[condensed milk]
9005,[red/blush wine],[]
3376,"[misc. beverages, bottled beer, white wine]",[prosecco]
2807,[frozen meals],[]
4839,"[tropical fruit, brown bread, specialty bar]",[spread cheese]
550,"[onions, fruit/vegetable juice]",[]


In [358]:
recommend(["hamburger meat", "domestic eggs", "mayonnaise", "chocolate"], df, k, .5)

We found 10 neighbors
{388: 0.25, 512: 0.25, 664: 0.25, 802: 0.25, 944: 0.25, 1827: 0.25, 1959: 0.25, 3521: 0.25, 3591: 0.25, 3763: 0.25}


[]

In [361]:
df.head()

Unnamed: 0,Item(s),items,onions,fruit/vegetable juice,misc. beverages,salt,shopping bags,waffles,curd,brandy,...,salty snack,whisky,dental care,beef,cookware,liqueur,liver loaf,dishes,meat spreads,red/blush wine
0,4,"[citrus fruit, semi-finished bread, margarine,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,"[tropical fruit, yogurt, coffee]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,[whole milk],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,"[pip fruit, yogurt, cream cheese, meat spreads]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,4,"[other vegetables, whole milk, condensed milk,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
