# Market Basket

This competition was held by Shopee ID NSDC, 2020.

Source : [https://www.kaggle.com/c/market-basket-id-ndsc-2020]

### Question and problem understanding

At Shopee, sellers list thousands of products for sale on our platform. A better understanding of users' tastes and preferences for products can help Shopee design better promotions and recommendations for our users. To do that, we conduct market basket analysis which allows us to identify the relationship between different combinations of products that users buy.

We are interested in finding association rules between combinations of different products. These association rules can help to uncover regularities in purchasing behaviors of our users.

For example, an association rule between 3 products, {Product A & Product B} → {Product C}, would indicate that a user buying both Product A & Product B would likely buy Product C as well.

Confidence is a measure that is used to indicate such tendencies and can be used to determine the association for varying numbers of products. For the purpose of this question, we will be using confidence to calculate the association for 2 products and 3 products.

Confidence for two products:
**Confidence(A > B) = (No. of orders containing both product A&B)/(No. of orders containing product A)**

Confidence for three products:
**Confidence(A > B&C) = (No. of orders containing both product A, B&C)/(No. of orders containing product A)**

Or:
**Confidence(A&B > C) = (No. of orders containing both product A, B&C)/(No. of orders containing product A&B)**

### Data Processing

In [1]:
# Importing needed packages and dataset
import pandas as pd
from math import floor
association = pd.read_csv('association_order.csv')
rules = pd.read_csv('rules.csv')

In [2]:
association.head()

Unnamed: 0,orderid,itemid
0,31379820545759,719740607
1,31378575577269,1825360194
2,31369591568249,1108903291
3,31369836201769,4507360843
4,31372360246729,1821888475


In [3]:
rules.head()

Unnamed: 0,rule
0,100242812>80361758
1,100242812>89031406
2,1003153762>1016449477
3,1006024995>2727415265
4,1006024995>866012366


In [4]:
# Creating set order id groupby item id
group_itemid = association.groupby('itemid')['orderid'].apply(set)

In [5]:
group_itemid.head()

itemid
103572                    {31347710941008}
103580                    {31338334007720}
108696                    {31372952019978}
240094                    {31380092284102}
262532    {31362741095032, 31379924542443}
Name: orderid, dtype: object

In [6]:
# Creating function from confidence equations above
def confidence(data):
    data = [i.split('&') for i in data.split('>')]
    denominator = group_itemid[int(data[0][0])]
    if len(data[0]) > 1:
        denominator = denominator.intersection(group_itemid[int(data[0][1])])
    numerator = set(denominator)
    for i in data[1]:
        numerator = numerator.intersection(group_itemid[int(i)])
    return floor(len(numerator)/len(denominator)*1000)

**floor** function is used to return the closest integer value which is less than or equal to the specified expression or value.

In [7]:
rules['confidence'] = rules['rule'].apply(confidence)

In [8]:
rules

Unnamed: 0,rule,confidence
0,100242812>80361758,470
1,100242812>89031406,352
2,1003153762>1016449477,388
3,1006024995>2727415265,171
4,1006024995>866012366,171
...,...,...
14233,995073047>3202007524,102
14234,995073047>651958908,130
14235,995073047>7902698606,65
14236,995073047>922394800,56


In [9]:
# Creating submission file
rules.to_csv('submissionMB.csv',index=False)

References :
    
[https://www.kaggle.com/dimaskuncoro/market-basket-accuracy-1-0]

[https://www.tutorialgateway.org/math-floor-in-python/]