<h1 id="basics" style="font-family:verdana;"> 
    <center> Association Rules Learning for Online Retail Dataset
    </center>
</h1>
<div style="width:100%;text-align: center;"> <img align=middle src="https://i0.wp.com/analyticsarora.com/wp-content/uploads/2022/06/association-rule-learning-visual-example-ml-interview.png?w=800&ssl=1" alt="ARL" style="height:500px;margin-top:2rem;"> </div>



This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.


## Main topics of the study can be seen below:

* [Aim of the study](#section-one)
* [Understanding the data](#section-two)
* [Preparation of data](#section-three)
* [Invoice - Product Matrix](#section-four)
* [Preparing Association Rule Learning Data Structures](#section-five)
* [Product Recommendation](#section-six)
* [Conclusion](#section-seven)


<a id="section-one"></a>
## 1. Aim of the Study

The main purpose of the study to find out the relationship between the products that get by customers. When X product orders, also Y product orders too by customers according to Online Retail data. This data is including two different time between 2009 - 2010 and 2010 - 2011. In this part lets explain the ARL as much as basicly.

Association Rule Learning is a machine learning technique used to identify relationships and patterns in large datasets. It is a process of discovering frequent patterns and associations between items or attributes in a dataset.

In simple terms, Association Rule Learning aims to identify patterns that indicate the co-occurrence of items in a given dataset. For example, it can help identify that customers who buy bread are likely to buy butter as well, or that people who watch a certain TV show are more likely to watch a specific movie.

The process of Association Rule Learning involves two main steps:

- Finding frequent itemsets:
The first step is to identify sets of items that appear frequently together in the dataset. This is achieved by applying a support threshold which indicates the minimum number of times an itemset must occur in the dataset to be considered frequent.

- Generating association rules:
The second step involves generating association rules from the frequent itemsets identified in step one. Association rules are conditional statements that show the likelihood of an item or a set of items occurring together given the occurrence of another item or set of items. These rules are evaluated based on their confidence, which indicates the likelihood of the rule being true.

Association Rule Learning has several applications in various fields, including market basket analysis, recommender systems, and web mining. It can help businesses understand customer behavior and preferences, make recommendations to customers, and improve marketing strategies.

<div style="width:100%;text-align: center;"> <img align=middle src="https://sp-ao.shortpixel.ai/client/to_auto,q_glossy,ret_img,w_1536/https://pianalytix.com/wp-content/uploads/2020/11/Association-Rule-1536x640.jpg" alt="ARL" style="height:300px;margin-top:1rem;"> </div>

<a id="section-two"></a>
## 2. Understanding the Data

First of all we should import the libraries that will use during the analysis and rating parts.

In [1]:
# Lets import libraries

import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
import warnings
warnings.filterwarnings("ignore")

pd.set_option("display.width", 500)
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: '%.5f' % x)

In [2]:
# Lets import the dataset

df_ = pd.read_csv("/kaggle/input/online-retail-ii-uci/online_retail_II.csv")
df = df_.copy()

In [3]:
# To understand the "check_df" functione can be used to decide the what should we do about the data.

def check_df(dataframe, head=10):
    print("########## First 10 Data #############")
    print(dataframe.head(head))
    print("########## Info #############")
    print(dataframe.info())
    print("########## Statistical Data #############")
    print(dataframe.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
    print("########## Null Data #############")
    print(dataframe.isnull().sum())
    print("########## Variable Types #############")
    print(dataframe.dtypes)
    
check_df(df)

########## First 10 Data #############
  Invoice StockCode                          Description  Quantity          InvoiceDate   Price  Customer ID         Country
0  489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS        12  2009-12-01 07:45:00 6.95000  13085.00000  United Kingdom
1  489434    79323P                   PINK CHERRY LIGHTS        12  2009-12-01 07:45:00 6.75000  13085.00000  United Kingdom
2  489434    79323W                  WHITE CHERRY LIGHTS        12  2009-12-01 07:45:00 6.75000  13085.00000  United Kingdom
3  489434     22041         RECORD FRAME 7" SINGLE SIZE         48  2009-12-01 07:45:00 2.10000  13085.00000  United Kingdom
4  489434     21232       STRAWBERRY CERAMIC TRINKET BOX        24  2009-12-01 07:45:00 1.25000  13085.00000  United Kingdom
5  489434     22064           PINK DOUGHNUT TRINKET POT         24  2009-12-01 07:45:00 1.65000  13085.00000  United Kingdom
6  489434     21871                  SAVE THE PLANET MUG        24  2009-12-01 07:45:0

Before the start the analysis, according to dataset summary, dataset has 8 variables. Lets check them;

1. Invoice No: Special number of Invoice for the each Customer ID
2. StockCode: Special number for the each kind of products.
3. Description: Summary of the products that sell by the company.
4. Quantity: Sell amounts of the product for each order.
5. InvoiceDate: Date of the Invoice
6. Unit Price: Unit price for the each quantity of the product.
7. CustomerID: Special ID for the customers.
8. Country: Customers Country

According to null value of data, the data has 135080 missing value for the Customer ID. These missing values should be eleminated the data in the next chapters.

Also, according to description part for the numerical variables, we can see the "-" UnitPrice or Quantity. It is meaning some kind of orders cancelled by the customers. We should also elimanite these rows to find out the meaningful results during the analysis process.

<a id="section-three"></a>
## 3. Preparation of the Data

In this stage, If any null values are in the dataset, they will drop it from the data. Also, outliers will be rearrange with the handmade functions from the dataset.

In [4]:
# Before the analysis we have to consider the "Outliers" in the data. Especially, outliers should be considered in "Quantity" and "Invoice" columns.

def outlier_thresholds(dataframe, variable):
    # Function prepared to find out outliers.
    # Normally, Q1 and Q3 uses as 0.25 - 0.75 but in this project we consider as 0.01 - 0.99
    quartile1 = dataframe[variable].quantile(0.01)
    quartile3 = dataframe[variable].quantile(0.99)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

In [5]:
# These outliers will be replaced according to up and low limits that calculate it according to formulas.

def replace_with_thresholds(dataframe, variable):
    # Outliers majorization function
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

In [6]:
# This function uses to eliminate the negative and meaningful datas from the dataset.

def retail_data_prep(dataframe):
    dataframe.dropna(inplace = True)
    dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)]
    dataframe = dataframe[dataframe["Quantity"] > 0]
    dataframe = dataframe[dataframe["Price"] > 0]
    replace_with_thresholds(dataframe, "Quantity")
    replace_with_thresholds(dataframe, "Price")
    return dataframe

In [7]:
# Lets finalized the dataset before the start ARL.

df = retail_data_prep(df)

<a id="section-four"></a>
## 4. Invoice - Product Matrix

In this part, Invoice - Product matrix will be created with "apriori" function.

<div style="width:100%;text-align: center;"> <img align=middle src="https://miro.medium.com/max/497/1*9J50LPtmb0fcgR5FhnDljQ.png" alt="ARL" style="height:300px;margin-top:1rem;"> </div>

In [8]:
# We will work on the France' products in this study. Anyone can choice another country to make this calculation.


df_fr = df[df["Country"] == "France"]

df_fr.groupby(["Invoice", "Description"]).agg({"Quantity": "sum"}).head(20)

df_fr.groupby(["Invoice", "Description"]).agg({"Quantity": "sum"}).unstack().iloc[0:5,0:5] # Önemli Fonksiyon!

df_fr.groupby(["Invoice", "Description"]).agg({"Quantity": "sum"}).unstack().fillna(0).iloc[0:5,0:5]

df_fr.groupby(["Invoice", "StockCode"]).\
    agg({"Quantity": "sum"}).\
    unstack().fillna(0).applymap(lambda x: 1 if x > 0 else 0).iloc[0:5, 0:5]

Unnamed: 0_level_0,Quantity,Quantity,Quantity,Quantity,Quantity
StockCode,10002,10120,10123C,10123G,10125
Invoice,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
489439,0,0,0,0,0
489557,0,0,0,0,0
489883,0,0,0,0,0
490139,0,0,0,0,0
490152,0,0,0,0,0


In [9]:
# To product matrix, this function can be used.

def create_invoice_product_df(dataframe, id=False):
    if id:
        return dataframe.groupby(['Invoice', "StockCode"])['Quantity'].sum().unstack().fillna(0). \
            applymap(lambda x: 1 if x > 0 else 0)
    else:
        return dataframe.groupby(['Invoice', 'Description'])['Quantity'].sum().unstack().fillna(0). \
            applymap(lambda x: 1 if x > 0 else 0)

fr_inv_pro_df = create_invoice_product_df(df_fr, id = True)

In [10]:
# According to functions, we will continue with the "StockCode". However, if user want to check the description of the ID, this function can be used.

def check_id(dataframe, stock_code):
    product_name = dataframe[dataframe["StockCode"] == stock_code][["Description"]].values[0].tolist()
    print(product_name)


<a id="section-five"></a>
## 5. Preparing Association Rule Learning Data Structures


In [11]:
# To obtain the relationship between the StockCode and Invoice, "apriori" function can be used.

frequent_itemsets = apriori(fr_inv_pro_df,
                            min_support=0.01,
                            use_colnames=True)

frequent_itemsets.sort_values("support", ascending=False)

rules = association_rules(frequent_itemsets,
                          metric="support",
                          min_threshold=0.01)


In [12]:
rules[(rules["support"]>0.05) & (rules["confidence"]>0.1) & (rules["lift"]>5)]

check_id(df_fr, "22352")

['LUNCHBOX WITH CUTLERY RETROSPOT ']


In [13]:
rules[(rules["support"]>0.05) & (rules["confidence"]>0.1) & (rules["lift"]>5)]. \
sort_values("confidence", ascending=False)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
47446,"(21080, POST, 21086)",(21094),0.07818,0.12704,0.07492,0.95833,7.54380,0.06499,20.95114
14759,"(21080, 21094)",(21086),0.09609,0.13844,0.09121,0.94915,6.85623,0.07790,16.94408
14758,"(21080, 21086)",(21094),0.09609,0.12704,0.09121,0.94915,7.47153,0.07900,17.16830
47448,"(21080, POST, 21094)",(21086),0.07980,0.13844,0.07492,0.93878,6.78127,0.06387,14.07220
1629,(21094),(21086),0.12704,0.13844,0.11564,0.91026,6.57526,0.09805,9.60028
...,...,...,...,...,...,...,...,...,...
5104,(22629),(22631),0.13029,0.08795,0.06026,0.46250,5.25880,0.04880,1.69684
639,(20724),(22356),0.13681,0.08469,0.05863,0.42857,5.06044,0.04705,1.60179
30433,(22629),"(22630, POST)",0.13029,0.07329,0.05537,0.42500,5.79889,0.04583,1.61167
30481,(22629),"(POST, 22631)",0.13029,0.07166,0.05049,0.38750,5.40739,0.04115,1.51566


<a id="section-six"></a>
## 6. Product Recommendation

<div style="width:100%;text-align: center;"> <img align=middle src="https://camo.githubusercontent.com/65663a743ab156da04afcb9783a4571af3968929d46c893edb78680e54c243e3/68747470733a2f2f692e68697a6c69726573696d2e636f6d2f68677272366b332e6a7067" alt="ARL" style="height:300px;margin-top:1rem;"> </div>

In [14]:
# Lets see specific product_id's relationship with the other products.

product_id = "22492"
check_id(df, product_id)

sorted_rules = rules.sort_values("lift", ascending=False)

recommendation_list = []


['MINI PAINT SET VINTAGE ']


In [15]:
def arl_recommender(rules_df, product_id, rec_count=1):
    sorted_rules = rules_df.sort_values("lift", ascending=False)
    recommendation_list = []
    for i, product in enumerate(sorted_rules["antecedents"]):
        for j in list(product):
            if j == product_id:
                recommendation_list.append(list(sorted_rules.iloc[i]["consequents"])[0])

    return recommendation_list[0:rec_count]

In [16]:
arl_recommender(rules, "22492", 1)
arl_recommender(rules, "22492", 2)
arl_recommender(rules, "22492", 3)

for each in arl_recommender(rules, "22492", 3):
    check_id(df, each)
    
arl_recommender(rules, "22492", 3)

['SPACEBOY LUNCH BOX ']
['SET/20 RED SPOTTY PAPER NAPKINS ']
['SET/20 RED SPOTTY PAPER NAPKINS ']


['22629', '21080', '21080']

<a id="section-seven"></a>
## 7. Conclusion

In conclusion, Association Rule Learning is a powerful technique for identifying relationships and patterns in large datasets. The Online Retail Dataset is a real-world example of a dataset that can benefit from Association Rule Learning techniques. Using this dataset, we can identify frequent itemsets and generate association rules that reveal patterns in customer behavior.

By applying Association Rule Learning to the Online Retail Dataset, we can gain insights into which products are frequently purchased together, which can help businesses identify cross-selling opportunities and improve their marketing strategies. For example, we might discover that customers who purchase a specific product are likely to also purchase another related product, and use this information to suggest complementary products to customers during their shopping experience.

Overall, Association Rule Learning is a valuable tool for gaining insights into complex datasets such as the Online Retail Dataset, and can help businesses make data-driven decisions to improve customer satisfaction and increase sales.


## Keep in Touch!

You can follow my the other social media adresses to see this kind of works!

1. [GitHub](https://github.com/KeskinHakan)
2. [LinkedIn](https://www.linkedin.com/in/hakan-keskin-/)
3. [Medium](https://medium.com/@hakan-keskin)
