<a href="https://colab.research.google.com/github/IsaiGowthami/CodeClause_market_basket_analysis_in_python_using_apriori_algorithm/blob/main/market_basket_analysis_in_python_using_apriori_algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing libraries

In [16]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Reading Data & Checking data

In [17]:
df = pd.read_csv('/content/Online_Retail.csv',encoding='latin1')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01-12-2010 08:26,3.39,17850.0,United Kingdom


Length of data with respect to rows and colomns

In [18]:
df.shape

(197732, 8)

Checking the data types

In [19]:
df.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

Describing the data

In [20]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,197732.0,197731.0,139871.0
mean,9.275626,5.074364,15279.012926
std,241.859574,96.986985,1726.936375
min,-74215.0,0.0,12346.0
25%,1.0,1.25,13824.0
50%,3.0,2.1,15157.0
75%,10.0,4.21,16813.0
max,74215.0,16888.02,18283.0


**Data Cleaning**


*   First, some of the descriptions have spaces that need to be removed.
*   We’ll also drop the rows that don’t have invoice numbers.


*   Remove the credit transactions (those with invoice numbers containing C).







In [21]:
df['Description'] = df['Description'].str.strip() #remove empty spaces
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True) #drop rows that dont have invoice numbers. 
df['InvoiceNo'] = df['InvoiceNo'].astype('str') #convert Invoice No to String
df = df[~df['InvoiceNo'].str.contains('C')] #drop rows with invoice containg C which means Credit Transcation.

Checking the Length of data with respect to rows and colomns after cleaning data

In [22]:
df.shape

(194168, 8)

After the cleanup, we need to consolidate the items into 1 transaction per row with each product 1 hot encoded. For the sake of keeping the data set small, I’m only looking at sales for France.

In [23]:
basket = (df[df['Country'] =="France"]        #get data for samples which have Country as France
          .groupby(['InvoiceNo', 'Description'])['Quantity'] #group them on InvoiceNo and Description based on Quantity
          .sum().unstack().reset_index().fillna(0) #sum the quantity, unstack them, fill 0 to nan values. 
          .set_index('InvoiceNo')) #set the index as Invoice Number
      
basket.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,20 DOLLY PEGS RETROSPOT,3 HOOK HANGER MAGIC GARDEN,...,WRAP I LOVE LONDON,WRAP POPPIES DESIGN,WRAP RED APPLES,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,ZINC STAR T-LIGHT HOLDER,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Length of basket data with respect to rows and colomns

In [24]:
basket.shape

(137, 860)

There are a lot of zeros in the data but we also need to make sure any positive values are converted to a 1 and anything less the 0 is set to 0.

In [25]:
#below function converts a values < 0 to 0 and value greater than equal 1 to 1.  
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1
#apply function to data using applymap.
basket_sets = basket.applymap(encode_units)
basket_sets.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,20 DOLLY PEGS RETROSPOT,3 HOOK HANGER MAGIC GARDEN,...,WRAP I LOVE LONDON,WRAP POPPIES DESIGN,WRAP RED APPLES,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,ZINC STAR T-LIGHT HOLDER,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536852,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536974,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537065,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that the data is structured properly, we can generate frequent item sets that have a support of at least 7% (this number was chosen so that I could get enough useful examples)

In [26]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

The final step is to generate the rules with their corresponding support, confidence and lift.

In [27]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(4 TRADITIONAL SPINNING TOPS),(POSTAGE),0.109489,0.766423,0.094891,0.866667,1.130794,0.010976,1.751825
1,(POSTAGE),(4 TRADITIONAL SPINNING TOPS),0.766423,0.109489,0.094891,0.12381,1.130794,0.010976,1.016344
2,(POSTAGE),(BAKING SET 9 PIECE RETROSPOT),0.766423,0.087591,0.072993,0.095238,1.087302,0.005861,1.008452
3,(BAKING SET 9 PIECE RETROSPOT),(POSTAGE),0.087591,0.766423,0.072993,0.833333,1.087302,0.005861,1.40146
4,(CHARLOTTE BAG DOLLY GIRL DESIGN),(POSTAGE),0.080292,0.766423,0.072993,0.909091,1.186147,0.011455,2.569343


We filter rules to check lift and confidence. This part of the analysis is where the domain knowledge will come in handy. Since we do not have that, we'll just look for a couple of illustrative examples.We can filter the dataframe using standard pandas code. In this case, look for a large lift (6) and high confidence (.8)

In [28]:
rules[(rules['lift'] >= 6) &
       (rules['confidence'] >= 0.8)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


**Conclusion**

In looking at the rules, it seems that the green and red alarm clocks are purchased together and the red paper cups, napkins and plates are purchased together in a manner that is higher than the overall probability would suggest.

