**Course: BA820 - Unsupervised and Unstructured ML**

**Notebook created by: Mohannad Elhamod**

**Note**: You are **ALLOWED** to use Generative AI for this notebook, but you must properly cite your usage. Be sure to review the syllabus for details on citation requirements and the consequences of failing to cite your sources correctly or simply copy-pasting without meaningful engagement.

#Analysis of Purchases and Product Recommendations.

In this notebook, we will work with a dataset of purchases, where each row represents a customer's purchase of a product.

Most column names are self-explanatory. The "Amount" column indicates the quantity of items sold.

## Load the Dataset

In [62]:
### DO NOT CHANGE THIS CODE###

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


url = "https://drive.google.com/uc?export=download&id=1JbXO4RgolhHz1FmVDJpTyFUV4oSEagjA"


# Read the CSV file into a DataFrame and make some edits.
df = pd.read_csv(url, on_bad_lines='skip', encoding="latin1")
df = df.drop('Cust_name', axis=1)
df = df.drop('Product_ID', axis=1)
df.Age = df.Age.astype('int64')
df.Orders = df.Orders.astype('int64')

# Display the first few rows
print(df.dtypes)
df


User_ID               int64
Gender               object
Age Group            object
Age                   int64
Marital_Status        int64
State                object
Zone                 object
Occupation           object
Product_Category     object
Orders                int64
Amount              float64
dtype: object


Unnamed: 0,User_ID,Gender,Age Group,Age,Marital_Status,State,Zone,Occupation,Product_Category,Orders,Amount
0,1002903,F,26-35,28,0,Maharashtra,Western,Healthcare,Auto,1,23952.0
1,1000732,F,26-35,35,1,Andhra Pradesh,Southern,Govt,Auto,3,23934.0
2,1001990,F,26-35,35,1,Uttar Pradesh,Central,Automobile,Auto,3,23924.0
3,1001425,M,0-17,16,0,Karnataka,Southern,Construction,Auto,2,23912.0
4,1000588,M,26-35,28,1,Gujarat,Western,Food Processing,Auto,2,23877.0
...,...,...,...,...,...,...,...,...,...,...,...
11246,1000695,M,18-25,19,1,Maharashtra,Western,Chemical,Office,4,370.0
11247,1004089,M,26-35,33,0,Haryana,Northern,Healthcare,Veterinary,3,367.0
11248,1001209,F,36-45,40,0,Madhya Pradesh,Central,Textile,Office,4,213.0
11249,1004023,M,36-45,37,0,Karnataka,Southern,Agriculture,Office,3,206.0


In [63]:
df[df['User_ID'] == 1000695]

Unnamed: 0,User_ID,Gender,Age Group,Age,Marital_Status,State,Zone,Occupation,Product_Category,Orders,Amount
3532,1000695,M,18-25,20,1,Uttar Pradesh,Central,Retail,Food,3,11678.0
4465,1000695,M,18-25,22,1,Uttarakhand,Central,Hospitality,Electronics & Gadgets,1,9816.0
7438,1000695,M,18-25,18,0,Punjab,Northern,Automobile,Clothing & Apparel,3,6945.0
9044,1000695,M,18-25,21,0,Haryana,Northern,Healthcare,Clothing & Apparel,3,5146.0
9893,1000695,M,18-25,24,1,Kerala,Southern,Aviation,Clothing & Apparel,3,3643.0
11246,1000695,M,18-25,19,1,Maharashtra,Western,Chemical,Office,4,370.0


## Data Cleaning and Preprocessing **(2 Points)**

For data cleaning and preprocessing, follow these steps:

- Use `SimpleImputer` to fill in missing values with the mean of their respective category. (For example, if "Amount" is missing for an "Auto" purchase, replace it with the average "Amount" of all "Auto" purchases.) **(1 Points)**
- Ensure the "Amount" column is an integer since it represents the number of units sold. **(0.5 Points)**
- Remove any duplicate entries from the dataset. **(0.5 Points)**




In [64]:
from sklearn.impute import SimpleImputer
import numpy as np

# Group data by 'Product'
grouped = df.groupby('Product_Category')

# Iterate through each group and fill missing 'Amount' values with the group's mean
mean_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
for product, group in grouped:
    df.loc[df['Product_Category'] == product, 'Amount'] = mean_imputer.fit_transform(group[['Amount']])
df.Amount = df.Amount.astype('int64')

# Remove duplicate rows
df.drop_duplicates(inplace=True)
df


Unnamed: 0,User_ID,Gender,Age Group,Age,Marital_Status,State,Zone,Occupation,Product_Category,Orders,Amount
0,1002903,F,26-35,28,0,Maharashtra,Western,Healthcare,Auto,1,23952
1,1000732,F,26-35,35,1,Andhra Pradesh,Southern,Govt,Auto,3,23934
2,1001990,F,26-35,35,1,Uttar Pradesh,Central,Automobile,Auto,3,23924
3,1001425,M,0-17,16,0,Karnataka,Southern,Construction,Auto,2,23912
4,1000588,M,26-35,28,1,Gujarat,Western,Food Processing,Auto,2,23877
...,...,...,...,...,...,...,...,...,...,...,...
11246,1000695,M,18-25,19,1,Maharashtra,Western,Chemical,Office,4,370
11247,1004089,M,26-35,33,0,Haryana,Northern,Healthcare,Veterinary,3,367
11248,1001209,F,36-45,40,0,Madhya Pradesh,Central,Textile,Office,4,213
11249,1004023,M,36-45,37,0,Karnataka,Southern,Agriculture,Office,3,206


# Question 1 **(10 Points)**



You have been tasked with sending personalized promotions to customers based on their purchase history to increase sales across different product categories.  

Since the product range is diverse, it's important to match each customer with the most relevant promotions.  

To do this, you will focus on **loyal customers**—those who either:  
1) Frequently make **high-value purchases**, *or*  
2) Place **multiple orders**.  

A customer is considered **loyal** if they have made **more than 20 orders** or bought **more than 80,000 items**.  

Using your data analytics skills, answer the following question:  

> Based on the *general* purchase patterns of loyal customers, what are the **top two product categories** that you recommend to your most loyal customer, uniquely based on their *personal* purchase history?





In [65]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

###Step 1: Identify Purchases Made by Loyal Customers **(3 Points)**

In [66]:
#Identify loyal customers based on Orders and Amount
loyal_customers = df.groupby('User_ID').agg({
    'Orders': 'sum',
    'Amount': 'sum'
}).reset_index()

# Define a loyal customer as one who has placed more than 20 orders or spent more than 80,000
loyal_customers = loyal_customers[(loyal_customers['Orders'] > 20) | (loyal_customers['Amount'] > 80000)]
loyal_customer_ids = loyal_customers['User_ID']

# Filter the dataset for loyal customers
loyal_df = df[df['User_ID'].isin(loyal_customer_ids)]
loyal_df

Unnamed: 0,User_ID,Gender,Age Group,Age,Marital_Status,State,Zone,Occupation,Product_Category,Orders,Amount
3,1001425,M,0-17,16,0,Karnataka,Southern,Construction,Auto,2,23912
4,1000588,M,26-35,28,1,Gujarat,Western,Food Processing,Auto,2,23877
5,1000588,M,26-35,28,1,Himachal Pradesh,Northern,Food Processing,Auto,1,23877
7,1002092,F,55+,61,0,Maharashtra,Western,IT Sector,Auto,1,20191
8,1003224,M,26-35,35,0,Uttar Pradesh,Central,Govt,Auto,2,23809
...,...,...,...,...,...,...,...,...,...,...,...
11237,1000687,M,26-35,29,1,Haryana,Northern,Media,Office,2,557
11240,1001425,F,0-17,12,0,Delhi,Central,IT Sector,Veterinary,1,396
11241,1003032,F,26-35,33,0,Delhi,Central,Hospitality,Office,3,384
11242,1004344,F,26-35,27,1,Delhi,Central,Healthcare,Office,2,382


### Step 2: Identify Purchase Patterns of Loyal Customers **(4 Points)**

Only consider patterns that appear in at least **5% of all purchases made by loyal customers**.  

This helps focus on meaningful trends rather than rare or one-off purchases.

In [67]:
basket = loyal_df.groupby('User_ID')['Product_Category'].apply(lambda x: ','.join(x)).reset_index()
basket

Unnamed: 0,User_ID,Product_Category
0,1000033,"Furniture,Sports Products,Food,Sports Products..."
1,1000036,"Furniture,Food,Electronics & Gadgets,Clothing ..."
2,1000053,"Footwear & Shoes,Auto,Clothing & Apparel,Cloth..."
3,1000148,"Furniture,Food,Footwear & Shoes,Footwear & Sho..."
4,1000151,"Stationery,Tupperware,Food,Food,Food,Books,Ele..."
...,...,...
267,1005837,"Food,Food,Food,Food,Electronics & Gadgets,Elec..."
268,1005954,"Footwear & Shoes,Footwear & Shoes,Food,Food,Fo..."
269,1006000,"Food,Footwear & Shoes,Sports Products,Games & ..."
270,1006016,"Footwear & Shoes,Footwear & Shoes,Footwear & S..."


In [68]:
# Convert the text in the table to a list of items
data_column = basket['Product_Category']
data = list(data_column.apply(lambda x: x.split(',')))

In [69]:
#What are the unique values?

flat_list = []
for lst in data:
  flat_list = flat_list + lst
set(flat_list)

{'Auto',
 'Beauty',
 'Books',
 'Clothing & Apparel',
 'Decor',
 'Electronics & Gadgets',
 'Food',
 'Footwear & Shoes',
 'Furniture',
 'Games & Toys',
 'Hand & Power Tools',
 'Household items',
 'Office',
 'Pet Care',
 'Sports Products',
 'Stationery',
 'Tupperware',
 'Veterinary'}

In [70]:
# Transform data
te = TransactionEncoder()
transactions = te.fit(data).transform(data) # or fit_transform(data)

# Create a dataframe from the data
df_encoded = pd.DataFrame(transactions, columns=te.columns_)
df_encoded

Unnamed: 0,Auto,Beauty,Books,Clothing & Apparel,Decor,Electronics & Gadgets,Food,Footwear & Shoes,Furniture,Games & Toys,Hand & Power Tools,Household items,Office,Pet Care,Sports Products,Stationery,Tupperware,Veterinary
0,False,False,False,True,False,False,True,False,True,True,False,False,False,False,True,False,False,False
1,False,False,False,True,False,True,True,False,True,False,False,False,False,False,False,False,False,False
2,True,False,False,True,False,False,False,True,False,False,False,False,True,False,False,False,False,False
3,False,False,False,True,False,False,True,True,True,False,False,False,True,True,True,False,False,False
4,False,False,True,False,False,True,True,False,False,False,False,False,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
267,False,False,False,True,False,True,True,False,False,False,False,False,False,False,False,False,False,False
268,False,True,False,True,True,True,True,True,False,False,False,False,False,True,False,False,False,False
269,False,False,False,True,False,True,True,True,False,True,False,False,False,False,True,False,False,False
270,False,False,False,True,False,True,True,True,False,False,False,False,True,False,False,False,False,False


In [71]:
# Let's find the most frequent itemsets.
frequent_itemsets = apriori(df_encoded, min_support=0.05, use_colnames=True)
frequent_itemsets.sort_values(by="support")

Unnamed: 0,support,itemsets
75,0.051471,"(Furniture, Pet Care)"
168,0.051471,"(Footwear & Shoes, Office, Food)"
161,0.051471,"(Furniture, Electronics & Gadgets, Sports Prod..."
219,0.051471,"(Furniture, Clothing & Apparel, Electronics & ..."
212,0.051471,"(Veterinary, Clothing & Apparel, Electronics &..."
...,...,...
47,0.716912,"(Electronics & Gadgets, Food)"
5,0.790441,(Electronics & Gadgets)
33,0.794118,"(Clothing & Apparel, Food)"
3,0.875000,(Clothing & Apparel)


In [72]:
#Let's find the rules of interest.
rules = association_rules(frequent_itemsets, metric="support", min_threshold=0.05, num_itemsets=frequent_itemsets.shape[0])
rules = rules[rules["lift"] >  1]
rules.sort_values(by=["support", "confidence"])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
655,(Food),"(Footwear & Shoes, Office)",0.911765,0.055147,0.051471,0.056452,1.023656,1.0,0.001189,1.001383,0.261905,0.056225,0.001381,0.494892
673,(Food),"(Stationery, Footwear & Shoes)",0.911765,0.051471,0.051471,0.056452,1.096774,1.0,0.004542,1.005279,1.000000,0.056452,0.005251,0.528226
1029,(Food),"(Footwear & Shoes, Household items, Beauty)",0.911765,0.051471,0.051471,0.056452,1.096774,1.0,0.004542,1.005279,1.000000,0.056452,0.005251,0.528226
1695,(Food),"(Clothing & Apparel, Electronics & Gadgets, Ho...",0.911765,0.055147,0.051471,0.056452,1.023656,1.0,0.001189,1.001383,0.261905,0.056225,0.001381,0.494892
439,(Clothing & Apparel),"(Footwear & Shoes, Office)",0.875000,0.055147,0.051471,0.058824,1.066667,1.0,0.003217,1.003906,0.500000,0.058577,0.003891,0.496078
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
362,"(Footwear & Shoes, Clothing & Apparel)",(Food),0.547794,0.911765,0.503676,0.919463,1.008443,1.0,0.004217,1.095588,0.018515,0.526923,0.087248,0.735941
35,(Clothing & Apparel),(Footwear & Shoes),0.875000,0.625000,0.547794,0.626050,1.001681,1.0,0.000919,1.002809,0.013423,0.575290,0.002801,0.751261
34,(Footwear & Shoes),(Clothing & Apparel),0.625000,0.875000,0.547794,0.876471,1.001681,1.0,0.000919,1.011905,0.004474,0.575290,0.011765,0.751261
81,(Food),(Footwear & Shoes),0.911765,0.625000,0.580882,0.637097,1.019355,1.0,0.011029,1.033333,0.215190,0.607692,0.032258,0.783254


### Step 3: Recommend the Top Two Product Categories **(3 Points)**
Using the identified purchase patterns, determine the two most relevant product categories for your top loyal customer, based on their unique purchase history.

In [73]:
loyal_customers.sort_values(by=['Amount'], ascending=False)

Unnamed: 0,User_ID,Orders,Amount
1045,1001680,58,281034
1197,1001941,52,239147
2134,1003476,57,220435
1628,1002665,50,201104
2355,1003808,55,197660
...,...,...,...
2979,1004823,21,50210
2381,1003850,22,48572
2876,1004647,24,48351
1772,1002895,22,37174


In [74]:
# Find the Top Loyal Customer based on Total Amount or Total Orders
top_loyal_customer = loyal_customers.sort_values(by=['Amount'], ascending=False).head(1)
print(top_loyal_customer)

      User_ID  Orders  Amount
1045  1001680      58  281034


In [75]:
customer_id = top_loyal_customer.iloc[0]['User_ID']

print(f"\nRecommended Product Categories for Customer {customer_id}:")

# Products that the customer has already bought
purchased_products = df[df['User_ID'] == customer_id]['Product_Category'].unique()
# Rules where the antecedent contains products the customer has bought
recommendations = rules[rules['antecedents'].apply(lambda x: all(item in purchased_products for item in x))]
# Get top recommendation: highest lift
top_recommendations = recommendations.sort_values(by='lift', ascending=False)
top_recommendations




Recommended Product Categories for Customer 1001680:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
1737,"(Footwear & Shoes, Electronics & Gadgets, Spor...","(Clothing & Apparel, Beauty)",0.117647,0.268382,0.051471,0.437500,1.630137,1.0,0.019896,1.300654,0.438095,0.153846,2.311558e-01,0.314640
1726,"(Footwear & Shoes, Clothing & Apparel, Electro...",(Beauty),0.102941,0.312500,0.051471,0.500000,1.600000,1.0,0.019301,1.375000,0.418033,0.141414,2.727273e-01,0.332353
1746,"(Sports Products, Electronics & Gadgets)","(Footwear & Shoes, Clothing & Apparel, Beauty)",0.194853,0.172794,0.051471,0.264151,1.528703,1.0,0.017801,1.124152,0.429550,0.162791,1.104403e-01,0.281012
1748,"(Footwear & Shoes, Sports Products)","(Clothing & Apparel, Electronics & Gadgets, Be...",0.147059,0.231618,0.051471,0.350000,1.511111,1.0,0.017409,1.182127,0.396552,0.157303,1.540670e-01,0.286111
988,"(Sports Products, Footwear & Shoes, Electronic...",(Beauty),0.117647,0.312500,0.055147,0.468750,1.500000,1.0,0.018382,1.294118,0.377778,0.147059,2.272727e-01,0.322610
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1104,"(Clothing & Apparel, Electronics & Gadgets)","(Games & Toys, Food)",0.691176,0.286765,0.198529,0.287234,1.001637,1.0,0.000324,1.000658,0.005291,0.254717,6.580390e-04,0.489771
319,(Electronics & Gadgets),"(Furniture, Clothing & Apparel)",0.790441,0.264706,0.209559,0.265116,1.001550,1.0,0.000324,1.000558,0.007387,0.247826,5.581395e-04,0.528391
1661,(Food),"(Clothing & Apparel, Electronics & Gadgets, Be...",0.911765,0.084559,0.077206,0.084677,1.001403,1.0,0.000108,1.000130,0.015873,0.084000,1.295505e-04,0.498860
910,"(Footwear & Shoes, Food)","(Electronics & Gadgets, Beauty)",0.580882,0.272059,0.158088,0.272152,1.000342,1.0,0.000054,1.000128,0.000816,0.227513,1.278609e-04,0.426616


In [76]:
# Consequent of the bought items, which are the recommended products
recommended_products = top_recommendations.head(1)['consequents'].apply(lambda x: list(x)).explode().unique()
print(recommended_products)

['Clothing & Apparel' 'Beauty']


By using Market Basket Analysis and refining the recommendation engine based on the past purchase behavior of loyal customers, we can send personalized promotions that suggest complementary products. The top loyal customer should be recommended Beauty and Clothing products.





# Question 2 **(5 Points)**

*Knowing nothing else about the customer*, what product categories would you *uniquely* recommend to a female customer aged 26-35?


💡 **Hint:** Look for rules that specifically apply to this demographic.

In [None]:
# To solve this, we need to encode all of product categories, gender, and age group.

basket = df.groupby('User_ID')[['Product_Category', 'Gender', 'Age Group']]
basket = basket.agg({
    'Product_Category': lambda x: ','.join(map(str, x)),
    'Gender': 'first',
    'Age Group': 'first'
}).reset_index()
basket = basket.reset_index()
basket

  and should_run_async(code)


Unnamed: 0,index,User_ID,Product_Category,Gender,Age Group
0,0,1000001,Electronics & Gadgets,F,0-17
1,1,1000002,Clothing & Apparel,F,55+
2,2,1000003,"Footwear & Shoes,Food",F,26-35
3,3,1000004,"Footwear & Shoes,Decor",F,46-50
4,4,1000005,"Footwear & Shoes,Sports Products",F,26-35
...,...,...,...,...,...
3750,3750,1006035,"Clothing & Apparel,Clothing & Apparel,Clothing...",M,26-35
3751,3751,1006036,"Food,Food,Food,Games & Toys,Food,Electronics &...",M,26-35
3752,3752,1006037,"Food,Clothing & Apparel",M,46-50
3753,3753,1006039,"Electronics & Gadgets,Clothing & Apparel,House...",M,46-50


In [None]:
#Convert the text in the table to a list of items
data_column = basket['Product_Category']
data = list(basket.apply(lambda x: x['Product_Category'].split(',') + [x['Gender'], x['Age Group']], axis=1))
data

  and should_run_async(code)


[['Electronics & Gadgets', 'F', '0-17'],
 ['Clothing & Apparel', 'F', '55+'],
 ['Footwear & Shoes', 'Food', 'F', '26-35'],
 ['Footwear & Shoes', 'Decor', 'F', '46-50'],
 ['Footwear & Shoes', 'Sports Products', 'F', '26-35'],
 ['Food', 'Electronics & Gadgets', 'F', '36-45'],
 ['Footwear & Shoes', 'Sports Products', 'Clothing & Apparel', 'F', '26-35'],
 ['Footwear & Shoes', 'Household items', 'F', '26-35'],
 ['Food',
  'Furniture',
  'Clothing & Apparel',
  'Games & Toys',
  'Beauty',
  'M',
  '36-45'],
 ['Food', 'Electronics & Gadgets', 'Clothing & Apparel', 'F', '46-50'],
 ['Food', 'Clothing & Apparel', 'F', '26-35'],
 ['Sports Products', 'M', '36-45'],
 ['Footwear & Shoes', 'Games & Toys', 'M', '51-55'],
 ['Clothing & Apparel', 'Furniture', 'M', '18-25'],
 ['Food', 'Clothing & Apparel', 'F', '0-17'],
 ['Food',
  'Clothing & Apparel',
  'Clothing & Apparel',
  'Clothing & Apparel',
  'Clothing & Apparel',
  'F',
  '18-25'],
 ['Footwear & Shoes',
  'Electronics & Gadgets',
  'Household 

In [None]:
# Transform data
te = TransactionEncoder()
transactions = te.fit(data).transform(data) # or fit_transform(data)

# Create a dataframe from the data
df_encoded = pd.DataFrame(transactions, columns=te.columns_)
df_encoded

  and should_run_async(code)


Unnamed: 0,0-17,18-25,26-35,36-45,46-50,51-55,55+,Auto,Beauty,Books,...,Games & Toys,Hand & Power Tools,Household items,M,Office,Pet Care,Sports Products,Stationery,Tupperware,Veterinary
0,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3750,False,False,True,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
3751,False,False,True,False,False,False,False,False,True,False,...,True,False,False,True,False,False,True,False,False,False
3752,False,False,False,False,True,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
3753,False,False,False,False,True,False,False,False,False,False,...,False,False,True,True,False,False,False,False,False,False


In [None]:
frequent_itemsets = apriori(df_encoded, min_support=0.05, use_colnames=True)
frequent_itemsets.sort_values(by="support")

  and should_run_async(code)


Unnamed: 0,support,itemsets
43,0.050333,"(Games & Toys, Clothing & Apparel)"
49,0.050333,"(Electronics & Gadgets, Household items)"
60,0.050333,"(M, Household items)"
71,0.050866,"(M, 26-35, Food)"
85,0.051398,"(M, Electronics & Gadgets, Food)"
...,...,...
1,0.369374,(26-35)
8,0.375499,(Electronics & Gadgets)
10,0.452730,(Food)
7,0.467909,(Clothing & Apparel)


In [None]:
rules = association_rules(frequent_itemsets, num_itemsets=frequent_itemsets.shape[0], metric="support", min_threshold=0.05)
rules = rules[rules["lift"] > 1] # Only recommend products that are of more interest to the demographic than the general population
rules = rules[rules["antecedents"].apply(lambda x: {"26-35", "F"} == set(x) )] # Specify the demographic
rules = rules[rules["consequents"].apply(lambda x: len(x) == 1 )] # Recommend single product categories.

rules.sort_values(by=["lift", "confidence"])
rules

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
104,"(26-35, F)",(Clothing & Apparel),0.256458,0.467909,0.1249,0.48702,1.040842,1.0,0.004901,1.037253,0.052773,0.208352,0.035915,0.376976
124,"(26-35, F)",(Electronics & Gadgets),0.256458,0.375499,0.099068,0.386293,1.028744,1.0,0.002768,1.017587,0.037579,0.185907,0.017283,0.325061
134,"(26-35, F)",(Food),0.256458,0.45273,0.11984,0.46729,1.032161,1.0,0.003734,1.027332,0.041905,0.203344,0.026605,0.365998
140,"(26-35, F)",(Footwear & Shoes),0.256458,0.213316,0.073502,0.286604,1.34357,1.0,0.018795,1.102732,0.343914,0.185484,0.093162,0.315587


For a female customer aged 25-35, without knowing anything about her past purchase pattern, recommending the consequents above would be sensible as her demographics would be more interested in these categories than the general population.

# Question 3 **(5 Points)**

Your company plans to open a branch near a **bank** and wants to cater to **banking customers** based on transaction data.  

  
> Find a product category that is **at least 20% more likely** to be purchased by **bankers** compared to the **average population**.  

This will help the company **strategically stock products** that appeal most to banking customers.



In [None]:
# To solve this, we need to encode all of product categories and gender.

basket = df.groupby('User_ID')[['Product_Category', 'Occupation']]
basket = basket.agg({
    'Product_Category': lambda x: ','.join(map(str, x)),
    'Occupation': 'first',
}).reset_index()
basket

  and should_run_async(code)


Unnamed: 0,User_ID,Product_Category,Occupation
0,1000001,Electronics & Gadgets,Aviation
1,1000002,Clothing & Apparel,Agriculture
2,1000003,"Footwear & Shoes,Food",Automobile
3,1000004,"Footwear & Shoes,Decor",Retail
4,1000005,"Footwear & Shoes,Sports Products",Food Processing
...,...,...,...
3750,1006035,"Clothing & Apparel,Clothing & Apparel,Clothing...",Healthcare
3751,1006036,"Food,Food,Food,Games & Toys,Food,Electronics &...",Chemical
3752,1006037,"Food,Clothing & Apparel",IT Sector
3753,1006039,"Electronics & Gadgets,Clothing & Apparel,House...",Hospitality


In [None]:
# Preparing data format
data = list(basket.apply(lambda x: x['Product_Category'].split(',') + [x['Occupation']], axis=1))
data

  and should_run_async(code)


[['Electronics & Gadgets', 'Aviation'],
 ['Clothing & Apparel', 'Agriculture'],
 ['Footwear & Shoes', 'Food', 'Automobile'],
 ['Footwear & Shoes', 'Decor', 'Retail'],
 ['Footwear & Shoes', 'Sports Products', 'Food Processing'],
 ['Food', 'Electronics & Gadgets', 'Media'],
 ['Footwear & Shoes',
  'Sports Products',
  'Clothing & Apparel',
  'Food Processing'],
 ['Footwear & Shoes', 'Household items', 'Food Processing'],
 ['Food',
  'Furniture',
  'Clothing & Apparel',
  'Games & Toys',
  'Beauty',
  'Chemical'],
 ['Food', 'Electronics & Gadgets', 'Clothing & Apparel', 'IT Sector'],
 ['Food', 'Clothing & Apparel', 'Chemical'],
 ['Sports Products', 'Retail'],
 ['Footwear & Shoes', 'Games & Toys', 'Banking'],
 ['Clothing & Apparel', 'Furniture', 'Media'],
 ['Food', 'Clothing & Apparel', 'Aviation'],
 ['Food',
  'Clothing & Apparel',
  'Clothing & Apparel',
  'Clothing & Apparel',
  'Clothing & Apparel',
  'IT Sector'],
 ['Footwear & Shoes',
  'Electronics & Gadgets',
  'Household items',
 

In [None]:
# Transform data
te = TransactionEncoder()
transactions = te.fit(data).transform(data) # or fit_transform(data)
# Create a dataframe from the data
df_encoded = pd.DataFrame(transactions, columns=te.columns_)
df_encoded

  and should_run_async(code)


Unnamed: 0,Agriculture,Auto,Automobile,Aviation,Banking,Beauty,Books,Chemical,Clothing & Apparel,Construction,...,Lawyer,Media,Office,Pet Care,Retail,Sports Products,Stationery,Textile,Tupperware,Veterinary
0,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3750,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
3751,False,False,False,False,False,True,False,True,True,False,...,False,False,False,False,False,True,False,False,False,False
3752,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
3753,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False


In [None]:

# Apply the Apriori algorithm to identify frequent itemsets
frequent_itemsets = apriori(df_encoded, min_support=0.005, use_colnames=True)

# Sort the frequent itemsets by support (descending order)
frequent_itemsets = frequent_itemsets.sort_values(by="support", ascending=False)

# Display the top frequent itemsets
print(frequent_itemsets.head())


     support                    itemsets
5   0.467909        (Clothing & Apparel)
7   0.452730                      (Food)
6   0.375499     (Electronics & Gadgets)
8   0.213316          (Footwear & Shoes)
24  0.207989  (Food, Clothing & Apparel)


  and should_run_async(code)


In [None]:
rules = association_rules(frequent_itemsets, num_itemsets=frequent_itemsets.shape[0], metric="support", min_threshold=0.005)

rules = rules[(rules["lift"] > 1.2) & # 20% projected increase in purchase compared to general population
 (rules["antecedents"] == {"Banking"}) &  # For bankers
  rules["consequents"].apply(lambda x: len(x) == 1) ] # Single category recommendations.
rules.sort_values(by=["lift"])

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
617,(Banking),(Games & Toys),0.097736,0.094274,0.013848,0.141689,1.502948,1.0,0.004634,1.055242,0.37089,0.077728,0.05235,0.144291


Games and Toys stand out at a product category that is much more likely to be purchased by bankers than the average customer.