# Joe Wehbe - 202000908 - Part 1

### Perform any necessary preprocessing steps. Identify the attributes that need to be discretized and report any analysis that allows you to choose the right number of bins.

Importing necessary libraries for data preprocessing

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer

Loading and storing the data

In [2]:
data = pd.read_csv('data/bank.csv')

Exploring the data and gaining insight into its structure

In [3]:
data

Unnamed: 0,id,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pip
0,ID12101,48,FEMALE,INNER_CITY,17546.00,NO,1,NO,NO,NO,NO,YES
1,ID12102,40,MALE,TOWN,30085.10,YES,3,YES,NO,YES,YES,NO
2,ID12103,51,FEMALE,INNER_CITY,16575.40,YES,0,YES,YES,YES,NO,NO
3,ID12104,23,FEMALE,TOWN,20375.40,YES,3,NO,NO,YES,NO,NO
4,ID12105,57,FEMALE,RURAL,50576.30,YES,0,NO,YES,NO,NO,NO
...,...,...,...,...,...,...,...,...,...,...,...,...
595,ID12696,61,FEMALE,INNER_CITY,47025.00,NO,2,YES,YES,YES,YES,NO
596,ID12697,30,FEMALE,INNER_CITY,9672.25,YES,0,YES,YES,YES,NO,NO
597,ID12698,31,FEMALE,TOWN,15976.30,YES,0,YES,YES,NO,NO,YES
598,ID12699,29,MALE,INNER_CITY,14711.80,YES,0,NO,YES,NO,YES,NO


Checking for missing and duplicate values

In [4]:
print(data.isnull().sum())
print()
print('Number of duplicate values: ', data.duplicated().sum())

id             0
age            0
sex            0
region         0
income         0
married        0
children       0
car            0
save_act       0
current_act    0
mortgage       0
pip            0
dtype: int64

Number of duplicate values:  0


Identifying numerical attributes for discretization

In [5]:
numerical_attributes = data.select_dtypes(include=['float64', 'int64']).columns

print('The numerical attributes for discretization are: ')
for i in numerical_attributes:
    print(i)

The numerical attributes for discretization are: 
age
income
children


Determining the number of bins for each attribute using Freedman-Diaconis rule.
This method is based on the interquartile range and consists of three steps:
- Step 1: Finding the IQR which is the difference between the third and first quartile
- Step 2: Calculating the bin width using a specific formula
- Step 3: Determining the number of bins by dividing the data range by the bin width and rounding up to the nearest integer

In [6]:
count = 0;

for attribute in numerical_attributes:
    # step 1
    iqr = np.percentile(data[attribute], 75) - np.percentile(data[attribute], 25)
    # step 2
    bin_width = (2 * iqr) / (len(data[attribute]) ** (1/3))
    # step 3
    num_bins = int(np.ceil((data[attribute].max() - data[attribute].min()) / bin_width))
    
    print(numerical_attributes[count], ": ", num_bins, "bins")
    count = count + 1;

age :  9 bins
income :  13 bins
children :  7 bins


Discretizing the attributes using the corresponding number of bins and displaying the new dataset

In [7]:
for attribute in numerical_attributes:
    discretizer = KBinsDiscretizer(n_bins=num_bins, encode='ordinal', strategy='uniform', subsample=None)
    discretized_values = discretizer.fit_transform(data[attribute].values.reshape(-1, 1))
    data[attribute] = discretized_values
data

Unnamed: 0,id,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pip
0,ID12101,4.0,FEMALE,INNER_CITY,1.0,NO,2.0,NO,NO,NO,NO,YES
1,ID12102,3.0,MALE,TOWN,3.0,YES,6.0,YES,NO,YES,YES,NO
2,ID12103,4.0,FEMALE,INNER_CITY,1.0,YES,0.0,YES,YES,YES,NO,NO
3,ID12104,0.0,FEMALE,TOWN,1.0,YES,6.0,NO,NO,YES,NO,NO
4,ID12105,5.0,FEMALE,RURAL,5.0,YES,0.0,NO,YES,NO,NO,NO
...,...,...,...,...,...,...,...,...,...,...,...,...
595,ID12696,6.0,FEMALE,INNER_CITY,5.0,NO,4.0,YES,YES,YES,YES,NO
596,ID12697,1.0,FEMALE,INNER_CITY,0.0,YES,0.0,YES,YES,YES,NO,NO
597,ID12698,1.0,FEMALE,TOWN,1.0,YES,0.0,YES,YES,NO,NO,YES
598,ID12699,1.0,MALE,INNER_CITY,1.0,YES,0.0,NO,YES,NO,YES,NO


### Apply association rule mining on the preprocessed data and experiment with different parameters so that you get at least 20-30 strong rules (e.g., rules with high lift and confidence which at the same time have relatively good support). 

Importing necessary libraries for association rule mining

In [8]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Applying association rule mining on the preprocessed data by applying the following four steps:
- Step 1: Converting the data into a binary matrix using one-hot encoding
- Step 2: Applying the apriori algorithm to find the frequent itemsets
- Step 3: Generating association rules wih specified parameters
- Step 4: Sorting the rules by lift and confidence



In [9]:
# step 1
one_hot_data = pd.get_dummies(data).astype(bool)
# step 2
frequent_itemsets = apriori(one_hot_data, min_support=0.1, use_colnames=True)
# step 3
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
# step 4
strong_rules = rules.sort_values(by=['lift','confidence'], ascending=[False, False])

# printing
print("Top 20 Strongest Rules:")
print()
counter = 0
for index, rule in strong_rules.iterrows():
    if counter >= 20:
        break
    antecedents = ', '.join(rule['antecedents'])
    consequents = ', '.join(rule['consequents'])
    support = rule['support']
    confidence = rule['confidence']
    lift = rule['lift']
    print(f"Rule {index + 1}:")
    print(f"Antecedents: {antecedents}")
    print(f"Consequents: {consequents}")
    print(f"Support: {support:.3f}")
    print(f"Confidence: {confidence:.3f}")
    print(f"Lift: {lift:.3f}")
    print()
    print("-------------")
    print()
    counter += 1

Top 20 Strongest Rules:

Rule 25010:
Antecedents: save_act_YES, age, current_act_YES, married_YES, pip_YES
Consequents: children, income
Support: 0.105
Confidence: 0.940
Lift: 1.926

-------------

Rule 25121:
Antecedents: children, income
Consequents: save_act_YES, age, current_act_YES, married_YES, pip_YES
Support: 0.105
Confidence: 0.215
Lift: 1.926

-------------

Rule 25056:
Antecedents: age, children, current_act_YES, income
Consequents: save_act_YES, pip_YES, married_YES
Support: 0.105
Confidence: 0.318
Lift: 1.872

-------------

Rule 25075:
Antecedents: save_act_YES, pip_YES, married_YES
Consequents: age, children, current_act_YES, income
Support: 0.105
Confidence: 0.618
Lift: 1.872

-------------

Rule 25041:
Antecedents: save_act_YES, pip_YES, current_act_YES, married_YES
Consequents: age, children, income
Support: 0.105
Confidence: 0.829
Lift: 1.856

-------------

Rule 25090:
Antecedents: age, children, income
Consequents: save_act_YES, pip_YES, current_act_YES, married_YE

### Select the top 5 most "interesting" rules and for each specify the following:
##### •	an explanation of the pattern and why you believe it is interesting based on the business objectives of the company;
##### •	any recommendations based on the discovered rule that might help the company to better understand behavior of its customers or in its marketing campaign.


In [10]:
sorted_rules = rules.sort_values(by=['lift','support'], ascending=[False, False])
top_rules = sorted_rules.head(5)

print("Top 5 Most Interesting Rules:")
print()
for index, rule in top_rules.iterrows():
    antecedents = ', '.join(rule['antecedents'])
    consequents = ', '.join(rule['consequents'])
    support = rule['support']
    confidence = rule['confidence']
    lift = rule['lift']
    print(f"Rule {index + 1}:")
    print(f"Antecedents: {antecedents}")
    print(f"Consequents: {consequents}")
    print(f"Support: {support:.3f}")
    print(f"Confidence: {confidence:.3f}")
    print(f"Lift: {lift:.3f}")
    print()
    print("-------------")
    print()

Top 5 Most Interesting Rules:

Rule 25010:
Antecedents: save_act_YES, age, current_act_YES, married_YES, pip_YES
Consequents: children, income
Support: 0.105
Confidence: 0.940
Lift: 1.926

-------------

Rule 25121:
Antecedents: children, income
Consequents: save_act_YES, age, current_act_YES, married_YES, pip_YES
Support: 0.105
Confidence: 0.215
Lift: 1.926

-------------

Rule 25056:
Antecedents: age, children, current_act_YES, income
Consequents: save_act_YES, pip_YES, married_YES
Support: 0.105
Confidence: 0.318
Lift: 1.872

-------------

Rule 25075:
Antecedents: save_act_YES, pip_YES, married_YES
Consequents: age, children, current_act_YES, income
Support: 0.105
Confidence: 0.618
Lift: 1.872

-------------

Rule 25041:
Antecedents: save_act_YES, pip_YES, current_act_YES, married_YES
Consequents: age, children, income
Support: 0.105
Confidence: 0.829
Lift: 1.856

-------------



The top 5 most interesting rules are selected based on the highest lift and support combined, which is a good criteria for the selection. Below is an interpretation as to why each of the five rules is interestin in terms of the business objectives:

- Rule 25017: This rule suggests that customers who have a current account (current_act_YES), personal investment plan (pip_YES), being married (married_YES), a specific age, and a savings account (save_act_YES) are highly likely to have income and children. 

- Rule 25114: This rule suggests that customers who have income and children are associated with having a current account (current_act_YES), personal investment plan (pip_YES), being married (married_YES), a specific age, and a savings account (save_act_YES).

- Rule 25044: This rule suggests that customers who have income, a current account (current_act_YES), a specific age, and children are associated with having personal investment plan (pip_YES), being married (married_YES), and a savings account (save_act_YES).

- Rule 25087: This rule suggests that customers who have personal investment plan (pip_YES), are married (married_YES), and have a savings account (save_act_YES) are associated with having income, a current account (current_act_YES), a specific age, and children.

- Rule 25037: This rule suggests that customers who have a current account (current_act_YES), personal investment plan (pip_YES), are married (married_YES), and have a savings account (save_act_YES) are associated with having income, a specific age, and children.

The abovementioned rules have a high lift and support, which makes it interesting to consider them for the objectives of the business.

We can identify a pattern as to which cutomers are more likely to buy the personal investment plan after the last mailing: all of the above rules suggest that customers who have income, are married, have children, and have an account are more likely to have a personal investment plan, or vice-versa.

These rules can help the company with finding their target market, and help achieve more effective marketing campaigns.

### Transferring the preprocessed dataset from this notebook to the notebook of part 2

In [11]:
import pickle

with open('preprocessed_dataset.pkl', 'wb') as file:
    pickle.dump(data, file)
