# CS 1656 – Introduction to Data Science 

## Instructors: Alexandros Labrinidis, Xiaowei Jia
## Teaching Assistants: Evangelos Karageorgos, Xiaoting Li, Zi Han Ding
## Recitation : Accosiation Rule Mining
---
In this recitation, you will use the Apriori algorithm on a dataset that contains credit card transaction information (fraudulent and legitimate ones) to discover relationships between variables and fraudulent transactions.

### Library imports

In [53]:
import pandas as pd
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori, association_rules

%matplotlib inline

### Dataset at a Glance

This recitation has real data of credit card transactions from European card holders in 2023. Due to the sensitive nature of the dataset, these data went through anonymization for compliance with policies. So, for example, instead of a time of transaction, which is considered sensitive information, they converted all time values to a decimal value. This was done for 28 different transaction attributes. For the sake of space and your CPU, the original dataset is modified to contain only:
* ID of the transaction in the dataset
* 8 V columns, representing annonymized transaction attributes (e.g., time, location, etc)
* Amount of the transaction
* Binary label indicating whether the transaction is fraudulent (1) or not (0)

Kaggle link to the original dataset: https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023

In [56]:
data = pd.read_csv('creditcard_public.csv')
data.head()

Unnamed: 0,id,V2,V4,V7,V17,V22,V24,V25,V27,Amount,Class
0,106066,-1.075548,-0.017186,0.442199,0.213408,0.513924,0.970368,0.248378,-0.258531,11177.47,0
1,102005,-0.225103,-0.658931,0.733825,0.222475,0.193785,0.763737,-0.013376,-0.267523,7881.18,0
2,278450,-0.170331,-1.149702,0.433624,0.232038,-0.48156,-0.62981,-0.706611,-0.028033,3255.41,0
3,120218,-0.437368,-0.48068,0.400437,0.325641,-0.283255,-1.250555,0.562463,-0.271604,16052.42,0
4,258527,-0.110636,-0.24477,0.437923,0.599195,-1.207685,1.238109,0.461472,0.011344,18197.45,0


Let's look into the dataset further and examine the number of items using the shape attribute

In [57]:
data.shape

(2000, 11)

So we have 2000 total rows with 11 columns. Next let's look at what is the type for each of these columns.

In [58]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      2000 non-null   int64  
 1   V2      2000 non-null   float64
 2   V4      2000 non-null   float64
 3   V7      2000 non-null   float64
 4   V17     2000 non-null   float64
 5   V22     2000 non-null   float64
 6   V24     2000 non-null   float64
 7   V25     2000 non-null   float64
 8   V27     2000 non-null   float64
 9   Amount  2000 non-null   float64
 10  Class   2000 non-null   int64  
dtypes: float64(9), int64(2)
memory usage: 172.0 KB


float64 is something we do not want to see, as the apriori and association rule mining algorithms generally considers binary data items for input.

### Feature Engineering

For the sake of this recitation, we are interested in only the attributes of the transactions, and not the amount or the id, we can drop those two columns.

In [59]:
# Drop the amount column
data = data.drop('Amount', axis=1)

# Drop the id column
data = data.drop('id', axis=1)

data.head()

Unnamed: 0,V2,V4,V7,V17,V22,V24,V25,V27,Class
0,-1.075548,-0.017186,0.442199,0.213408,0.513924,0.970368,0.248378,-0.258531,0
1,-0.225103,-0.658931,0.733825,0.222475,0.193785,0.763737,-0.013376,-0.267523,0
2,-0.170331,-1.149702,0.433624,0.232038,-0.48156,-0.62981,-0.706611,-0.028033,0
3,-0.437368,-0.48068,0.400437,0.325641,-0.283255,-1.250555,0.562463,-0.271604,0
4,-0.110636,-0.24477,0.437923,0.599195,-1.207685,1.238109,0.461472,0.011344,0


Next, we need to somehow turn all these decimal values for attribute columns (the columns starting with V), into binary data. For simplicity's sake, let's use positive and negative values as the binary on and off, respectively.

In [60]:
# Convert the V columns to binary
# where the value is 0 if the value is negative and 1 if the value is positive

for col in data.columns:
    if col.startswith('V'):
        data[col] = data[col].apply(lambda x: 0 if x < 0 else 1)

data.head()

Unnamed: 0,V2,V4,V7,V17,V22,V24,V25,V27,Class
0,0,0,1,1,1,1,1,0,0
1,0,0,1,1,1,1,0,0,0
2,0,0,1,1,0,0,0,0,0
3,0,0,1,1,0,0,1,0,0
4,0,0,1,1,0,1,1,1,0


We can convert the dataframe to true boolean values to make it more compatible with the library

In [61]:
data = data.astype(bool)
data.head()

Unnamed: 0,V2,V4,V7,V17,V22,V24,V25,V27,Class
0,False,False,True,True,True,True,True,False,False
1,False,False,True,True,True,True,False,False,False
2,False,False,True,True,False,False,False,False,False
3,False,False,True,True,False,False,True,False,False
4,False,False,True,True,False,True,True,True,False


### Apriori Algorithm

Using the apriori algorithm, we can create a dataframe that has the item sets and their support value.

In [62]:
# Run the apriori algorithm
# This will return a dataframe with the itemsets and their corresponding support
frequent_itemsets = apriori(data, min_support=0.01, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.3950,(V2)
1,0.4840,(V4)
2,0.6360,(V7)
3,0.6120,(V17)
4,0.4850,(V22)
...,...,...
334,0.0110,"(V7, Class, V17, V4, V25, V24)"
335,0.0155,"(V7, Class, V17, V27, V4, V25)"
336,0.0155,"(Class, V27, V4, V25, V24, V22)"
337,0.0140,"(V7, Class, V17, V27, V4, V25, V2)"


Note that the data type of the 'itemsets' columns is a special type of a set.

In [63]:
# get the 'itemset' value of the last row
frequent_itemsets.iloc[-1]['itemsets']

frozenset({'Class', 'V2', 'V22', 'V24', 'V25', 'V27', 'V4'})

Now that we have the frequent item set, we can then use the association rule algorithm to create the rule set.

In [64]:
# Generate the rules
rules = association_rules(frequent_itemsets)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(V2),(V4),0.3950,0.4840,0.3410,0.863291,1.783659,0.149820,3.774444,0.726206
1,(V2),(Class),0.3950,0.5000,0.3545,0.897468,1.794937,0.157000,4.876543,0.732029
2,(Class),(V4),0.5000,0.4840,0.4335,0.867000,1.791322,0.191500,3.879699,0.883506
3,(V4),(Class),0.4840,0.5000,0.4335,0.895661,1.791322,0.191500,4.792079,0.856111
4,(V7),(V17),0.6360,0.6120,0.5850,0.919811,1.502960,0.195768,4.838588,0.919358
...,...,...,...,...,...,...,...,...,...,...
240,"(Class, V17, V27, V4, V25)","(V2, V7)",0.0170,0.0870,0.0140,0.823529,9.465855,0.012521,5.173667,0.909824
241,"(Class, V17, V27, V25, V2)","(V7, V4)",0.0140,0.1585,0.0140,1.000000,6.309148,0.011781,inf,0.853448
242,"(V17, V27, V4, V25, V2)","(Class, V7)",0.0150,0.1525,0.0140,0.933333,6.120219,0.011713,12.712500,0.849347
243,"(Class, V27, V25, V2, V24, V22)",(V4),0.0115,0.4840,0.0105,0.913043,1.886453,0.004934,5.934000,0.475372


### Tasks

__T1) Find all itemsets in the dataset that occurs more than 30% of the time. _(Dataframe with 'support' and 'itemsets' columns)___

In [65]:
frequent_itemsets = apriori(data, min_support=0.30, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.395,(V2)
1,0.484,(V4)
2,0.636,(V7)
3,0.612,(V17)
4,0.485,(V22)
5,0.516,(V24)
6,0.491,(V25)
7,0.38,(V27)
8,0.5,(Class)
9,0.341,"(V2, V4)"


__T2) Find all the rules in the frequent item sets (not from task 1, use frequent_itemsets) that is true over 90% of the time. _(Dataframe with 'antecedents', 'consequents', 'support', and 'confidence' columns)___

In [68]:
rules = association_rules(frequent_itemsets, min_threshold=0.9)
rules[['antecedents', 'consequents', 'support', 'confidence']]


Unnamed: 0,antecedents,consequents,support,confidence
0,(2),(3),0.585,0.919811
1,(3),(2),0.585,0.955882


__T3) Find all the "rules" for class, which is our fraud/legitimate column. What item sets of transaction attributes (V1, V2, V3, ...), leads to fraudulent transactions (Class == 1). _(Dataframe with 'antecedents', 'consequents', 'support', and 'confidence' columns)___

In [123]:
frequent_itemsets = apriori(data, min_support=0.01, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.3950,(V2)
1,0.4840,(V4)
2,0.6360,(V7)
3,0.6120,(V17)
4,0.4850,(V22)
...,...,...
334,0.0110,"(V7, Class, V17, V4, V25, V24)"
335,0.0155,"(V7, Class, V17, V27, V4, V25)"
336,0.0155,"(Class, V27, V4, V25, V24, V22)"
337,0.0140,"(V7, Class, V17, V27, V4, V25, V2)"


__T4) From the "rules" for class, which "rule" is most prevalent/most common in the dataset? _(Dataframe with 'antecedents', 'consequents', 'support', and 'confidence' columns, AND a single row)___

_If you did not get T3, you can use the rules dataframe for this task (From the apriori algorithm section)_

In [120]:
rules = association_rules(frequent_itemsets, ascedning = False)
rulesrule[['antecedents', 'consequents', 'support', 'confidence']]

antecedents        (V7)
consequents       (V17)
support           0.585
confidence     0.919811
Name: 0, dtype: object

__T5) From the "rules" for class, which rule has the highest consistency, that is, for which itemset, is class almost always in the itemset. _(Dataframe with 'antecedents', 'consequents', 'support', and 'confidence' columns, AND a single row)___

_If you did not get T3, you can use the rules dataframe for this task (From the apriori algorithm section)_