### Google.colab
Only execute this cell when use on google colab platform (colab).

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://github.com/Nak007/AssoruleMining">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

In [None]:
# Mount with google drive.
from google.colab import drive
drive.mount('/content/dirve')
# Import other libraries required. All *.py will be 
# stored under the following location i.e. '/content/example.py'.
!git clone 'http://github.com/Nak007/AssoruleMining.git'
!pip install PrettyTable

## Example

In [19]:
import pandas as pd, numpy as np, sys
try: sys.path.append('/content/AssoruleMining')
except: pass
from AssoruleMining import *
from sklearn.model_selection import train_test_split as tts

In [20]:
X = pd.read_csv('card_transdata_10K.txt', sep="|")
y = X.pop("fraud").values

In [21]:
for var in ["repeat_retailer", "used_chip", "used_pin_number", "online_order"]:
    X[var] = np.where(X[var]==1,"yes","no")

In [22]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   distance_from_home              10000 non-null  float64
 1   distance_from_last_transaction  10000 non-null  float64
 2   ratio_to_median_purchase_price  10000 non-null  float64
 3   repeat_retailer                 10000 non-null  object 
 4   used_chip                       10000 non-null  object 
 5   used_pin_number                 10000 non-null  object 
 6   online_order                    10000 non-null  object 
dtypes: float64(3), object(4)
memory usage: 547.0+ KB


We use **`define_dtype`** to convert columns in `X` to possible dtypes which are `float32`, `int32`, `category`, and `object`. However, it ignores columns, whose dtype is either np.datetime64 or np.timedelta64.

In [23]:
X = define_dtype(X)

In [24]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   distance_from_home              10000 non-null  float32 
 1   distance_from_last_transaction  10000 non-null  float32 
 2   ratio_to_median_purchase_price  10000 non-null  float32 
 3   repeat_retailer                 10000 non-null  category
 4   used_chip                       10000 non-null  category
 5   used_pin_number                 10000 non-null  category
 6   online_order                    10000 non-null  category
dtypes: category(4), float32(3)
memory usage: 156.9 KB


Split data into **train**, and **test** sets [(**`train_test_split`**)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [25]:
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.3, shuffle=True, random_state=0)

To discretize `X`, we use **`discretize`**.

In [26]:
discr_X1, rules1 = discretize(X_train, n_cutoffs=20)

## Creation of rules (1)
- Antecedent rule is mutually exclusive to consequent rule (assumption).
- Training samples captured by antecedent rule(s) are excluded before determining the next consequent rule.
- This approach stops when the evaluating metric is deemed satisfactory or not improving.

In [27]:
asso1 = AssoRuleMining(metric="f1", operator="and", n_jobs=3, n_batches=5).fit(discr_X1, y_train, rules=rules1)

HBox(children=(HTMLMath(value='Calculating . . .'), HTMLMath(value='')))

**info** (attribute) : a summary table that comes in a form of `dict` with keys as column headers. It can be imported into a pandas DataFrame.

In [28]:
pd.DataFrame(asso1.info).sort_values(by=["f1_score","n_features"], ascending=[False,True]).head()

Unnamed: 0,start_with,variable,n_features,p_target,p_sample,f1_score,recall,precision,entropy
2,,124,3,0.69191,0.057429,0.817904,0.69191,1.0,0.16946
19,,99,4,0.690189,0.057286,0.816701,0.690189,1.0,0.170203
27,,61,4,0.659208,0.054714,0.794606,0.659208,1.0,0.18339
50,,21,4,0.657487,0.054571,0.793354,0.657487,1.0,0.184113
35,,58,4,0.652324,0.054143,0.789583,0.652324,1.0,0.186275


For this example, we focus on `f1-score`. Hence, we choose rule(s) that has the highest `f1-score` accordingly. In the case of a tie, we select `variable`, whose number of features is the lowest. This is for the sake of reducing rule complexity.

To create $1^{st}$ rule, we use **`RuleToFeature`** to convert rules into features array.

In [29]:
rule1_index = 124
FirstRule = RuleToFeature(X_train, asso1.asso_results_, which_rules=[rule1_index])

Use **`print_rule`** to tabulate rule information i.e. intervals.

In [30]:
print_rule(FirstRule[1][rule1_index])

Operator:  and
+------+--------------------------------+------+-------+
| Item | Variable                       | Sign | Value |
+------+--------------------------------+------+-------+
|  1   | used_pin_number                |  ==  |    no |
|  2   | ratio_to_median_purchase_price |  >=  |  4.07 |
|  3   | online_order                   |  ==  |   yes |
+------+--------------------------------+------+-------+


Before determining next rule, we exclude only instances that meet the $1^{st}$ rule.

In [31]:
index = FirstRule[0].values.ravel()
X2 = X_train.loc[~index] 
y2 = y_train[~index]

In [32]:
discr_X2, rules2 = discretize(X2, n_cutoffs=20)

In [33]:
asso2 = AssoRuleMining(metric="f1", operator="and", n_jobs=3).fit(discr_X2, y2, rules=rules2)

HBox(children=(HTMLMath(value='Calculating . . .'), HTMLMath(value='')))

In [34]:
pd.DataFrame(asso2.info).sort_values(by=["f1_score","n_features"], ascending=[False,True]).head(5)

Unnamed: 0,start_with,variable,n_features,p_target,p_sample,f1_score,recall,precision,entropy
22,,122,4,0.620112,0.017884,0.747475,0.620112,0.940678,0.088354
26,,98,5,0.608939,0.017581,0.738983,0.608939,0.939655,0.090314
69,,61,5,0.603352,0.017278,0.737201,0.603352,0.947368,0.090663
43,,101,5,0.592179,0.017126,0.726027,0.592179,0.938053,0.093229
68,,62,5,0.586592,0.016823,0.724138,0.586592,0.945946,0.093575


Create $2^{nd}$ rule

In [35]:
rule2_index = 122
SecondRule = RuleToFeature(X_train, asso2.asso_results_, which_rules=[rule2_index])
print_rule(SecondRule[1][rule2_index])

Operator:  and
+------+--------------------+------+-------+
| Item | Variable           | Sign | Value |
+------+--------------------+------+-------+
|  1   | used_chip          |  ==  |    no |
|  2   | distance_from_home |  >=  | 96.43 |
|  3   | online_order       |  ==  |   yes |
|  4   | used_pin_number    |  ==  |    no |
+------+--------------------+------+-------+


Summary on `X_train`

In [36]:
corr = np.corrcoef(np.hstack((FirstRule[0], SecondRule[0])).T)[0,1]
print("Correlation between 1st and 2nd rules : {:.2%}".format(corr))

Correlation between 1st and 2nd rules : 2.06%


Since the correlation is insignificant i.e. 2.06%, we will ignore adding the negation of the first rule to the second rule.

In [37]:
y_pred_train = (FirstRule[0].values | SecondRule[0].values)
print_stats(y_train, y_pred_train)

+----------------+-------+-------+
| Statistics     | Value |     % |
+----------------+-------+-------+
| N              | 7,000 |       |
| Target         |   581 |  8.3% |
| True Positive  |   513 |  7.3% |
| True Negative  | 6,412 | 91.6% |
| False Positive |     7 |  0.1% |
| False Negative |    68 |  1.0% |
| Precision      |       | 98.7% |
| Recall         |       | 88.3% |
| Accuracy       |       | 98.9% |
| F1-Score       |       | 93.2% |
+----------------+-------+-------+


Summary on `X_test`

In [38]:
y_pred_test = (RuleToFeature(X_test, asso1.asso_results_, which_rules=[rule1_index])[0].values |
               RuleToFeature(X_test, asso2.asso_results_, which_rules=[rule2_index])[0].values)
print_stats(y_test, y_pred_test, 0)

+----------------+-------+------+
| Statistics     | Value |    % |
+----------------+-------+------+
| N              | 3,000 |      |
| Target         |   258 |   9% |
| True Positive  |   211 |   7% |
| True Negative  | 2,742 |  91% |
| False Positive |     0 |   0% |
| False Negative |    47 |   2% |
| Precision      |       | 100% |
| Recall         |       |  82% |
| Accuracy       |       |  98% |
| F1-Score       |       |  90% |
+----------------+-------+------+


Alternatively, we can use **`evaluate_rules`** to evaluate all datasets at the same time.

In [39]:
rules=[asso1.asso_results_[rule1_index], asso2.asso_results_[rule2_index]]
evaluate_rules([(X_train,y_train), (X_test,y_test)], rules=rules, operator="or")

EvalResults(sample=[7000, 3000], target=[581, 258], tp=[513, 211], fp=[7, 0], fn=[68, 47], tn=[6412, 2742], recall=[0.882960413080895, 0.8178294573643411], precision=[0.9865384615384616, 1.0], f1=[0.9318801089918256, 0.8997867803837952], accuracy=[0.9892857142857143, 0.9843333333333333])

## Creation of rules (2)
- Convert all rules into features.
- Determine combinations of rules that optimize the evaluating metric. This can be used as validation of rules.

Selecting variables that capture target more than `x`% helps in reducing features, whose impact is insignificant. For this example, we use 10%.

In [43]:
which_rules = np.array(asso1.info["variable"])[np.array(asso1.info["p_target"])>0.01]
discr_X3, rules3 = RuleToFeature(X_train, asso1.asso_results_, which_rules=which_rules)

In [44]:
asso3 = AssoRuleMining(metric="f1", operator="or", n_jobs=4, n_batches=5)
asso3.fit(discr_X3, y_train, rules=rules3)

HBox(children=(HTMLMath(value='Calculating . . .'), HTMLMath(value='')))

<__main__.AssoRuleMining at 0x1bae1dae580>

In [45]:
pd.DataFrame(asso3.info).sort_values(by=["f1_score", "n_features"], ascending=[False, True]).head(5)

Unnamed: 0,start_with,variable,n_features,p_target,p_sample,f1_score,recall,precision,entropy
0,,0,4,0.908778,0.077,0.942857,0.908778,0.979592,0.074412
1,,121,4,0.924269,0.079857,0.942105,0.924269,0.960644,0.073367
2,,2,4,0.924269,0.080714,0.937173,0.924269,0.950442,0.07721


We select rule set from `variable 0` due to low correlations between rules.

In [46]:
rule3_index = 0
selected_rules = asso3.asso_results_[rule3_index].features
np.round(RuleToFeature(X_train, asso1.asso_results_, 
                       which_rules=selected_rules)[0].corr(),2)

Unnamed: 0,0,21,39,126
0,1.0,-0.01,-0.01,-0.0
21,-0.01,1.0,0.02,-0.01
39,-0.01,0.02,1.0,-0.01
126,-0.0,-0.01,-0.01,1.0


See all selected rules and their subrules.

In [47]:
for n,r in zip(asso3.asso_results_[rule3_index].features,
               asso3.asso_results_[rule3_index].rule):
    print("Rule number: ",n); print_rule(r); print()

Rule number:  0
Operator:  and
+------+--------------------------------+------+-------+
| Item | Variable                       | Sign | Value |
+------+--------------------------------+------+-------+
|  1   | distance_from_home             |  <   |  1.00 |
|  2   | ratio_to_median_purchase_price |  >=  |  4.07 |
|  3   | used_pin_number                |  ==  |    no |
+------+--------------------------------+------+-------+

Rule number:  21
Operator:  and
+------+--------------------------------+------+-------+
| Item | Variable                       | Sign | Value |
+------+--------------------------------+------+-------+
|  1   | distance_from_home             |  >=  |  1.00 |
|  2   | ratio_to_median_purchase_price |  >=  |  4.07 |
|  3   | online_order                   |  ==  |   yes |
|  4   | used_pin_number                |  ==  |    no |
+------+--------------------------------+------+-------+

Rule number:  39
Operator:  and
+------+--------------------+------+-------+
| I

Summary on `X_train`

In [48]:
y_pred_train = RuleToFeature(X_train, asso1.asso_results_, which_rules=selected_rules)[0].sum(1)>0
print_stats(y_train, y_pred_train)

+----------------+-------+-------+
| Statistics     | Value |     % |
+----------------+-------+-------+
| N              | 7,000 |       |
| Target         |   581 |  8.3% |
| True Positive  |   528 |  7.5% |
| True Negative  | 6,408 | 91.5% |
| False Positive |    11 |  0.2% |
| False Negative |    53 |  0.8% |
| Precision      |       | 98.0% |
| Recall         |       | 90.9% |
| Accuracy       |       | 99.1% |
| F1-Score       |       | 94.3% |
+----------------+-------+-------+


Summary on `X_test`

In [49]:
y_pred_test = RuleToFeature(X_test, asso1.asso_results_, which_rules=selected_rules)[0].sum(1)>0
print_stats(y_test, y_pred_test)

+----------------+-------+-------+
| Statistics     | Value |     % |
+----------------+-------+-------+
| N              | 3,000 |       |
| Target         |   258 |  8.6% |
| True Positive  |   223 |  7.4% |
| True Negative  | 2,740 | 91.3% |
| False Positive |     2 |  0.1% |
| False Negative |    35 |  1.2% |
| Precision      |       | 99.1% |
| Recall         |       | 86.4% |
| Accuracy       |       | 98.8% |
| F1-Score       |       | 92.3% |
+----------------+-------+-------+


In [50]:
rules = [asso1.asso_results_[n] for n in selected_rules]
evaluate_rules([(X_train,y_train), (X_test,y_test)], rules=rules, operator="or")

EvalResults(sample=[7000, 3000], target=[581, 258], tp=[528, 223], fp=[11, 2], fn=[53, 35], tn=[6408, 2740], recall=[0.9087779690189329, 0.8643410852713178], precision=[0.9795918367346939, 0.9911111111111112], f1=[0.9428571428571427, 0.9233954451345755], accuracy=[0.9908571428571429, 0.9876666666666667])

## Creation of rules (3)
- Create set of rules of your choice.

In [51]:
subrules = [('ratio_to_median_purchase_price', '>=', 4.065), 
            ('online_order', '==', 'yes'), 
            ('used_pin_number', '==', 'no')]
operator = 'and'
rule1 = create_rule(subrules)

In [52]:
subrules = [('distance_from_home', '>=', 96.4349), 
            ('used_chip', '==', 'no'), 
            ('online_order', '==', 'yes'), 
            ('used_pin_number', '==', 'no')]
operator = 'and'
rule2 = create_rule(subrules)

In [53]:
evaluate_rules([(X_train,y_train), (X_test,y_test)], 
               rules=[rule1, rule2], operator="or")

EvalResults(sample=[7000, 3000], target=[581, 258], tp=[513, 211], fp=[7, 0], fn=[68, 47], tn=[6412, 2742], recall=[0.882960413080895, 0.8178294573643411], precision=[0.9865384615384616, 1.0], f1=[0.9318801089918256, 0.8997867803837952], accuracy=[0.9892857142857143, 0.9843333333333333])