<a href="https://colab.research.google.com/github/Ps1231/Data-Science-Tutotial-Using-Python/blob/main/Association%20Rules/Association_Rules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#12: Association Rules
Association rules are patterns or relationships identified in datasets that highlight the associations or connections between different variables or items. These rules are commonly used in data mining, machine learning, and business intelligence to discover interesting and meaningful connections within large sets of data.

In the context of association rule mining, a dataset typically consists of transactions, where each transaction contains a set of items. The goal is to identify rules that indicate the likelihood of certain items appearing together in transactions. These rules are usually expressed in the form of "if X, then Y," where X and Y are sets of items.

There are two fundamental metrics used to evaluate association rules:

**Support (Sup):** It measures the frequency of occurrence of a set of items in the dataset. High support indicates that the itemset is common in the dataset.

**Confidence (Conf)**: It measures the reliability or strength of the association between two sets of items. High confidence suggests that the presence of one item implies the presence of another.

Additionally, there are other metrics like lift, conviction, and leverage that provide further insights into the relationships between items.

Association rule mining is commonly used in various applications, such as market basket analysis, where retailers aim to understand purchasing patterns, and recommendation systems, where the goal is to suggest items based on users' past behaviors.

Popular algorithms for association rule mining include Apriori, Eclat, and FP-growth. These algorithms efficiently discover frequent itemsets and generate association rules from large datasets.

In [2]:
!pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5954 sha256=73ab6593bd4c612c134ba8221d012052a5062f829414a32e25fe37aa6c0b403e
  Stored in directory: /root/.cache/pip/wheels/c4/1a/79/20f55c470a50bb3702a8cb7c94d8ada15573538c7f4baebe2d
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


## 12.1 How to Mine Association Rules Using Python


In [5]:
import pandas as pd
from apyori import apriori

# Load the bank dataset (replace 'bank.csv' with the actual path to your dataset)
bank_data = pd.read_csv('/content/bank.csv')

# Subset the data to include only the relevant columns
min_bank = bank_data[["job", "marital", "education", "default", "housing", "loan", "contact", "poutcome", "deposit"]]

# Print column names to check for correctness
print(min_bank.columns)

# Convert categorical variables to string for apyori
min_bank = min_bank.astype(str)

# Check if "CustServ Calls" is in the column names
if "CustServ Calls" in min_bank.columns:
    # Convert Customer Service Calls to a factor
    min_bank['CustServ Calls'] = pd.Categorical(min_bank['CustServ Calls'], ordered=True)

    # Obtain baseline distribution information
    baseline_tables = []
    for column in min_bank.columns:
        t = pd.crosstab(min_bank[column], margins=True, margins_name="Total", normalize="index")
        baseline_tables.append(t)

    # Install and load the apyori package (if not already installed)
    # !pip install apyori
    from apyori import apriori

    # Flatten the tables for apyori
    transactions = []
    for t in baseline_tables:
        transactions.append([(str(item),) for item in t.index if item != "Total"])

    # Run apriori algorithm
    rules = apriori(transactions, min_support=0.01, min_confidence=0.4, min_lift=1, min_length=2)

    # Convert apyori results to a dataframe for further analysis
    rules_df = pd.DataFrame(list(rules))

    # Filter out rules containing 'deposit' in antecedent
    filtered_rules_df = rules_df[~rules_df['items'].astype(str).str.contains("deposit")]

    # Display the top 10 rules sorted by lift
    top_rules = filtered_rules_df.sort_values(by='lift', ascending=False).head(10)
    print(top_rules)
else:
    print("Column 'CustServ Calls' not found in the dataset.")


Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'poutcome', 'deposit'],
      dtype='object')
Column 'CustServ Calls' not found in the dataset.


## 12.2 How to Apply the Confidence Difference Criterion Using Python



In [14]:
import pandas as pd
from apyori import apriori

# Load the bank dataset (replace 'bank.csv' with the actual path to your dataset)
bank_data = pd.read_csv('/content/bank.csv')

# Subset the data to include only the relevant columns
min_bank = bank_data[["job", "marital", "education", "default", "housing", "loan", "contact", "poutcome", "deposit"]]

# Convert categorical variables to string for apyori
min_bank = min_bank.astype(str)

# Flatten the tables for apyori
transactions = []
for column in min_bank.columns:
    transactions.append([(str(item),) for item in pd.unique(min_bank[column])])

# Run apriori algorithm with confidence difference criterion
rules_confdiff = apriori(transactions, min_support=0.01, min_confidence=0.4, min_lift=1, min_length=2)

# Display all rules before filtering
for rule in rules_confdiff:
    print(rule)

# Filter rules based on confidence difference criterion
confidence_difference_threshold = 0.1  # Set your desired confidence difference threshold
filtered_rules_confdiff = [
    rule for rule in rules_confdiff
    if any(
        abs(stat.confidence - rule.ordered_statistics[i].confidence) >= confidence_difference_threshold
        for i, stat in enumerate(rule.ordered_statistics)
    )
]

# Display the filtered rules
for rule in filtered_rules_confdiff[:10]:
    print(rule)


RelationRecord(items=frozenset({('no',)}), support=0.4444444444444444, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({('no',)}), confidence=0.4444444444444444, lift=1.0)])
RelationRecord(items=frozenset({('unknown',)}), support=0.4444444444444444, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({('unknown',)}), confidence=0.4444444444444444, lift=1.0)])
RelationRecord(items=frozenset({('yes',)}), support=0.4444444444444444, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({('yes',)}), confidence=0.4444444444444444, lift=1.0)])
RelationRecord(items=frozenset({('admin.',), ('blue-collar',)}), support=0.1111111111111111, ordered_statistics=[OrderedStatistic(items_base=frozenset({('admin.',)}), items_add=frozenset({('blue-collar',)}), confidence=1.0, lift=9.0), OrderedStatistic(items_base=frozenset({('blue-collar',)}), items_add=frozenset({('admin.',)}), confidence=1.0, lift=9.0)])
RelationR

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



RelationRecord(items=frozenset({('management',), ('self-employed',), ('admin.',), ('blue-collar',), ('student',), ('unemployed',), ('entrepreneur',), ('unknown',), ('housemaid',)}), support=0.1111111111111111, ordered_statistics=[OrderedStatistic(items_base=frozenset({('admin.',)}), items_add=frozenset({('management',), ('self-employed',), ('blue-collar',), ('student',), ('unemployed',), ('entrepreneur',), ('unknown',), ('housemaid',)}), confidence=1.0, lift=9.0), OrderedStatistic(items_base=frozenset({('blue-collar',)}), items_add=frozenset({('management',), ('self-employed',), ('admin.',), ('student',), ('unemployed',), ('entrepreneur',), ('unknown',), ('housemaid',)}), confidence=1.0, lift=9.0), OrderedStatistic(items_base=frozenset({('entrepreneur',)}), items_add=frozenset({('management',), ('self-employed',), ('admin.',), ('blue-collar',), ('student',), ('unemployed',), ('unknown',), ('housemaid',)}), confidence=1.0, lift=9.0), OrderedStatistic(items_base=frozenset({('housemaid',)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



RelationRecord(items=frozenset({('management',), ('self-employed',), ('blue-collar',), ('student',), ('services',), ('retired',), ('unemployed',), ('entrepreneur',), ('unknown',), ('housemaid',)}), support=0.1111111111111111, ordered_statistics=[OrderedStatistic(items_base=frozenset({('blue-collar',)}), items_add=frozenset({('management',), ('self-employed',), ('student',), ('services',), ('retired',), ('unemployed',), ('entrepreneur',), ('unknown',), ('housemaid',)}), confidence=1.0, lift=9.0), OrderedStatistic(items_base=frozenset({('entrepreneur',)}), items_add=frozenset({('management',), ('self-employed',), ('blue-collar',), ('student',), ('services',), ('retired',), ('unemployed',), ('unknown',), ('housemaid',)}), confidence=1.0, lift=9.0), OrderedStatistic(items_base=frozenset({('housemaid',)}), items_add=frozenset({('management',), ('self-employed',), ('blue-collar',), ('student',), ('services',), ('retired',), ('unemployed',), ('entrepreneur',), ('unknown',)}), confidence=1.0, 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## 12.3 How to Apply the Confidence Quotient Criterion Using Python

In [15]:
import pandas as pd
from apyori import apriori

# Load the bank dataset (replace 'bank.csv' with the actual path to your dataset)
bank_data = pd.read_csv('/content/bank.csv')

# Subset the data to include only the relevant columns
min_bank = bank_data[["job", "marital", "education", "default", "housing", "loan", "contact", "poutcome", "deposit"]]

# Convert categorical variables to string for apyori
min_bank = min_bank.astype(str)

# Flatten the tables for apyori
transactions = []
for column in min_bank.columns:
    transactions.append([(str(item),) for item in pd.unique(min_bank[column])])

# Run apriori algorithm with Confidence Quotient criterion
rules_confquot = apriori(transactions, min_support=0.01, min_confidence=0.05, min_lift=1, min_length=2, max_length=2,
                        arem="quot", aval=True, minval=0.4, target="rules")

# Convert apyori results to a dataframe for further analysis
rules_confquot_df = pd.DataFrame(list(rules_confquot))

# Filter out rules containing 'deposit' in antecedent
filtered_rules_confquot_df = rules_confquot_df[~rules_confquot_df['items'].astype(str).str.contains("deposit")]

# Display the rules
print(filtered_rules_confquot_df)


                              items   support  \
0                       ((admin.,))  0.111111   
1                  ((blue-collar,))  0.111111   
2                     ((cellular,))  0.111111   
3                     ((divorced,))  0.111111   
4                 ((entrepreneur,))  0.111111   
..                              ...       ...   
105  ((technician,), (unemployed,))  0.111111   
106     ((technician,), (unknown,))  0.111111   
107      ((unknown,), (telephone,))  0.111111   
108       ((unknown,), (tertiary,))  0.111111   
109     ((unknown,), (unemployed,))  0.111111   

                                    ordered_statistics  
0       [((), (('admin.',)), 0.1111111111111111, 1.0)]  
1    [((), (('blue-collar',)), 0.1111111111111111, ...  
2     [((), (('cellular',)), 0.1111111111111111, 1.0)]  
3     [((), (('divorced',)), 0.1111111111111111, 1.0)]  
4    [((), (('entrepreneur',)), 0.1111111111111111,...  
..                                                 ...  
105  [((), (