# Homework 2 - association rules

In this homework we will perform association between data series in the obesity dataset. For this notebook we will use the following formulas:

$$
\begin{align*}
    support(X) &= P(X) \\
            &= \frac{\text{number of instances containing X in the dataset}}{\text{total number of items in the dataset}} \\
    \\
    support(X \rightarrow Y) &= P(X \cap Y) \\
                            &= support(X \cup Y) \\
                            &= \frac{\text{number of instances containing both X and Y in the dataset}}{\text{total number of items in the dataset}} \\
    \\
    confidence(X \rightarrow Y) &= P(Y|X) = \frac{P(X \cap Y)}{P(X)} \\
                                &= \frac{support(X \cup Y)}{support(X)} \\
    \\
    lift(X \rightarrow Y) &= \frac{P(X \cap Y)}{P(X) \cdot P(Y)} \\
                          &= \frac{support(X \cup Y)}{support(X) \cdot support(Y)} \\
    \\
    conviction(X \rightarrow Y) &= \frac{1 - P(Y)}{1 - P(Y|X)} = \frac{1 - P(Y)}{1 - \frac{P(X \cap Y)}{P(X)}} \\
                                &= \frac{1 - support(Y)}{1 - confidence(X \rightarrow Y)} \\
\end{align*}
$$

We will use the Apriori and ECLAT algorithms.

## Conclusions

Using the Apriori algorithm, we gathered the following conclusions:
- for numerical - numerical attribute analysis:
  - With over average confidence (`0.67`) and low support (`0.27`), we can state that people that **consume many vegetables per day** also **consume more meals**
- for categorial - categorial attribute analysis:
  - Majority of the surveyed population **does not smoke** (support `0.97`, confidence `0.97`)
  - Majority of the surveyed population **does not perform calories monitoring** (support `0.95`, confidence `0.95`)
  - A good part of the surveyed population that **presents an overweight family member** reported **frequent consumption of high-caloric food** (support `0.74`, confidence `0.91`)
- for categorial + numerical - categorial + numerical attribute analysis:
  - With rather high confidence (`0.90`), but rather low support (`0.40`), people that **eat 3 meals per day**, **sometimes eat outside of meal hours** and **don't perform calorie monitoring** have **an overweight family member**

## Dependencies

### General dependencies

Imports for Python

In [None]:
import copy
import os
import platform

if platform.system() == "Windows":
    os.environ['R_HOME'] = 'C:\Program Files\R\R-4.3.3'

In [None]:
import typing as t
import csv
import numpy as np
import numpy.typing as npt
import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import HTML, IFrame

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline


In [None]:
import arulespy.arules as arules
import arulespy.arulesViz as arulesviz


In [None]:
import seaborn as sns
import rpy2.robjects.packages as packages
import rpy2.robjects.lib.ggplot2 as gp
from rpy2.ipython.ggplot import image_png
from arulespy.arulesViz import plot, inspectDT
from catscatter import catscatter

htmlwidgets = packages.importr("htmlwidgets")


### Dataset-specific dependencies

Dataset manager, known labels, known outputs for the dataset.

In [None]:
LABEL_VARIABLE = "NObeyesdad"
NUMERICAL_VARIABLES = ["Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE"]
CATEGORICAL_VARIABLES_NO_LABEL = [
    "FAVC",
    "CAEC",
    "CALC",
    "SCC",
    "MTRANS",
    "Gender",
    "family_history_with_overweight",
    "SMOKE",
]
CATEGORICAL_VARIABLES = [
    *CATEGORICAL_VARIABLES_NO_LABEL,
    LABEL_VARIABLE,
]
ALL_VARIABLES_NO_LABEL = [*NUMERICAL_VARIABLES, *CATEGORICAL_VARIABLES_NO_LABEL]
ALL_VARIABLES = [*NUMERICAL_VARIABLES, *CATEGORICAL_VARIABLES]
LABEL_DICTIONARY = {
    "Age": "Age",
    "Height": "Height (cm)",
    "Weight": "Weight (kg)",
    "FCVC": " Frequency of consumption of vegetables (times per day)",
    "NCP": "Number of main meals",
    "CH2O": "Consumption of water daily (Liters)",
    "FAF": "Physical activity frequency (times per day)",
    "TUE": "Time using technology devices (hours)",
    "FAVC": "Frequent consumption of high caloric food",
    "CAEC": "Consumption of food between meals",
    "CALC": "Consumption of alcohol",
    "SCC": "Calories consumption monitoring",
    "MTRANS": "Transportation used",
    "Gender": "Gender",
    "family_history_with_overweight": "Family member suffered or suffers from overweight",
    "SMOKE": "Smoker or not",
    "NObeyesdad": "Obesity level",
}

T = t.TypeVar("T")


class Person:
    Gender: str
    Age: np.int32
    Height: np.float32
    Weight: np.float32
    family_history_with_overweight: str
    FAVC: str
    FCVC: np.float32
    NCP: np.float32
    CAEC: str
    SMOKE: str
    CH2O: np.float32
    SCC: str
    FAF: np.float32
    TUE: np.float32
    CALC: str
    MTRANS: str
    NObeyesdad: str

    def __init__(
        self,
        Gender: str,
        Age: str,
        Height: str,
        Weight: str,
        family_history_with_overweight: str,
        FAVC: str,
        FCVC: str,
        NCP: str,
        CAEC: str,
        SMOKE: str,
        CH2O: str,
        SCC: str,
        FAF: str,
        TUE: str,
        CALC: str,
        MTRANS: str,
        NObeyesdad: str,
    ):
        self.Gender = Gender
        self.Age = np.float32(Age)
        self.Height = np.float32(Height)
        self.Weight = np.float32(Weight)
        self.family_history_with_overweight = family_history_with_overweight
        self.FAVC = FAVC
        self.FCVC = np.float32(FCVC)
        self.NCP = np.float32(NCP)
        self.CAEC = CAEC
        self.SMOKE = SMOKE
        self.CH2O = np.float32(CH2O)
        self.SCC = SCC
        self.FAF = np.float32(FAF)
        self.TUE = np.float32(TUE)
        self.CALC = CALC
        self.MTRANS = MTRANS
        self.NObeyesdad = NObeyesdad

    def __str__(self):
        return vars(self)

    def __len__(self):
        return len(vars(self))

    def __repr__(self):
        return vars(self)


class DatasetManager:
    def __init__(self, path_to_csv: str):
        self.path_to_csv = path_to_csv

    def load_as_obj_list(self) -> list[Person]:
        with open(self.path_to_csv) as csv_file:
            csv_reader = csv.DictReader(csv_file)
            return [Person(**row) for row in csv_reader]

In [None]:
dataset_manager = DatasetManager("data/ObesityDataSet.csv")
dataset_obj_list = dataset_manager.load_as_obj_list()
dataset_dataframe = pd.DataFrame.from_records(
    data=[vars(entry) for entry in dataset_obj_list]
)

### Categorial data utility functions

Here we add utility functions (if any) for the categorial data types.

In [None]:
# TODO: add if any are found

### Continuous data utility functions

Here we add utility functions (if any) for the continuous data types.

In [None]:
def get_np_bins_and_labels(bins: list[tuple[float, float]]) -> tuple[np.array, list[str]]:
    return np.array(np.array(bins).T[0].tolist() + [bins[-1][1]]).astype(np.float32), [f"({lh}, {rh}]" for lh, rh in bins]


predifined_bins = {
    "Age": get_np_bins_and_labels([
        (0, 12),
        (12, 18),
        (18, 26),
        (26, 36),
        (36, 46),
        (46, 60),
        (60, 200)
    ]),
    "Weight": get_np_bins_and_labels([
        (0, 55),
        (55, 70),
        (70, 80),
        (80, 100),
        (100, 120),
        (120, 400)
    ]),
    "Height": get_np_bins_and_labels([
        (0, 1.62),
        (1.62, 1.75),
        (1.75, 3.00)
    ])
}

def bin_numerical_equally_by_frequency(
    data: t.Union[npt.NDArray[np.float32], npt.NDArray[np.int32]], bins: int = 30
):
    """
    Performs roughly equal binning based on the frequency of the items.
    Example: For data = [1, 2, 3, 4, 5, 6, 7, 8, 9] and bins = 3, result will be (1, 3], (3, 6], (6, 9]
    """

    if type(data[0]) is not np.float32 and type(data[0]) is not np.int32:
        return data

    result = pd.qcut(data, q=bins, duplicates="drop").astype("string")
    return result

def bin_on_predefined_way(data: t.Union[npt.NDArray[np.float32], npt.NDArray[np.int32]], predefined_bins: tuple[np.array, list[str]]):

    if type(data[0]) is not np.float32 and type(data[0]) is not np.int32:
        return data

    return pd.cut(data, bins=predefined_bins[0], labels=predefined_bins[1], include_lowest=True)

def bin_numerical_smartly(data_name: str, data: t.Union[npt.NDArray[np.float32], npt.NDArray[np.int32]], bins: int = 30):

    if data_name not in predifined_bins:
        return bin_numerical_equally_by_frequency(data, bins=bins)
    else:
        return bin_on_predefined_way(data, predifined_bins[data_name])


### Algorithm utility functions

Here we add utility functions for the algorithms to help us reduce code duplication.

In [None]:
def run_apriori(
    transactions: arules.ro.DataFrame, support: float, confidence: float
) -> t.Optional[pd.DataFrame]:
    try:
        rules = arules.apriori(
            transactions,
            parameter=arules.parameters({"supp": support, "conf": confidence}),
            control=arules.parameters({"verbose": False}),
        )
        return rules.as_df()
    except:
        return None


def run_apriori_build_html(
    transactions: arules.ro.DataFrame, support: float, confidence: float
) -> tuple[str, pd.DataFrame]:
    result = ""

    rules_dataframe = run_apriori(
        transactions=transactions, support=support, confidence=confidence
    )

    result += f"<h2>Result for Apriori run (support: {support}, confidence: {confidence})</h2>"
    result += (
        rules_dataframe.to_html()
        if rules_dataframe is not None
        else "<p>No rules were found</p>"
    )
    result += "</br>"

    return result, rules_dataframe


def run_eclat(
    transactions: arules.ro.DataFrame, support: float, confidence: float
) -> t.Optional[pd.DataFrame]:
    try:
        rules = arules.eclat(
            transactions,
            parameter=arules.parameters({"supp": support}),
            control=arules.parameters({"verbose": False}),
        )
        return rules.as_df()
    except:
        return None


def run_eclat_build_html(
    transactions: arules.ro.DataFrame, support: float, confidence: float
) -> tuple[str, pd.DataFrame]:
    result = ""

    rules_dataframe = run_eclat(
        transactions=transactions, support=support, confidence=confidence
    )

    result += f"<h2>Result for ECLAT run (support: {support})</h2>"
    result += (
        rules_dataframe.to_html()
        if rules_dataframe is not None
        else "<p>No rules were found</p>"
    )
    result += "</br>"

    return result, rules_dataframe

def data_frame_to_html(title: str, data_frame: pd.DataFrame) -> str:
    return f"<h2>{title}</h2></br>{data_frame.to_html()}</br>"

## Preliminary data analysis

Here we built plots for the data. Mostly for debugging purposes.

In [None]:
for label in CATEGORICAL_VARIABLES_NO_LABEL:
    pretty_label = LABEL_DICTIONARY[label]
    dataset_subset_dataframe = dataset_dataframe[label].astype(str)

    plt.figure()
    plt.hist(dataset_subset_dataframe)
    plt.xlabel(pretty_label)
    plt.ylabel("Count")
    plt.show()

In [None]:
for label in NUMERICAL_VARIABLES:
    pretty_label = LABEL_DICTIONARY[label]
    dataset_subset_dataframe = dataset_dataframe[label].astype(np.float32)
    binned_data = bin_numerical_smartly(
        data_name=label,
        data=dataset_subset_dataframe, bins=10
    )

    plt.figure()
    binned_data.value_counts().plot(kind="bar", xlabel=label, ylabel="Count", rot=90)
    plt.xlabel(pretty_label)
    plt.ylabel("Count")
    plt.show()

## Finding association rules

Here we use the `arules` library to find association rules.

### Categorical - categorical associations with Apriori algorithm


In [None]:
parameters = [
    (0.7, 0.4),
    (0.8, 0.8),
    (0.9, 0.8),
    (0.9, 0.9),
    (0.95, 0.9),
    (0.95, 0.95),
]

dataset_subset_dataframe = dataset_dataframe[CATEGORICAL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)
result_html = ""
categorical_categorical_apriori = {}

for support, confidence in parameters:
    html_out, df_out = run_apriori_build_html(
        transactions=transactions, support=support, confidence=confidence
    )
    result_html += html_out
    categorical_categorical_apriori[(support, confidence)] = df_out

HTML(result_html)

### Categorial - categorial associations with ECLAT algorithm


In [None]:
parameters = [
    (0.7, 0.4),
    (0.8, 0.8),
    (0.9, 0.8),
    (0.9, 0.9),
    (0.95, 0.9),
    (0.95, 0.95),
]

dataset_subset_dataframe = dataset_dataframe[CATEGORICAL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)
result_html = ""
categorical_categorical_eclat = {}

for support, confidence in parameters:
    html_out, df_out = run_eclat_build_html(
        transactions=transactions, support=support, confidence=confidence
    )
    result_html += html_out
    categorical_categorical_eclat[(support, confidence)] = df_out

HTML(result_html)

### Numerical - numerical associations with Apriori algorithm


In [None]:
parameters = [
    (0.2, 0.4),
    (0.25, 0.6),
]

dataset_subset_dataframe = dataset_dataframe[NUMERICAL_VARIABLES]
dataset_subset_dataframe = dataset_subset_dataframe.apply(
    lambda dataseries, label: bin_numerical_smartly(data_name=next(label), data=dataseries, bins=10),
    args=(iter(dataset_subset_dataframe.columns), )
)
transactions = arules.Transactions.from_df(dataset_subset_dataframe)
result_html = ""
numerical_numerical_apriori = {}

for support, confidence in parameters:
    html_out, df_out = run_apriori_build_html(
        transactions=transactions, support=support, confidence=confidence
    )
    result_html += html_out
    numerical_numerical_apriori[(support, confidence)] = df_out

HTML(result_html)

### Numerical - numerical associations with ECLAT algorithm


In [None]:
parameters = [
    (0.2, 0.4),
    (0.25, 0.6),
]

dataset_subset_dataframe = dataset_dataframe[NUMERICAL_VARIABLES]
dataset_subset_dataframe = dataset_subset_dataframe.apply(
    lambda dataseries, label: bin_numerical_smartly(data_name=next(label), data=dataseries, bins=10),
    args=(iter(dataset_subset_dataframe.columns), )
)
transactions = arules.Transactions.from_df(dataset_subset_dataframe)
result_html = ""
numerical_numerical_eclat = {}

for support, confidence in parameters:
    html_out, df_out = run_eclat_build_html(
        transactions=transactions, support=support, confidence=confidence
    )
    result_html += html_out
    numerical_numerical_eclat[(support, confidence)] = df_out

HTML(result_html)

### Categorical + numerical - categorical + numerical associations with Apriori algorithm


In [None]:
parameters = [
    (0.4, 0.9),
]

dataset_subset_dataframe = dataset_dataframe[ALL_VARIABLES]
dataset_subset_dataframe = dataset_subset_dataframe.apply(
    lambda dataseries, label: bin_numerical_smartly(data_name=next(label), data=dataseries, bins=10),
    args=(iter(dataset_subset_dataframe.columns), )
)
transactions = arules.Transactions.from_df(dataset_subset_dataframe)
result_html = ""
catnum_catnum_apriori = {}

for support, confidence in parameters:
    html_out, df_out = run_apriori_build_html(
        transactions=transactions, support=support, confidence=confidence
    )
    result_html += html_out
    catnum_catnum_apriori[(support, confidence)] = df_out

HTML(result_html)

### Categorical + numerical - categorical + numerical associations with ECLAT algorithm

In [None]:
parameters = [
    (0.4, 0.9),
]

dataset_subset_dataframe = dataset_dataframe[ALL_VARIABLES]
dataset_subset_dataframe = dataset_subset_dataframe.apply(
    lambda dataseries, label: bin_numerical_smartly(data_name=next(label), data=dataseries, bins=10),
    args=(iter(dataset_subset_dataframe.columns), )
)
transactions = arules.Transactions.from_df(dataset_subset_dataframe)
result_html = ""
catnum_catnum_eclat = {}

for support, confidence in parameters:
    html_out, df_out = run_eclat_build_html(
        transactions=transactions, support=support, confidence=confidence
    )
    result_html += html_out
    catnum_catnum_eclat[(support, confidence)] = df_out

HTML(result_html)

### Utils

In [None]:
in_data = categorical_categorical_apriori[(0.8, 0.8)]

def preprocess_str(arg_str: str) -> str:
    arg_str = list(arg_str)
    status = 0
    for it in range(len(arg_str)):
        if arg_str[it] == "," and status == 0:
            arg_str[it] = ";"
        if arg_str[it] in ["(", "["]:
            status += 1
        if arg_str[it] in [")", "]"]:
            status -= 1
    return "".join(arg_str)

def get_assoc(assoc: str) -> list[tuple[str, str]]:
    return [tuple(part.split("=")) for part in list(filter(lambda x: x != "", preprocess_str(assoc[1:-1]).split(";")))]

def distill_dataframe(arg_data: pd.DataFrame) -> pd.DataFrame:
    rules = {
        "LHS": [],
        "RHS": [],
        "support": [],
        "confidence": [],
        "coverage": [],
        "lift": [],
        "count": []
    }

    for it in range(0, len(arg_data)):
        lh_rule = get_assoc(arg_data["LHS"][it])
        rh_rule = get_assoc(arg_data["RHS"][it])

        if len(lh_rule) == 0 or len(rh_rule) == 0:
            continue

        rules["LHS"].append(lh_rule)
        rules["RHS"].append(rh_rule)
        rules["support"].append(arg_data["support"][it])
        rules["confidence"].append(arg_data["confidence"][it])
        rules["coverage"].append(arg_data["coverage"][it])
        rules["lift"].append(arg_data["lift"][it])
        rules["count"].append(arg_data["count"][it])


    return pd.DataFrame(rules).sort_values("confidence", ascending=False).reset_index(drop=True)[:10]


def get_associations_with_label(assoc_data: pd.DataFrame) -> pd.DataFrame:
    rules = {
        "LHS": [],
        "RHS": [],
        "support": [],
        "confidence": [],
        "coverage": [],
        "lift": [],
        "count": []
    }

    for it in range(len(assoc_data)):
        lh_rule = assoc_data["LHS"][it]
        rh_rule = assoc_data["RHS"][it]

        if len(rh_rule) != 1 or rh_rule[0][0] != LABEL_VARIABLE:
            continue

        rules["LHS"].append(lh_rule)
        rules["RHS"].append(rh_rule)
        rules["support"].append(assoc_data["support"][it])
        rules["confidence"].append(assoc_data["confidence"][it])
        rules["coverage"].append(assoc_data["coverage"][it])
        rules["lift"].append(assoc_data["lift"][it])
        rules["count"].append(assoc_data["count"][it])

    return pd.DataFrame(rules)


### Categorical - Categorical top rules with Apriori algorithm


In [None]:
distyled_categorical_categorical_apriori = distill_dataframe(categorical_categorical_apriori[(0.7, 0.4)])

HTML(data_frame_to_html("Categorical - Categorical top rules with Apriori algorithm", distyled_categorical_categorical_apriori))

### Numerical - Numerical top rules with Apriori algorithm


In [None]:
distyled_numerical_numerical_apriori = distill_dataframe(numerical_numerical_apriori[(0.2, 0.4)])

HTML(data_frame_to_html("Numerical - Numerical top rules with Apriori algorithm", distyled_numerical_numerical_apriori))

### Categorical + Numerical - Categorical + Numerical top rules with Apriori algorithm


In [None]:
distyled_catnum_catnum_apriori = distill_dataframe(catnum_catnum_apriori[(0.4, 0.9)])

HTML(data_frame_to_html("Categorical + Numerical - Categorical + Numerical top rules with Apriori algorithm", distyled_catnum_catnum_apriori))

### Categorical - Label associations with Apriori algorithm

In [None]:
parameters = [
    (0.05, 0.05)
]

result = ""
catlabel_df = None

for cat in CATEGORICAL_VARIABLES_NO_LABEL:

    dataset_subset_dataframe = dataset_dataframe[[LABEL_VARIABLE, cat]]
    transactions = arules.Transactions.from_df(dataset_subset_dataframe)

    for support, confidence in parameters:
        html_out, df_out =  run_apriori_build_html(
            transactions=transactions, support=support, confidence=confidence
        )

        df_out = get_associations_with_label(distill_dataframe(df_out))

        if catlabel_df is not None:
            catlabel_df = pd.concat([catlabel_df, df_out])
        else:
            catlabel_df = df_out

catlabel_df = catlabel_df.sort_values("confidence", ascending=False).reset_index(drop=True)

HTML(data_frame_to_html("Categorical to Label Top Rules", catlabel_df))

### Numerical - Label associations with Apriori algorithm


In [None]:
parameters = [
    (0.05, 0.05)
]

result = ""
numlabel_df = None

for num in NUMERICAL_VARIABLES:

    dataset_subset_dataframe = dataset_dataframe[[LABEL_VARIABLE, num]]

    dataset_subset_dataframe = dataset_subset_dataframe.apply(
        lambda dataseries, label: bin_numerical_smartly(data_name=next(label), data=dataseries, bins=10),
        args=(iter(dataset_subset_dataframe.columns), )
    )

    transactions = arules.Transactions.from_df(dataset_subset_dataframe)

    result = ""

    for support, confidence in parameters:
        html_out, df_out =  run_apriori_build_html(
            transactions=transactions, support=support, confidence=confidence
        )

        df_out = get_associations_with_label(distill_dataframe(df_out))

        if numlabel_df is not None:
            numlabel_df = pd.concat([numlabel_df, df_out])
        else:
            numlabel_df = df_out

numlabel_df = numlabel_df.sort_values("confidence", ascending=False).reset_index(drop=True)

HTML(data_frame_to_html("Numerical to Label Top Rules", numlabel_df))

### Visualisations

Here we visualise the rules we found using the algorithms above.

In [None]:
def get_single_association(arg_data: pd.DataFrame) -> pd.DataFrame:
    arg_data = copy.deepcopy(arg_data)

    rules = {
        "LHS": [],
        "RHS": [],
        "support": [],
        "confidence": [],
        "coverage": [],
        "lift": [],
        "count": []
    }

    for it in range(0, len(arg_data)):
        lh_rule = arg_data["LHS"][it]
        rh_rule = arg_data["RHS"][it]

        if len(lh_rule) != 1 or len(rh_rule) != 1:
            continue

        rules["LHS"].append(lh_rule[0])
        rules["RHS"].append(rh_rule[0])
        rules["support"].append(arg_data["support"][it])
        rules["confidence"].append(arg_data["confidence"][it])
        rules["coverage"].append(arg_data["coverage"][it])
        rules["lift"].append(arg_data["lift"][it])
        rules["count"].append(arg_data["count"][it])

    return pd.DataFrame(rules)


def strip_variables_categories(arg_data: pd.DataFrame) -> pd.DataFrame:
    arg_data = copy.deepcopy(arg_data)
    for it in range(0, len(arg_data)):
        arg_data["LHS"][it] = arg_data["LHS"][it][0]
        arg_data["RHS"][it] = arg_data["RHS"][it][0]
    return arg_data


### Categorical - Categorical CatPlot

In [None]:
categorical_categorical_catplot_df = get_single_association(distyled_categorical_categorical_apriori)


colors=['green','grey','orange']
catscatter(categorical_categorical_catplot_df,'LHS','RHS','confidence', color=colors,ratio=100)


### Numerical - Numerical CatPlot

In [None]:
numerical_numerical_catplot_df = get_single_association(distyled_numerical_numerical_apriori)


colors=['green','grey','orange']
catscatter(numerical_numerical_catplot_df,'LHS','RHS','confidence', color=colors,ratio=100)


### Categorical - categorial visualisations

In [None]:
dataset_subset_dataframe = dataset_dataframe[CATEGORICAL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)

rules = arules.apriori(
    transactions,
    parameter=arules.parameters({"supp": 0.5, "conf": 0.90}),
    control=arules.parameters({"verbose": False}),
)

gg = plot(rules, method="scatter")
image_png(gg)

In [None]:
dataset_subset_dataframe = dataset_dataframe[CATEGORICAL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)

rules = arules.apriori(
    transactions,
    parameter=arules.parameters({"supp": 0.7, "conf": 0.90}),
    control=arules.parameters({"verbose": False}),
)

#gg = plot(rules, method="grouped")
#image_png(gg)

In [None]:
dataset_subset_dataframe = dataset_dataframe[CATEGORICAL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)

rules = arules.apriori(
    transactions,
    parameter=arules.parameters({"supp": 0.5, "conf": 0.90}),
    control=arules.parameters({"verbose": False}),
)

rules_20 = rules.sort(by = 'confidence')[0:100]
gg = plot(rules_20, method="graph")
image_png(gg)

### Numerical - numerical visualisations

In [None]:
dataset_subset_dataframe = dataset_dataframe[NUMERICAL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)

rules = arules.apriori(
    transactions,
    parameter=arules.parameters({"supp": 0.2, "conf": 0.4}),
    control=arules.parameters({"verbose": False}),
)

gg = plot(rules, method="scatter")
image_png(gg)

In [None]:
dataset_subset_dataframe = dataset_dataframe[NUMERICAL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)

rules = arules.apriori(
    transactions,
    parameter=arules.parameters({"supp": 0.2, "conf": 0.4}),
    control=arules.parameters({"verbose": False}),
)

gg = plot(rules, method="grouped")
image_png(gg)

In [None]:
dataset_subset_dataframe = dataset_dataframe[NUMERICAL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)

rules = arules.apriori(
    transactions,
    parameter=arules.parameters({"supp": 0.2, "conf": 0.2}),
    control=arules.parameters({"verbose": False}),
)

rules_20 = rules.sort(by = 'confidence')[0:20]
gg = plot(rules_20, method="graph")
image_png(gg)

### Categorical + numerical - categorical + numerical visualisations

In [None]:
dataset_subset_dataframe = dataset_dataframe[ALL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)

rules = arules.apriori(
    transactions,
    parameter=arules.parameters({"supp": 0.2, "conf": 0.4}),
    control=arules.parameters({"verbose": False}),
)

gg = plot(rules, method="scatter")
image_png(gg)

In [None]:
dataset_subset_dataframe = dataset_dataframe[ALL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)

rules = arules.apriori(
    transactions,
    parameter=arules.parameters({"supp": 0.2, "conf": 0.2}),
    control=arules.parameters({"verbose": False}),
)

rules_20 = rules.sort(by = 'confidence')[0:20]
gg = plot(rules_20, method="graph")
image_png(gg)

In [None]:
dataset_subset_dataframe = dataset_dataframe[ALL_VARIABLES]
transactions = arules.Transactions.from_df(dataset_subset_dataframe)

rules = arules.apriori(
    transactions,
    parameter=arules.parameters({"supp": 0.2, "conf": 0.2}),
    control=arules.parameters({"verbose": False}),
)

rules_20 = rules.sort(by = 'confidence')[0:20]
gg = plot(rules_20, method="graph")
image_png(gg)

### Catscatter

In [None]:
catlabel_df_catscat = get_single_association(catlabel_df)


colors=['green','grey','orange']
catscatter(catlabel_df_catscat,'LHS','RHS','confidence', color=colors,ratio=100)


In [None]:
numlabel_df_catscat = get_single_association(numlabel_df)


colors=['green','grey','orange']
catscatter(numlabel_df_catscat,'LHS','RHS','confidence', color=colors,ratio=100)
