## Association Rule Mining US Census Data

In [1]:
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
path = "https://raw.githubusercontent.com/cs6220/cs6220.spring2019/master/data/adult/"
names = pd.read_csv(path + "adult.names", sep="\n", header=None)
parse_cols = lambda x: x.str.split(":", expand=True).iloc[:, 0]
columns = np.roll(parse_cols(names.iloc[92:108, 0]), shift=-1)
df_adult = pd.read_csv(path + "adult.data", sep=",", header=None, index_col=False)
df_adult.columns = columns

In [None]:
df_adult.head()

#### 2.1 Association Rule Mining

Raw dataset is transformed into a format appropriate for association rule mining by dropping all continuous columns and one-hot encoding the remaining columns.  The values for each resulting column should be binary, represented by a 1 or 0

In [None]:
df_adult_transformed = df_adult.select_dtypes(exclude=np.number)
df_adult_encoded = pd.get_dummies(df_adult_transformed)
df_adult_encoded.head()

In [None]:
frequent_itemsets = apriori(df_adult_encoded, min_support=0.1, use_colnames=True, max_len=3)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

In [None]:
frequent_itemsets

Confidence for the rule interestingness (metric="confidence") is used to generate rules up to a depth of at least 3 (maxlen=3 or higher).

In [None]:
rules_confidence_1 = association_rules(frequent_itemsets, metric="confidence", min_threshold = 0.1).sort_values(by='lift', ascending=False).reset_index()
rules_confidence_1.head()

For the explanations below, I'm basing my answer on the combination of support, confidence and mainly, lift.
Also from my understanding, antecedents are present in the dataset and consequents are inferred from the dataset.

From the above result, I can safely say that:
- People who are own children are earning less 50K or more than 50K or equal to 50K and not married
- The commutative inference of the previous point, people are earning less 50K or more than 50K or equal to 50K and not married are only children
- People who are own child are never married and from US
- People who are never married and from US are own child
- People who are white and not married are own child

Lift for the rule interestingness (metric="Lift") is used to generate rules up to a depth of at least 3 (maxlen=3 or higher).

In [None]:
rules_lift_1 = association_rules(frequent_itemsets, metric="lift", min_threshold = 0.1).sort_values(by='support', ascending=False).reset_index()
rules_lift_1.head()

From the above result, I can safely say that:
1. People who are natively from United States are white people
2. The commutative inference of the rule stated above, people who are white natively from the US
3. People with income less than 50K or more than or equal to 50k are natively from the US
4. People who are from the US with income less than 50K or more than or equal to 50K
5. People who are white have income less than 50K or more than or equal to 50K

The top rules are compared using the two interestingness measures for the same levels of support.

I will be comparing the same process used for the above produced results first for the min threshold 0.1

In [None]:
rules_confidence_1 = association_rules(frequent_itemsets, metric="confidence", min_threshold = 0.1)
rules_lift_1 = association_rules(frequent_itemsets, metric="lift", min_threshold = 0.1)

In [None]:
rules_confidence_2 = association_rules(frequent_itemsets, metric="confidence", min_threshold = 0.8)
rules_lift_2 = association_rules(frequent_itemsets, metric="lift", min_threshold = 0.8)

In [None]:
rules_confidence_2.head()

In [None]:
rules_confidence_1.head()

In [None]:
rules_lift_1.head()

In [None]:
rules_lift_2.head()

Upon comparing the above results of confidence and lift for the same value of supports (0.1 and 0.8), I am getting the same results for lift and different results for confidence. The lift values, from my understanding, is one of the most important criteria as higher lift values mean less likeliness of randomness (so more likely to occur). Support seems a little less and alarming for the top rules but confidence seems better, so the rules are likely to occur according to the confidence.

Although everything is justifyable, I'm not able to pin point which one metric makes most sense.
Because in the case of confidence_2, a Private working class person can be white, and according to confidence_1, a Bachelor can be working for the private sector. Lift as a parameter seems pretty straigt-forward, the top results are less likely to occur at random and thus the result makes sense.
