### Association Rule Mining
In this notebook, I explore association rule mining on the dataset. Association rule mining is an unsupervised learning algorithm that searches for associations (for example, neighboord X and time Y are associated with Assualt crimes) in the data. For the Toronto Crimes dataset, association rule learning can be useful in many different ways. For example, the uncovered patterns and structures in the crime data can be useful for providing insight for individual safety, informing housing decisions, and providing insights for the Tornoto Police Department, just to name a few.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import networkx as nx

### Data Processing
Association rule mining is best applied on sparse data with binary variables. To do so, I one-hot-encode the categorical variables for location type, neighborhood, hour, day of week, and MCI category before running the Apriori Algorithm.

In [2]:
# import data
df = pd.read_csv('data/clean_data.csv')

In [3]:
# drop unneeded data
df = df.drop(columns=['OBJECTID', 'OCC_DATE', 'dayofweek_sin', 
                      'dayofweek_cos', 'hour_sin', 'hour_cos', 'year', 
                      'lat_hour_cos', 'lat_hour_sin', 'long_hour_cos', 'long_hour_sin', 'LONG_WGS84','LAT_WGS84', 'isholiday'])
df

Unnamed: 0,OCC_HOUR,LOCATION_TYPE,NEIGHBOURHOOD_158,MCI_CATEGORY,dayofweek,month
0,3,Commercial Dwelling Unit,South Riverdale,Break and Enter,2,1
1,4,Apartment,North St.James Town,Assault,2,1
2,4,"Streets, Roads, Highways",NSA,Theft Over,2,1
3,4,"Streets, Roads, Highways",Blake-Jones,Assault,2,1
4,2,Bar / Restaurant,Wellington Place,Assault,2,1
...,...,...,...,...,...,...
321838,8,"Single Home, House",Wexford/Maryvale,Break and Enter,3,12
321839,3,Other Commercial / Corporate Places,Milliken,Break and Enter,5,12
321840,16,Apartment,Forest Hill North,Assault,3,12
321841,4,Convenience Stores,Rosedale-Moore Park,Assault,5,12


In [4]:
# one-hot-encode the categorical variables
encoder = OneHotEncoder()
new_df = encoder.fit_transform(X=df)

In [5]:
# construct corresponding column names
column_names = encoder.get_feature_names_out(input_features=df.columns.to_list())

# convert output array back to a sparse pandas df
new_df = pd.DataFrame.sparse.from_spmatrix(new_df, columns=column_names).astype(bool)
new_df

  new_df = pd.DataFrame.sparse.from_spmatrix(new_df, columns=column_names).astype(bool)


Unnamed: 0,OCC_HOUR_0,OCC_HOUR_1,OCC_HOUR_2,OCC_HOUR_3,OCC_HOUR_4,OCC_HOUR_5,OCC_HOUR_6,OCC_HOUR_7,OCC_HOUR_8,OCC_HOUR_9,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
0,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321838,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,True
321839,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
321840,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
321841,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


### Find the Association Rules
We now can search the data for association rules.

In [6]:
# Find high support itemsets (support > 0.02) by running the Apriori Algorithm
frequent_itemsets = apriori(new_df, min_support=0.02, use_colnames=True)

In [7]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets.length.value_counts()

2    88
1    58
3     7
Name: length, dtype: int64

In [8]:
# find association rules given the frequent item sets (set confidence > 0.7)
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.7)
rules.sort_values(by='confidence', ascending=False, inplace=True)

rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
2,"(dayofweek_6, LOCATION_TYPE_Apartment)",(MCI_CATEGORY_Assault),0.027961,0.767898,1.435974
1,"(dayofweek_5, LOCATION_TYPE_Apartment)",(MCI_CATEGORY_Assault),0.027572,0.748924,1.400492
0,(LOCATION_TYPE_Apartment),(MCI_CATEGORY_Assault),0.167271,0.704481,1.317383


Here we see that in this dataset, location type Apartment is most strongly associated with assault crimes (particularly on Saturday and Sunday). Here, we have reasonably high confidence and lift values (which measure the strength of the association between the antecedent and consequent) are reasonably higher than 1. Interestingly, note that the association between apartments and assaults have a very high support of 16.7%. 

### Explore Association Rules on Location Variables
Specifically explore the structure of location variables and crime type consequents. Theoretically, these relationships can also be found simulateously using the analysis above and adjusting the minimum support and confidence thresholds, but since the algorithm runs very quickly we can subset the data to automatically filter for the desired results.

In [9]:
# select location variables
location_df = df[['LOCATION_TYPE', 'NEIGHBOURHOOD_158', 'MCI_CATEGORY']].copy()

In [10]:
# one-hot-encode location df
encoder_loc = OneHotEncoder()
new_loc_df = encoder_loc.fit_transform(X=location_df)

# construct column names
column_names_loc = encoder_loc.get_feature_names_out(input_features=['LOCATION_TYPE', 'NEIGHBOURHOOD_158', 'MCI_CATEGORY'])

# convert output array back to a sparse pandas df
new_loc_df = pd.DataFrame.sparse.from_spmatrix(new_loc_df, columns=column_names_loc).astype(bool)
new_loc_df

  new_loc_df = pd.DataFrame.sparse.from_spmatrix(new_loc_df, columns=column_names_loc).astype(bool)


Unnamed: 0,LOCATION_TYPE_Apartment,LOCATION_TYPE_Bank And Other Financial Institutions,LOCATION_TYPE_Bar / Restaurant,LOCATION_TYPE_Commercial Dwelling Unit,LOCATION_TYPE_Community Group Home,LOCATION_TYPE_Construction Site,LOCATION_TYPE_Convenience Stores,LOCATION_TYPE_Dealership,LOCATION_TYPE_Gas Station,LOCATION_TYPE_Group Homes,...,NEIGHBOURHOOD_158_Yonge-Doris,NEIGHBOURHOOD_158_Yonge-Eglinton,NEIGHBOURHOOD_158_Yonge-St.Clair,NEIGHBOURHOOD_158_York University Heights,NEIGHBOURHOOD_158_Yorkdale-Glen Park,MCI_CATEGORY_Assault,MCI_CATEGORY_Auto Theft,MCI_CATEGORY_Break and Enter,MCI_CATEGORY_Robbery,MCI_CATEGORY_Theft Over
0,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321838,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
321839,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
321840,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
321841,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False


In [11]:
# Find high support itemsets (support > 0.01) by running the Apriori Algorithm
frequent_itemsets_loc = apriori(new_loc_df, min_support=0.02, use_colnames=True)

In [12]:
frequent_itemsets_loc['length'] = frequent_itemsets_loc['itemsets'].apply(lambda x: len(x))
frequent_itemsets_loc.length.value_counts()

1    16
2    13
Name: length, dtype: int64

In [13]:
# Find association rules given the frequent item sets (set confidence > 0.5)
rules_loc = association_rules(frequent_itemsets_loc, metric='confidence', min_threshold=0.5)
rules_loc.sort_values(by='confidence', ascending=False, inplace=True)

rules_loc[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(LOCATION_TYPE_Apartment),(MCI_CATEGORY_Assault),0.167271,0.704481,1.317383
3,"(LOCATION_TYPE_Streets, Roads, Highways)",(MCI_CATEGORY_Assault),0.093875,0.591808,1.106685
1,(LOCATION_TYPE_Bar / Restaurant),(MCI_CATEGORY_Assault),0.022368,0.568058,1.062272
2,(LOCATION_TYPE_Parking Lots),(MCI_CATEGORY_Auto Theft),0.048465,0.563939,3.946252


Beyond the Apartment-Assault connection discovered earlier, we also discover an association between roads and highways and assualt with 59% confidence, although lift is closer to 1 here which suggests a weaker association. We also find an association between Bar/Restaurant with 57% confidence, but the lift is very close to 1 which suggests that the association is very weak. Finally, we also find an unsurprising relationship between Parking Lots and Auto Theft with 56% confidence, though lift is very high at 3.94 which suggests the two are strongly associated.