### Association Rule Mining
In this notebook, I explore association rule mining on the dataset. Association rule mining is an unsupervised learning algorithm that searches for associations (for example, neighboord X and time Y are associated with Assualt crimes) in the data. For the Toronto Crimes dataset, association rule learning can be useful in many different ways. For example, the uncovered patterns and structures in the crime data can be useful for providing insight for individual safety, informing housing decisions, and providing insights for the Tornoto Police Department, just to name a few.

In [35]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import networkx as nx

### Data Processing
Association rule mining is best applied on sparse data with binary variables. To do so, I one-hot-encode the categorical variables for location type, neighborhood, hour, day of week, and MCI category before running the Apriori Algorithm.

In [18]:
# import data
df = pd.read_csv('data/clean_data.csv')

In [19]:
# drop unneeded data
df = df.drop(columns=['OBJECTID', 'OCC_DATE', 'HOOD_158','LONG_WGS84','LAT_WGS84', 'isholiday'])
df

Unnamed: 0,OCC_HOUR,LOCATION_TYPE,NEIGHBOURHOOD_158,MCI_CATEGORY,dayofweek
0,3,Commercial Dwelling Unit,South Riverdale,Break and Enter,2
1,4,Apartment,North St.James Town,Assault,2
2,4,"Streets, Roads, Highways",NSA,Theft Over,2
3,4,"Streets, Roads, Highways",Blake-Jones,Assault,2
4,2,Bar / Restaurant,Wellington Place,Assault,2
...,...,...,...,...,...
321838,8,"Single Home, House",Wexford/Maryvale,Break and Enter,3
321839,3,Other Commercial / Corporate Places,Milliken,Break and Enter,5
321840,16,Apartment,Forest Hill North,Assault,3
321841,4,Convenience Stores,Rosedale-Moore Park,Assault,5


In [20]:
# one-hot-encode the categorical variables
encoder = OneHotEncoder()
new_df = encoder.fit_transform(X=df)

In [21]:
# construct corresponding column names
column_names = encoder.get_feature_names_out(input_features=df.columns.to_list())

# convert output array back to a sparse pandas df
new_df = pd.DataFrame.sparse.from_spmatrix(new_df, columns=column_names).astype(bool)
new_df

Unnamed: 0,OCC_HOUR_0,OCC_HOUR_1,OCC_HOUR_2,OCC_HOUR_3,OCC_HOUR_4,OCC_HOUR_5,OCC_HOUR_6,OCC_HOUR_7,OCC_HOUR_8,OCC_HOUR_9,...,MCI_CATEGORY_Break and Enter,MCI_CATEGORY_Robbery,MCI_CATEGORY_Theft Over,dayofweek_0,dayofweek_1,dayofweek_2,dayofweek_3,dayofweek_4,dayofweek_5,dayofweek_6
0,False,False,False,True,False,False,False,False,False,False,...,True,False,False,False,False,True,False,False,False,False
1,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
2,False,False,False,False,True,False,False,False,False,False,...,False,False,True,False,False,True,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321838,False,False,False,False,False,False,False,False,True,False,...,True,False,False,False,False,False,True,False,False,False
321839,False,False,False,True,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,True,False
321840,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
321841,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


### Find the Association Rules
We now can search the data for association rules.

In [95]:
# Find high support itemsets (support > 0.02) by running the Apriori Algorithm
frequent_itemsets = apriori(new_df, min_support=0.02, use_colnames=True)

In [96]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets.length.value_counts()

2    70
1    46
3     7
Name: length, dtype: int64

In [98]:
# find association rules given the frequent item sets (set confidence > 0.7)
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.7)
rules.sort_values(by='confidence', ascending=False, inplace=True)

rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
2,"(LOCATION_TYPE_Apartment, dayofweek_6)",(MCI_CATEGORY_Assault),0.027961,0.767898,1.435974
1,"(LOCATION_TYPE_Apartment, dayofweek_5)",(MCI_CATEGORY_Assault),0.027572,0.748924,1.400492
0,(LOCATION_TYPE_Apartment),(MCI_CATEGORY_Assault),0.167271,0.704481,1.317383


Here we that in this dataset, location type Apartment is most strongly associated with assault crimes (particularly on Saturday and Sunday). Here, we have reasonably high confidence and lift values (which measure the strength of the association between the antecedent and consequent) are reasonably higher than 1. Interestingly, note that the association between apartments and assaults have a very high support of 16.7%. 

### Explore Association Rules on Location Variables
Specifically explore the structure of location variables and crime type consequents. Theoretically, these relationships can also be found simulateously using the analysis above and adjusting the minimum support and confidence thresholds, but since the algorithm runs very quickly we can subset the data to automatically filter for the desired results.

In [99]:
# select location variables
location_df = df[['LOCATION_TYPE', 'NEIGHBOURHOOD_158', 'MCI_CATEGORY']].copy()

In [100]:
# one-hot-encode location df
encoder_loc = OneHotEncoder()
new_loc_df = encoder_loc.fit_transform(X=location_df)

# construct column names
column_names_loc = encoder_loc.get_feature_names_out(input_features=['LOCATION_TYPE', 'NEIGHBOURHOOD_158', 'MCI_CATEGORY'])

# convert output array back to a sparse pandas df
new_loc_df = pd.DataFrame.sparse.from_spmatrix(new_loc_df, columns=column_names_loc).astype(bool)
new_loc_df

Unnamed: 0,LOCATION_TYPE_Apartment,LOCATION_TYPE_Bank And Other Financial Institutions,LOCATION_TYPE_Bar / Restaurant,LOCATION_TYPE_Commercial Dwelling Unit,LOCATION_TYPE_Community Group Home,LOCATION_TYPE_Construction Site,LOCATION_TYPE_Convenience Stores,LOCATION_TYPE_Dealership,LOCATION_TYPE_Gas Station,LOCATION_TYPE_Group Homes,...,NEIGHBOURHOOD_158_Yonge-Doris,NEIGHBOURHOOD_158_Yonge-Eglinton,NEIGHBOURHOOD_158_Yonge-St.Clair,NEIGHBOURHOOD_158_York University Heights,NEIGHBOURHOOD_158_Yorkdale-Glen Park,MCI_CATEGORY_Assault,MCI_CATEGORY_Auto Theft,MCI_CATEGORY_Break and Enter,MCI_CATEGORY_Robbery,MCI_CATEGORY_Theft Over
0,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321838,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
321839,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
321840,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
321841,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False


In [101]:
# Find high support itemsets (support > 0.01) by running the Apriori Algorithm
frequent_itemsets_loc = apriori(new_loc_df, min_support=0.02, use_colnames=True)

In [102]:
frequent_itemsets_loc['length'] = frequent_itemsets_loc['itemsets'].apply(lambda x: len(x))
frequent_itemsets_loc.length.value_counts()

1    16
2    13
Name: length, dtype: int64

In [103]:
# Find association rules given the frequent item sets (set confidence > 0.5)
rules_loc = association_rules(frequent_itemsets_loc, metric='confidence', min_threshold=0.5)
rules_loc.sort_values(by='confidence', ascending=False, inplace=True)

rules_loc[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(LOCATION_TYPE_Apartment),(MCI_CATEGORY_Assault),0.167271,0.704481,1.317383
3,"(LOCATION_TYPE_Streets, Roads, Highways)",(MCI_CATEGORY_Assault),0.093875,0.591808,1.106685
1,(LOCATION_TYPE_Bar / Restaurant),(MCI_CATEGORY_Assault),0.022368,0.568058,1.062272
2,(LOCATION_TYPE_Parking Lots),(MCI_CATEGORY_Auto Theft),0.048465,0.563939,3.946252


Beyond the Apartment-Assault connection discovered earlier, we also discover an association between roads and highways and assualt with 59% confidence, althugh lift is closer to 1 here which suggests a weaker association. We also find an association between Bar/Restaurant with 57% confidence, but the lift is very close to 1 which suggests that the association is very weak. Finally, we also find an unsurprising relationship between Parking Lots and Auto Theft with 56% confidence, though lift is very high at 3.94 which suggests the two are strongly associated.

### What about only neighborhoods and crime type?
This might give an indication about which neighborhoods are most associated with which crimes.

In [73]:
# select neighborhood and MCI
neighborhood_df = df[['NEIGHBOURHOOD_158', 'MCI_CATEGORY']].copy()

In [75]:
# one-hot-encode location df
encoder_nb = OneHotEncoder()
new_nb_df = encoder_nb.fit_transform(X=neighborhood_df)

# construct column names
column_names_nb = encoder_nb.get_feature_names_out(input_features=['NEIGHBOURHOOD_158', 'MCI_CATEGORY'])

# convert output array back to a sparse pandas df
new_nb_df = pd.DataFrame.sparse.from_spmatrix(new_nb_df, columns=column_names_nb).astype(bool)
new_nb_df

Unnamed: 0,NEIGHBOURHOOD_158_Agincourt North,NEIGHBOURHOOD_158_Agincourt South-Malvern West,NEIGHBOURHOOD_158_Alderwood,NEIGHBOURHOOD_158_Annex,NEIGHBOURHOOD_158_Avondale,NEIGHBOURHOOD_158_Banbury-Don Mills,NEIGHBOURHOOD_158_Bathurst Manor,NEIGHBOURHOOD_158_Bay-Cloverhill,NEIGHBOURHOOD_158_Bayview Village,NEIGHBOURHOOD_158_Bayview Woods-Steeles,...,NEIGHBOURHOOD_158_Yonge-Doris,NEIGHBOURHOOD_158_Yonge-Eglinton,NEIGHBOURHOOD_158_Yonge-St.Clair,NEIGHBOURHOOD_158_York University Heights,NEIGHBOURHOOD_158_Yorkdale-Glen Park,MCI_CATEGORY_Assault,MCI_CATEGORY_Auto Theft,MCI_CATEGORY_Break and Enter,MCI_CATEGORY_Robbery,MCI_CATEGORY_Theft Over
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321838,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
321839,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
321840,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
321841,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False


In [82]:
# Find high support itemsets (support > 0.01) by running the Apriori Algorithm
frequent_itemsets_nb = apriori(new_nb_df, min_support=0.01, use_colnames=True)

In [83]:
frequent_itemsets_nb['length'] = frequent_itemsets_nb['itemsets'].apply(lambda x: len(x))
frequent_itemsets_nb.length.value_counts()

1    26
2     7
Name: length, dtype: int64

In [84]:
# Find association rules given the frequent item sets (set confidence > 0.5)
rules_nb = association_rules(frequent_itemsets_nb, metric='confidence', min_threshold=0.5)
rules_nb.sort_values(by='confidence', ascending=False, inplace=True)

rules_nb[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
4,(NEIGHBOURHOOD_158_West Hill),(MCI_CATEGORY_Assault),0.01072,0.68128,1.273997
5,(NEIGHBOURHOOD_158_Yonge-Bay Corridor),(MCI_CATEGORY_Assault),0.013609,0.675509,1.263206
3,(NEIGHBOURHOOD_158_Wellington Place),(MCI_CATEGORY_Assault),0.012826,0.66292,1.239664
0,(NEIGHBOURHOOD_158_Downtown Yonge East),(MCI_CATEGORY_Assault),0.013491,0.635353,1.188113
2,(NEIGHBOURHOOD_158_Moss Park),(MCI_CATEGORY_Assault),0.01493,0.621765,1.162704
1,(NEIGHBOURHOOD_158_Kensington-Chinatown),(MCI_CATEGORY_Assault),0.010726,0.596716,1.115862


Finally, we find that these 6 neighborhoods are most strongly associated with assault crimes, with confidence 60% or higher. The lift values, however, are closer to 1 so these associations are weaker. This makes sense because there could be many other crime types that can potentially occur in these neighborhoods.