# Apriori Algorithm

---

In terms of unsupervised learning, Apriori’s algorithm is a key method to leverage when trying to find rules and associations. For this business problem, mining for any combination of rules is extremely useful in identifying key aspects of the issue. After having imported, transformed, and applied the algorithm; it became clear that there were several rules which could be of use when determining factors affecting the severity/fatality of accidents. Multiple experiments were done as shown below, each of which adjusted either the confidence threshold or support threshold in order to gain more insight into the data. 

Though the rules were limited in quantity, the confidence levels provided can be used to predict events with great accuracy. However, there were several limitations of the dataset like not enough relevant data which lead to a reduced number of rules generated. This was surprising as with an initial glance it seemed as though association rule mining would be a very effective way to determine many rules. Following the experiments below, some of the results can be summarized and interpreted to the business problems needs.



In [None]:
# Import libraries and tools
import pandas as pd
import csv

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

In [None]:
# Distributing CSV data into rows
dataset = []

with open('collisions.csv') as f:
  reader = csv.reader(f)
  for row in reader:
    dataset.append(row)

for row in dataset:
  print(row)

['_id', 'ACCNUM', 'YEAR', 'DATE', 'TIME', 'HOUR', 'STREET1', 'STREET2', 'OFFSET', 'ROAD_CLASS', 'DISTRICT', 'WARDNUM', 'DIVISION', 'LOCCOORD', 'ACCLOC', 'TRAFFCTL', 'VISIBILITY', 'LIGHT', 'RDSFCOND', 'ACCLASS', 'IMPACTYPE', 'INVTYPE', 'INVAGE', 'INJURY', 'FATAL_NO', 'INITDIR', 'VEHTYPE', 'MANOEUVER', 'DRIVACT', 'DRIVCOND', 'PEDTYPE', 'PEDACT', 'PEDCOND', 'CYCLISTYPE', 'CYCACT', 'CYCCOND', 'PEDESTRIAN', 'CYCLIST', 'AUTOMOBILE', 'MOTORCYCLE', 'TRUCK', 'TRSN_CITY_VEH', 'EMERG_VEH', 'PASSENGER', 'SPEEDING', 'AG_DRIV', 'REDLIGHT', 'ALCOHOL', 'DISABILITY', 'POLICE_DIVISION', 'HOOD_ID', 'NEIGHBOURHOOD', 'ObjectId', 'geometry']
['1', '892658', '2006', '2006-03-11T00:00:00', '852', '8', 'BLOOR ST W', 'DUNDAS ST W', '', 'Major Arterial', 'Toronto and East York', '4', '11', 'Intersection', 'At Intersection', 'Traffic Signal', 'Clear', 'Daylight', 'Dry', 'Fatal', 'Pedestrian Collisions', 'Driver', 'unknown', 'None', '', 'South', 'Automobile, Station Wagon', 'Turning Left', 'Failed to Yield Right o

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
# Data transformation to prepare for rule generation
oht = TransactionEncoder()
oht_array = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_array, columns = oht.columns_)

# Removing irrelavant features
df.drop(['ACCNUM','HOOD_ID','OFFSET', '_id', 'geometry'], axis=1)

# Drop any row (axis 0) that has any type of null value
df = df.dropna(how='any',axis=0) 

# Creating a dataframe with subset of the features selected for inital supervised
# Learning task (alongside target class of injury)
df_sub = df[['DRIVACT', 'INVAGE', 'VEHTYPE', 'DRIVCOND', 'INVTYPE', 'INJURY']]

df.head() #viewing orginal dataset

Unnamed: 0,Unnamed: 1,0,0 to 4,"0,51","0,52","07,06",1,1 AUTUMN AVE,1 MASSEY Sq,1 MURRAY GLEN DR,...,"{u'type': u'Point', u'coordinates': (-79.621974, 43.746806)}","{u'type': u'Point', u'coordinates': (-79.62429, 43.753145)}","{u'type': u'Point', u'coordinates': (-79.62619, 43.734345)}","{u'type': u'Point', u'coordinates': (-79.62659, 43.747645)}","{u'type': u'Point', u'coordinates': (-79.63249, 43.747745)}","{u'type': u'Point', u'coordinates': (-79.633502, 43.751265)}","{u'type': u'Point', u'coordinates': (-79.634021, 43.751216)}","{u'type': u'Point', u'coordinates': (-79.63419, 43.747445)}","{u'type': u'Point', u'coordinates': (-79.63467, 43.751242)}","{u'type': u'Point', u'coordinates': (-79.63839, 43.749045)}"
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [None]:
df_sub.head() #viewing the dataset with subset features

Unnamed: 0,DRIVACT,INVAGE,VEHTYPE,DRIVCOND,INVTYPE,INJURY
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [None]:
# Finding frequent elements
frequent_itemsets = apriori(df, min_support = 0.6, use_colnames = True)
print(frequent_itemsets)

     support                                   itemsets
0   0.999941                                         ()
1   0.858549                                    (Clear)
2   0.796809                                      (Dry)
3   0.660756                             (Intersection)
4   0.710278                           (Major Arterial)
5   0.863591                         (Non-Fatal Injury)
6   0.999644                                      (Yes)
7   0.858549                                  (, Clear)
8   0.796809                                    (, Dry)
9   0.660756                           (, Intersection)
10  0.710278                         (, Major Arterial)
11  0.863591                       (, Non-Fatal Injury)
12  0.999644                                    (, Yes)
13  0.794378                               (Clear, Dry)
14  0.605777                    (Major Arterial, Clear)
15  0.741237                  (Clear, Non-Fatal Injury)
16  0.858371                               (Clea

## Experiment 1 - Confidence Increase
---


**Confidence Threshold:**  0.9

**Hypothesis:**

By raising the threshold for confidence really high, the rules generated will most probably be related to weather conditions and accident fatality. This is because these features are easily identifiable with common occurence. For example, we can hypthesize that rainy conditions might lead to fatal accidents.

**Purpose:**

Want to explore the rules which are almost certain, this can provide insight on which features contribute to more severe accidents, etc.

**Findings:**

As expected, the rules generated mostly focused on the correlation between dry and clear driving conditions which lead to non-fatal injuries. Interestingly there was no correlation between rainy conditions and fatal accidents.


In [None]:
# Generating the rules
rules_experiment_1 = association_rules(frequent_itemsets, metric = 'confidence', min_threshold = 0.9)

print(rules_experiment_1[['antecedents','consequents','support','confidence']])

                       antecedents     consequents   support  confidence
0                          (Clear)              ()  0.858549    1.000000
1                            (Dry)              ()  0.796809    1.000000
2                   (Intersection)              ()  0.660756    1.000000
3                 (Major Arterial)              ()  0.710278    1.000000
4               (Non-Fatal Injury)              ()  0.863591    1.000000
..                             ...             ...       ...         ...
88    (Dry, Yes, Non-Fatal Injury)       (, Clear)  0.686851    0.997244
89  (Clear, Yes, Non-Fatal Injury)         (, Dry)  0.686851    0.926851
90  (Clear, Dry, Non-Fatal Injury)         (, Yes)  0.686851    0.999741
91         (Dry, Non-Fatal Injury)  (, Clear, Yes)  0.686851    0.996987
92       (Clear, Non-Fatal Injury)    (, Yes, Dry)  0.686851    0.926628

[93 rows x 4 columns]


## Experiment 2 - Confidence Decrease
---


**Confidence Threshold** = 0.7

**Hypothesis:**

By lowering the confidence threshold, I believe there will be more rules related to raining/dry conditions. Aswell, other features like time of day might also affect the rules.

**Purpose:**

To reveal more of the uncommon rules present within the dataset.

**Findings:**

Suprisingly, lowering the confidence threshold had no effects on the rule generation. It seems as though there are no correlation with the other features which we thought might play a big role like time, etc.


In [None]:
# Generating the rules
rules_experiment_2 = association_rules(frequent_itemsets, metric = 'confidence', min_threshold = 0.7)

print(rules_experiment_2[['antecedents','consequents','support','confidence']])

                   antecedents                       consequents   support  \
0                           ()                           (Clear)  0.858549   
1                      (Clear)                                ()  0.858549   
2                           ()                             (Dry)  0.796809   
3                        (Dry)                                ()  0.796809   
4               (Intersection)                                ()  0.660756   
..                         ...                               ...       ...   
226  (Clear, Non-Fatal Injury)                      (, Yes, Dry)  0.686851   
227               (Clear, Dry)         (, Yes, Non-Fatal Injury)  0.686851   
228         (Non-Fatal Injury)               (, Clear, Yes, Dry)  0.686851   
229                      (Dry)  (, Clear, Yes, Non-Fatal Injury)  0.686851   
230                    (Clear)    (, Dry, Yes, Non-Fatal Injury)  0.686851   

     confidence  
0      0.858600  
1      1.000000  
2      0.

## Experiment 3 - Support Increase
---


**Support Threshold** = 0.8

**Hypothesis:**
By increasing the support threshold, the rules should be limited to only the  common features. For example, there might be some correlation between vehicle type, weather, and fatality.


**Purpose:**

Since support value essentially describes the frequency/occurence of a feature, increasing the support will show only rules with perhaps more confidence.

**Findings:**

As expected, the confidence levels were on average higher and most of the rules were associate with the common features.




In [None]:
# Generating the rules
rules_experiment_3 = association_rules(frequent_itemsets, metric = 'support', min_threshold = 0.8)

print(rules_experiment_3[['antecedents','consequents','support','confidence']])

                antecedents              consequents   support  confidence
0                        ()                  (Clear)  0.858549    0.858600
1                   (Clear)                       ()  0.858549    1.000000
2                        ()       (Non-Fatal Injury)  0.863591    0.863642
3        (Non-Fatal Injury)                       ()  0.863591    1.000000
4                        ()                    (Yes)  0.999644    0.999703
5                     (Yes)                       ()  0.999644    1.000000
6                   (Clear)                    (Yes)  0.858371    0.999793
7                     (Yes)                  (Clear)  0.858371    0.858677
8                     (Yes)       (Non-Fatal Injury)  0.863294    0.863601
9        (Non-Fatal Injury)                    (Yes)  0.863294    0.999657
10                (, Clear)                    (Yes)  0.858371    0.999793
11                  (, Yes)                  (Clear)  0.858371    0.858677
12             (Clear, Ye

##Summarizing Unsupervised Learning Results
---
Overall, while analyzing the data and leveraging Apriori’s Algorithm to generate rules, it was evident that there were several features with greater significance. While using the features derived from the classification, it became apparent the whole dataset needed to be considered as the 'df_sub' features did not have enough variables to create rules meeting adequate confidence thresholds. There were several rules generated involving weather conditions and severity of accidents which are key factors for solving the business problem. 

By adjusting and experimenting with the support and confidence threshold, there was clear differentiation in which features were more significant in rule mining. Apriori's algorithm provided a clear and concise way to determine any associations, however, the depth and insight retrieved from these rules were not as useful due to the nature of the dataset chosen.