

The dataset is having many binary features, each representing the presence of some extra equipment in the car. 
This makes the dataset suitable to run association analysis on the equipment fields. 

By definition, Association analysis (or Market Basket Analysis) is mainly a data mining process that helps identify co-occurrence of certain events/activities performed by a user group.

In our case we will use the results to see which pairs of the equipment features are found together most often. There are 3 main concepts that help us measure the strength of an association rule. They are as follows: 
***
1. Support : 
    - $ supp(X) = {\text{# of listings in which }X \text{ appears} \over \text{Total # of listings}}$

    - Support of an itemset $ X $ is defined as a proportion of transactions in the database that contain $ X $
<br/><br/>

2.  *Confidence*:  
    - $conf(X \to Y) = {supp(X \cup Y)\over supp(X)}$

    - Confidence measures the probability of itemset $ Y $ occuring with itemset $ X $.
<br/><br/>

3. *Lift*:   

    - $lift(X \to Y) = {supp(X \cup Y)\over supp(X) \times supp(Y)}$ 
    <br/><br/>
    - Lift measures the ratio of the observed support to that expected if  $ X $ and $ Y $ were independent.
        <br/><br/>
        - If $ lift(X \to Y)  = 1 $, then it would imply that probabilities of occurrences of itemset X and itemset Y are independent of each other, meaning that the rule doesn’t show any statistically proven relationship.
        <br/><br/>
        - If $ lift(X \to Y) > 1 $, that lets us know the degree to which those two occurrences are dependent on one another
        <br/><br/>
        - If $ lift(X \to Y) < 1 $, that lets us know the items are substitute to each other
        
***

We are sorting the association table by the lift measure, as it is the most complex one and most usefull in our dataset. 

In [7]:
import numpy as np
import pandas as pd
from pymongo import MongoClient 
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
tbr = ['1','10 months','11 months','112 months','12 months','13 months','14 months','15 months','16 months','17 months','18 months',
       '19 months','2 months','20 months','21 months','22 months','23 months','24 months','25 months','26 months','27 months',
       '28 months','29 months','3 months','30 months','31 months','32 months','33 months','34 months','35 months','36 months',
 '38 months','4 months','40 months','41 months','42 months','43 months','44 months','45 months','46 months','47 months',
 '48 months','5 months','50 months','52 months','53 months','54 months','55 months','56 months','58 months','59 months',
 '6 months','60 months','7 months','72 months','8 months','84 months','88 months','9 months', '0 months','1 months']


def readData():

    client = MongoClient('mongodb+srv://Martin:Kostadinov@dwprojectcluster.lpqbf.mongodb.net/cars_database?retryWrites=true&w=majority')

    df_cars = pd.DataFrame(list(client.cars_database.cars.find({})))
    df_cars.drop('_id', axis = 1, inplace = True)
    df_cars = df_cars[df_cars['Loaded_in_DW'].eq(False)]


    return df_cars


df_cars = readData()

equipment = df_cars.iloc[:,15:]
equipment = equipment.replace({np.nan: False})
equipment = equipment.replace({1: True})
equipment = equipment.replace({'1': True})
equipment = equipment.replace(tbr , True)

ap = apriori(equipment, min_support=0.7, use_colnames=True)
rules_ap = association_rules(ap, metric="lift", min_threshold=0)
rules_ap.sort_values(by = 'lift', ascending = False)[0:20]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
94,(Side airbag),"(Passenger-side airbag, ABS)",0.783843,0.786129,0.70144,0.894873,1.138328,0.085238,2.034404
91,"(Passenger-side airbag, ABS)",(Side airbag),0.786129,0.783843,0.70144,0.892271,1.138328,0.085238,2.006483
93,(Passenger-side airbag),"(Side airbag, ABS)",0.821904,0.75247,0.70144,0.853433,1.134175,0.082982,1.688848
92,"(Side airbag, ABS)",(Passenger-side airbag),0.75247,0.821904,0.70144,0.932183,1.134175,0.082982,2.626135
41,(Side airbag),(Passenger-side airbag),0.783843,0.821904,0.723756,0.923343,1.123419,0.079512,2.323283
40,(Passenger-side airbag),(Side airbag),0.821904,0.783843,0.723756,0.880584,1.123419,0.079512,1.810123
86,"(ABS, Power windows)",(Side airbag),0.800806,0.783843,0.701644,0.876171,1.117789,0.073937,1.745611
87,(Side airbag),"(ABS, Power windows)",0.783843,0.800806,0.701644,0.895133,1.117789,0.073937,1.899482
84,"(Side airbag, ABS)",(Power windows),0.75247,0.84132,0.701644,0.932454,1.108323,0.068576,2.34922
89,(Power windows),"(Side airbag, ABS)",0.84132,0.75247,0.701644,0.83398,1.108323,0.068576,1.490963


##### The same table sorted by confidence:

In [6]:
rules_ap.sort_values(by = 'confidence', ascending = False)[0:20]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
90,"(Passenger-side airbag, Side airbag)",(ABS),0.723756,0.891539,0.70144,0.969166,1.087071,0.056183,3.517606
85,"(Side airbag, Power windows)",(ABS),0.724028,0.891539,0.701644,0.969084,1.086979,0.056145,3.508252
13,(Electronic stability control),(ABS),0.736619,0.891539,0.713473,0.968578,1.086412,0.056749,3.451762
79,"(Passenger-side airbag, Power windows)",(ABS),0.748709,0.891539,0.725023,0.968365,1.086173,0.057521,3.428512
61,"(Side airbag, Power steering)",(ABS),0.731519,0.891539,0.707683,0.967415,1.085107,0.055505,3.328592
55,"(Passenger-side airbag, Power steering)",(ABS),0.757113,0.891539,0.732066,0.966917,1.084549,0.05707,3.278506
72,"(Passenger-side airbag, Air conditioning)",(ABS),0.742349,0.891539,0.716947,0.965782,1.083275,0.055114,3.169701
50,"(Power windows, Power steering)",(ABS),0.773884,0.891539,0.745344,0.963121,1.080291,0.055396,2.940986
43,"(Air conditioning, Power steering)",(ABS),0.768087,0.891539,0.738366,0.961305,1.078254,0.053587,2.802995
67,"(Air conditioning, Power windows)",(ABS),0.76837,0.891539,0.737936,0.960392,1.077229,0.052904,2.73834
