## Chapter 13: Data Association

In this notebook, you will learn how to use association to identify patterns within datasets, such as the items a customer often purchases together at a grocery store, the links on a Web site upon which a customer clicks before making a purchase, and the snacks a fan purchases together at a ballgame.

As you will learn, one of the most well-known association applications is the shopping-cart problem, which identifies the association between buying diapers and beer. You can think of this association process as looking in each shopping cart as customers leave the market and taking note of the items they bought. By noting that many of the carts that contained diapers also contained beer, you form an association. Data analysts often refer to this process as Market-Basket Analysis.

Some of the scripts presented in this notebook use several Python libraries which have been pre-installed for you. If you had been required to install these libaries on your own, you would issue the following commands:

```python
! pip install --user pandas
! pip install --user numpy
! pip install --user matplotlib
! pip install --user apyori
```

# Understanding Support, Confidence, and Lift

To determine the level of association between two variables (the antecedent, which was the first variable that existed, and the consequent, which is the variable that occurred following or as a result of the antecedent), we will examine four measures: support, confidence, lift, and conviction.

* Support is a measure that specifies the frequency with which an item occurs within a data set.
* Confidence is a measure that indicates the likelihood of the consequent based on a rule to all ocurrences of the     antecdent.
* Lift is a measure that shows the ratio of confidence to the expected confidence.
    

The following Python script, DiapersAndBeer.py, uses the DiapersAndBeer dataset to calculate support, confidence, and lift:

In [None]:
import pandas as pd
from apyori import apriori

data = pd.read_csv('DiapersAndBeer.csv', header=None)

records = []
for i in range(0, len(data)):  
    records.append([str(data.values[i,j]) for j in range(0, len(data.columns))])

rules = apriori(records, min_length=2)  
results = list(rules)  

for item in results:
   print()
   print(item)
   print()
   print("-----------------------")

As you can see, the program displays the results for each item and the combinations of items
that occur in the shopping cart (called item sets). Within the combinations, you will see items_base,
which specifies the antecedent.

The following Python script, RealWorldApriori.py, uses the Groceries dataset to calculate support, confidence, and lift:

In [None]:
######################################
# Chapter 13 (Python) / Deliverable 1
######################################

import pandas as pd
from apyori import apriori

data = pd.read_csv('Groceries.csv', header=None)

records = []
for i in range(0, len(data)):  
    records.append([str(data.values[i,j]) for j in range(0, len(data.columns))])

rules = apriori(records, min_support=0.025, min_length=2, min_lift=1.2)  
results = list(rules)  

for item in results:
 if not 'nan' in str(item):
   print()
   print(item)
   print()
   print("-----------------------")

As you can see, the program opens the data set and then uses the apriori function to determine the association rules. The function call specifies the minimum number of items to consider as two and, to reduce uninteresting output, specifies minimum support and lift parameters.

# Dataset Summaries and Correlation


The goal of association is to identify relationship patterns within a dataset that illustrate the influence of an antecedent variable on a consequent variable. Do not confuse association with correlation, which merely identifies a statistical relationship between two variables. Correlation can be negative, positive or nonexistent.

The following Python script, Summary.py, loads the Auto-MPG dataset that contains data about different car models, such as the horsepower, weight, and miles per gallon (MPG). The script then uses the describe function to provide a summary of the dataset values, which includes each column’s min, max, mean, standard deviation, and so on:

In [None]:
import pandas as pd
data = pd.read_csv('auto-mpg.csv')
print(data.describe())

As you can see, the describe function return the count, mean, min, max, standard deviation, a well as quartile values. Using the describe function, you can quickly gain insights into the data a dataset contains.

The following Python script, MPGCorrelation.py, displays the correlation between MPG and other vehicle attributes:

In [None]:
######################################
# Chapter 13 (Python) / Deliverable 2
######################################


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('auto-mpg.csv')

coefs = np.corrcoef(data['mpg'], data['weight'])
plt.scatter(data['mpg'], data['weight'])
plt.title('MPG and Weight Correlation: ' + str(coefs[0,1]))
plt.show()

coefs = np.corrcoef(data['mpg'], data['horsepower'])
plt.scatter(data['mpg'], data['horsepower'])
plt.title('MPG and Horsepower Correlation: ' + str(coefs[0,1]))
plt.show()

coefs = np.corrcoef(data['mpg'], data['acceleration'])
plt.scatter(data['mpg'], data['acceleration'])
plt.title('MPG and Acceleration Correlation: ' + str(coefs[0,1]))
plt.show()