## Chapter 13: Data Association

In this notebook, you will learn how to use association to identify patterns within datasets, such as the items a customer often purchases together at a grocery store, the links on a Web site upon which a customer clicks before making a purchase, and the snacks a fan purchases together at a ballgame.

As you will learn, one of the most well-known association applications is the shopping-cart problem which identified the association between buying diapers and beer. You can think of this association process as looking in each shopping cart as customers leave the market and taking note of the items they bought. By noting that many of the carts that contained diapers also contained beer, you form an association. Data analysts often refer to this process as Market Basket Analysis.

Some of the scripts presented in this notebook use R libraries (known as packages) which have been pre-installed for you. If you had been required to install these libaries on your own, you would issue the commands below.

```R
install.packages("arules")
install.packages("datasets")
```

# Understanding Support, Confidence, and Lift

To determine the level of association between two variables (the antecedent, which was the first variable that existed, and the consequent, which is the variable that occurred following or as a result of the antecedent), we will examine four measures: support, confidence, lift, and conviction.

* Support is a measure that specifies the frequency with which an item occurs within a data set.
* Confidence is a measure that indicates the likelihood of the consequent based on a rule to all ocurrences of the     antecedent.
* Lift is a measure that shows the ratio of confidence to the expected confidence.

The following R script, DiapersAndBeer.R, uses the DiapersAndBeer data to calculate support, confidence, and lift. The antecedents are represented in the lhs (left-hand side) column, and the consequents in the rhs (right-hand side):

In [None]:
library(arules)
library(datasets)

df <- read.csv(file='DiapersAndBeer.csv')

rules <- apriori(df)
inspect(rules)

The following R script, RealWorldApriori.R, uses the Groceries dataset to calculate support, confidence, and lift, and specifies that only item associations with a minimum support of 0.001 and minimum confidence of 0.9 be returned:

In [None]:
#################################
# Chapter 13 (R) / Deliverable 1
#################################

library(arules)
library(datasets)

data(Groceries)

rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.9))

inspect(rules)

# Dataset Summaries and Correlation


The goal of association is to identify relationship patterns within a dataset that illustrate the influence of an antecedent variable on a consequent variable. Do not confuse association with correlation which identifies a statistical relationship between two variables. Correlation can be negative, positive or nonexistent.

The following R script, Summary.R, loads the Auto-MPG dataset that contains data about different car models such as the horsepower, weight, and miles per gallon (MPG) . The script then uses the describe function to provide a summary of the dataset values, which includes each column’s min, max, mean, standard deviation, and so on:

In [None]:
df <- read.csv(file='auto-mpg.csv')
print(summary(df))

As you can see, the describe function returns the count, mean, min, max, standard deviation, a well as quartile values. Using the describe function, you can quickly gain insights into the data a dataset contains.

The following R script, MPGCorrelation.R, displays the correlation between MPG and other vehicle attributes:

In [None]:
#################################
# Chapter 13 (R) / Deliverable 2
#################################

df <- read.csv(file='auto-mpg.csv')

plots <- par(mfrow=c(1, 3)) 

corr <- cor(df$mpg, df$displacement)
title <- paste("MPG / Weight ", sprintf("%s", corr))
plot(df$weight, df$mpg, main=title,
   xlab="Car Weight ", ylab="Miles Per Gallon ")

corr <- cor(df$mpg, df$origin) 
title <- paste("         MPG / Horsepower ", sprintf("%s", corr))
plot(df$horsepower, df$mpg, main=title,
   xlab="Car Horsepower ", ylab="Miles Per Gallon ")

corr <- cor(df$mpg, df$acceleration)   # assign the variables to be correlated
title <- paste("       MPG / Acceleration ", sprintf("%s", corr))
plot(df$acceleration, df$mpg, main=title,  # assign the variables to be plotted
   xlab="Car Acceleration ", ylab="Miles Per Gallon ")

par(plots)