# CAR TALK: SLICING AND DICING OF PRICING
**_The search for normality and significance_**
### Data Science 410 BB
#### University of Washington Professional & Continuing Education
#### Homework 4: Hypothesis Testing
#### Leo Salemann, 2/1/18


# Load & Inspect Data

In [None]:
read.auto = function(file = '../../../DataScience410/Lecture1/Automobile price data _Raw_.csv'){
  ## Read the csv file
  auto.price <- read.csv(file, header = TRUE, 
                      stringsAsFactors = FALSE)

  ## Coerce some character columns to numeric
  numcols <- c('price', 'bore', 'stroke', 'horsepower', 'peak.rpm')
  auto.price[, numcols] <- lapply(auto.price[, numcols], as.numeric)

  ## Remove cases or rows with missing values. In this case we keep the 
  ## rows which do not have nas. 
  auto.price[complete.cases(auto.price), ]
}
auto.price = read.auto()
str(auto.price)

## Normalize Pricing values
Get everything ranging from zero to one.
normit function from  [stack overlow](https://stackoverflow.com/questions/5665599/range-standardization-0-to-1-in-r)

In [None]:
prices = auto.price$price
prices.log = log(auto.price$price)

normit = function(x){(x-min(x))/(max(x)-min(x))}

prices.norm = normit(prices)
prices.log.norm = normit(prices.log)

head(prices.norm)
head(prices.log.norm)

# Compare and test Normality: A Graphical Approach

## Start with paired Q-Q plots  

In [None]:
options(repr.plot.width=5, repr.plot.height=3)

# Visual test of normality
par(mfrow = c(1, 2))
qqnorm(prices.norm, main = 'Q-Q plot of Price')
qqnorm(prices.log.norm, main = 'Q-Q plot of log(Price)')
par(mfrow = c(1, 1))

Log(Price) appears to be closer to normal (e.g. a move of a straight line) than regular prce.

## A closer look with individual plots

In [None]:
options(repr.plot.width=5, repr.plot.height=5)

# Visual test of normality
qqnorm(prices.norm, main = 'Q-Q plot of Price')
abline(a = 0.0, b = 1.0, lty = 2, col = 'blue')

In [None]:
options(repr.plot.width=5, repr.plot.height=5)

# Visual test of normality
qqnorm(prices.log.norm, main = 'Q-Q plot of log(Price)')
abline(a = 0.0, b = 1.0, lty = 2, col = 'blue')

There's a tiny patch between the 0.4 and 0.6 quantiles where log(Price) matches a normal distribution; regular Price doesn't come close.

# Compare and test Normality: Formal tests though K-S Statistics

In [None]:
normal.distro = rnorm(948) ## Our standard Normal for comparison.
ks.test(prices.norm, normal.distro, alternative = "two.sided") 
ks.test(prices.log.norm, normal.distro, alternative = "two.sided") 


Analyticlally, the p-values are the same; can't tell the difference.


## Conclusion, Normality Tests
Both are terrible, but log of price is just-slightly more similar to a normal distributon.

# Testing Significance of log(Price) based on fuel, aspiration, drive train

## Significance by Fuel Type: Are Diesel Cars More Expensive than Gas?

Start with a plot function that we'll be using over and over again.

In [None]:
plot.t <- function(a, b, cols = c('pop_A', 'pop_B'), nbins = 20){
  maxs = max(c(max(a), max(b)))
  mins = min(c(min(a), min(b)))
  breaks = seq(maxs, mins, length.out = (nbins + 1))
  par(mfrow = c(2, 1))
  hist(a, breaks = breaks, main = paste('Histogram of', cols[1]), xlab = cols[1])
  abline(v = mean(a), lwd = 4, col = 'red')
  hist(b, breaks = breaks, main = paste('Histogram of', cols[2]), xlab = cols[2])
  abline(v = mean(b), lwd = 4, col = 'red')
  par(mfrow = c(1, 1))
}

In [None]:
table(auto.price$fuel.type)

In [None]:
autos.gas = auto.price[auto.price$fuel.type == 'gas',]
autos.diesel = auto.price[auto.price$fuel.type == 'diesel',]
autos.gas.log.prices = log(autos.gas$price)
autos.diesel.log.prices = log(autos.diesel$price)

In [None]:
plot.t(autos.gas.log.prices, autos.diesel.log.prices, 
       cols = c('log(Gas Auto Price)', 'log(Diesel Auto Price)'))

t.test(autos.gas.log.prices, autos.diesel.log.prices, alternative = "two.sided")

* Histograms appear to have significant overlap, with means fairly close together.

* P-value indicates we'd obtain the observed difference (or larger) in 5.6% of samples due to random sample error. 

* The 95% confidence interval straddles zero. 

* All the above suggests diesel autos *are not* reliably more expensive than gas autos.

## Aspiration: Are Turbos more expensive? 

In [None]:
head(auto.price, 3)

In [None]:
table(auto.price$aspiration)

In [None]:
autos.std = auto.price[auto.price$aspiration == 'std',]
autos.turbo = auto.price[auto.price$aspiration == 'turbo',]
autos.std.log.prices = log(autos.std$price)
autos.turbo.log.prices = log(autos.turbo$price)

In [None]:
plot.t(autos.std.log.prices, autos.turbo.log.prices, 
       cols = c('log(Auto Prices), std Aspiration', 'log(Auto Prices), Turbo'))

t.test(autos.std.log.prices, autos.turbo.log.prices, alternative = "two.sided")

* Histograms appear to have somewhat divergent shapes with a fairly big gap between means.

* P-value is infinitessimal, indicating it would be highly unlikely that any observed price difference is due to random sampling error.

* Both sides of the 95% confidence interval is on the same side of zero

* All the above suggests a turbo charger is usually going to have a signtificant impact on price.

## Drive Train: Is Front-Wheel Drive cheaper than Rear Wheeel?

In [None]:
head(auto.price, 3)

In [None]:
table(auto.price$drive.wheels)

In [None]:
autos.fwd = auto.price[auto.price$drive.wheels == 'fwd',]
autos.rwd = auto.price[auto.price$drive.wheels == 'rwd',]
autos.fwd.log.prices = log(autos.fwd$price)
autos.rwd.log.prices = log(autos.rwd$price)

In [None]:
plot.t(autos.fwd.log.prices, autos.rwd.log.prices, 
       cols = c('log of Auto Prices, FWD', 'log of Auto Prices, RWD'))

t.test(autos.fwd.log.prices, autos.rwd.log.prices, alternative = "two.sided")

* Histograms appear to have higlly divergent shapes with a major gap between means.

* P-value is infinitessimal, indicating it would be highly unlikely that any observed price difference is due to random sampling error.

* Both sides of the 95% confidence interval is on the same side of zero

* All the above indicates you're practically guaranteed to save money by picking front-wheel drive.

# Pricing and Body Style

## Price by body style, ANOVA

In [None]:
head(auto.price, 4)

In [None]:
table (auto.price$body.style)

Drop convertibles and hardtops, since there aren't very many of them.

In [None]:
autos.body.style = auto.price[auto.price$body.style %in% c("hatchback","sedan", "wagon"),]
table (autos.body.style$body.style)

In [None]:
autos.body.style.aov = aov(log(price) ~ body.style, data = autos.body.style)
summary(autos.body.style.aov)

P-value indicates at least two body styles have a significant price difference.

In [None]:
print(autos.body.style.aov)

In [None]:
options(repr.plot.width=4, repr.plot.height=6)
boxplot(log(autos.body.style$price) ~ autos.body.style$body.style)

* Sedan and Wagon appear very similar
* Greatest difference appears to be between hatchback and wagon

## Price by body style, Tukey HSD

In [None]:
autos.body.style.hsd = TukeyHSD(autos.body.style.aov)
autos.body.style.hsd

* P-values indicate a signtificant price (okay, log(price) difference betwen sedans & hatchbacks.
* Wagons & Hatchbacks fall just-short of the 5% p-value threshold.
* Wagon vs.  Sedan body style is not a good price indicator.

las parameter setting from [stack overvlow](https://stackoverflow.com/questions/1828742/rotating-axis-labels-in-r)

mar parameter from [r-bloggers](https://www.r-bloggers.com/setting-graph-margins-in-r-using-the-par-function-and-lots-of-cow-milk/)

In [None]:
options(repr.plot.width=8, repr.plot.height=3)
par (las=2)
par(mar=c(5,10,3,1))
plot(autos.body.style.hsd)

Tukey HSD validates the "hunch" we got from ANOVA
* There's a singificant difference between sedans and hatchbacks. 
* Wagons vs. Hatchback almost makes the cut, but not quite.
* Wagan vs. Sedan does not have a significant impact on price.