# Connect Intensive - Machine Learning Nanodegree
# Lesson 2: Predicting Automobile Gas Mileage 

## Objectives
  - Use [Numpy](https://docs.scipy.org/doc/numpy/reference/routines.statistics.html) to compute statistics  
  - Split a dataset into training and testing sets using scikit-learn's [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) 
  - Build a [decision tree regressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) model using scikit-learn
  




## Background

The purpose of this project is to use a decision tree regressor to predict automobile gas mileage. The dataset was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG).


 **Target variable** 
 - mpg: The vehicle’s gas mileage 
 
**Predictor variables**
 - displacement: The vehicle’s engine displacement
 - horsepower: The vehicle’s horsepower
 - weight: The vehicle’s weight





##  Import the necessary libraries and dataset

In [46]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import ShuffleSplit

car_data = pd.read_csv('car_mpg_data.csv')
mpg = car_data['mpg']
predictor_variables = car_data.drop('mpg', axis = 1)
    
print "Rows: {} Columns: {} ".format(*car_data.shape)

Rows: 392 Columns: 4 


##  Use Numpy to learn more about the dataset by calculating basic statistics 

In [47]:
min = np.min(mpg)
max = np.max(mpg)
mean = np.mean(mpg)
median = np.median(mpg)
std = np.std(mpg)

print "Minimum MPG:{:,.0f}".format(min)
print "Maximum MPG:{:,.0f}".format(max)
print "Mean MPG:{:,.0f}".format(mean)
print "Median MPG: {:,.0f}".format(median)
print "MPG standard deviation:{:,.0f}".format(std)

Minimum MPG:9
Maximum MPG:47
Mean MPG:23
Median MPG: 23
MPG standard deviation:8


## Import the r2_score function for evaluating the model


In [48]:
from sklearn.metrics import r2_score 

def performance_metric(y_true, y_predict):
    
    score = r2_score(y_true, y_predict)
    
    return score

## Import the train_test_split function for dividing the dataset into training and testing sets

In [49]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(predictor_variables, mpg, test_size=0.25, random_state=42)

print "X_train: rows:  {},  columns: {}".format(*X_train.shape)
print "X_test:  rows:  {},  columns: {}".format(*X_test.shape)
print "y_train: rows:  {},  columns: 1".format(*y_train.shape)
print "y_test:  rows:  {},  columns: 1".format(*y_test.shape)

X_train: rows:  294,  columns: 3
X_test:  rows:  98,  columns: 3
y_train: rows:  294,  columns: 1
y_test:  rows:  98,  columns: 1


## Use the grid search technique to train the decision tree algorithm 

In [50]:
# TODO: Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

def fit_model(X, y):
    
    
    cross_validation_sets = ShuffleSplit(X.shape[0], n_iter = 5, test_size = 0.25, random_state = 1)

    reg_model = DecisionTreeRegressor()

    max_depth = {'max_depth': range(1,11)}
   
    # greater_is_better will look for the highest r2
    scorer = make_scorer(performance_metric, greater_is_better = True)
    

    
    grid = GridSearchCV(reg_model, param_grid=max_depth, scoring=scorer, cv=cross_validation_sets)

    grid = grid.fit(X, y)
  
    return grid.best_estimator_

## Fit the model to the training data

In [51]:
reg = fit_model(X_train, y_train)

In [52]:
print car_data.head()
len(car_data.index)

    mpg  displacement  horsepower  weight
0  18.0         307.0         130    3504
1  15.0         350.0         165    3693
2  18.0         318.0         150    3436
3  16.0         304.0         150    3433
4  17.0         302.0         140    3449


392

## Make predictions using the model

In [53]:
car_specs = [[302,140 ,3450 ], [350, 165, 3690 ]] 
                               
for i, mpg in enumerate(reg.predict(X_test)):
    print "Car {}'s predicted MPG: {:,.0f}".format(i+1, mpg)

pred = reg.predict(X_test)

Car 1's predicted MPG: 30
Car 2's predicted MPG: 25
Car 3's predicted MPG: 34
Car 4's predicted MPG: 34
Car 5's predicted MPG: 25
Car 6's predicted MPG: 30
Car 7's predicted MPG: 13
Car 8's predicted MPG: 30
Car 9's predicted MPG: 19
Car 10's predicted MPG: 34
Car 11's predicted MPG: 15
Car 12's predicted MPG: 25
Car 13's predicted MPG: 13
Car 14's predicted MPG: 30
Car 15's predicted MPG: 19
Car 16's predicted MPG: 25
Car 17's predicted MPG: 19
Car 18's predicted MPG: 30
Car 19's predicted MPG: 30
Car 20's predicted MPG: 25
Car 21's predicted MPG: 25
Car 22's predicted MPG: 34
Car 23's predicted MPG: 34
Car 24's predicted MPG: 15
Car 25's predicted MPG: 34
Car 26's predicted MPG: 25
Car 27's predicted MPG: 25
Car 28's predicted MPG: 19
Car 29's predicted MPG: 34
Car 30's predicted MPG: 30
Car 31's predicted MPG: 13
Car 32's predicted MPG: 19
Car 33's predicted MPG: 25
Car 34's predicted MPG: 30
Car 35's predicted MPG: 13
Car 36's predicted MPG: 34
Car 37's predicted MPG: 13
Car 38's p

In [54]:
accuracy = reg.score(X_test, y_test)

In [55]:
print accuracy

0.639407328253
