<a href="https://colab.research.google.com/github/Higgins2718/DS-Unit-2-Sprint-2-Regression/blob/master/Regression_Sprint_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Regression Sprint Challenge

For this Sprint Challenge, you'll predict the price of used cars. 

The dataset is real-world. It was collected from advertisements of cars for sale in the Ukraine in 2016.

The following import statements have been provided for you, and should be sufficient. But you may not need to use every import. And you are permitted to make additional imports.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

[The dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/car_regression.csv) contains 8,495 rows and 9 variables:

- make: manufacturer brand
- price: seller’s price in advertisement (in USD)
- body: car body type
- mileage: as mentioned in advertisement (‘000 Km)
- engV: rounded engine volume (‘000 cubic cm)
- engType: type of fuel
- registration: whether car registered in Ukraine or not
- year: year of production
- drive: drive type

Run this cell to read the data:

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/car_regression.csv')
print(df.shape)
df.sample(10)

(8495, 9)


Unnamed: 0,make,price,body,mileage,engV,engType,registration,year,drive
2674,17,3400.0,3,133,1.3,1,1,2005,0
3974,77,10200.0,3,129,2.0,3,1,2006,0
1412,77,3650.0,3,197,2.0,3,1,1996,0
6203,39,22500.0,0,70,2.0,0,1,2013,1
93,74,73900.0,0,1,4.5,0,1,2016,1
3020,67,5999.0,1,98,1.2,1,1,2007,0
4926,59,11900.0,5,143,2.0,0,1,2009,0
4234,26,4100.0,3,108,1.5,1,1,2011,0
4037,22,8300.0,5,149,1.6,0,1,2012,0
2742,17,4300.0,3,80,1.3,1,1,2012,0


In [3]:
df.isnull().sum()

make            0
price           0
body            0
mileage         0
engV            0
engType         0
registration    0
year            0
drive           0
dtype: int64

# Predictive Modeling with Linear Regression

## 1.1 Split the data into an X matrix and y vector (`price` is the target we want to predict).

In [0]:
features = ['make', 
            'body', 'mileage', 'engV', 'engType', 'registration', 'year', 'drive']

target = 'price'

X = df[features]
y = df[target]

## 1.2 Split the data into test and train sets, using `train_test_split`.
You may use a train size of 80% and a test size of 20%.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, test_size=0.20, random_state=42)


## 1.3 Use scikit-learn to fit a multiple regression model, using your training data.
Use `year` and one or more features of your choice. You will not be evaluated on which features you choose. You may choose to use all features.

In [6]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

## 1.4 Report the Intercept and Coefficients for the fitted model.

In [7]:
print(model.intercept_)

-2269355.0772314165


In [8]:
print(model.coef_)

[  -35.16726588 -1770.98509064   -40.26859658   273.03540784
 -1111.08031708  4535.06013378  1140.73124767  8292.04613874]


## 1.5 Use the test data to make predictions.

In [0]:
y_pred = model.predict(X_test)

## 1.6 Use the test data to get both the Root Mean Square Error and $R^2$ for the model. 
You will not be evaluated on how high or low your scores are.

In [10]:
# Compare predictions to test target
rmse = (np.sqrt(mean_squared_error(y_test, y_pred)))
r2 = r2_score(y_test, y_pred)
print('Root Mean Squared Error', rmse)
print('R^2 Score', r2)

Root Mean Squared Error 21394.43524600266
R^2 Score 0.29213322373743256


## 1.7 How should we interpret the coefficient corresponding to the `year` feature?
The coefficient corresponding to the `year` feature has a value of approximately 1140.7. `price` has a strong causual relationship with `year`.

In [11]:
pd.Series(model.coef_, features)

make             -35.167266
body           -1770.985091
mileage          -40.268597
engV             273.035408
engType        -1111.080317
registration    4535.060134
year            1140.731248
drive           8292.046139
dtype: float64

## 1.8 How should we interpret the Root Mean Square Error?
It is very high, suggesting this model is innaccurate at predicting price given the current state of our featurers.

## 1.9 How should we interpret the $R^2$?
One sentence can be sufficient

The R^2 score is low—this is especially disappointing given the number of features we added. It also suggests our model has performed poorly.

# Log-Linear and Polynomial Regression

## 2.1 Engineer a new variable by taking the log of the price varible.

In [0]:
df['log_price'] = np.log(df['price'])

## 2.2 Visualize scatterplots of the relationship between each feature versus the log of price, to look for non-linearly distributed features.
You may use any plotting tools and techniques.

In [0]:
log_price = 'log_price'
features_graph = df.columns.drop(log_price)
n = 0
for feature in features:
    sns.lmplot(x=features_graph[n], y=log_price, data=df, scatter_kws=dict(alpha=0.1))
    plt.show()
    n += 1

## 2.3 Create polynomial feature(s)
You will not be evaluated on which feature(s) you choose. But try to choose appropriate features.

In [0]:
# Visualizing without the distraction of log_price
# A simple modification of the above code-cell

graph_target = 'price'
features_graph = df.columns.drop(graph_target)
n = 0
for feature in features_graph:
    sns.lmplot(x=features_graph[n], y=graph_target, data=df, scatter_kws=dict(alpha=0.1))
    plt.show()
    n += 1

Year has a nice curve to it.

In [0]:
df['year_squared'] = df['year']**2
for feature in ['year', 'year_squared']:
    sns.scatterplot(x=feature, y=target, data=df, alpha=0.1)
    plt.show()

I also want to create a polynomial feature from mileage

In [0]:
df['mileage_squared'] = df['mileage']**2
for feature in ['mileage', 'mileage_squared']:
    sns.scatterplot(x=feature, y=target, data=df, alpha=0.1)
    plt.show()

## 2.4 Use the new log-transformed y variable and your x variables (including any new polynomial features) to fit a new linear regression model. Then report the: intercept, coefficients, RMSE, and $R^2$.

In [17]:
df.head()

Unnamed: 0,make,price,body,mileage,engV,engType,registration,year,drive,log_price,year_squared,mileage_squared
0,23,15500.0,0,68,2.5,1,1,2010,1,9.648595,4040100,4624
1,50,20500.0,3,173,1.8,1,1,2011,2,9.92818,4044121,29929
2,50,35000.0,2,135,5.5,3,1,2008,2,10.463103,4032064,18225
3,50,17800.0,5,162,1.8,0,1,2012,0,9.786954,4048144,26244
4,55,16600.0,0,83,2.0,3,1,2013,1,9.717158,4052169,6889


In [0]:
features = ['make', 
            'body', 'mileage', 'engV', 'engType', 'registration', 'year', 'drive', 'year_squared', 'mileage_squared']

target = 'log_price'

X = df[features]
y = df[target]

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, test_size=0.20, random_state=42)

In [21]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [22]:
print(model.intercept_)
print(model.coef_)

6082.256345344236
[-1.68957134e-03 -9.42269254e-02  7.78177052e-04  8.22070681e-03
 -4.80904423e-02  6.71842604e-01 -6.17044214e+00  3.74004369e-01
  1.56661127e-03 -2.06653079e-07]


In [0]:
y_pred = model.predict(X_test)

In [24]:
# Compare predictions to test target
rmse = (np.sqrt(mean_squared_error(y_test, y_pred)))
r2 = r2_score(y_test, y_pred)
print('Root Mean Squared Error', rmse)
print('R^2 Score', r2)

Root Mean Squared Error 0.562808450807755
R^2 Score 0.6694853735895996



Needs improvement, but far, far better than our previous attempt!

In [25]:
pd.Series(model.coef_, features)

make              -1.689571e-03
body              -9.422693e-02
mileage            7.781771e-04
engV               8.220707e-03
engType           -4.809044e-02
registration       6.718426e-01
year              -6.170442e+00
drive              3.740044e-01
year_squared       1.566611e-03
mileage_squared   -2.066531e-07
dtype: float64

## 2.5 How do we interpret coefficients in Log-Linear Regression (differently than Ordinary Least Squares Regression)?
One sentence can be sufficient

We interprect our coefficients in terms of percentages, rather than raw value increases or decreases in the feature-target relationship (as in OLS)

# Decision Trees

## 3.1 Use scikit-learn to fit a decision tree regression model, using your training data.
Use one or more features of your choice. You will not be evaluated on which features you choose. You may choose to use all features.

You may use the log-transformed target or the original un-transformed target. You will not be evaluated on which you choose.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt

def viztree(decision_tree, feature_names):
    dot_data = export_graphviz(decision_tree, out_file=None, feature_names=feature_names, 
                               filled=True, rounded=True)   
    return graphviz.Source(dot_data)

def putts_tree(max_depth=1):
    tree = DecisionTreeRegressor(max_depth=max_depth)
    tree.fit(putts_X, putts_y)
    print('R^2 Score', tree.score(putts_X, putts_y))
    ax = putts.plot('distance', 'rate of success', kind='scatter', title='Golf Putts')
    ax.step(putts_X, tree.predict(putts_X), where='mid')
    plt.show()
    display(viztree(tree, feature_names=['distance']))

interact(putts_tree, max_depth=(1,6,1));

## 3.2 Use the test data to get the $R^2$ for the model. 
You will not be evaluated on how high or low your scores are.

# Regression Diagnostics

## 4.1 Use statsmodels to run a log-linear or log-polynomial linear regression with robust standard errors.

## 4.2 Calculate the Variance Inflation Factor (VIF) of our X variables. 

### Do we have multicollinearity problems?
One sentence can be sufficient