<a href="https://colab.research.google.com/github/DylanGraves/DS-Unit-2-Sprint-2-Regression/blob/master/Regression_Sprint_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Regression Sprint Challenge

For this Sprint Challenge, you'll predict the price of used cars. 

The dataset is real-world. It was collected from advertisements of cars for sale in the Ukraine in 2016.

The following import statements have been provided for you, and should be sufficient. But you may not need to use every import. And you are permitted to make additional imports.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

[The dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/car_regression.csv) contains 8,495 rows and 9 variables:

- make: manufacturer brand
- price: seller’s price in advertisement (in USD)
- body: car body type
- mileage: as mentioned in advertisement (‘000 Km)
- engV: rounded engine volume (‘000 cubic cm)
- engType: type of fuel
- registration: whether car registered in Ukraine or not
- year: year of production
- drive: drive type

Run this cell to read the data:

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/car_regression.csv')
print(df.shape)
df.sample(10)

(8495, 9)


Unnamed: 0,make,price,body,mileage,engV,engType,registration,year,drive
466,55,12500.0,0,107,2.0,1,1,2008,1
4352,33,28500.0,0,0,2.0,3,1,2016,1
6798,50,30500.0,3,89,6.3,3,1,2008,2
1270,4,5000.0,3,280,2.3,1,1,1992,0
6201,24,2200.0,2,32,2.3,1,1,2002,2
5932,74,30600.0,0,39,2.2,0,1,2014,1
876,4,3550.0,3,196,2.8,3,0,1997,0
1096,67,12800.0,3,135,1.8,3,1,2011,0
372,30,13700.0,3,120,3.5,3,1,2008,0
1842,57,11200.0,3,59,1.6,3,1,2013,0


# Predictive Modeling with Linear Regression

## 1.1 Split the data into an X matrix and y vector (`price` is the target we want to predict).

In [0]:
#Separate X as feature and y as target
#Give the X an extra [] to turn into a matrix
X = df.drop(['price'], axis=1)
y = df['price']

## 1.2 Split the data into test and train sets, using `train_test_split`.
You may use a train size of 80% and a test size of 20%.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, test_size=.2)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(6796, 8) (1699, 8) (6796,) (1699,)


## 1.3 Use scikit-learn to fit a multiple regression model, using your training data.
Use `year` and one or more features of your choice. You will not be evaluated on which features you choose. You may choose to use all features.

In [5]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

## 1.4 Report the Intercept and Coefficients for the fitted model.

In [9]:
print("Intercept: ", model.intercept_)
print("Coefficients: ", model.coef_)

Intercept:  -2343224.8014021353
Coefficients:  [  -37.81961472 -1694.80229329   -40.84415778   319.47420551
 -1031.60528786  5121.42582865  1177.19788217  8663.80658434]


## 1.5 Use the test data to make predictions.

In [11]:
y_pred = model.predict(X_test)
y_pred

array([24240.07367707, 34948.89499335, 16685.01418327, ...,
       25957.6371672 , 20370.96536107,  6988.22958964])

## 1.6 Use the test data to get both the Root Mean Square Error and $R^2$ for the model. 
You will not be evaluated on how high or low your scores are.

In [14]:
print("Root mean square error: ")
np.sqrt(mean_squared_error(y_test, y_pred))

Root mean square error: 


17934.633576419805

In [15]:
print("r2 score: ")
r2_score(y_true=y_test, y_pred=y_pred)

r2 score: 


0.3287363416962159

## 1.7 How should we interpret the coefficient corresponding to the `year` feature?
One sentence can be sufficient

As year increases (the car is newer) so does the price.

## 1.8 How should we interpret the Root Mean Square Error?
One sentence can be sufficient

The Root Mean Square Error should be interpreted as how close (or far away) the actual values were from the model's predicted values.

## 1.9 How should we interpret the $R^2$?
One sentence can be sufficient

R^2 can effectively be interpreted as the accuracy of the model, with the score being the percentage of the dependent variable that is explained by the model.

# Log-Linear and Polynomial Regression

## 2.1 Engineer a new variable by taking the log of the price varible.

In [0]:
df['price_log'] = np.log(df['price'])

In [18]:
df.head()

Unnamed: 0,make,price,body,mileage,engV,engType,registration,year,drive,price_log
0,23,15500.0,0,68,2.5,1,1,2010,1,9.648595
1,50,20500.0,3,173,1.8,1,1,2011,2,9.92818
2,50,35000.0,2,135,5.5,3,1,2008,2,10.463103
3,50,17800.0,5,162,1.8,0,1,2012,0,9.786954
4,55,16600.0,0,83,2.0,3,1,2013,1,9.717158


## 2.2 Visualize scatterplots of the relationship between each feature versus the log of price, to look for non-linearly distributed features.
You may use any plotting tools and techniques.

In [22]:
target = 'price_log'
numeric_columns = df.select_dtypes(include='number').columns
for feature in df.drop(target):
    sns.scatterplot(x=feature, y=target, data=df, alpha=0.1)
    plt.show()

KeyError: ignored

## 2.3 Create polynomial feature(s)
You will not be evaluated on which feature(s) you choose. But try to choose appropriate features.

## 2.4 Use the new log-transformed y variable and your x variables (including any new polynomial features) to fit a new linear regression model. Then report the: intercept, coefficients, RMSE, and $R^2$.

## 2.5 How do we interpret coefficients in Log-Linear Regression (differently than Ordinary Least Squares Regression)?
One sentence can be sufficient

# Decision Trees

## 3.1 Use scikit-learn to fit a decision tree regression model, using your training data.
Use one or more features of your choice. You will not be evaluated on which features you choose. You may choose to use all features.

You may use the log-transformed target or the original un-transformed target. You will not be evaluated on which you choose.

## 3.2 Use the test data to get the $R^2$ for the model. 
You will not be evaluated on how high or low your scores are.

# Regression Diagnostics

## 4.1 Use statsmodels to run a log-linear or log-polynomial linear regression with robust standard errors.

## 4.2 Calculate the Variance Inflation Factor (VIF) of our X variables. 

### Do we have multicollinearity problems?
One sentence can be sufficient