# ML Community Workshop - Linear Regression with SciKitLearn

## Goals
* Demo and explain Linear regression
* Show why encoding features is important.

## Agenda
1. Read and visualize data
2. Review Linear Regression
3. Evaluate Linear Regression without encoding
4. Evaluate Linear Regression with encoding

## Part 1 - Read and Visualize the Data

### Import libraries

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

### Read data into a new dataframe

#### Read into DataFrame

In [None]:
housing_data = pd.read_csv('home_data.csv')

#### Check out the data

In [None]:
# Show first few rows with headings
housing_data.head()

In [None]:
#List column names
print list(housing_data.columns.values)

### Plot _price_ vs. *sqft_living*: is there a correlation?

In [None]:
housing_data.plot(kind='scatter', x='sqft_living', y='price', xlim=(0,16000))

## Part 2 - Review Linear Regression

### Store feature data in _X_ and Target in _Y_

In [None]:
X = housing_data.sqft_living
y = housing_data.price

### Create a train-test split of the data using *train_test_split*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
y_train.shape

#### The _shape_ properties are missing a dimension, so fix this for *Feature and Target variables for train and test*

In [None]:
#Feature variables
X_train, X_test = X_train.reshape((17290, 1)), X_test.reshape((4323,1))

#Target variables
y_train, y_test = y_train.reshape((17290,1)), y_test.reshape((4323,1))

#Quick sanity check on target variable
y_train.shape

#### Train the model using the *fit* function of the *LinearRegression* object.

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

### Evaluate the model

In [None]:
y_pred = lin_reg.predict(X_test)

print ('RMSE with size only: {:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, y_pred))))
print ('RMSE (standard): {:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test, [y_test.mean()] * len(y_test)))))

### Plot the model with the data

In [None]:
plt.plot(X_test, y_test, '.', X_test, y_pred, '-')

## Part 3 - Evaluate Linear Regression without encoding

### Plot *price* vs. *zipcode*

In [None]:
housing_data.plot(kind='scatter', x='zipcode', y='price')

### Let's build a model by adding zipcode without encoding and evaluate it.

In [None]:
feature_subset = ['sqft_living', 'zipcode']

In [None]:
X_feature_subset = housing_data[feature_subset]
X_train_feature_subset, X_test_feature_subset, y_train_feature_subset, y_test_feature_subset = train_test_split(X_feature_subset, y, test_size = 0.2)

### Train the model

In [None]:
lin_reg_feature_subset = LinearRegression()
lin_reg_feature_subset.fit(X_train_feature_subset, y_train_feature_subset)

### Evaluate the model with size with zip code (unencoded)

In [None]:
y_pred_feature_subset = lin_reg_feature_subset.predict(X_test_feature_subset)
print('RMSE with size & zip: {:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test_feature_subset, y_pred_feature_subset))))
print (
    'RMSE (standard): {:,.2f}'.format(
        np.sqrt(metrics.mean_squared_error(y_test_feature_subset,[y_test_feature_subset.mean()] * len(y_pred_feature_subset)))))

## Part 4 - Evaluate Linear Regression with encoding

#### Get features subset from main dataset

In [None]:
X_feature_one_hot = housing_data[feature_subset]
X_feature_one_hot.zipcode = X_feature_one_hot.zipcode.map(lambda z: str(z))

### Encode the zipcode with custom function *one_hot_dataframe*.
Got code from [here](https://gist.github.com/saihttam/cad6d3d223fc8d769227). It uses sklearn.feature_extraction.*DictVectorizer*.

In [None]:
from ohe import one_hot_dataframe

X_feature_ohe,categorical,_ = one_hot_dataframe(X_feature_one_hot, ['zipcode'], True)
X_feature_ohe.head()

### Train the model

In [None]:
X_train_feature_ohe, X_test_feature_ohe, y_train_feature_ohe, y_test_feature_ohe = train_test_split(X_feature_ohe, y, test_size = 0.2)
lin_reg_feature_ohe = LinearRegression()
lin_reg_feature_ohe.fit(X_train_feature_ohe, y_train_feature_ohe)

### Evaluate the model with size & encoded zip

In [None]:
y_pred_feature_ohe = lin_reg_feature_ohe.predict(X_test_feature_ohe)
print ('RMSE with size & encoded zip: {:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test_feature_ohe, y_pred_feature_ohe))))
print ('RMSE (standard): {:,.2f}'.format(np.sqrt(metrics.mean_squared_error(y_test_feature_ohe, [y_test_feature_ohe.mean()] * len(y_test_feature_ohe)))))