# Scikit-learn Intro
Scikit-learn is a very powerful machine learning library in Python. We will be using the [kaggle housing dataset][1]. Keep the [scikit-learn API][2] open during this section.

[1]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
[2]: http://scikit-learn.org/stable/modules/classes.html

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# read in the data
housing = pd.read_csv('data/housing.csv')
housing.head()

# Use correlation matrix to understand relationships

In [None]:
corr = housing.corr()
corr.head()

In [None]:
sns.clustermap(corr)

In [None]:
corr['SalePrice'].sort_values(ascending=False).head(10)

# Machine Learning Must have a Goal
Our goal is to predict sale price. We use **squared error** as the metric to evaluate how well we have done. We will try and find a model that minimizes the squared error.

# Supervised vs Unsupervised
Two broad types of machine learning. 
* Supervised provides with the **ground truth** - goal is to predict accurately
* Unsupervised learning has no truth - the goal is to discover structure within the data (cluster similar data points together)

# Regression vs Classification
There are two types of supervised learning:
* Regression - Continuous value output
* Classification - Discrete output (classes)

# Begin with simplest model - Use one predictor variable

# Must go from Pandas to NumPy to use Scikit-Learn
* Use `X` for predictor variables
* Use `y` for target

In [None]:
X = housing['OverallQual'].values
y = housing['SalePrice'].values
X

# Silly NumPy gotcha - X must be two dimensional

In [None]:
X = X.reshape(-1, 1)
X

# Three steps to doing machine learning in Scikit-Learn
1. Import Model
2. Instantiate model
3. Train model (use `fit` method)

# Step 1: Import Model

In [None]:
from sklearn.linear_model import LinearRegression

# Step 2: Instantiate Model

In [None]:
lr = LinearRegression()

# Step 3: Train model with `fit` method

In [None]:
lr.fit(X, y)

# More steps - Predict Housing Price

In [None]:
lr.predict(X)

# Evaluate Model - Find error

In [None]:
lr.score(X, y)

# Use Cross Validation for estimate of true error

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cross_val_score(lr, X, y)

In [None]:
scores = cross_val_score(lr, X, y, cv=10)
scores

In [None]:
scores.mean()

# More Machine Learning Models
### Try a random forest
Use the same exact steps as above. The Scikit-Learn API is the same for all models. Make sure you are using **Regression** models and not classification.

# 1. Import Model

In [None]:
from sklearn.ensemble import RandomForestRegressor

# 2. Instantiate Model

In [None]:
rf = RandomForestRegressor()

# 3. Train

In [None]:
rf.fit(X, y)

# Evaluate and Cross Validate

In [None]:
rf.score(X, y)

In [None]:
scores = cross_val_score(rf, X, y, cv=10)
scores.mean()

# Practice
Use many more models here

In [None]:
# your code here