# Integrating Pandas with Scikit-Learn, an Exciting New Workflow

## About Me - Ted Petrou

* Author of Pandas Cookbook
* Author of [Master Data Analysis with Python][2]
* Founder of Dunder Data - Expert Data Science Instruction
* Specialize in finding best practices to use the python data science ecosystem
* Follow me on Twitter [@TedPetrou][3]

## Make sure you have scikit-learn 0.20 installed

## Major Objective
* Teach latest and most robust workflows for those that use pandas for data exploration and scikit-learn for machine learning. 
* New features added to version 0.20 of scikit-learn in September, 2018

[1]: https://scikit-learn.org/stable/whats_new.html#version-0-20-0
[2]: https://online.dunderdata.com/courses/master-data-analysis-with-python-volume-1-foundations-of-data-exploration
[3]: https://twitter.com/tedpetrou

## 1. The Scikit-Learn Estimator
* one primary type of object to do machine learning - the **estimator**.

All estimators:
* Learn from data
* Are python types
* Written in CamelCase
* Use the three-step process: import, instantiate, fit

Types of estimators:
* Regressors - Supervised learning with continuous target
* Transformers - Transform the input/output data
* Meta-estimators - Learn from other estimators

### Helper Functions

### Finding estimators in the scikit-learn API
[1]: https://scikit-learn.org/stable/glossary.html#class-apis-and-estimator-types

In [None]:
from IPython.display import IFrame
IFrame('https://scikit-learn.org/stable/modules/classes.html', 800, 600)

## Common Estimators and Helper Functions


### House - Room - Object

![](images/scikit_house.png)

## The Housing Dataset
* 80 variables
* 1460 rows
* Ames, Iowa 2006 - 2010

[1]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

In [None]:
import pandas as pd
hs = pd.read_csv('data/housing_sample.csv')
hs.head()

Get the shape, data types, and number of missing values:

## Prepare Data - scikit-learn gotchas
* No missing values
* Use `X` and `y`

### Model sale price with ground living area

## Import, Instantiate, Fit — The three-step process for each estimator

* **Import** the estimator from its module
* **Instantiate** the estimator, possibly changing the (hyper)parameters
* **Fit** the estimator to the data

## Linear regression with the three-step process

### Step 1: Import

[1]: https://scikit-learn.org/stable/glossary.html#term-regressors

### Step 2: Instantiate

### Step 3: Fit

## Estimated Parameters - end in a single underscore

## Make predictions

## Summary of commands

In [None]:
hs = pd.read_csv('data/housing_sample.csv')
X = hs[['GrLivArea']].values
y = hs.pop('SalePrice').values

from sklearn.linear_model import LinearRegression  # step 1 - import
lr = LinearRegression()                            # step 2 - instantiate
lr.fit(X, y)                                       # step 3 - fit

lr.predict(X)

## Exercise
All other regression estimators use the same three-step process to learn from the data. Complete the three-step process for the following models:
* K-nearest neighbors
* Decision trees
* Random Forests
* Gradient Boosted trees

The model learned can drastically change by setting the hyperparameters in step 2 during instantiation. We aren't concerned with hyperparameters at this point. Also, You may choose input data from other columns that have no missing values.