# Intro to Machine Learning with scikit-learn

## Resources
* [Scikit-learn docs][1]
* [Master Machine Learning with Python by Ted Petrou][2]
* [Dunder Data][3]

[1]: http://scikit-learn.org/stable/
[2]: https://www.dunderdata.com/offers/zL9AKLct/checkout
[3]: https://www.dunderdata.com

## Typical Workflow for Beginners
* Find dataset
    * [Kaggle Datasets](https://www.kaggle.com/datasets)
    * [data.world](https://data.world/)
    * [data.gov](https://www.data.gov/)
 

* Read data into Pandas
* Clean data
* Exploratory data analysis with basic statistics and visualizations
* Define Problem
* Train and Evaluate model with Scikit-Learn

## Learning vs Machine Learning

### What is Learning?
Learning is the ability to improve at a **task**. Learning is done by animals, humans, and some machines.

### What is a task?
A task is a clearly defined piece of work.

### Measuring task performance
Learning happens when the person or machine improves its performance at completing the task. 

### What is Machine Learning?
Machine learning is often defined as the ability of a machine to learn (to improve on a specific task) without being explicitly programmed to do so.

### What is "not explicitly programmed"?
Not updated by a human

### The two types of machine learning
* **supervised** - have labels (ground truth)
* **unsupervised** - no labels

### Regression vs Classification
* **regression** - continuous value labels
* **classification** - discrete labels


## Terminology

![][1]


## Assessing task performance
Objectively quantifiable measure of performance

### Assessing regression task performance
Minimize error

### Make prediction on unseen data
We only care about unseen data

[1]: images/terminology.png

## Ames Housing Data

* [Famous beginners Kaggle competition][0] compiled by professor Dean De Cock from Ames, Iowa from 2006 - 2010
* Original dataset has 79 features and 1460 samples
* For simplicity, we will only look at 8 features
* Predict sale price
* Evaluation metric - R^2 - least squared error

[0]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

### Read in Data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

## Quick EDA with pandas_profiling

In [None]:
from pandas_profiling import ProfileReport

### What are we predicting?
In this problem, we want to predict the final sale price of the house.

### Assign target variable to `y`

### Baseline model for regression
Guess the mean

### Choose a single column to learn from
Use ground living area - high correlation with sales price

### Linear Regression

Maps input to output using the following equation.

$$y = w_{0} + w_{1}x_{1}$$

### Extract columns into own variable

## Import, Instantiate, Fit - Three-step process for all Estimators
* Import - find model in scikit-learn 'house'
* Instantiate - create a single instance of the model
* Fit - learn from data

All scikit-learn estimators follow this three-step process

### Step 1: Import

### Step 2: Instantiate

### Step 3: Fit

### Get the model

### How does linear regression work
Finds best combination of coefficients(weights) that minimize the squared error

![][0]

[0]: images/r2.png

### Make predictions
`predict` method

### Evaluate model - get R-squared
`score` method

### Multiple Linear Regression
Use more than one predictor variable

## Other ML models - The scikit-learn house

![][0]

[0]: images/scikit_house.png

## K-Nearest Neighbors
Same three-step process

## Decision Trees

## Model evaluation
Must evaluate ourselves on unseen data. Use cross-validation.

![][0]

[0]: images/kfold.png

## Tune Hyper-parameters with grid-search
create parameter grid then do three step process

### Grid search results

## Extra
* Remedying missing values
* Categorical features
* Feature Standardization
* Pipelines