#   Machine Learning Workflow-4,5:

The general machine learning projects will follow the following pipeline. However, the detailed implementation can vary. For example, oftentimes we will iterate some procedures, such as feature engineering and selection etc.
1. Data cleaning and formatting
2. Exploratory Data Analysis(EDA)
3. Feature engineering and selection
4. Split dataset and compare different models on a performance metric
5. Perform hyperparameter tuning on the best model to optimize it 
6. Evaluate the best model on the testing set
7. Interpret the model results
8. Draw conclusions and write a well-documented report

Most of the following explainations and codes are from [Will Koehrsen](https://github.com/WillKoehrsen/machine-learning-project-walkthrough/blob/master/Machine%20Learning%20Project%20Part%202.ipynb).


In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams['font.size'] = 24
plt.style.use('fivethirtyeight')

import seaborn as sns
sns.set(font_scale = 2)

# Imputing missing values and scaling values
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import SimpleImputer

# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

## 4. Split dataset and compare different models 
For supervised machine learning, we should first check if all the target has values, if not we should remove them, then split data.
- no_target_score = data[data[target].isna()]
- target = data[data[target].notnull()]

### 4.1 split dataset
#### separate out the features and target
- features = target_score.drop(columns=target)
- target = pd.DataFrame(data[target])

#### replace the inf and -inf with nan
- features = features.replace(np.inf: np.nan, -np.inf:np.nan)

#### split into 70% training and 30% tresting 
x_train,x_test,y_train,y_test = train_test_split(features, target, test_size = 0.3, random_state = 42)


### 4.2 Establish a baseline
The goal to establish a baseline is to compare the model with the naive guess. if the model we built failed, it means the model may be not suited for this problem. This may be because we are not using the right models, or we need more data, or there is a simpler solution that does not require machine learning.

For a regression task, a good naive baseline is to predict the median value of the target on the training set for all examples on the test set. If the model can't do better than guessing the median value, we will need to rethink the approach. And we can also use mean absolute error as metric to evaluate the model.

#### calculate the median guess and evaluate it on the test set.
- baseline_guess = np.median(y_train)

- print('The baseline guess is a score of %0.2f'% baseline_guess)
- print('Baseline Performance on the test set: Mean Absolute Eroor = %0.4f'% mean_abso_err(y_test, baseline_guess))

### 4.3 Evaluating and Comparing Machine Learning Models
In this section we will build, train, and evaluate several models. The objective is to determine which model holds the most promise for further hyperparameter tuning.

we will compare models using the Mean Absolute Error.

### a) Imputing Missing Values
Standard machine learning models can't deal with missing values, which means we should fill in those missing values or discard the feautres with missing values. Since we already removed features with more than 50% missing values in the first part, herein we will focus on filling in those missing values, a process known as imputation. There are many ways to filling missing values, one of the relatively simple method is to replace missing values with the median of the column.


In the code below, a sklearn imputer object will be created to fill in missing values with the median of the columns. Notice that the imputer will be trained using Imputer.fit method on the training data but not on the test data. But they will be transformed to training data and testing data by using Imputer.transform. This means the missing values in the testing set will be filled in with the median value of the corresponding columns in the training set. This is one way to avoid the problem known as data leakage where information from testing set 'leaks' into the training process.

##### Create an imputer object with a median filling strategy
- imputer = SimpleImputer(stragegy = 'median')

##### Train on the training features
- imputer.fit(train_features)

##### Transform both training and testing data
- x_train = imputer.transform(tran_features)
- x_test = imputer.transform(test_features)

##### Check if the missing values is null
- print('Missing values in training features: ', np.sum(np.isnan(x_train))
- print('Missing values in testing features: ', np.sum(np.isnan(x_test))

### b) Scaling Features
Sometimes features are in different units, if we do not normalize them, some of important features will be easily ignored because they are too small compared with other large ones. ** Linear Regression and Random Forest do not requre feature scaling.* But other methods, such as support vector machines and k nearest neighbors, do require feature scaling because they take into account the Euclidean distance between observations.


There are two ways to scale features:
1. For each value, subtract the mean of the feature and divide by the standard deviation of the features. This is known as standardization and results in each feature having a mean of 0 and a standard deviation of 1.


2. For each value, subtract the minimum value of the feature and divide by the maximum minus the minimum for the feature(the range). This assures that all the values for a feature are between 0 and 1 and is called scaling to a range or normalization.

##### Create the scaler object with a range of 0-1
- scaler = MinMaxScaler(feature_range=(0,1))

##### Fit on the training data
- scaler.fit(X)

##### Transform both the training and testing data
- x_train = scaler.transform(x_train)
- x_test = scaler.transform(x_test)

##### Convert y to one-dimensional array（vector）??
- y_train = np.array(train_labels).reshape((-1, ))
- y_test = np.array(test_labels).reshape((-1, ))

### c) Models to Evaluate
We will compare five different machine learning models:


- Linear Regression


- Support Vector Machine Regression


- Random Forest Regression


- Gradient Boosting Regression


- K-Nearest Neighbors Regression


Generally these models will perform decently, but should be optimized before actually using as a model. At first, we just want to determine the baseline performance of each model, then select the best performing model for further optimization using hyperparameter tuning. 

#### Linear Regression
- lr = LinearRegression()
- lr_mae = fit_and_evaluate(lr)

#### Support Vector Machine Regression
- svm = SVR(C = 1000, gamma = 0.1)
- svm_mae = fit_and_evaluate(svm)

#### Random Forest Regression
- random_forest = RandomForestRegression(random_state = 60)
- random_forest_mae = fit_and_evaluate(random_forest)

#### Gradient Boosting Regression
- gradient_boosted = GradientBoostingRegressor(random_state = 60)
- gradient_boosted_mae = fit_and_evaluate(gradient_boosted)

#### K-Nearest Neighbors Regression
- knn = KNeighborsRegressor(n_neighbors = 10)
- knn_mae = fit_and_evaluate(knn)

In [None]:
print('Linear Regression Performance on the test set: MAE = %0.4f' % lr_mae)
print('Support Vector Machine Regression Performance on the test set: MAE = %0.4f' % svm_mae)
print('Random Forest Regression Performance on the test set: MAE = %0.4f' % random_forest_mae)
print('Gradient Boosted Regression Performance on the test set: MAE = %0.4f') % gradient_boosted_mae)
print('K-Nearest Neighbors Regression Performance on the test set: MAE = %0.4f' % knn_mae)

##### Transfer abover maes to DataFrame
- model_comparison = pd.DataFrame({'model':['Linear Regression','Support Vector Machine', 'Random Forest', 'Gradient Boosted', 'K-Nearest Neighbors'],'Mean Absolute Error':[lr_mae, svm_mae, random_forest_mae, gradient_boosted_mae, knn_mae]})

##### Horizontal bar chart of test mae
- model_comparison.sort_values('Mean Absolute Error', ascending = False).plot(x='model',y='Mean Absolute Error', kind = 'barh', edgecolor='black')

- plt.xlabel('??')
- plt.xticks(size=??)
- plt.ylabel('??')
- plt.yticks(size=??)
- plt.title('??', size=20)

We then compare the maes of different models with baseline we set in 4.2, if they are outperform the baseline, the model is applicable.(less than baseline mae??)

## 5. Perform hyperparameter tuning on the best model to optimize
What's the hyperparameters and model parameters:
- **Hyperparameter** are best thought of as setting for a machine leraning algorithem that are tuned by the data scientist before training. Such as number of trees in the random forest, or the number of neighbors used in K Nearest Neighbors Regression.

- Model **parameters** are what the model learns during training


Tunning the model hyperparameters controls the balance of under vs over fitting in a model. A model that underfits has high bias, and occurs when our model does not have enough capacity(degress of freedom) to learn the relationship between the features and the target. A model that overfits has high variance and in effect has memorized the training set. 

### Hyperparameter Tuning with Random Search and Cross Validation
We can choose the best hyperparameters for a model through random search and cross validation
- **Random search** refers to the method in which we choose hyperparameters to evaluate:we define a range of options, and **randomly select combinations to try**. This is in contrast to **grid search** which **evaluates every single combination specified**. Generally, random search is better when we have limited knowledge of the best model hyperparameters and we can use random search to narrow down the options and user grid search with a more limited range of opitions.


- **Cross Validation** is the method used to **assess the performance of the hyperparameters**. Rather than splitting the dataset into training and validation sets which reduces the amount of training data, we use K-Fold Cross Validation. This means dividing the training data into K folds, and then going through an interative process where we first train on k-1 of the folds and then evaluate performance on the kth fold. We repeat this process K times so eventually we will have tested on every example in the training data with the key that each iteration we are testing on data that we did not train on. At the end of K-fold cross validation, we take the average error on each of the K iterations as the final performance and train the model on all the training dataset. 


When implement random search with cross validation to select the optimal hyperparameters for the model, let's say **Gradient Boosting Regressor**, we first define a grid, then perform an iterative process of: randomly sample a set of hyperparameters from the grid, evaluate the hyperparameters using 4-fold cross-validation, then select the parameters with the best performance.