#   Machine Learning Workflow-1,2:

The general machine learning projects will follow the following pipeline. However, the detailed implementation can vary. For example, oftentimes we will iterate some procedures, such as feature engineering and selection etc.
1. Data cleaning and formatting
2. Exploratory Data Analysis(EDA)
3. Feature engineering and selection
4. Split dataset and compare different models on a performance metric
5. Perform hyperparameter tuning on the best model
6. Evaluate the model
7. Interpret the model results

In [1]:
import numpy as np
import pandas as np
import matplotlib.pyplot as plt
%matplotlib inline

# set default font size
plt.rcParams['font.size'] = 24

from IPython.core.pylabtools import figsize

import seaborn as sns
sns.set(font_scale = 2)

# imputing missing values and scaling values
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import SimpleImputer

# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

import graphviz

## 1. Data Clearning and formatting
### 1.1  check the data type
For some dataset, the default index is range of numer, which means nothing, we can replay the previous index with the 'id' as following:
- pd.read_csv('filename with path', index_col = colume_with_id)


- data.info()

### 1.2 Convert to correct types
a) convert 'object' columns with numbers and 'Not Available'to np.nan:
- data.replace({'Not Available': np.nan})

b) convert 'object' columns to numeric: 

c) convert datetime to datetime
- pd.to_datetime(data[datetime])

### 1.3 Define the target

### 1.4 Check missing values
Missing values are fine when we do exploratory data analysis(EDA), but they have to be filled in Maching Learning models.
- method 1:using mapping
    - sns.heatmap(data.isnull(), yticklabels=False, cbar=False)

- method 2: using defined functions

### 1.5 Check outlier and remove:
- method 1: using build-in function
    - data.describe()
    
- method 2: using other function:

Outlier are extreme values. One of its definationa are any data values which lie more than 3 times the interquartile range below the first quartile or above the thir quartile. Therefore, the lower end will be 'First Quartile - 3 * Interquartile Range', and the higher end will be 'Third Quartile + 3 * Interquartile Range'

## 2. EDA
### 2.1 Single Variable Plots(distribution of targets)
- plt.hist(data['target'].dropna())

- data[target].hist(bins=?,edgecolor='black)  #This one is better

### 2.2 Relationships between categorical variables and target
In order to look at the effect of categorical variables on the target, we made density plots colored by the value of the categorical variables. In order not to clutter the plot, sometimes we prefer to limit the graph to variables that have more than n vlueas in the dataset.
#### choose those types that has counts > n 
- types = data.dropna(subset = ['target']
- types = types['??'].value_counts()
- types = list(types[types.values > n].index)

to check the relationship between categorical variable and target, we can also use the barplot
- sns.barplot(x=categorical varible, y=target, data=?, ax=ax[0 or 1])


for ax[0 or 1] is for there are more than one figures and we need to locate which figure we want to put it in.

### 2.3 Correlation coefficients between all variables(features) and target
This is to quantify correlations between the features and target.And we can obtain the strength and direction of a linear relationship between two variables.
#### a) find the correlation coefficients
- correlation_data = data.corr()[target].sort_values()


- correlation_data.head(10)  (This is to find the most negative correlations)


- correlation_data.tail(10)  (This is to find the most positive correlations)

To account for possible non-linear relationship, we can take square root and log transformation of the varibalbes and then calculate the correlation coefficients with target.
#### b) take square root and log transformation(numeric columns)
- numeric_subset = data.select_types('number')

#### c) use one-hot encode to capture any possible relationships between the categorical and target (categorical columns)
- categorical_subset = data[['??', '??', '??'...]]


- categorical_subset = pd.get_dummies(categorical_subset)

#### d) Join the above (b) and (c) together
- new_features = pd.concat([numeric_subset, categorical_subset], axis = 1)


- new_feautres = new_features.dropna(subset=target) (This is to drop target without values)


- new_correlations = new_features.corr()[target].dropna().sort_values()


- new_correlations = new_correlations.head(10)


- new_correlations = new_correlations.tail(10)

### 2.4 Two-variable relationship plots(between numerical columns)
#### a) for several numerical variables (more than 2)
- plot_data = new_features[['??', '??', '??']] (This is to choose how many features need to plot)
- plot_data = plot_data.replace({np.inf:np.nan, -np.inf:np.nan}) (This is to replace the inf with nan)
- plot_data = plot_data.dropna()

In [4]:
# define function to calculate correlation coefficient between two columns
def corr_coef(x,y,**kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate('r = {:.2f}'.format(r),xy = (.2, .8), xycoords = ax.transAzes, size=20)

- grid = sns.PairGrid(data = plot_data, size =3)


- grid.map_upper(plt.scatter)


- grid.map_diag(plt.hist, edgecolor = 'black')


- grid.map_lower(corr_coef)


- grid.map_lower(sns.kdeplot)


- plt.suptitle('???', size=36)

#### b) for two numerical variables
Use scatterplot to visualize the relationship between two variables. We can also include additional variable by using color, size of markers to represent a thir categorical variable.
- sns.lmplot(numerical feature, target, hue = categorical feature, data = new_features, scatter_kws = {'alpha':0.8})