# Machine Learning Pipeline Cheat Sheet
---  

## Data Preparation
  Data Preprocessing
    1. Imputation
    2. Outlier Handling
    3. Grouping
  
  Handling Continuous Data
    1. Using It raw
    2. Using Derived Value
    3. Using Count and Frequency
    4. Binarization
    5. Percentage & Rounding
    6. Polynomial Features
    7. Binning
    8. Statistical Transformation
  
  Handling Categorical Data
    1. Norminal Attributes
    2. Ordinal Attributes
    3. One-Hot Encoding
    4. Bin Counting
    5. Feature Hashing
  
  Feature Selection
    1. Correlation Heatmap
    2. Univariate Selection using Select K-Best w/ Scoring
    3. Recursive Feature Elimination
    4. Select from Model
  
  Handling Imbalanced Data
    1. SMOTE (Oversampling)
    2. Random Undersampling w/ Ensemble Methods
    3. Near Miss (Undersampling)
    4. Cost-based Classification
    5. Probability Threshold
  
  Test and Train Spliting
    1. Stratification
    
### Data Preprocessing

#### Imputation

In [1]:
import numpy as np
import pandas as pd

##### Dropping missing value

In [2]:
threshold = 0.7
#Dropping columns with missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]]

#Dropping rows with missing value rate higher than threshold
data = data.loc[data.isnull().mean(axis=1) < threshold]

##### Numerical Imputation

In [None]:
#Filling all missing values with 0
data = data.fillna(0)
#Filling missing values with medians of the columns
data = data.fillna(data.median())

##### Categorical Imputation

In [None]:
#Max fill function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts().idxmax(),
                           inplace=True)

---
#### Handling Outlier

##### Outlier detection with SD
If a value has a distance to the average higher than x * standard deviation, it can be assumed as an outlier. Then what x should be?

There is no trivial solution for x, but usually, a value between 2 and 4 seems practical.

In [None]:
#Dropping the outlier rows with standard deviation
factor = 3
upper_lim = data['column'].mean() + data['column'].std() * factor
lower_lim = data['column'].mean() - data['column'].std() * factor

data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]

In addition, z-score can be used instead of the formula above.

Z-score (or standard score) standardizes the distance between a value and the mean using the standard deviation.

In [None]:
#Remove outlier using Z-score
data['z_score'] = data.groupby('column1')['column2'].apply(
    lambda x: (x - x.mean()) / x.std())
# Plot Density Curve
data = data[(data['z_score'] < 3) & (data['z_score'] > -3)]

##### Outlier detection with Percentile
Another mathematical method to detect outliers is to use percentiles. You can assume a certain percent of the value from the top or the bottom as an outlier.

The key point is here to set the percentage value once again, and this depends on the distribution of your data as mentioned earlier.

In [None]:
#Dropping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)

data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]

##### An Outlier Dilemma: Drop or Cap
Another option for handling outliers is to cap them instead of dropping. So you can keep your data size and at the end of the day, it might be better for the final model performance.
On the other hand, capping can affect the distribution of the data, thus it better not to exaggerate it.

In [None]:
#Capping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)
data.loc[(df[column] > upper_lim), column] = upper_lim
data.loc[(df[column] < lower_lim), column] = lower_lim

---
#### Grouping

##### Categorical Grouping
The first option is to select the label with the highest frequency. In other words, this is the max operation for categorical columns, but ordinary max functions generally do not return this value, you need to use a lambda function for this purpose.

In [None]:
data.groupby('id').agg(lambda x: x.value_counts().index[0])

##### Pivot Table
Second option is to make a pivot table. This approach resembles the encoding method in the preceding step with a difference. Instead of binary notation, it can be defined as aggregated functions for the values between grouped and encoded columns. This would be a good option if you aim to go beyond binary flag columns and merge multiple features into aggregated features, which are more informative.

In [None]:
#Pivot table Pandas Example
data.pivot_table(index='column_to_group',
                 columns='column_to_encode',
                 values='aggregation_column',
                 aggfunc=np.sum,
                 fill_value=0)

##### Numerical Column Grouping
Numerical columns are grouped using sum and mean functions in most of the cases. Both can be preferable according to the meaning of the feature. For example, if you want to obtain ratio columns, you can use the average of binary columns. In the same example, sum function can be used to obtain the total count either.

In [None]:
#sum_cols: List of columns to sum
#mean_cols: List of columns to average
grouped = data.groupby('column_to_group')

sums = grouped[sum_cols].sum().add_suffix('_sum')
avgs = grouped[mean_cols].mean().add_suffix('_avg')

new_df = pd.concat([sums, avgs], axis=1)

For time series, you can group data by day.

In [None]:
pm3_s2_d = pm3_s2_d.set_index(pd.DatetimeIndex(pm3_s2_d['timest'])).groupby(
    pd.Grouper(freq='d')).mean().dropna(how='all')

---
### Handling Continuous Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as spstats

%matplotlib inline

#### Using Raw Measure
Raw measures are typically indicated using numeric variables directly as features without any form of transformation or engineering. Typically these features can indicate values or counts.

In [None]:
poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8') poke_df.head()

---
#### Clip Value
If you closely observe the data frame snapshot in the above figure, you can see that several attributes represent numeric raw values which can be used directly. The following snippet depicts some of these features with more emphasis.

In [None]:
poke_df[['HP', 'Attack', 'Defense']].head()
poke_df[['HP', 'Attack', 'Defense']].describe()

---
#### Counts and Frequency
Another form of raw measures include features which represent frequencies, counts or occurrences of specific attributes. Let’s look at a sample of data from the millionsong dataset which depicts counts or frequencies of songs which have been heard by various users.

---
#### Binarization
Often raw frequencies or counts may not be relevant for building a model based on the problem which is being solved. For instance if I’m building a recommendation system for song recommendations, I would just want to know if a person is interested or has listened to a particular song. This doesn’t require the number of times a song has been listened to since I am more concerned about the various songs he\she has listened to. In this case, a binary feature is preferred as opposed to a count based feature.


In [None]:
watched = np.array(popsong_df['listen_count']) 
watched[watched >= 1] = 1
popsong_df['watched'] = watched

Or you can use scikit learn binarizer as well

In [None]:
from sklearn.preprocessing import Binarizer
bn = Binarizer(threshold=0.9)
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
popsong_df.head(11)

---
#### Percentage and Rounding
Hence it often makes sense to round off these `high precision percentages into numeric integers`. These integers can then be directly used as raw values or even as categorical (discrete-class based) features.

In [None]:
items_popularity = pd.read_csv('datasets/item_popularity.csv', encoding='utf-8')
items_popularity['popularity_scale_10'] = np.array(
                   np.round((items_popularity['pop_percent'] * 10)),  
                   dtype='int')
items_popularity['popularity_scale_100'] = np.array(
                  np.round((items_popularity['pop_percent'] * 100)),    
                  dtype='int')
items_popularity

---
#### Feature Interaction and Polynomial Features
In this case, this simple linear model depicts the relationship between the output and inputs, purely based on the individual, separate input features.

However, often in several real-world scenarios, it makes sense to also try and `capture the interactions between these feature variables as a part of the input feature set`. A simple depiction of the extension of the above linear regression formulation with interaction features would be

In [None]:
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
res = pf.fit_transform(atk_def)
res

---
#### Binning
Binning, also known as quantization is used for `transforming continuous numeric features into discrete ones (categories)`. These discrete values or numbers can be thought of as categories or bins into which the raw, continuous numeric values are binned or grouped into. Each bin represents a specific degree of intensity and hence a specific range of continuous numeric values fall into it.

##### Fix Width

In [None]:
df['Age_band']=0
df.loc[df['Age']<=16,'Age_band']=0
df.loc[(df['Age']>16)&(df['Age']<=32),'Age_band']=1
df.loc[(df['Age']>32)&(df['Age']<=48),'Age_band']=2
df.loc[(df['Age']>48)&(df['Age']<=64),'Age_band']=3
df.loc[df['Age']>64,'Age_band']=4
df.tail()

In [None]:
bin_ranges = [0, 15, 30, 45, 60, 75, 100]
bin_names = [1, 2, 3, 4, 5, 6]
fcc_survey_df['Age_bin_custom_range'] = pd.cut(np.array(
                                              fcc_survey_df['Age']), 
                                              bins=bin_ranges)
fcc_survey_df['Age_bin_custom_label'] = pd.cut(np.array(
                                              fcc_survey_df['Age']), 
                                              bins=bin_ranges,            
                                              labels=bin_names)
# view the binned features 
fcc_survey_df[['ID.x', 'Age', 'Age_bin_round', 
               'Age_bin_custom_range',   
               'Age_bin_custom_label']].iloc[10a71:1076]


##### Adaptive
(you can change the criteria to be z-score or SD instead of quartile)

In [None]:
quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
fcc_survey_df['Income_quantile_range'] = pd.qcut(
                                            fcc_survey_df['Income'], 
                                            q=quantile_list)
fcc_survey_df['Income_quantile_label'] = pd.qcut(
                                            fcc_survey_df['Income'], 
                                            q=quantile_list,       
                                            labels=quantile_labels)

fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_quantile_range', 
               'Income_quantile_label']].iloc[4:9]

In [None]:
fig, ax = plt.subplots()
fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3', 
                             edgecolor='black', grid=False)
for quantile in quantiles:
    qvl = plt.axvline(quantile, color='r')
ax.legend([qvl], ['Quantiles'], fontsize=10)
ax.set_title('Developer Income Histogram with Quantiles', 
             fontsize=12)
ax.set_xlabel('Developer Income', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)

Scikit learn also provide `KBinsDiscretizer` function

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

binner = KBinsDiscretizer(encode='ordinal')
bins = binner.fit_transform(train_df[['sepal.length']])

sns.scatterplot(x=train_df['sepal.length'], y=bins[:, 0])
plt.show()

---
#### Statistical Transformation
Their main significance is that they help in `stabilizing variance`, adhering closely to the normal distribution and making the data independent of the mean based on its distribution.

In [None]:
fcc_survey_df['Income_log'] = np.log((1+ fcc_survey_df['Income']))
fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log']].iloc[4:9]

In [None]:
income_log_mean = np.round(np.mean(fcc_survey_df['Income_log']), 2)
fig, ax = plt.subplots()
fcc_survey_df['Income_log'].hist(bins=30, color='#A9C5D3', 
                                 edgecolor='black', grid=False)
plt.axvline(income_log_mean, color='r')
ax.set_title('Developer Income Histogram after Log Transform', 
             fontsize=12)
ax.set_xlabel('Developer Income (log scale)', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.text(11.5, 450, r'$\mu$='+str(income_log_mean), fontsize=10)

Again, `scikit-learn` also has the standardization
here are many approaches to standardize the features. Common approaches are

* `sklearn.preprocessing.StandardScaler`
* `sklearn.preprocessing.RobustScaler`

`RobustScaler` uses quartiles instead of mean and variance as in `StandardScaler`, so it is more suitable for data with outliers.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_train_df = pd.DataFrame(
    scaler.fit_transform(train_df[features]), 
    index=train_df.index,
    columns=features)
scaled_train_df['variety'] = train_df['variety']

In [None]:
fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(8, 8))
for i, c in enumerate(train_df.columns):
    if c == 'variety':
        continue
    sns.violinplot(y='variety', x=c, data=train_df, ax=axes[i][0])
    sns.violinplot(y='variety', x=c, data=scaled_train_df, ax=axes[i][1])
plt.tight_layout()

---
### Handling Categorical Data

In [None]:
import pandas as pd
import numpy as np

#### Norminal Attribute
Nominal attributes consist of discrete categorical values with no notion or sense of order amongst them. The idea here is to transform these attributes into a more representative numerical format which can be easily understood by downstream code and pipelines.

`scikit-learn` also provide `LabelEncoder` for this task

In [None]:
from sklearn.preprocessing import LabelEncoder
gle = LabelEncoder()
genre_labels = gle.fit_transform(vg_df['Genre'])
genre_mappings = {index: label for index, label in 
                  enumerate(gle.classes_)}
genre_mappings
Output
------
{0: 'Action', 1: 'Adventure', 2: 'Fighting', 3: 'Misc',
 4: 'Platform', 5: 'Puzzle', 6: 'Racing', 7: 'Role-Playing',
 8: 'Shooter', 9: 'Simulation', 10: 'Sports', 11: 'Strategy'}

The transformed labels are stored in the `genre_labels` value which we can write back to our data frame.

In [None]:
vg_df['GenreLabel'] = genre_labels
vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]

---
#### Ordinal Attribute
Ordinal attributes are categorical attributes with a sense of order amongst the values.

`scikit-learn` also provide `OrdinallEncoder` for this task

In [None]:
#manual way
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
               'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]

Or

In [None]:
from sklearn.preprocessing import OrdinalEncoder
gle = OrdinalEncoder()
gen_labels = gle.fit_transform(vg_df['Generation'])
gen_mappings = {index: label for index, label in 
                  enumerate(gle.classes_)}
gen_mappings
Output
------
{'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}

The transformed labels are stored in the `gen_labels` value which we can write back to our data frame.

In [None]:
vg_df['GenerationLabel'] = gen_labels
vg_df[['Name', 'Platform', 'Year', 'Generation', 'GenerationLabel']].iloc[4:10]

---
#### One-hot Encoding Scheme
Ordinal attributes are categorical attributes with a sense of order amongst the values.

`scikit-learn` also provide `OneHotEncoder` for this task

In [None]:
from sklearn.preprocessing import OneHotEncoder

# encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
gen_feature_labels = list(gen_le.classes_)
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)

# encode legendary status labels using one-hot encoding scheme
leg_ohe = OneHotEncoder()
leg_feature_arr = leg_ohe.fit_transform(poke_df[['Lgnd_Label']]).toarray()
leg_feature_labels = ['Legendary_'+str(cls_label) for cls_label in leg_le.classes_]
leg_features = pd.DataFrame(leg_feature_arr, columns=leg_feature_labels)

You can also do this in `pandas` w/ the `get_dummies`

In [None]:
gen_onehot_features = pd.get_dummies(poke_df['Generation'])
pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]

---
#### Bin-counting Scheme
The encoding schemes we discussed so far, work quite well on categorical data in general, but they start causing problems when the number of distinct categories in any feature becomes very large.

Besides this, we also have to deal with what is popularly known as the `curse of dimensionality` where basically with an enormous number of features and not enough representative samples, model performance starts getting affected often leading to overfitting.

This scheme needs historical data as a pre-requisite and is an elaborate one.

You can do this by `pandas.groupby` categories and using count function. Then you bin them.

---
#### Feature Hashing Scheme
The feature hashing scheme is another useful feature engineering scheme for `dealing with large scale categorical features`.

Hashing schemes work on strings, numbers and other structures like vectors. You can think of hashed outputs as a finite set of b bins such that when hash function is applied on the same values\categories, they get assigned to the same bin (or subset of bins) out of the b bins based on the hash value. We can pre-define the value of b which becomes the final size of the encoded feature vector for each categorical attribute that we encode using the feature hashing scheme.

Thus even if we have over 1000 distinct categories in a feature and we set b=10 as the final feature vector size, the output feature set will still have only 10 features as compared to 1000 binary features if we used a one-hot encoding scheme. Let’s consider the Genre attribute in our video game dataset.

`scikit-learn` also provide `FeatureHasher` for this task

In [None]:
from sklearn.feature_extraction import FeatureHasher

fh = FeatureHasher(n_features=6, input_type='string')
hashed_features = fh.fit_transform(vg_df['Genre'])
hashed_features = hashed_features.toarray()
pd.concat([vg_df[['Name', 'Genre']], pd.DataFrame(hashed_features)], 
          axis=1).iloc[1:7]

---
### Feature Selection

#### Correlation Heatmap
Correlation states how the features are related to each other or the target variable.

Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the `seaborn` library.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

#### Univariate Selection using Select K-Best w/ Scoring
The scikit-learn library provides the `SelectKBest` class that can be used with a suite of different statistical tests to select a specific number of features.

There are many statistical methods to measure the dependency between two variables. In this notebook, we will look at a few scoring available in sklearn.

Classification
* Chi-squared test (`chi2`): often used for fequency variables (count)
* One-way ANOVA (`f_classif`): used when variables are approximately normally distributed. Also all groups (by labels) have roughly the same variance.
* Mutual Information (`mutual_info_classif`)

Regression
* Pearson Correlation (`f_regression`): used when variables are approximately normally distributed.
* Mutual Information (`mutual_info_regression`)

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range

#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
new_X = bestfeatures.fit_transform(X, y)

In [None]:
new_X = SelectKBest(f_classif, k=2).fit_transform(X, y)
sub2 = cross_val_score(DecisionTreeClassifier(), new_X, y, cv=10, error_score='accuracy')

#### Recursive Feature Elimination (RFE)
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. 

`RFECV` performs `RFE` in a cross-validation loop to find the optimal number of features.

`scikit-learn` provide `RFECV`

In [None]:
from sklearn.feature_selection import RFE
# Create the RFE object and rank each pixel
clf_rf_3 = RandomForestClassifier()      
rfe = RFE(estimator=clf_rf_3, n_features_to_select=5, step=1)
rfe = rfe.fit(x_train, y_train)

In [None]:
from sklearn.feature_selection import RFECV

# The "accuracy" scoring is proportional to the number of correct classifications
clf_rf_4 = RandomForestClassifier() 
rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5,scoring='accuracy')   #5-fold cross-validation
rfecv = rfecv.fit(x_train, y_train)

print('Optimal number of features :', rfecv.n_features_)
print('Best features :', x_train.columns[rfecv.support_])

In [None]:
# Plot number of features VS. cross-validation scores
import matplotlib.pyplot as plt
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score of number of selected features")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

#### Select from Model
You can get the `feature importance` of each feature of your dataset by using the feature importance property of the model.
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.

`scikit-learn` provide `SelectFromModel`

##### L1 Based Feature Selection

In [None]:
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)

##### Tree-based Feature Selection

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt

data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range

model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_)

#use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization

feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

##### Select from Model as a part of Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.SVm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)

---
### Handling Imbalanced Data

#### Label distribution
Another quick check is to see if the labels are uniformly distribution (a fancy way to say that labels have the same frequency). Most machine learning algorithms work well on balance dataset.

In [None]:
df['variety'].value_counts()

If the label distribution is not uniform (or we say imbalance), there are many ways to deal with this problem as well:

#### SMOTE
SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

`imblearn` provide `SMOTE`

In [None]:
# Oversample with SMOTE and random undersample for imbalanced dataset
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from matplotlib import pyplot
from numpy import where

# define dataset
X, y = make_classification(n_samples=10000, n_features=2, 
                           n_redundant=0,n_clusters_per_class=1, 
                           weights=[0.99], flip_y=0, random_state=1)

# summarize class distribution
counter = Counter(y)
print(counter)

# define pipeline
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

# transform the dataset
X, y = pipeline.fit_resample(X, y)

# summarize the new class distribution
counter = Counter(y)
print(counter)

# scatter plot of examples by class label
for label, _ in counter.items():
    row_ix = where(y == label)[0]
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()

##### SMOTE in Pipeline

In [None]:
# decision tree  on imbalanced dataset with SMOTE oversampling and random undersampling
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# define dataset
X, y = make_classification(n_samples=10000, n_features=2, 
                           n_redundant=0, n_clusters_per_class=1, 
                           weights=[0.99], flip_y=0, random_state=1)

# define pipeline
model = DecisionTreeClassifier()
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('over', over), ('under', under), ('model', model)]
pipeline = Pipeline(steps=steps)

# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC AUC: %.3f' % mean(scores))

#### Random Undersampling w/ Ensemble Methods

Undersampling the majority class in the bootstrap is referred to as UnderBagging, and combining both approaches is referred to as OverUnderBagging.

The `imbalanced-learn` library provides an implementation of UnderBagging.

Specifically, it provides a version of bagging that uses a random undersampling strategy on the majority class within a bootstrap sample in order to balance the two classes. This is provided in the `BalancedBaggingClassifier` class.

In [None]:
# bagged decision trees with random undersampling for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.ensemble import BalancedBaggingClassifier

# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, 
                           n_redundant=0, n_clusters_per_class=1, 
                           weights=[0.99], flip_y=0, random_state=4)

# define model
model = BalancedBaggingClassifier()

# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

### Data Spliting

#### Test Train Spliting

Test data = Black Box. Don't touch it. It is meant for testing only.
Train data = Do whatever you want with it.

In [None]:
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

# Load the Diabetes dataset
columns = “age sex bmi map tc ldl hdl tch ltg glu”.split()
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data, columns=columns)
y = diabetes.target

# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, Stratify=y)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

#### Stratification for Imbalanced Data

`scikit-learn` has a class to do `StratifiedKFold`

It forces each fold to have at least m instances of each class. This approach ensures that one class of data is not overrepresented especially when the target variable is unbalanced.

In [None]:
import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)

## Visualization

## Modeling
  Classification
    1. Decision Tree Classifier (> 100K Sample)
    2. SGD Classifier
    3. Logistic Regression
    4. KNN (< 100K Sample)
    5. Linear SVC
    6. SVC (Non-linear Kernel)
    7. Ensemble Classifier
    8. Naive Bayes (Text Data)
    9. Neural Network
    
  Regression
    1. SGD Regressor (> 100K Sample)
    2. Linear Model
    3. Ridge Regression (< 100K Sample)
    4. SVR (Linear Kernel)
    5. SVR (Non-linear Kernel)
    6. Ensemble Regressor
    7. Lasso
    8. Elastic Net
    9. Neural Network
    
  Clustering
    1. Minibatch K-mean (know K)
    2. K-mean 
    3. Spectral Clustering 
    4. Gaussian Mixture Modeling
    5. Meanshift (Don't know K)
    6. DBSCAN
    
  Dimensionality Reduction
    1. Randomized PCA
    2. ISO Map
    3. Spectral Embedding
    4. Local Linear Embedding (LLE)
  
  Pipeline
  
  GridSearch CV
  
  Examples

***
### Classification Model

#### Decision Tree Classifier

In [None]:
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 

# Create Decision Tree classifer object
model = DecisionTreeClassifier()

# Train Decision Tree Classifer
model = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [None]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

#### SGD Classifier

In [None]:
from sklearn.linear_model import SGDClassifier

model = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
model.fit(X, y)

#### Logistic Regression

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, y)

#### K-Nearest Neighbours

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)

#### Linear SVC

In [None]:
from sklearn.svm import LinearSVC

model = LinearSVC(random_state=0, tol=1e-5)
model.fit(X, y)

#### SVC w/ Non-linear Kernel

In [None]:
from sklearn.svm import SVC

model = SVC(gamma='auto')
model.fit(X, y)

#### Ensemble Classifiers

##### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=10)
model = clf.fit(X, Y)

##### Adaboost

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier(n_estimators=100)

scores = cross_val_score(model, X, y, cv=5)
scores.mean()

##### Gradient Boosting Tree

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=100,
                                   learning_rate=1.0,
                                   max_depth=1,
                                   random_state=0)
model.fit(X, y)

#### Neural Network

In [None]:
from sklearn.neural_network import MLPClassifier

# 5 nodes in hidden layer 1
# 2 nodes in hidden layer 2
model = MLPClassifier(solver='lbfgs',
                      alpha=1e-5,
                      hidden_layer_sizes=(5, 2),
                      random_state=1)
model.fit(X, y)

#### Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X, Y)

***
### Regression Model

#### Linear Regressor

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)

#### SGD Regressor

In [None]:
from sklearn.linear_model import SGDRegressor

model = SGDRegressor(max_iter=1000, tol=1e-3)
model.fit(X, y)

#### Ridge Regressor

In [None]:
from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
model.fit(X, y)

#### Linear SVR

In [None]:
from sklearn.svm import LinearSVR

model = LinearSVR(random_state=0, tol=1e-5)
model.fit(X, y)

#### Non-linear SVR

In [None]:
from sklearn.svm import SVR

model = SVR(C=1.0, epsilon=0.2)
model.fit(X, y)

#### Ensemble Regressor

##### Bagging Regressor

In [None]:
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor

model = BaggingRegressor(base_estimator=SVR(),
                         n_estimators=100,
                         random_state=0)
model.fit(X, y)

##### Adaboost

In [None]:
from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostRegressor(random_state=0, n_estimators=100)
model.fit(X, y)

##### Gradient Boosting Tree

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(n_estimators=100,
                                  learning_rate=0.1,
                                  max_depth=1,
                                  random_state=0,
                                  loss='ls')
model.fit(X_train, y_train)
mean_squared_error(y_test, est.predict(X_test))

#### Lasso Regressor

In [None]:
from sklearn import linear_model

model = linear_model.Lasso(alpha=0.1)
model.fit(X, y)

#### Elastic Net

In [None]:
from sklearn.linear_model import ElasticNetCV

model = ElasticNetCV(cv=5, random_state=0)
model.fit(X, y)

#### Neural Network

In [None]:
from sklearn.neural_network import MLPClassifier

model = MLPClassifier(solver='sgd',
                      alpha=1e-5,
                      hidden_layer_sizes=(100, ),
                      random_state=1)
model.fit(X, y)

***
### Clustering

#### Minibatch K-mean

In [None]:
from sklearn.cluster import MiniBatchKMeans

model = MiniBatchKMeans(n_clusters=2,
                        random_state=0,
                        batch_size=6,
                        max_iter=10)
model.fit(X)

#### K-mean Clustering

In [None]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=2, random_state=0, max_iter=10)
model.fit(X)

#### Spectral Clustering

In [None]:
from sklearn.cluster import SpectralClustering

model = SpectralClustering(n_clusters=2,
                           assign_labels="discretize",
                           random_state=0)
model.fit(X)

#### Gaussian Mixture Model

In [None]:
from sklearn.mixture import GMM

model = mixture.GMM(n_components=2)
model.fit(X)

In [None]:
from sklearn.mixture import GaussianMixture

model = mixture.GaussianMixture(n_components=2)
model.fit(X)

#### Meanshift Clustering

In [None]:
from sklearn.cluster import MeanShift

model = MeanShift(bandwidth=2)
model.fit(X)

#### DBSCAN

In [None]:
from sklearn.cluster import DBSCAN

model = DBSCAN(eps=3, min_samples=2)
model.fit(X)

***
### Dimensionality Reduction

#### Randomized PCA

In [None]:
from sklearn.decomposition import PCA

model = PCA(n_components=2)
model.fit(X)

In [None]:
from sklearn.decomposition import RandomizedPCA

model = RandomizedPCA(n_components=2)
model.fit(X)

#### ISO Map

In [None]:
from sklearn.manifold import Isomap

embedding = Isomap(n_components=2)
X_transformed = embedding.fit_transform(X)

#### Spectral Embedding

In [None]:
from sklearn.manifold import SpectralEmbedding

embedding = SpectralEmbedding(n_components=2)
X_transformed = embedding.fit_transform(X)

#### Local Linear Embedding (LLE)

In [None]:
from sklearn.manifold import LocallyLinearEmbedding

embedding = LocallyLinearEmbedding(n_components=2)
X_transformed = embedding.fit_transform(X)

***
### Pipelining

In [None]:
from sklearn.neural_network import MLPRegressor

model = make_pipeline(
    PolynomialFeatures(degree=2), preprocessing.StandardScaler(),
    MLPRegressor(hidden_layer_sizes=(100, ),
                 solver='sgd',
                 alpha=0.0001,
                 max_iter=100000))

scores = cross_validate(model,
                        X,
                        y,
                        cv=kfold,
                        scoring='neg_mean_squared_error',
                        return_train_score=True)

a = pd.DataFrame(scores)
a['fold'] = a.index
a = a.drop(columns=['fit_time', 'score_time'])
a['test_score'] = a['test_score'] * -1
a['train_score'] = a['train_score'] * -1
b = pd.melt(a, id_vars='fold', var_name='split', value_name='score')
sns.barplot(x='fold',
            y='score',
            hue='split',
            hue_order=['train_score', 'test_score'],
            data=b)
__ = plt.title((
    f'Train avg MSE = {a["train_score"].mean():.3} +/- {a["train_score"].std() * 2:.3}\n'
    f'Test avg MSE = {a["test_score"].mean():.3} +/- {a["test_score"].std() * 2:.3}'
))

### GridSearch CV

In [None]:
from sklearn import datasets
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

# Load data

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Grid Search
gs = GridSearchCV(DecisionTreeClassifier(),
                  param_grid={'min_samples_split': range(2, 15, 1)},
                  scoring='accuracy',
                  refit='AUC',
                  return_train_score=True)
gs.fit(X, y)
results = gs.cv_results_

# Plot
plt.figure(figsize=(10, 4))
plt.title("GridSearchCV evaluating using multiple scorers simultaneously",
          fontsize=16)

plt.xlabel("min_samples_split")
plt.ylabel("Score")

ax = plt.gca()

# Get the regular numpy array from the MaskedArray
X_axis = np.array(results['param_min_samples_split'].data, dtype=float)

scorer = 'score'
color = 'g'
for sample, style in (('train', '--'), ('test', '-')):
    sample_score_mean = results['mean_%s_%s' % (sample, scorer)]
    sample_score_std = results['std_%s_%s' % (sample, scorer)]
    ax.fill_between(X_axis,
                    sample_score_mean - sample_score_std,
                    sample_score_mean + sample_score_std,
                    alpha=0.1 if sample == 'test' else 0,
                    color=color)
    ax.plot(X_axis,
            sample_score_mean,
            style,
            color=color,
            alpha=1 if sample == 'test' else 0.7,
            label="%s (%s)" % (scorer, sample))

best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]
best_score = results['mean_test_%s' % scorer][best_index]

plt.legend(loc="best")
plt.grid(False)
plt.show()