## (Datacamp) Designing Machine Learning Workflows in Python

1. Feature engineering
Most classifiers expect numeric features
Need to convert string columns to numbers
- Preprocess using LabelEncoder from sklearn.preprocessing
le = LabelEncoder()
le.fit_transform()

2. Model fitting
.fit(featuresm labels)
.predict(features)

3. Model Selection
.fit() optimises the parameters of the given model

4. Performace assessment
need to avoid overfitting 


**Scalable ways to tune your pipeline\
**Making sure your predictions are relevant by involving domain experts\
**Making sure your model continues to perform well over time 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#### Feature engineering

In [4]:
credit = pd.read_csv('/Users/ingeonhwang/Desktop/1.Yonsei_bigdata_analysis/1.Class_material/5.머신러닝_박홍규/data/credit.csv')

In [6]:
# Inspect the first few lines of your data using head()
credit.head(3)


Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,'<0',6,'critical/other existing credit',buy_radio_tv,1169,'no known savings','>=7',4,'male single',none,...,'real estate',67,none,own,2,skilled,1,yes,yes,good
1,'0<=X<200',48,'existing paid',buy_radio_tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,...,'real estate',22,none,own,1,skilled,1,none,yes,bad
2,'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,...,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good


In [7]:
non_numeric_columns = ['checking_status', 'credit_history', 'purpose', 'savings_status', 'employment', 'personal_status',
                       'other_parties', 'property_magnitude', 'other_payment_plans', 'housing', 'job', 'own_telephone', 'foreign_worker']

In [8]:
from sklearn.preprocessing import LabelEncoder
# Create a label encoder for each column. Encode the values
for column in non_numeric_columns:
    le = LabelEncoder()
    credit[column] = le.fit_transform(credit[column])

# Inspect the data types of the columns of the data frame
print(credit.dtypes)

checking_status            int64
duration                   int64
credit_history             int64
purpose                    int64
credit_amount              int64
savings_status             int64
employment                 int64
installment_commitment     int64
personal_status            int64
other_parties              int64
residence_since            int64
property_magnitude         int64
age                        int64
other_payment_plans        int64
housing                    int64
existing_credits           int64
job                        int64
num_dependents             int64
own_telephone              int64
foreign_worker             int64
class                     object
dtype: object


#### Your first pipeline

In [9]:
X = credit[['checking_status', 'duration', 'credit_history', 'purpose', 'credit_amount', 'savings_status', 'employment', 'installment_commitment', 'personal_status', 'other_parties', 'residence_since',
            'property_magnitude', 'age', 'other_payment_plans', 'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone', 'foreign_worker']]
y = credit[['class']]

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Split the data into train and test, with 20% as test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)

# Create a random forest classifier, fixing the seed to 2
rf_model = RandomForestClassifier(random_state=2, n_estimators=10).fit(
    X_train, y_train.values.ravel())

# Use it to predict the labels of the test data
rf_predictions = rf_model.predict(X_test)

# Assess the accuracy of both classifiers
accuracy_score(y_test, rf_predictions)

0.74

#### Grid search CV for model complexity

In [11]:
from sklearn.model_selection import GridSearchCV
# Set a range for n_estimators from 10 to 40 in steps of 10
param_grid = {'n_estimators': range(10, 50, 10)}

# Optimize for a RandomForestClassifier() using GridSearchCV
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=3)
grid.fit(X, y.values.ravel())
grid.best_params_

{'n_estimators': 40}

In [12]:
from sklearn.ensemble import AdaBoostClassifier
# Define a grid for n_estimators ranging from 1 to 10
param_grid = {'n_estimators': range(1, 11)}

# Optimize for a AdaBoostClassifier() using GridSearchCV
grid = GridSearchCV(AdaBoostClassifier(), param_grid, cv=3)
grid.fit(X, y.values.ravel())
grid.best_params_

{'n_estimators': 10}

In [13]:
from sklearn.neighbors import KNeighborsClassifier
# Define a grid for n_neighbors with values 10, 50 and 100
param_grid = {'n_neighbors': [10, 50, 100]}

# Optimize for KNeighborsClassifier() using GridSearchCV
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=3)
grid.fit(X, y.values.ravel())
grid.best_params_

{'n_neighbors': 50}

### The best-performing tree depth increases as the number of estimators grows in this case. This is in fact what tends to happen in most cases.

#### Number of trees and estimators
Random forests are an ensemble over a large number of decision trees. The number of trees used is controlled by a parameter called n_estimators. Below you can see a heatmap of the accuracy of a random forest classifier. Different values of maximum depth (max_depth) are shown on the vertical axis. Different numbers of estimators (n_estimators) are shown on the horizontal axis

### Categorical encodings

In [14]:
#  1) use LabelEncoder() 결과론적으로는 이거 말고 one-hot encoding 쓸거임

# Create numeric encoding for credit_history
credit_history_num = LabelEncoder().fit_transform(
  credit['credit_history'])

# Create a new feature matrix including the numeric encoding
X_num = pd.concat([X, pd.Series(credit_history_num)], axis = 1)

# Create new feature matrix with dummies for credit_history
X_hot = pd.concat(
  [X, pd.get_dummies(credit['credit_history'])],axis = 1)

# Compare the number of features of the resulting DataFrames
X_hot.shape[1] > X_num.shape[1]

True

You are discussing the credit dataset with the bank manager. She suggests that the safest loan applications tend to request mid-range credit amounts. Values that are either too low or too high suggest high risk. This means that a non-linear relationship might exist between this variable and the class. You want to test this hypothesis. You will construct a non-linear transformation of the feature. Then, you will assess which of the two features is better at predicting the class using SelectKBest() and the chi2() metric, both of which have been preloaded.

In [15]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Function computing absolute difference from column mean
def abs_diff(x):
    return np.abs(x-np.mean(x))

# Apply it to the credit amount and store to new column
credit['diff'] = abs_diff(credit['credit_amount'])

# Create a feature selector with chi2 that picks one feature
sk = SelectKBest(chi2, k=1)

# Use the selector to pick between credit_amount and diff
sk.fit(credit[['credit_amount', 'diff']], credit['class'])

# Inspect the results
sk.get_support()

array([ True, False])

To make sure you don't overfit by mistake, you have already split your data. You will use X_train and y_train for the grid search, and X_test and y_test to decide if feature selection helps. All four dataset folds are preloaded in your environment. You also have access to GridSearchCV(), train_test_split(), SelectKBest(), chi2() and RandomForestClassifier as rfc.

In [18]:
# Find the best value for max_depth among values 2, 5 and 10
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=1, n_estimators=10), param_grid={'max_depth': [2, 5, 10]}, cv=3)
best_value = grid_search.fit(
    X_train, y_train.values.ravel()).best_params_['max_depth']

# Using the best value from above, fit a random forest
clf = RandomForestClassifier(
    random_state=1, max_depth=best_value, n_estimators=10).fit(X_train, y_train.values.ravel())

# Apply SelectKBest with chi2 and pick top 100 features
vt = SelectKBest(chi2, k=5).fit(X_train, y_train.values.ravel())

# Create a new dataset only containing the selected features
X_train_reduced = vt.transform(X_train)

In [20]:
flows = pd.read_csv('/Users/ingeonhwang/Desktop/1.Yonsei_bigdata_analysis/1.Class_material/5.머신러닝_박홍규/data/lanl_flows.csv')
flows.head()

Unnamed: 0,time,duration,source_computer,source_port,destination_computer,destination_port,protocol,packet_count,byte_count
0,471692,0,C5808,N24128,C26871,N17023,6,1,60
1,471692,0,C5808,N2414,C26871,N19148,6,1,60
2,471692,0,C5808,N24156,C26871,N8001,6,1,60
3,471692,0,C5808,N24161,C26871,N18502,6,1,60
4,471692,0,C5808,N24162,C26871,N11309,6,1,60


In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)