# HW2

#### Machine Learning in Korea University
#### COSE362, Fall 2018
#### Due : 11/26 (TUE) 11:59 PM

#### In this assignment, you will learn various classification methods with given datasets.
* Implementation detail: Anaconda 5.3 with python 3.7
* Use given dataset. Please do not change train / valid / test split.
* Use numpy, scikit-learn, and matplotlib library
* You don't have to use all imported packages below. (some are optional). <br>
Also, you can import additional packages in "(Option) Other Classifiers" part. 
* <b>*DO NOT MODIFY OTHER PARTS OF CODES EXCEPT "Your Code Here"*</b>

In [2]:
# Basic packages
%matplotlib inline
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt

# Machine Learning Models
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score

# Additional packages
from sklearn.model_selection import cross_val_score

## Process

> 1. Load "train.csv". It includes all samples' features and labels.
> 2. Training four types of classifiers(logistic regression, decision tree, random forest, support vector machine) and <b>validate</b> it in your own way. <b>(You can't get full credit if you don't conduct validation)</b>
> 3. Optionally, if you would train your own classifier(e.g. ensembling or gradient boosting), you can evaluate your own model on the development data. <br>
> 4. <b>You should submit your predicted results on test data with the selected classifier in your own manner.</b>

## Task & dataset description
1. 6 Features (1~6)<br>
Feature 2, 4, 6 : Real-valued<br>
Feature 1, 3, 5 : Categorical <br>

2. Samples <br>
>In development set : 2,000 samples <br>
>In test set : 1,500 samples

## Load development dataset
Load your development dataset. You should read <b>"train.csv"</b>. This is a classification task, and you need to preprocess your data for training your model. <br>
> You need to use <b>1-of-K coding scheme</b>, to convert categorical features to one-hot vector. <br>
> For example, if there are 3 categorical values, you can convert these features as [1,0,0], [0,1,0], [0,0,1] by 1-of-K coding scheme. <br>

In [3]:
# For training your model, you need to convert categorical features to one-hot encoding vectors.
# Your Code Here

# Load data
train_data = pd.read_csv('./data/train.csv')
X_data, y_train = np.array(train_data.iloc[:,:-1]), np.array(train_data.iloc[:,-1])

le = LabelEncoder()
X_data[:,0], X_data[:,2], X_data[:,4] = map(le.fit_transform, [X_data[:,0], X_data[:,2], X_data[:,4]])

numeric_features = [1, 3, 5]

# Z-normalization for numerical features
for feature_num in numeric_features:
    feature_values = X_data[:, feature_num]
    feature_mean = np.mean(feature_values)
    feature_std = np.std(feature_values)
    
    X_data[:, feature_num] = (X_data[:, feature_num] - feature_mean) / feature_std
    
X_train = X_data.copy()

# One-hot encoding for categorical features
categorical_features = [0, 2, 4]

for feature_num in categorical_features:
    feature_values_tr = X_train[:, feature_num]
    feature_set = set(feature_values_tr)
    
    feature_dict = {}
    for i, value in enumerate(feature_set):
        feature_dict[value] = i
    
    for i, value in enumerate(feature_values_tr):
        feature_values_tr[i] = feature_dict[value]

    one_hot_matriX_train = np.eye(len(feature_set))[feature_values_tr.astype(int)]

    X_train = np.concatenate((X_train, one_hot_matriX_train), axis=1)

X_train = np.delete(X_train, categorical_features, 1)

X_train_ohv = X_train.copy()

# End Your Code

### Logistic Regression
Train and validate your <b>logistic regression classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [5]:
# Training your logistic regression classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here
from sklearn.model_selection import ShuffleSplit

# For the case of CV
valid_split = 1/5
cv = ShuffleSplit(n_splits=5, test_size=valid_split, random_state=0)

# Train the model with regularization
coefs = [0.01, 0.05, 0.1, 0.5, 1, 10, 100]

max_f1 = 0
f1 = []

print("For choosing the optimal regularization parameter\n")
for coef in coefs:
    logreg = LogisticRegression(C=coef, solver='lbfgs', multi_class='multinomial', max_iter=500)
    f1_cv = []
    for t_index, v_index in cv.split(X_train):
        X_tr, X_val = X_train[t_index], X_train[v_index]
        y_tr, y_val = y_train[t_index], y_train[v_index]
        logreg.fit(X_tr, y_tr)
        y_pred = logreg.predict(X_val)
        f1_cv.append(f1_score(y_val, y_pred, average='macro', labels=np.unique(y_pred)))
    mean_f1 = np.mean(f1_cv)
    f1.append(mean_f1)
    if max_f1 < mean_f1:
        max_f1 = mean_f1
        opt_coef = coef

    print("Regularization parameter: ", 1/coef)
    print("F1 Score: ", mean_f1)
    print("="*30)

print("Optimal: {}, F1 score: {}".format(1/opt_coef, max_f1))
print("\n\n\n")

max_f1 = 0
solvers = ['newton-cg', 'lbfgs', 'sag', 'saga']

print("For choosing the optimal solver\n")
for solver in solvers:
    logreg = LogisticRegression(C=opt_coef, solver=solver, multi_class='multinomial', max_iter=1800)
    f1_cv = []
    for t_index, v_index in cv.split(X_train):
        X_tr, X_val = X_train[t_index], X_train[v_index]
        y_tr, y_val = y_train[t_index], y_train[v_index]
        logreg.fit(X_tr, y_tr)
        y_pred = logreg.predict(X_val)
        f1_cv.append(f1_score(y_val, y_pred, average='macro', labels=np.unique(y_pred)))
    mean_f1 = np.mean(f1_cv)
    if max_f1 < mean_f1:
        max_f1 = mean_f1
        opt_sol = solver
    print("Solver: ", solver)
    print("F1 score:", mean_f1)
    print("="*30)
    
print("Optimal: {}, F1 score: {}".format(opt_sol, max_f1))
print("\n\n\n")


logreg = LogisticRegression(C=opt_coef, solver=opt_sol, max_iter=1800, multi_class='multinomial')

print("Optimal regularization parameter: {}\nOptimal solver: {}\n3-fold cross validaion score {:.6f}".\
     format(1/opt_coef, opt_sol, np.mean(cross_val_score(logreg, X_train, y_train, cv=3))))
# End Your Code

For choosing the optimal regularization parameter

Regularization parameter:  100.0
F1 Score:  0.18018405350874023
Regularization parameter:  20.0
F1 Score:  0.18677884488976837
Regularization parameter:  10.0
F1 Score:  0.19584122916108573
Regularization parameter:  2.0
F1 Score:  0.2301127012471597


  'recall', 'true', average, warn_for)


Regularization parameter:  1.0
F1 Score:  0.2468880340300123
Regularization parameter:  0.1
F1 Score:  0.28165619208649806
Regularization parameter:  0.01
F1 Score:  0.27319212346618726
Optimal: 0.1, F1 score: 0.28165619208649806




For choosing the optimal solver

Solver:  newton-cg
F1 score: 0.283502132638321
Solver:  lbfgs
F1 score: 0.28165619208649806
Solver:  sag
F1 score: 0.283502132638321
Solver:  saga
F1 score: 0.283502132638321
Optimal: newton-cg, F1 score: 0.283502132638321




Optimal regularization parameter: 0.1
Optimal solver: newton-cg
3-fold cross validaion score 0.290478


### Decision Tree
Train and validate your <b>decision tree classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [6]:
# Training your decision tree classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here
X_train = X_data.copy()

dt = DecisionTreeClassifier(criterion="entropy", random_state=0)

# Feature selection
sel_num = X_train.shape[1]
selected_feature = []
selected_f1 = []

for sel in range(sel_num) :
    max_f1 = 0
    min_feature = 0
    
    # For each feature
    for i in range(X_train.shape[1]) :
        f1_ith = []
        
        if i in selected_feature:
            continue
        X_tr = X_train[:, selected_feature + [i]]
        
        # For cross validation
        for train_index, val_index in cv.split(X_train) :
            X_tr_cv, X_val = X_tr[train_index], X_tr[val_index]
            y_tr_cv, y_val = y_train[train_index], y_train[val_index]
        
            # Derive f1 score
            dt.fit(X_tr_cv, y_tr_cv)
            y_pred = dt.predict(X_val)
            f1_ith.append(f1_score(y_val, y_pred, average='macro', labels=np.unique(y_pred)))

        if np.mean(f1_ith) > max_f1:
            max_f1 = np.mean(f1_ith)
            min_feature = i
            opt_params = dt.get_params()

    print('='*50)
    print("# of selected feature(s) : {}".format(sel+1))
    print("Selected feature of this iteration : {}".format(min_feature))
    print("F1 score: {:.8f}".format(max_f1))
    selected_feature.append(min_feature)
    selected_f1.append(max_f1)

print("\n\n\n")    

selected_feature = selected_feature[:selected_f1.index(max(selected_f1))+1]
X_train = X_train[:, selected_feature]

print("Selected features: ", selected_feature)
print("3-fold cross validation score of this model: ", np.mean(cross_val_score(dt, X_train, y_train, cv = 3)))
# End Your Code

# of selected feature(s) : 1
Selected feature of this iteration : 4
F1 score: 0.28491780
# of selected feature(s) : 2
Selected feature of this iteration : 1
F1 score: 0.20820198
# of selected feature(s) : 3
Selected feature of this iteration : 5
F1 score: 0.24934150


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


# of selected feature(s) : 4
Selected feature of this iteration : 0
F1 score: 0.31586097
# of selected feature(s) : 5
Selected feature of this iteration : 3
F1 score: 0.35752144
# of selected feature(s) : 6
Selected feature of this iteration : 2
F1 score: 0.38406268




Selected features:  [4, 1, 5, 0, 3, 2]
3-fold cross validation score of this model:  0.4354333529839798


### Random Forest
Train and validate your <b>random forest classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [8]:
# Training your random forest classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here
X_train = X_data.copy()

num_set = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]   

max_f1 = 0

for n in num_set:
    f1_cv = []
    rfc = RandomForestClassifier(n_estimators=n ,criterion="entropy", random_state=0)
    for t_index, v_index in cv.split(X_train):
        X_tr, y_tr = X_train[t_index], y_train[t_index]
        X_val, y_val = X_train[v_index], y_train[v_index]
        rfc.fit(X_tr, y_tr)
        y_pred = rfc.predict(X_val)
        f1 = f1_score(y_val, y_pred, average='macro', labels=np.unique(y_pred))
        f1_cv.append(f1)
    if max_f1 < np.mean(f1_cv):
        max_f1 = np.mean(f1_cv)
        opt_n = n
    print("The number of trees: ", n)
    print("F1 score: ", np.mean(f1_cv))
    print("="*40)

print("\n\n\n")
print("Optimal number of trees: ", opt_n)
print("3-fold cross validation score of this model: ", np.mean(cross_val_score(\
                        RandomForestClassifier(n_estimators=opt_n, criterion='entropy'), X_train, y_train, cv = 3)))
# End Your Code

  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


The number of trees:  5
F1 score:  0.3684475152545085


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


The number of trees:  10
F1 score:  0.41688447130181555


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


The number of trees:  15
F1 score:  0.4269743546613952


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


The number of trees:  20
F1 score:  0.41748892136043125


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


The number of trees:  25
F1 score:  0.436561404271713


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


The number of trees:  30
F1 score:  0.44789144035029305


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


The number of trees:  35
F1 score:  0.44966763296766327


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


The number of trees:  40
F1 score:  0.45877048961196454


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


The number of trees:  45
F1 score:  0.45589503520366714


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


The number of trees:  50
F1 score:  0.45971643069335333




Optimal number of trees:  50
3-fold cross validation score of this model:  0.4774643516205715


### Support Vector Machine
Train and validate your <b>support vector machine classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [10]:
# Training your support vector machine classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here
X_train = X_train_ohv.copy()

reg_params = [0.001, 0.01, 0.05, 0.1, 1.0, 5.0, 10.0, 50.0, 100.0, 1000.0]
max_f1 = 0

print("For choosing the optimal regularization parameter\n")
for param in reg_params:
    f1_cv = []
    svc = SVC(C=param)
    for t_index, v_index in cv.split(X_train):
        X_tr, y_tr = X_train[t_index], y_train[t_index]
        X_val, y_val = X_train[v_index], y_train[v_index]
        svc.fit(X_tr, y_tr)
        y_pred = svc.predict(X_val)
        f1 = f1_score(y_val, y_pred, average='macro', labels=np.unique(y_pred))
        f1_cv.append(f1)
    if max_f1 < np.mean(f1_cv):
        max_f1 = np.mean(f1_cv)
        opt_param = param
    print("Regularization parameter: ", param)
    print("F1 score: ", np.mean(f1_cv))
    print("="*40)

print("Optimal regularization parameter: ", opt_param)
print("F1 score: {:.6f}".format(max_f1))
print("\n\n\n")    
    
kernels = ['poly', 'rbf', 'sigmoid']
max_f1 = 0

print("For choosing the optimal kernel function\n")
for kernel in kernels:
    f1_cv = []
    svc = SVC(C=opt_param, kernel=kernel)
    for t_index, v_index in cv.split(X_train):
        X_tr, y_tr = X_train[t_index], y_train[t_index]
        X_val, y_val = X_train[v_index], y_train[v_index]
        svc.fit(X_tr, y_tr)
        y_pred = svc.predict(X_val)
        f1 = f1_score(y_val, y_pred, average='macro', labels=np.unique(y_pred))
        f1_cv.append(f1)
    if max_f1 < np.mean(f1_cv):
        max_f1 = np.mean(f1_cv)
        opt_kernel = kernel
    print("Current kernel: ", kernel)
    print("F1 score: ", np.mean(f1_cv))
    print("="*40)
print("Optimal kerenel function: ", opt_kernel)
print("F1 score: {:.6f}".format(max_f1))
print('\n\n\n')

print("Optimal regularization parameter: ", opt_param)    
print("Optimal kernel: ", opt_kernel)   
print("3-fold cross validation score: {:.6f}".format(cross_val_score(\
                            SVC(C=opt_param, kernel=opt_kernel), X_train, y_train, cv=3).mean()))
# End Your Code

For choosing the optimal regularization parameter

Regularization parameter:  0.001
F1 score:  0.294155912528284
Regularization parameter:  0.01
F1 score:  0.294155912528284
Regularization parameter:  0.05
F1 score:  0.15677424975336465
Regularization parameter:  0.1
F1 score:  0.15637113397087893
Regularization parameter:  1.0
F1 score:  0.23887572379076266
Regularization parameter:  5.0
F1 score:  0.35681018433278033


  'recall', 'true', average, warn_for)


Regularization parameter:  10.0
F1 score:  0.40038368405115704
Regularization parameter:  50.0
F1 score:  0.42141988168121775
Regularization parameter:  100.0
F1 score:  0.4232442372826927
Regularization parameter:  1000.0
F1 score:  0.4046013696307599
Optimal regularization parameter:  100.0
F1 score: 0.423244




For choosing the optimal kernel function

Current kernel:  poly
F1 score:  0.40958825673060983
Current kernel:  rbf
F1 score:  0.4232442372826927
Current kernel:  sigmoid
F1 score:  0.26266681223909444
Optimal kerenel function:  rbf
F1 score: 0.423244




Optimal regularization parameter:  100.0
Optimal kernel:  rbf
3-fold cross validation score: 0.465432


### (Option) Other Classifiers.
Train and validate other classifiers by your own manner.
> <b> If you need, you can import other models only in this cell, only in scikit-learn. </b>

In [15]:
# If you need additional packages, import your own packages below.
# Your Code Here
from sklearn.ensemble import BaggingClassifier

models = {"Logistic Regression" : LogisticRegression(C=0.1, solver='lbfgs', max_iter=2000, multi_class='multinomial'),
         "Decision Tree" : DecisionTreeClassifier(criterion="entropy", random_state=0),
         "Random Forest" : RandomForestClassifier(n_estimators=50, criterion="entropy", random_state=0),
         "Support Vector Machine" : SVC(C=100.0, kernel="rbf")}
num_set = [5, 10, 15, 20, 25, 30, 35, 40]

for model in models.keys():
    print("\nCurrent model for bagging: ", model, "\n")
    
    max_f1 = 0
    
    if model == "Logistic Regression" or model == "Support Vector Machine":
        X_train = X_train_ohv
    else:
        X_train = X_data
        
    for n in num_set:
        f1_cv = []
        accuracy_cv = []
        bag = BaggingClassifier(base_estimator=models[model], n_estimators=n)
        for t_index, v_index in cv.split(X_train):
            X_tr, y_tr = X_train[t_index], y_train[t_index]
            X_val, y_val = X_train[v_index], y_train[v_index]
            
            bag.fit(X_tr, y_tr)
            y_pred = bag.predict(X_val)
            
            f1 = f1_score(y_val, y_pred, average='macro', labels=np.unique(y_pred))
            f1_cv.append(f1)
            
            accuracy = len(y_pred[y_pred == y_val]) / len(y_pred)
            accuracy_cv.append(accuracy)
            
        if max_f1 < np.mean(f1_cv):
            max_f1 = np.mean(f1_cv)
            opt_n = n
            
        print("With {} models, F1 score is {:.6f}".format(n, np.mean(f1_cv)))
        print("Accuracy: {:.6f}".format(np.mean(accuracy_cv)))
        print("="*40)
    print("Optimal number of models: ", opt_n)
    print("3-fold cross validation score: {:.6f}".format(cross_val_score(BaggingClassifier(base_estimator=models[model], n_estimators=opt_n), X_train, y_train, cv=3).mean()))
    print("\n\n\n")
#End your code


Current model for bagging:  Logistic Regression 

With 5 models, F1 score is 0.194204
Accuracy: 0.257000
With 10 models, F1 score is 0.187815
Accuracy: 0.252000
With 15 models, F1 score is 0.195679
Accuracy: 0.263000
With 20 models, F1 score is 0.186904
Accuracy: 0.253500
With 25 models, F1 score is 0.193956
Accuracy: 0.257500
With 30 models, F1 score is 0.190547
Accuracy: 0.257500
With 35 models, F1 score is 0.189012
Accuracy: 0.256000
With 40 models, F1 score is 0.190781
Accuracy: 0.260000
Optimal number of models:  15
3-fold cross validation score: 0.274439





Current model for bagging:  Decision Tree 

With 5 models, F1 score is 0.429073
Accuracy: 0.466000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 10 models, F1 score is 0.452904
Accuracy: 0.481500
With 15 models, F1 score is 0.468887
Accuracy: 0.495000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 20 models, F1 score is 0.454635
Accuracy: 0.485000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 25 models, F1 score is 0.479146
Accuracy: 0.500500


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 30 models, F1 score is 0.477688
Accuracy: 0.501500


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 35 models, F1 score is 0.459558
Accuracy: 0.499000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 40 models, F1 score is 0.466857
Accuracy: 0.504500
Optimal number of models:  25
3-fold cross validation score: 0.501940





Current model for bagging:  Random Forest 



  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 5 models, F1 score is 0.433194
Accuracy: 0.472000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 10 models, F1 score is 0.462866
Accuracy: 0.480000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 15 models, F1 score is 0.445912
Accuracy: 0.483000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 20 models, F1 score is 0.455900
Accuracy: 0.480000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 25 models, F1 score is 0.464734
Accuracy: 0.486000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 30 models, F1 score is 0.472419
Accuracy: 0.492000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 35 models, F1 score is 0.462846
Accuracy: 0.488000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 40 models, F1 score is 0.451946
Accuracy: 0.481000
Optimal number of models:  30
3-fold cross validation score: 0.491945





Current model for bagging:  Support Vector Machine 



  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 5 models, F1 score is 0.405933
Accuracy: 0.444500


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 10 models, F1 score is 0.419313
Accuracy: 0.455500


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 15 models, F1 score is 0.408988
Accuracy: 0.452500


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 20 models, F1 score is 0.433607
Accuracy: 0.459500


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 25 models, F1 score is 0.424960
Accuracy: 0.460000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 30 models, F1 score is 0.415550
Accuracy: 0.462000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 35 models, F1 score is 0.419730
Accuracy: 0.456000


  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)
  'recall', 'true', average, warn_for)


With 40 models, F1 score is 0.424578
Accuracy: 0.466000
Optimal number of models:  20
3-fold cross validation score: 0.460479






## Submit your prediction on the test data.

* Select your model and explain it briefly.
* You should read <b>"test.csv"</b>.
* Prerdict your model with dictionary form.
* Prediction example <br>
[2, 6, 14, 8, $\cdots$]
* We will rank your result by <b>F1 metric(with 'macro' option)</b>.
* <b> If you don't submit prediction file or submit it in wrong format, you can't get the point for this part.

# Explain your final model
각각의 모델 별로 여러 parameter를 조정하여 F1 score가 가장 높게 나오는 모델을 찾아낸 뒤, 해당 모델을 ensemble 모델 중 bagging 모델에 적용하였습니다. bagging을 선택한 이유는, target class의 수가 많은 것에 비해 학습하려는 데이터의 숫자가 많지 않다는 판단이 들어서 좀더 stable한 모델을 만들기 위해서입니다. bagging에서는 새로운 데이터 셋의 개수만을 조정하여 가장 적합한 모델을 찾아봤습니다. 그 결과 Decision Tree를 통해 40개의 새로 추출된 데이터셋으로 bagging을 통해 학습했을 때 가장 점수가 높게 나왔고, Decision Tree와 Randoom Forest의 경우 categorical feature를 one-hot encoding을 하지 않았을 때가 encoding을 한 경우 보다 더 높은 점수를 냈기에 one-hot encoding을 하지 않고 진행하였습니다.



In [17]:
# Load test dataset.
# Your Code Here
X_test = pd.read_csv('./data/test.csv')
X_test = np.array(X_test)

le = LabelEncoder()
X_test[:,0], X_test[:,2], X_test[:,4] = map(le.fit_transform, [X_test[:,0], X_test[:,2], X_test[:,4]])

numeric_features = [1, 3, 5]

for feature_num in numeric_features:
    feature_values = X_test[:, feature_num]
    feature_mean = np.mean(feature_values)
    feature_std = np.std(feature_values)
    
    X_test[:, feature_num] = (X_test[:, feature_num] - feature_mean) / feature_std
    
X_train = X_data.copy()
# End Your Code

In [20]:
# Predict target class
# Make variable "my_answer", type of array, and fill this array with your class predictions.
# Modify file name into your student number and your name.
# Your Code Here
bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy', random_state=0), n_estimators=25)
my_answer = bag.fit(X_train, y_train).predict(X_test)

file_name = "HW2_2012130730_ShimJaeheon.csv"
# End Your Code

In [22]:
# This section is for saving predicted answers. DO NOT MODIFY.
pd.Series(my_answer).to_csv("./data/" + file_name, header=None, index=None)