# Abstraction
## 1. Test the effect of different data preprocesses
Tested preprocess:
- ### [No preprocess](#Test-Preprocess:-No-preprocess)

- ### [Preprocess 1](#Test-Preprocess:-Preprocess-1)
 - Drop less relevent data
 - One hot encode catagorical data
 - Normalize numerical data

- ### [Preprocess 2](#Test-Preprocess:-Preprocess-2)
 - The same to Preprocess 1, and
 - Oversampling train set

- ### [Preprocess 3](#Test-Preprocess:-Preprocess-3)
 - Only drop geo_level_2_id and geo_level_3_id
 - The other operations are the same to Preprocess 1

- ### [Preprocess 4](#Test-Preprocess:-Preprocess-4)
 - Drop only geo_level_3_id
 - The other operations are the same to Preprocess 1

## 2. Test the effect of different hyperparameters
Tested hyperparameters:
- [Iteration number](#Test-Hyperparameter:-max_iter)
- [C: punishment on wrong resoposes](#Test-Hyperparameter:-C:-punishment-on-wrong-resoposes)
- [class_weight](#Test-Hyperparameter:-class_weight)

## 3. Test the effeck of different kernel trick
<b><em>Notes: each of the following model may take more than 3 hours for training</em></b><br>
Tested kernel trick:
- [RBF](#Kernel-Trick-Test:-RBF)
- [Polynomial (3 degree)](#Kernel-Trick-Test:-3-Degree-Polynomial)
- [Sigmoid](#Kernel-Trick-Test:-Sigmoid)

## 4. [Predict on DrivenData Competition Test Set](#Prediction-on-DrivenData-Competition-Dataset)
## 5. [Save the Optimal Model](#Save-the-Optimal-Model-to-Avoid-Training-Again)


# Results
- ## Preprocess Test
 - Raw data cannot be used to train SVM. SVM only accepts number while raw data contains string
 - Preprocess 1: On test set: Accuracy: 0.6620, F1 Score: 0.6428
 - Preprocess 2: On test set: Accuracy: 0.5559, F1 Score: 0.5570
 - Preprocess 3: On test set: Accuracy: 0.6655, F1 Score: 0.6470
 - Preprocess 4: Fail to rum due to memory error

- ## Hyperparameter Test:
 - Iteration number<br>
    No effect, because the sk-learn model will adjust the iteration number automatically
 - C: punishment on wrong resoposes<br>
    The optimal value is around 0.1~1.0.<br>
    When C < 0.1, the accuracy of prediction decreases.<br>
    When C > 1.0, the the risk of overfitting increases.<br>
 - class_weight
   - Uniform weight: On test set: Accuracy: 0.6620, F1 Score: 0.6428
   - Balanced weight: On test set: Accuracy: 0.6391, F1 Score: 0.6413

- ## Kernel Trick Test
 - RBF: On test set: Accuracy: 0.6783, F1 Score: 0.6669
 - Polynomial: On test set: Accuracy: 0.6749, F1 Score: 0.6648
 - Sigmoid: On test set: Accuracy: 0.5448, F1 Score: 0.5460
 
- ###  Test result on DrivenData: f1 score 0.6768

# Conclusions
- The optimal SVM model:
 - No oversampling
 - Drop less relative features to save training time
 - Using default C value (C = 1)
 - Using uniform class weight
 - Using RBF kernel
- About oversampling:
 - It will degrade the overall performance
 - It will increase the accuracy of prediction of minor class, that is damage_grade == 1.
 - It will decrease the accuracy of prediction of major class, that is damage_grade == 2.
- About class_weight:
 - Uniform weight provides better overall performance
 - Balanced weight provides more even performance on predicting each class
- Our hardware is not powerful enough, cannot train model with ~1500 dimension features and training each SVM with kernel trick takes at least 3 hours.

In [1]:
import numpy as np
import pandas as pd
import time
import pickle
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import confusion_matrix

### Auxiliary functions
[Back to abstraction](#Abstraction)

In [2]:
# This function is used to print several parameters calculated form a confusion matrix.
def print_conf_paras(conf_mat):
    print('Classification Accuracy\t:', conf_mat.trace()/conf_mat.sum())
    print()
    micro_f1 = 0
    macro_f1 = 0
    for i in range(len(conf_mat)):
        tp = conf_mat[i,i]/conf_mat[i].sum()
        fp = (conf_mat[:,i].sum()-conf_mat[i,i])/(conf_mat.sum()-conf_mat[i].sum())
        fn = (conf_mat[i].sum()-conf_mat[i,i])/conf_mat[i].sum()
        precision = conf_mat[i,i]/conf_mat[:,i].sum()
        f1 = 2*precision*tp/(precision+tp)
        micro_f1 += f1 * conf_mat[i].sum()
        macro_f1 += f1
        print('  True Positive Rate for ', i, '\t:', tp)
        print('  False Positive Rate for', i, '\t:', fp)
        print('  False Negative Rate for', i, '\t:', fn)
        print('  Precision & Recall for ', i, '\t:', precision, tp)
        print('  F1 Score for', i, '           \t:', f1)
        print()
    micro_f1 /= conf_mat.sum()
    macro_f1 /= len(conf_mat)
    print('Micro F1 Score          \t:', micro_f1)
    print('Macro F1 Score          \t:', macro_f1)

# This function is used to print accuracy and f1 score calculated form a confusion matrix.
def print_conf_paras_simple(conf_mat):
    print('Classification Accuracy\t:', conf_mat.trace()/conf_mat.sum())
    micro_f1 = 0
    for i in range(len(conf_mat)):
        recall = conf_mat[i,i]/conf_mat[i].sum()
        precision = conf_mat[i,i]/conf_mat[:,i].sum()
        f1 = 2*precision*recall/(precision+recall)
        micro_f1 += f1 * conf_mat[i].sum()
    micro_f1 /= conf_mat.sum()
    print('Micro F1 Score          \t:', micro_f1)

In [3]:
# This function is used to run a linear SVM on the given data set with given hyperparameters,
# then it will test the SVM with train set and test set and print the test results.
# If verbose == True, it will print lots of test details.
# If verbose == False, it will only print classification accuracy and micro f1 score
# It return the trained SVM
def test_linear_svm_with(train_X, train_Y, test_X, test_Y, verbose=True, max_iter=5000, **args):
    # Training
    linear_svc = LinearSVC(dual=False, max_iter=max_iter, **args)
    linear_svc.fit(train_X, train_Y)
    test_svm(linear_svc, train_X, train_Y, test_X, test_Y, verbose)
    
    return linear_svc

# This function is used to test the given SVM with given data set
# If verbose == True, it will print lots of details.
# If verbose == False, it will only print classification accuracy and micro f1 score
def test_svm(svc, train_X, train_Y, test_X, test_Y, verbose=True):
    # Test on train set
    train_pred = svc.predict(train_X)
    train_conf = confusion_matrix(train_Y, train_pred)
    print('======= Test result on train set =======')
    if verbose:
        print_conf_paras(train_conf)
    else:
        print_conf_paras_simple(train_conf)
    
    # Test on test set
    test_pred = svc.predict(test_X)
    test_conf = confusion_matrix(test_Y, test_pred)
    print('======= Test result on test set ========')
    if verbose:
        print_conf_paras(test_conf)
    else:
        print_conf_paras_simple(test_conf)
    
    print('================= Done =================')
    

## Test Preprocess: No preprocess
[Back to abstraction](#Abstraction)<br>
The code cannot run normally due to encoding issue

In [4]:
# raw_data = pd.read_csv('data/feature.csv')
# raw_label = pd.read_csv('data/label.csv')
# 
# raw_data.drop(columns=['building_id'], inplace=True)
# raw_label = raw_label['damage_grade']
# 
# raw_train_x, raw_test_x, raw_train_y, raw_test_y = train_test_split(raw_data, raw_label, test_size=0.2)
# 
# test_linear_svm_with(raw_train_x, raw_train_y, raw_test_x, raw_test_y)

## Test Preprocess: Preprocess 1
[Back to abstraction](#Abstraction)<br>
- Drop less relevent data
- One hot encode catagorical data
- Normalize numerical data

In [5]:
%run Preprocess.ipynb

In [6]:
lsvm1 = test_linear_svm_with(train_x, train_y, test_x, test_y)

Classification Accuracy	: 0.6637759017651573

  True Positive Rate for  0 	: 0.23035216872264225
  False Positive Rate for 0 	: 0.01530449738820232
  False Negative Rate for 0 	: 0.7696478312773577
  Precision & Recall for  0 	: 0.6163162097418152 0.23035216872264225
  F1 Score for 0            	: 0.3353609964515895

  True Positive Rate for  1 	: 0.8347999055999461
  False Positive Rate for 1 	: 0.555935259806759
  False Negative Rate for 1 	: 0.16520009440005395
  Precision & Recall for  1 	: 0.664782833401572 0.8347999055999461
  F1 Score for 1            	: 0.7401534201942227

  True Positive Rate for  2 	: 0.49774852291630817
  False Positive Rate for 2 	: 0.12447026263441635
  False Negative Rate for 2 	: 0.5022514770836919
  Precision & Recall for  2 	: 0.6677504376767541 0.49774852291630817
  F1 Score for 2            	: 0.5703510775525631

Micro F1 Score          	: 0.6443235859462847
Macro F1 Score          	: 0.5486218313994584
Classification Accuracy	: 0.6642236334682757

 

## Test Preprocess: Preprocess 2
[Back to abstraction](#Abstraction)<br>
- Drop less relevent data
- One hot encode catagorical data
- Normalize numerical data
- Oversampling train set

In [7]:
lsvm2 = test_linear_svm_with(train_x_over, train_y_over, test_x_over, test_y_over)

Classification Accuracy	: 0.6530264207770023

  True Positive Rate for  0 	: 0.8156754661002663
  False Positive Rate for 0 	: 0.16683102390344223
  False Negative Rate for 0 	: 0.18432453389973366
  Precision & Recall for  0 	: 0.709691849635529 0.8156754661002663
  F1 Score for 0            	: 0.7590017489784553

  True Positive Rate for  1 	: 0.41133980647988944
  False Positive Rate for 1 	: 0.1729122416641381
  False Negative Rate for 1 	: 0.5886601935201106
  Precision & Recall for  1 	: 0.5432636113677601 0.41133980647988944
  F1 Score for 1            	: 0.46818593897648186

  True Positive Rate for  2 	: 0.7320639897508513
  False Positive Rate for 2 	: 0.18071710326691615
  False Negative Rate for 2 	: 0.2679360102491487
  Precision & Recall for  2 	: 0.6694697734647788 0.7320639897508513
  F1 Score for 2            	: 0.6993691143847557

Micro F1 Score          	: 0.6421856007798976
Macro F1 Score          	: 0.6421856007798977
Classification Accuracy	: 0.5568388941117783

 

## Test Preprocess: Preprocess 3
[Back to abstraction](#Abstraction)<br>
- Drop geo_level_2_id and geo_level_3_id
- One hot encode catagorical data
- Normalize numerical data

In [8]:
%run Preprocess_keep_features_except_geo_2_3.ipynb

In [9]:
lsvm3 = test_linear_svm_with(train_x_keep, train_y_keep, test_x_keep, test_y_keep, verbose=False)

Classification Accuracy	: 0.6667881811204912

  True Positive Rate for  0 	: 0.25326070661136374
  False Positive Rate for 0 	: 0.017111567419575632
  False Negative Rate for 0 	: 0.7467392933886362
  Precision & Recall for  0 	: 0.6111178102013747 0.25326070661136374
  F1 Score for 0            	: 0.35811192764273597

  True Positive Rate for  1 	: 0.8338214255659345
  False Positive Rate for 1 	: 0.5478416062465142
  False Negative Rate for 1 	: 0.16617857443406547
  Precision & Recall for  1 	: 0.6685897825192143 0.8338214255659345
  F1 Score for 1            	: 0.7421197107408615

  True Positive Rate for  2 	: 0.5005959304412757
  False Positive Rate for 2 	: 0.12337133843749325
  False Negative Rate for 2 	: 0.4994040695587243
  Precision & Recall for  2 	: 0.6705327947682247 0.5005959304412757
  F1 Score for 2            	: 0.5732350015210188

Micro F1 Score          	: 0.6488477325514083
Macro F1 Score          	: 0.5578222133015388
Classification Accuracy	: 0.6645689837109802


## Test Preprocess: Preprocess 4
[Back to abstraction](#Abstraction)<br>
- drop geo_level_3_id only
- One hot encode catagorical data
- Normalize numerical data

### Failed to run due to Memory Error

In [11]:
# clean memory for next test, which will take lots of memory
# del train_x, train_y, test_x, test_y
# del train_x_over, train_y_over, test_x_over, test_y_over
# del train_x_keep, train_y_keep, test_x_keep, test_y_keep

In [12]:
# %run Preprocess_keep_features_except_geo_3.ipynb

In [13]:
# lsvm4 = test_linear_svm_with(train_x_keep12, train_y_keep12, test_x_keep12, test_y_keep12)

In [14]:
# clean memory
# del train_x_keep12, train_y_keep12, test_x_keep12, test_y_keep12

## Test Hyperparameter: max_iter
[Back to abstraction](#Abstraction)<br>

In [15]:
%run Preprocess.ipynb

In [16]:
for max_iter in [10, 100, 1000, 5000]:
    print('Max_iter =', max_iter)
    test_linear_svm_with(train_x, train_y, test_x, test_y, max_iter=max_iter, verbose=False)
    print()

Max_iter = 10
Classification Accuracy	: 0.6650038372985418
Micro F1 Score          	: 0.6463600586133069
Classification Accuracy	: 0.662017229139886
Micro F1 Score          	: 0.643059238487166

Max_iter = 100
Classification Accuracy	: 0.6650038372985418
Micro F1 Score          	: 0.6463600586133069
Classification Accuracy	: 0.662017229139886
Micro F1 Score          	: 0.643059238487166

Max_iter = 1000
Classification Accuracy	: 0.6650038372985418
Micro F1 Score          	: 0.6463600586133069
Classification Accuracy	: 0.662017229139886
Micro F1 Score          	: 0.643059238487166

Max_iter = 5000
Classification Accuracy	: 0.6650038372985418
Micro F1 Score          	: 0.6463600586133069
Classification Accuracy	: 0.662017229139886
Micro F1 Score          	: 0.643059238487166



## Test Hyperparameter: C: punishment on wrong resoposes
[Back to abstraction](#Abstraction)<br>

In [17]:
for lgC in range(-4, 5):
    print('C =', 10**lgC)
    test_linear_svm_with(train_x, train_y, test_x, test_y, verbose=False, C=10**lgC)
    print()

C = 0.0001
Classification Accuracy	: 0.6507338833461244
Micro F1 Score          	: 0.615520964395088
Classification Accuracy	: 0.6508700907503693
Micro F1 Score          	: 0.6149150584307574

C = 0.001
Classification Accuracy	: 0.6626295088257866
Micro F1 Score          	: 0.6419265774910258
Classification Accuracy	: 0.6597148941885229
Micro F1 Score          	: 0.6386266930609307

C = 0.01
Classification Accuracy	: 0.6647448196469685
Micro F1 Score          	: 0.6458628208710776
Classification Accuracy	: 0.661767809520155
Micro F1 Score          	: 0.6425117620569234

C = 0.1
Classification Accuracy	: 0.66504221028396
Micro F1 Score          	: 0.6463751456045157
Classification Accuracy	: 0.6620556013890754
Micro F1 Score          	: 0.6430567831434513

C = 1
Classification Accuracy	: 0.6650038372985418
Micro F1 Score          	: 0.6463600586133069
Classification Accuracy	: 0.662017229139886
Micro F1 Score          	: 0.643059238487166

C = 10
Classification Accuracy	: 0.665013430544

## Test Hyperparameter: class_weight
[Back to abstraction](#Abstraction)<br>

In [18]:
# the default value of class_weight, which all class has uniform weight, is tested in Preprocess 1

lsvm5 = test_linear_svm_with(train_x, train_y, test_x, test_y, class_weight='balanced')

Classification Accuracy	: 0.6426132003069839

  True Positive Rate for  0 	: 0.61587019776101
  False Positive Rate for 0 	: 0.08175215060333789
  False Negative Rate for 0 	: 0.38412980223899
  Precision & Recall for  0 	: 0.4480160723254646 0.61587019776101
  F1 Score for 0            	: 0.5187015845984507

  True Positive Rate for  1 	: 0.675204580578146
  False Positive Rate for 1 	: 0.3751318462922047
  False Negative Rate for 1 	: 0.324795419421854
  Precision & Recall for  1 	: 0.7029453138737471 0.675204580578146
  F1 Score for 1            	: 0.688795751077953

  True Positive Rate for  2 	: 0.5950852557673019
  False Positive Rate for 2 	: 0.18267358857884491
  False Negative Rate for 2 	: 0.4049147442326981
  Precision & Recall for  2 	: 0.6211078874166243 0.5950852557673019
  F1 Score for 2            	: 0.6078181711743356

Micro F1 Score          	: 0.6451444429660078
Macro F1 Score          	: 0.6051051689502464
Classification Accuracy	: 0.6369793365438116

  True Positiv

## Kernel Trick Test: RBF
[Back to abstraction](#Abstraction)<br>

In [18]:
# Set up RBF SVM
rbf_svc = SVC(cache_size = 1024)

In [19]:
print(time.ctime())
rbf_svc.fit(train_x, train_y)
print(time.ctime())

Wed Apr  8 03:19:33 2020
Wed Apr  8 07:01:18 2020


In [20]:
print(time.ctime())
test_svm(rbf_svc, train_x, train_y, test_x, test_y)
print(time.ctime())

Wed Apr  8 07:01:18 2020
Classification Accuracy	: 0.6813986953184957

  True Positive Rate for  0 	: 0.40453865336658357
  False Positive Rate for 0 	: 0.026036193812025685
  False Negative Rate for 0 	: 0.5954613466334164
  Precision & Recall for  0 	: 0.6231082430667588 0.40453865336658357
  F1 Score for 0            	: 0.49057973205915256

  True Positive Rate for  1 	: 0.822725471515172
  False Positive Rate for 1 	: 0.49831428793964816
  False Negative Rate for 1 	: 0.17727452848482805
  Precision & Recall for  1 	: 0.6854234859446778 0.822725471515172
  F1 Score for 1            	: 0.7478244875906702

  True Positive Rate for  2 	: 0.5208312447187889
  False Positive Rate for 2 	: 0.12066466171920638
  False Negative Rate for 2 	: 0.47916875528121106
  Precision & Recall for  2 	: 0.6848974518334369 0.5208312447187889
  F1 Score for 2            	: 0.5917019199479336

Micro F1 Score          	: 0.670796938718233
Macro F1 Score          	: 0.6100353798659188
Classification Accura

## Kernel Trick Test: 3 Degree Polynomial
[Back to abstraction](#Abstraction)<br>

In [35]:
# Test for Polynomial SVM
poly_svc = SVC(kernel='poly', cache_size = 1024)

In [36]:
print(time.ctime())
poly_svc.fit(train_x, train_y)
print(time.ctime())

Sat Apr 11 01:55:52 2020
Sat Apr 11 06:26:17 2020


In [37]:
print(time.ctime())
test_svm(poly_svc, train_x, train_y, test_x, test_y)
print(time.ctime())

Sat Apr 11 06:26:17 2020
Classification Accuracy	: 0.6820606293169609

  True Positive Rate for  0 	: 0.41130430475058205
  False Positive Rate for 0 	: 0.02578959387762689
  False Negative Rate for 0 	: 0.5886956952494179
  Precision & Recall for  0 	: 0.6309749981001596 0.41130430475058205
  F1 Score for 0            	: 0.49799076350986626

  True Positive Rate for  1 	: 0.8214614222664788
  False Positive Rate for 1 	: 0.49465986507962967
  False Negative Rate for 1 	: 0.17853857773352125
  Precision & Recall for  1 	: 0.6862315213636652 0.8214614222664788
  F1 Score for 1            	: 0.7477818662282892

  True Positive Rate for  2 	: 0.5236846629986245
  False Positive Rate for 2 	: 0.1219932510383018
  False Negative Rate for 2 	: 0.4763153370013755
  Precision & Recall for  2 	: 0.6835677414528316 0.5236846629986245
  F1 Score for 2            	: 0.5930391043323057

Micro F1 Score          	: 0.6717921318229684
Macro F1 Score          	: 0.6129372446901536
Classification Accura

## Kernel Trick Test: Sigmoid
[Back to abstraction](#Abstraction)<br>

In [42]:
sig_svc = SVC(kernel='sigmoid', cache_size = 1024)

In [43]:
print(time.ctime())
sig_svc.fit(train_x, train_y)
print(time.ctime())

Wed Apr 15 12:38:15 2020
Wed Apr 15 15:26:52 2020


In [44]:
print(time.ctime())
test_svm(sig_svc, train_x, train_y, test_x, test_y)
print(time.ctime())

Wed Apr 15 15:26:52 2020
Classification Accuracy	: 0.5457453952417498

  True Positive Rate for  0 	: 0.37790871866573456
  False Positive Rate for 0 	: 0.09391681789720568
  False Negative Rate for 0 	: 0.6220912813342655
  Precision & Recall for  0 	: 0.2995211144971702 0.37790871866573456
  F1 Score for 0            	: 0.33417967456339837

  True Positive Rate for  1 	: 0.6292453307688414
  False Positive Rate for 1 	: 0.5129756054170651
  False Negative Rate for 1 	: 0.3707546692311586
  Precision & Recall for  1 	: 0.6178516996885561 0.6292453307688414
  F1 Score for 1            	: 0.623496468424792

  True Positive Rate for  2 	: 0.4522399588053553
  False Positive Rate for 2 	: 0.2227642745799896
  False Negative Rate for 2 	: 0.5477600411946447
  Precision & Recall for  2 	: 0.505993438425222 0.4522399588053553
  F1 Score for 2            	: 0.4776090092675816

Micro F1 Score          	: 0.5467833920699035
Macro F1 Score          	: 0.4784283840852573
Classification Accuracy	:

## Prediction on DrivenData Competition Dataset 
[Back to abstraction](#Abstraction)<br>

In [22]:
print(time.ctime())
compit_pred = rbf_svc.predict(test_values)
print(time.ctime())

Sat Apr 11 19:44:57 2020
Sat Apr 11 20:11:51 2020


In [23]:
compit_result = np.concatenate([[test_building_id], [compit_pred]], axis=0)

In [24]:
compit_result = np.transpose(compit_result)
compit_result.shape

(86868, 2)

In [26]:
np.savetxt("compit_answer.csv", compit_result, delimiter=",")

## Save the Optimal Model to Avoid Training Again
[Back to abstraction](#Abstraction)<br>

In [28]:
with open('rbf_svm_sklearn0.22.2.post1', 'wb') as rbf_file:
    pickle.dump(rbf_svc, rbf_file)