# Support Vector Machine

Support vector machines (SVMs) are supervised algorithms for both classification and regression.
Based on discriminative classification: rather than modeling each class, we simply find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each other.

Datapoints from different classes are separated by lines (if SVM uses a linear kernel) which have margins.
These margins are maximized till they "touch" some datapoints.
These datapoints are called "support vectors" and are the only datapoints that are considered for future predictions. Datapoints which are not in the margins don't influence the prediction.

C parameter determines how tolerant is the margin with respect to data points inside itself. The lesser the more tolerant it is.

**References**
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html)
* [An Idiot's guide to Support vector machines (SVMs)](http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf)

## Load Data

In [7]:
import pandas as pd

data_path = 'input/'

df_train = pd.read_csv(data_path + 'train.csv')
df_test = pd.read_csv(data_path + 'test.csv')

df_train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## Data processing and model training

In [8]:
# Preprocessing
dv_train_X = df_train.drop(['PassengerId','Survived'], axis=1).values
dv_train_y = df_train['Survived'].values

In [9]:
# Prepare training set
X_train, X_test, y_train, y_test = train_test_split(
    dv_train_X, dv_train_y, test_size=0.25, random_state=1, stratify=dv_train_y);

In [None]:
# Loại bỏ các cột không số
cols_to_drop = ['PassengerId', 'Name', 'Ticket', 'Cabin']
dv_train_X = df_train.drop(cols_to_drop + ['Survived'], axis=1)

# Chuyển các cột dạng chuỗi còn lại thành số (one-hot encoding)
dv_train_X = pd.get_dummies(dv_train_X)

dv_train_y = df_train['Survived'].values

In [14]:
# Grid search to find best parameter values
param_grid = {
    'kernel': ['linear', 'rbf', 'sigmoid'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'gamma' : [0.001, 0.01, 0.1, 1]
}

grid_svc = GridSearchCV(svm.SVC(), param_grid, cv=10, scoring='accuracy')
grid_svc.fit(dv_train_X, dv_train_y)

print('Best score: {}'.format(grid_svc.best_score_))
print('Best parameters: {}'.format(grid_svc.best_params_))

ValueError: 
All the 720 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
720 fits failed with the following error:
Traceback (most recent call last):
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\svm\_base.py", line 197, in fit
    X, y = validate_data(
           ~~~~~~~~~~~~~^
        self,
        ^^^^^
    ...<5 lines>...
        accept_large_sparse=False,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 2961, in validate_data
    X, y = check_X_y(X, y, **check_params)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 1370, in check_X_y
    X = check_array(
        X,
    ...<12 lines>...
        input_name="X",
    )
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 1107, in check_array
    _assert_all_finite(
    ~~~~~~~~~~~~~~~~~~^
        array,
        ^^^^^^
    ...<2 lines>...
        allow_nan=ensure_all_finite == "allow-nan",
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 120, in _assert_all_finite
    _assert_all_finite_element_wise(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        X,
        ^^
    ...<4 lines>...
        input_name=input_name,
        ^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "c:\ProgramData\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 169, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
SVC does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values


In [None]:
# Model training
svc = svm.SVC(**grid_svc.best_params_).fit(X_train, y_train)

## Model parameters

In [None]:
# Features importance (for linear kernels only)
try:
    f_name = df_train.drop(['PassengerId','Survived'], axis=1).columns.values
    f_score = map(lambda x: -x.round(2), svc.coef_[0])
    
    print('{:<10}{:16}{:>10}'.format('RANK', 'FEATURE', 'SCORE'))
    for i, f in enumerate(sorted(zip(f_name, f_score), key=lambda x: x[1], reverse=True)):
        print('{:<10}{:16}{:10}'.format(i+1, f[0], f[1]))
    
except AttributeError:
    print('non-linear kernels are not support')

non-linear kernels are not support


In [None]:
# Number of support vectors
print('number of support vectors: {}'.format(len(svc.support_vectors_)))

number of support vectors: 264


## Score

In [None]:
# Test set score
testset_score = svc.score(X_test, y_test)
print('Accuracy with test set: {} (+/- {})'
      .format(round(testset_score.mean(),2), round(testset_score.std() * 2,2)))

Accuracy with test set: 0.85 (+/- 0.0)


In [None]:
# Cross-validation score
cv_iterations = 10
cv_score = cross_val_score(svc, dv_train_X, dv_train_y, cv=cv_iterations)
print('Accuracy with cross-validation (split size = {}): {} (+/- {})'
      .format(cv_iterations, round(cv_score.mean(),2), round(cv_score.std() * 2,2)))

Accuracy with cross-validation (split size = 10): 0.84 (+/- 0.07)


## Test set prediction

In [None]:
# Prediction on test set
dv_test_X = df_test.drop(['PassengerId'], axis=1).values

test_prediction_results = pd.DataFrame(
    data={'PassengerId': df_test['PassengerId'].values,
          'Survived': svc.predict(dv_test_X).astype(int)})

# Write results to a csv file
test_prediction_results.to_csv(data_path+'support-vector-machine.csv', index=False)