##### Today we are going to learn to clean data for SVM. 


Resources: 
This notebook follows https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

For Easy.py see http://www.developerstation.org/2011/03/simple-tutorial-on-using-libsvm.html

### Transform the data to the format of an SVM package
We will use sklearn.svm, it requires a list of data and a list of labels.

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.svm import SVC
import numpy as np
import csv

pathx='C:/Users/FSU/Documents/HW/data/task1_1_learn_X.csv'    
pathy='C:/Users/FSU/Documents/HW/data/task1_1_learn_y.csv'
pathx_2='C:/Users/FSU/Documents/HW/data/task1_1_test_X.csv'

with open(pathx, newline='') as csvfile:
    datax = csv.reader(csvfile, delimiter=' ', quoting = csv.QUOTE_NONNUMERIC)
    dataX_train = list(datax)
with open(pathx_2, newline='') as csvfile:
    datax2 = csv.reader(csvfile, delimiter=' ', quoting = csv.QUOTE_NONNUMERIC)
    dataX_test = list(datax2)
with open(pathy, newline='') as csvfile:
    datay = csv.reader(csvfile, delimiter=' ', quoting = csv.QUOTE_NONNUMERIC)
    dataY_train =np.ravel(list(datay))

print("Data is in the correct format for sklearn.svm")
print("Input format:list of vectors to train, vector of labels.")
print("_----------------------------------_")
print("Testing that the input is correct:")
test_clf =  SVC(kernel = 'rbf', class_weight = 'balanced')
test_clf = test_clf.fit(dataX_train[:200], dataY_train[:200])
test_y_predict = test_clf.predict(dataX_train[200:300])                       
print(confusion_matrix(dataY_train[200:300], test_y_predict, labels = range(2)))
print('accuracy: ',accuracy_score(dataY_train[200:300], test_y_predict))
print('f1: ' ,f1_score(dataY_train[200:300], test_y_predict))
print("Test classifier input passed.")
print("_----------------------------------_")

Above we run the svc on raw data and it worked. Below we will remove outliers and normalize the data. It is important to apply the same normalization to the test data. 

In [None]:
#let's work with numpy arrays
X_train = np.array(dataX_train)
Y_train = np.array(dataY_train)
X_test = np.array(dataX_test)


### Are there outliers?
We will use IsolationForest to find outliers. Ill explain more about Isolation Forest in a future jupyter note.
### is the data weighted? 
We have two classes and we need to make sure the proportion among them is close to 1 (or that the number of representatives of one divided by the total amount is closed to .5)


In [None]:
from sklearn.ensemble import IsolationForest
# fit the model
clf = IsolationForest(max_samples=250)
clf = clf.fit(X_train)
y_pred_train = clf.predict(X_train)

X_no_Outliers = X_train[y_pred_train==1]
Y_no_Outliers = Y_train[y_pred_train==1]
print("Outliers removed")
print("++++++++++++++++++++++++++++++++++++++++++++++")
print(len(X_train), len(X_no_Outliers))
print('percentage of points of class 1 {0}'.format(np.sum(Y_no_Outliers[Y_no_Outliers == 1])/len(Y_no_Outliers)))

Since $.47$ is close to $.5$ we dont need to weight the data.
If needed, we can add a weight to SVM, for example svm.SVC(kernel='linear', class_weight={1: 10})


Our current data sets are "X_no_Outliers", "Y_no_Outliers" and "X_test".

### Conduct simple scaling on the data
Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order.
In "preprocessing" there is a function called StandardScaller that finds the coefficients to make our features centered and with standar deviation 1. We use it to convert our train data and our  test data with the same parameters.

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X_no_Outliers)

print(scaler.mean_)
print(scaler.scale_)
dataX = scaler.transform(X_no_Outliers)
dataY = Y_no_Outliers
X_test = scaler.transform(X_test)

print("_______________________________________________________________________________")


Our current data sets are 'dataX', 'dataY' and 'X_test'. From 'dataX' we found the parameters and we used them to scale 'dataX' and 'X_test'.

### Consider the RBF kernel $K(x,y)=e^{\gamma||s-y||^2}$
When the number of features is very large, one may just use the linear kernel.
We split the train test to have a way to evaluate performance.


### Use cross-validation to find the best parameter $C$, $\gamma$
Here we tell to the Grid search to use "accuracy" as the score to compare. You will need to modify this to test for the 4 different scores available in the first task.

After we get the best parameters, we repeat the search with parameters closer to the previously found.

In [None]:
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer

training_X, testing_X, training_Y = dataX, X_test, dataY
# testing_Y) = train_test_split(dataX, dataY, train_size = .90, random_state = 1)

accuracy_scorer = make_scorer(accuracy_score, greater_is_better=True)#This step is needed to personalize the score used by Grid search
inner_cv = KFold(n_splits=4, shuffle=True)
param_grid =[ {'C':[.00001, .001, 1, 1e2, 1e3, 1e4, 1e5], 'kernel':['linear']},
{'C':[.00001, .01, 1, 1e2, 1e3, 1e4, 1e5],
             'gamma':[.0001, .001, .01, .1, 1, 10, 100],'kernel':['rbf'] },
]

clf = GridSearchCV( SVC( class_weight = 'balanced'), param_grid, scoring = accuracy_scorer, cv = inner_cv)# lets you determine other scoring
clf = clf.fit(training_X, training_Y)
print("Classifier parameters found")
print(clf.best_params_)
print("Now we make another search close to the previous parameters.")

c_v = clf.best_params_['C']
g_v = clf.best_params_['gamma']
param_grid =[ {
'C':
[c_v, c_v+e**(-9), c_v+e**(-5), c_v-e**(-5), c_v-e**(-9)],
             'gamma':
[g_v, g_v+e**(-7), g_v+e**(-10), g_v-e**(-7), g_v-e**(-10)],
            'kernel':
            ['rbf'] 
                  },]

clf = GridSearchCV( SVC( class_weight = 'balanced'), param_grid, scoring = accuracy_scorer, cv = inner_cv)# lets you determine other scoring
clf = clf.fit(training_X, training_Y)
print("Classifier parameters found")
print(clf.best_params_)




print("Grid scores on development set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.6f (+/-%0.03f) for %r"% (mean, std * 2, params))


### Use the best parameter $C$ and $\gamma$ to train the whole training set
After the best (C, γ) is found, the whole training set is trained again
to generate the final classifier.
The above approach works well for problems with thousands or more data points.
For very large data sets a feasible approach is to randomly choose a subset of the
data set, conduct grid-search on them, and then do a better-region-only grid-search
on the complete data set.

In [None]:

clf = SVC(C= clf.best_params_['C'], gamma= clf.best_params_['gamma'], kernel = clf.best_params_['kernel'],  class_weight = 'balanced')

### Test
Now we use it on the new data and save it to a file.

In [None]:

print("+++++++++++++++++++++++++++++++")
print("Evaluation on new data.")
presults = clf.predict(dataX2)
name = 'experiment.csv' 
with open(name, 'w') as csvfile:
    writing = csv.writer(csvfile, delimiter = ' ')
    for line in presults:
        writing.writerow(line)
print("data saved")

#### There is already a package that does all of this for you: easy.py, it is part of libsvm but the data has different format.