# Performance sklearn
This notebook is for trying Python machine learning package `sklearn` methods. Perhaps I'll give some performance test as well.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from datetime import *
from sklearn import svm
%matplotlib inline

## SVM Bounday
Learn well separated 2 classes, imbalanced, with rate 1:5.  
Generate two classes:  
* Minority (40 samples): $x,y\sim N(0,1)$ 
* Majority (200 samples): $x,y\sim N(10,25)$

Here's a demo
Merge them together with a shuffle.

In [None]:
x=np.random.randn(40)
y=np.random.randn(40)
d1=np.vstack((x,y,np.zeros(40)))
x=np.random.randn(200)*5+10
y=np.random.randn(200)*5+10
d2=np.vstack((x,y,np.ones(200)))
data=np.vstack((d1.T, d2.T))
np.random.shuffle(data)
d1=data[data[:, -1]==0, :-1]
d2=data[data[:, -1]==1,:-1]
plt.plot(zip(*d1)[0], zip(*d1)[1], 'x')
plt.plot(zip(*d2)[0], zip(*d2)[1], '.')
plt.xlim(-5, 25)
plt.ylim(-5, 25)
plt.show()

### Train with default SVM
Use `sklearn.svm.SVC` with default settings, `rbf kernel, C=1`

In [None]:
bsvm=svm.SVC()
bsvm.fit(data[:,:-1], data[:,-1])

### The boundary
The boundary of the trained SVM.

In [None]:
x=np.arange(-5,25,0.1)
y=x
X,Y=np.meshgrid(x,y)
grid_points=np.c_[X.ravel(), Y.ravel()]
Z=bsvm.predict(grid_points)
Z=Z.reshape((len(x),-1))
plt.contour(x,y,Z, 1)
plt.plot(zip(*d1)[0], zip(*d1)[1], 'x')
plt.plot(zip(*d2)[0], zip(*d2)[1], '.')
plt.xlim(-5, 25)
plt.ylim(-5, 25)

## One Class SVM
Then I try `one-class SVM`, which is a good method for extremly imbalanced data, to catch the representation of majority class, and regard minority's as outliers.  
The boundary of `one-class SVM` is demostrated bellow.  
Note that using `RBF` kernels, as $\gamma = 1/\sigma^2$, extrem $\gamma$ values cause extrem classification results, when
* $\gamma \rightarrow 0$, **entire** over-fitting happens.
* $\gamma \rightarrow +\infty$, **unable** to classify the data.

In [None]:
def foo(gamma):
    vsvm=svm.OneClassSVM(kernel='rbf', gamma=gamma)
    vsvm.fit(d2, np.ones(len(d2)))
    x=np.arange(-5,25,0.1)
    y=x
    X,Y=np.meshgrid(x,y)
    grid_points=np.c_[X.ravel(), Y.ravel()]
    Z=vsvm.predict(grid_points)
    Z=Z.reshape((len(x),-1))
    plt.contour(x,y,Z, 1)
    plt.plot(zip(*d1)[0], zip(*d1)[1], 'x')
    plt.plot(zip(*d2)[0], zip(*d2)[1], '.')
    plt.xlim(-5, 25)
    plt.ylim(-5, 25)


plt.gcf().set_size_inches(12,4)
plt.subplot(121)
foo(5)
plt.title(r'$\gamma = 5$, Unclassifiable')
plt.subplot(122)
foo(1e-9)
plt.title(r'$\gamma = 1e^{-9}$, Overfitting')

## Decision Tree and Small Disjuncts
We leverage decision tree to find small disjuncts and see how they affect the performance of classifier in imbalanced problem.

In [None]:
import json
import pandas as pd
from sklearn import tree
import seaborn as sns


conf=json.load(open('../../conf.json'))
pima=pd.read_csv(conf["dpath"]+'/pima/pima.data')
pima['class'].value_counts().plot(kind='bar')
sns.pairplot(pima, hue='class')


In [None]:
X=pima.ix[:, 0:8]
y=pima.ix[:, 8]
clf=tree.DecisionTreeClassifier()
clf.fit(X, y)

In [None]:
from IPython.display import Image
from sklearn.externals.six import StringIO
import pydot


dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,
                     feature_names=list(pima.columns[:8]),
                     class_names=np.array(['negative', 'positive']),
                     filled=True, rounded=True,
                     special_characters=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())