# Classification using Scikit-Learn

Currently your mileage may vary as to whether or not this will run on the docker data science container in pb.csc.fi.

If you want to run it locally you should install the latest version of scikit-learn, bokeh and numpy.

In [4]:
import numpy as np

from bokeh.plotting import figure, show, save

from bokeh.io import output_notebook, output_file
import pandas as pd

In [3]:
output_notebook()

In [4]:
from sklearn.preprocessing import StandardScaler
from bokeh.charts import Scatter
from bokeh.layouts import column
from bokeh.models import Range1d

range_ = Range1d(-10, 10)

from sklearn.datasets import make_blobs
dims, labels = make_blobs(n_samples=100, n_features=2, centers=1, cluster_std=3, center_box=(2, 2), shuffle=True)

df_1 = pd.DataFrame.from_records(dims, columns=["first", "second"])
dims_scaler_1 = StandardScaler(copy=True, with_mean=True, with_std=False)

dims2 = dims_scaler_1.fit(dims).transform(dims)
df_2 = pd.DataFrame.from_records(dims2, columns=["first", "second"])
dims_scaler_2 = StandardScaler(copy=True, with_mean=True, with_std=True)
df_3 = pd.DataFrame.from_records(dims_scaler_2.fit(dims).transform(dims), columns=["first", "second"])
s1 = Scatter(df_1, x="first", y="second")
s1.y_range = range_
s1.x_range = range_
s2 = Scatter(df_2, x="first", y="second")
s2.y_range = range_
s2.x_range = range_
s3 = Scatter(df_3, x="first", y="second")
s3.y_range = range_
s3.x_range = range_
show(column(s1, s2, s3))


## Iris

See [here](http://archive.ics.uci.edu/ml/datasets/Iris) for more info about the dataset. The iris dataset is a classic in the machine learning community. 

[Scikit-learn documentation](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)

In [17]:
from sklearn import datasets
iris = datasets.load_iris()

scaler = StandardScaler()
scaler = scaler.fit(iris.data)
iris.data = scaler.transform(iris.data)

from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target)
clf = DummyClassifier()
clf.fit(X_train, Y_train)
clf.score(X_test,Y_test)


0.31578947368421051

In [18]:
from sklearn.svm import LinearSVC
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target)
clf = LinearSVC()
clf.fit(X_train, Y_train)
clf.score(X_test,Y_test)

0.97368421052631582

In [19]:
from sklearn.svm import SVC
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target)
clf = SVC()
clf.fit(X_train, Y_train)
clf.score(X_test,Y_test)

0.92105263157894735

## CROSS-VALIDATION

[Overfitting](http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html) is a real problem in training ML models. For this reason it is usually necessary to either withhold a third validation set or do [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29) . 

See also [Scikit-learn documentation](http://scikit-learn.org/stable/modules/cross_validation.html]).




In [29]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()

clf1 = SVC()
clf2 = KNeighborsClassifier()

score1 = cross_val_score(clf1, iris.data, iris.target, cv=3)
score2 = cross_val_score(clf2, iris.data, iris.target, cv=3)


In [27]:
print(score1)
print(score2)

[ 0.96        0.93333333]
[ 0.96  0.92]


## Cancer data

[Description](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)) of the dataset. 



In [40]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target)
#clf = DummyClassifier()
#clf.fit(X_train, Y_train)
#clf.score(X_test,Y_test)
cancer.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 0,

In [47]:
#example
from sklearn.svm import LinearSVC
X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target)
clf = LinearSVC()
clf.fit(X_train, Y_train)
clf.score(X_test,Y_test)




0.74825174825174823

In [45]:
from sklearn.model_selection import train_test_split

data = pd.read_csv("./cancer.csv", usecols=range(2,11)).as_matrix()
target = pd.read_csv("./cancer.csv", usecols=[11]).replace("malignant", 1).replace("benign", 2).as_matrix()
target = np.fromfile("./cancer.csv")
target = np.array(int(x) for x in target)
print(target)
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
x1, x2, y1, y2 = train_test_split(data, target)
clf = SVC()
x1.shape
y1.shape
clf.fit(x1, y1)
clf.score(x2, y2)

<generator object <genexpr> at 0x109df55c8>


TypeError: Singleton array array(<generator object <genexpr> at 0x109df55c8>, dtype=object) cannot be considered a valid collection.

# 20 NEWSGROUPS

The [20 newsgroups](https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups) dataset contains discussions from 20 Usenet news discussion groups.

As the data is text and not numeric we need to transform it into a numeric format first as a form of preprocessing. To do so we use the rather intuitive [inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to obtain weights [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) . This is way the more important terms get larger weights and more common terms get smaller weights.

In [17]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
#cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train')#, categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test')#, categories=cats)
vectorizer = TfidfVectorizer()
vectorized = vectorizer.fit_transform(newsgroups_train.data)
vectorized_test = vectorizer.transform(newsgroups_test.data)
print(vectorized.shape)
print(vectorized.count_nonzero())
print(vectorized.count_nonzero()/(vectorized.shape[0]*vectorized.shape[1]))#ratio

(11314, 130107)
1787565
0.001214353154362896


In [14]:
vectorized.size

1787565

As you can see the dataset has roughly 11 000 rows and 130 000 columns. The matrix is [sparse](https://en.wikipedia.org/wiki/Sparse_matrix), 
as is typical for text matrices. 

The original file in zipped format is 17 megabytes. Assuming 4 bytes per cell (i.e. a single precision float), the memory 
taken up by the data is 11k*130k*4 ~ 5.7 gigabytes! Yet only roughly 0.1% of the cells are nonzero. 
Fortunately Scikit (and NumPy) support sparse matrices.

Any assumptions about the data columns being normally distributed can be thrown out of the window. 

Fortunately a Naive Bayes classifier can be fitted with a single read-through of the data.

In [None]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB(alpha=0.01)
classifier.fit(vectorized, newsgroups_train.target)

classifier.score(vectorized_test, newsgroups_test.target)

0.83523632501327671

In [None]:
#from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components = 200)
svd.fit(vectorized)
svd_vectorized = svd.transform(vectorized)
svd_vectorized_test = svd.transform(vectorized_test)
clf = SVC()
clf.fit(pca_vectorized, newsgroups_train.target)
clf.score(pca_vectorized_test, newsgroups_test.target)

## Handwritten digits

The [handwritten digits](http://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html) dataset benefits particularly much from scaling.

Try it out without the scaling feature!

In [22]:
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale
digits = load_digits()
digits.data = scale(digits.data)
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
from sklearn.svm import SVC
clf = SVC()
from sklearn.model_selection import cross_val_score
cross_val_score(clf, scale(digits.data), digits.target)

array([ 0.96179402,  0.95993322,  0.95134228])

array([ 0.91860465,  0.88146912,  0.91610738])