**Pyladies Berlin Meetup 05.12.17**

In this notebook we will apply the unsupervised machine learning methods discussed in the talk on both toy and real world examples.

We will firstly import the required modules. 

In [None]:
import matplotlib.pyplot as plt

import sklearn
import sklearn.datasets
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.cluster import KMeans




** One-Class SVM **

We will explore utilising a OC-SVM on a toy data set, and the continue to use it for a real-world scenario.

We will generate a 2-D Swiss Roll for testing a OC-SVM. 
The algorithm is from *S. Marsland, “Machine Learning: An Algorithmic Perspective”, Chapter 10, 2009*. We could have alternatively just taken 2 dimensions from sklearn.data.make_swiss_roll. 

In [None]:
nSamples = 7000
nFeatures = 2
nLayers = 5
radius = 5
nLayers = 5
radius = 5
rand = np.random.RandomState(1711)

t = rand.uniform(low=0, high=1, size=nSamples)
toyData1 = np.zeros((nSamples, nFeatures))

maxRot = nLayers * 2 * np.pi
toyData1[:, 0] = radius * t * np.cos(t * maxRot) + 0.1*np.random.normal(0, 1, nSamples)
toyData1[:, 1] = radius *  t * np.sin(t * maxRot)+ 0.1*np.random.normal(0, 1, nSamples)


We will now use matplotlib to scatter this data.

In [None]:
# TODO: Plot the toy data scatter plot

In the next section we will create a one-class SVM and train it on the toy data. 

In [None]:
# TODO: Use sklearn's One Class SVM class with a nu=0.15 and gamma=7 and train it on the toy data set. Set the random_state to 1711

In the next section we will plot the data set, the decision surface and the contour plot. 

In [None]:
xx, yy = np.meshgrid(np.linspace(-10, 10, 500), np.linspace(-10, 10, 500))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

In [None]:
plt.figure(figsize=(12, 12), dpi= 80)
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.PuBu)
a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='red')
plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='orange')
plt.scatter(toyData1[:,0], toyData1[:,1], c='white')

plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))

plt.show()

** Using a One-Class SVM to detect credit card fraud detection ** 

We will now use a real life example of detecting credit card fraud based on the data set provided by 

In [None]:
#TODO: Use Pandas to load the credit card transaction data from creditCardTransactions.csv into dfCreditCardData

The datasets contains transactions made by credit cards in September 2013 by european cardholders. For anonomity reasons the authors transformed the majority of the data using PCA and provided the first 28 Principal Components. The only numerical values they left was the amount of the transaction. The other value is the time. Let us still observe the value ranges uses describe().

The dataset is provided by a research collaboration of the Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles). It is under the ODbL 1.0 license.

In [None]:
dfCreditCardData.describe()

Let us observe how many transactions are normal and how many are fraudulent. 

In [None]:
dfCreditCardData['Class'].value_counts()

What are your observations?

Since we are observing the transactions independently we can drop the time feature. 

In [None]:
#TODO: Drop the Time column

Our next step is to divide our data points from their labels. We will then apply a MinMaxScaler and then proceed to apply PCA on them. 

In [None]:
# For conssistency of the notation we will apply the label -1 to the anomolous cases and 1 to the normal ones.  
dfCreditCardData['Class'] = dfCreditCardData['Class'].replace({0: 1, 1:-1})

X = dfCreditCardData.ix[:, 0:29]
y = dfCreditCardData.ix[:, 29]

#TODO: Apply a Min Max Scaler on X


#TODO: Apply a regular PCA on X


Separate our data and label set into a training using train_test_split. Usually, the test set should be smaller than the training set. Due to the convergence rate of the algorithm, and our time constraints of the meetup, set the test_size to 0.75. Random state=1711

In [None]:
# TODO, Split data

Remove the fraudulent transactions from our data set.

In [None]:
# TODO: Remove the fradulent training samples

Create a OneClassSVM with an rbf kernel using sklearn and fit it on the training data.

In [None]:
# Create a OneClassSVM with an rbf kernel using sklearn and fit it on the training data.  Set nu to 0.2. Random state=1711

In [None]:
# Fit the OCSVM on the good data

After the fitting is done proceed to classify the remaining test samples and 

In [None]:
y_test_pred = ocSVM.predict(X_test)

In [None]:
y_test_pred

In [None]:
print(classification_report(y_test, y_test_pred))

** Breast Cancer **

In the next example we will use unsupervised machine learning algorithms in order to detect breast cancer. 
The data set used is the Breast Cancer Wisconsin (Diagnostic) Data Set, provided by the UCI Machine Learning Repository. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. We inspect their characteristics. 

In [None]:
dataBreastCancer = sklearn.datasets.load_breast_cancer()
dfBreastCancer = pd.DataFrame(dataBreastCancer.data, columns=[dataBreastCancer.feature_names])

In [None]:
dfBreastCancer.describe()

In [None]:
# TODO: Load the features into an nXBreastCancer and the labels in yBreastCancer

In [None]:
# TODO: Normalize the data using a min max scaler and then apply t-sne to it. 

In [None]:
# TODO: Scatter the transformed data and color them according to their labes

In [None]:
# TODO: Use K-Means with 2 clusters to predict the classes of the data 


In [None]:
# TODO: Scatter the transformed data with the prediction based on k-means