## Introduction to Scikit-Learn and Pandas
Artifical Intelligence and Machine Learning Symposium at OU
Univeristy of Oklahoma Memorial Union Ballroom
September 25, 2019 Author: Keerti Banweer keerti.banweer@ou.edu

## Overview: Dimensionality reduction and Clustering (Breast Cancer dataset)
Below are the topics that will be covered in this section:

1. Load the dataset using sklearn.datasets
2. Describe the dataset using DESCR
3. Check for missing values using numpy functions isnan() and any()
4. Scale the data using sklearn scaler (we will be using min max scaler)
5. Dimensionality reduction using PCA and tSNE functions in Sklearn
6. build the models using sklearn packages: Kmeans
7. Evaluate the predictions, check accuracy
8. visualizing the clusters with different graphs 
   compare different models using cross validation (sklearn.model_selection.cross_validate )

### General References
* [Sci-kit Learn API](https://scikit-learn.org/stable/modules/classes.html)

Clustering is a technique of identifying similar instances and assigning them to clusters 


## IMPORTS

In [1]:
# Index of sklearn datasets https://scikit-learn.org/stable/datasets/index.html #datasets
# https://scikit-learn.org/stable/modules/classes.html
# module-sklearn.datasets

"""
This section will import all the required packages for this tutorial

"""

from sklearn.datasets import load_breast_cancer, load_iris
from sklearn import cluster, datasets

import numpy as np
import pandas as pd
import itertools 
import time

from matplotlib import rcParams, pyplot as plt
# Though the following import is not directly being used, it is required
# for 3D projection to work
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import NullFormatter

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics.cluster import contingency_matrix 
from sklearn.metrics.pairwise import paired_euclidean_distances
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE

import scipy.stats as stats
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.cluster import KMeans, AffinityPropagation
from sklearn.decomposition import PCA


%matplotlib inline
%reload_ext autoreload
%autoreload 2

rcParams['figure.figsize'] = (8, 8)

globalStart = time.time()


## Data Preprocessing
In this section, we will understand and visualize structure of breast cancer dataset from sklearn.
feature_names will lists all the names of the different attributes
target_names are the names of the classes


In [2]:
"""
1. We will load breast cancer dataset
2. Using the function keys(), we will display all the keys of the dataset
3. DESCR will describe the dataset. It includes the list of attributes and their meaning. 
"""

#loading breast cancer dataset 


## Display the keys


## Describe the dataset


'\n1. We will load breast cancer dataset\n2. Using the function keys(), we will display all the keys of the dataset\n3. DESCR will describe the dataset. It includes the list of attributes and their meaning. \n'

In [3]:
"""
Store the data in variable X and using pandas, we convert it into a dataframe
Feature names and target names are available under keys: feature_names and target_names respectively
"""



'\nStore the data in variable X and using pandas, we convert it into a dataframe\nFeature names and target names are available under keys: feature_names and target_names respectively\n'

In [4]:
""" 
Store the number of samples and the number of features, by
accessing the values from the shape of X
"""



' \nStore the number of samples and the number of features, by\naccessing the values from the shape of X\n'

In [5]:
## Breast cancer dataset loaded in the form of dictionary
## changing it to pandas dataframe for more features

# data = np.c_[bc_dataset.data, bc_dataset.target]
# columns = np.append(bc_dataset.feature_names, ["target"])


# Data clean up

Check for any missing values using isna() and any(). 
We can use functions like head() and tail() to view top 5 and bottom 5 rows of the dataframe. 

In [6]:
"""
Using head() fucntion, we can check top 5 rows of the dataframe
"""


'\nUsing head() fucntion, we can check top 5 rows of the dataframe\n'

In [7]:
"""
Using tail() function we can check last 5 rows
"""


'\nUsing tail() function we can check last 5 rows\n'

## Check for missing values

In [8]:
""" 
Determine whether any data are NaN. Use isna() and
any() to obtain a summary of which features have at 
least one missing value
"""


' \nDetermine whether any data are NaN. Use isna() and\nany() to obtain a summary of which features have at \nleast one missing value\n'

In [9]:
"""
List of attributes
"""


'\nList of attributes\n'

# Histogram of the features

In [10]:
"""
HISTOGRAMS OF THE PREDICTOR FEATURES 
"""




'\nHISTOGRAMS OF THE PREDICTOR FEATURES \n'

## Normalizing and Scaling the dataset

In [11]:
"""
Use Min-Max scaler from sklearn
"""


'\nUse Min-Max scaler from sklearn\n'

In [12]:
"""
Display the scaled data using histograms
"""



'\nDisplay the scaled data using histograms\n'

## PCA for Dimensionality Reduction

In [13]:
"""
Principal Component Analysis (PCA) is one of the popular dimensionality reduction algorithm
Visualize using bar plot
"""


'\nPrincipal Component Analysis (PCA) is one of the popular dimensionality reduction algorithm\nVisualize using bar plot\n'

In [24]:
"""
Display top five rows of PCA
"""



'\nDisplay top five rows of PCA\n'

In [15]:
"""
Using seaborn to display the PCS distribution
"""



'\nUsing seaborn to display the PCS distribution\n'

In [16]:
## 3D version of the same plot

# num_classes = len(np.unique(colors))
# palette = np.array(sns.color_palette("hls", num_classes))


## tSNE for dimensionality reduction

In [17]:
#Also use tSNE for dimensionality reduction
##T-Distributed Stochastic Neighbouring Entities (t-SNE)



In [18]:
## In this section we will use seaborn to visualize


# Clustering

In [19]:
"""
Train a KMeans cluster for breast cancer dataset
We need to specify the number of clusters that algorithm will find
In the next section, we will compare the clusters with n_clusters 2, 4, 8
"""



'\nTrain a KMeans cluster for breast cancer dataset\nWe need to specify the number of clusters that algorithm will find\nIn the next section, we will compare the clusters with n_clusters 2, 4, 8\n'

In [20]:
"""
Centroid and labels for clustering
"""


'\nCentroid and labels for clustering\n'

In [21]:

# Add 3rd dimension to figure


# Plot all the features and assign color based on cluster identity label


# Plot centroids, though you can't really see them.


#Set labels on figure and show 3D scatter plot to visualize data and clusters.



#### CLUSTERING DATASET

In [22]:
# RETREIVING CLUSTER EXAMPLE INDICIES
def get_examples_in_cluster_c(estimator, X, c):
    nclusters = estimator.cluster_centers_.shape[0]
    inds = np.where(estimator.labels_ == c)[0]
    return inds


In [23]:
# https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html
features_diff = np.array(data_bc[selected_features])
kmeans = KMeans(n_clusters=2)
kmeans.fit(features_diff)

# Observing different Cluster counts
'''
TUTORIAL NOTES: Just have them play with different cluster sizes in the constructors
'''
estimators = [('2_clusters1', KMeans(n_clusters=2)),
              ('4_clusters2', KMeans(n_clusters=4)),
              ('8_clusters', KMeans(n_clusters=8))]
titles = ['2 Clusters','4 Clusters','8 Clusters']

plt.figure(1)
for i, (name, est) in enumerate(estimators):
    (fig, sub) = plt.subplots(2, 2, num=i+1)
    xx = int(i > 1)
    yy = i % 2
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
    est.fit(features_diff)
    labels = est.labels_

    ax.scatter(features_diff[:, 0], features_diff[:, 1], features_diff[:, 2], c=labels.astype(np.float), edgecolor='k')

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel('Radius Mean')
    ax.set_ylabel('Concavity Mean')
    ax.set_zlabel('Symmetry length')
    ax.set_title(titles[i])
    ax.dist = 12
    plt.show()

# Plot the ground truth
fig = plt.figure(4)
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
for label, name in enumerate(targ_names):
    ax.text3D(features[y == label, 0].mean(),
              features[y == label, 1].mean(),
              features[y == label, 2].mean() + 2, name,
              horizontalalignment='center',
              bbox=dict(alpha=.2, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [0, 1, 2]).astype(np.float)
ax.scatter(features[:, 0], features[:, 1], features[:, 2], c=y, edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Radius Mean')
ax.set_ylabel('Concavity Mean')
ax.set_zlabel('Symmetry length')
ax.set_title('Ground Truth')
ax.dist = 12
plt.show()

NameError: name 'data_bc' is not defined

## Clustering using AffinityPropagation

In [120]:
cluster_affinity = AffinityPropagation(damping=0.5, max_iter=300, affinity='euclidean', verbose=False)
data_affinity = np.array(features_diff)

In [159]:
# display affinity propagation clustering method

In [75]:
# analysing, visualizing dataset

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.figure()
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    #plt.savefig('confusion_mtx', bbox_inches="tight")
    