#Classification
Classification is another supervised learning task where an algorithm learns a function to map an input to an output class.

In this activity we will become familiar with several techniques for predicting categorical outcomes. Such models are also called classifiers (they classify an input as one of the outcome groups).

##Breast Cancer Dataset
This is one of the classic datasets used for classification. It contains actual values about tumor characteristics and the diagnosis of the tumor as benign or malignant.

Two version of this dataset can be found at 
*   https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) 
*   https://archive.ics.uci.edu/ml/support/Breast+Cancer


In [None]:
#loading the breast cancer dataset
from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names) #load the dataset as a dataframe
df_y = pd.DataFrame(data.target, columns=['Malignant']) # in original dataset Benign=1, Malignant=0 
#we named the target column "Malignant" and will recode the variable (Malignant=1, Benign=0) correspondingly
df=pd.concat([df,1-df_y],axis=1)
df.head(2)

In [None]:
df

In [None]:
#212 observations are Malignant tumors, 357 are benign
df['Malignant'].sum()

In [None]:
#let's take a look at the summary statistics
df.describe()

In [None]:
#to save the loaded file as csv, uncomment next line and run this cell
#df.to_csv('breastcancer.csv')

In [None]:
#check variables for NULL or NA values
df.isnull().sum()

In [None]:
df.isna().sum()

In [None]:
#visually explore the data: correlation matrix of the dataset
import seaborn as sb
cormatrix=df.corr()
sb.heatmap(cormatrix,cmap='Blues') 
#cmap(colormap) options: 'Greys', 'Purples', 'Blues', 'Greens', 'Oranges', 'Reds','YlOrBr', 'YlOrRd', 'OrRd', 'PuRd', 'RdPu', 'BuPu','GnBu', 'PuBu', 'YlGnBu', 'PuBuGn', 'BuGn', 'YlGn'

In [None]:
##preparing the data for model training
#Outcome variable y: binary (Malignant=1, Benign=0)
y = df[['Malignant']]  
X = df.drop('Malignant',axis=1) #all other variables used as potential predictors

None of the potential predictor/input/X variables are categorical, so we won't need to encode them (i.e., dummy or OneHot encoding). 

In [None]:
X.head(2)

In [None]:
y.head(2)

## Splitting the data into Train/Test set
We will split the data intro train and test set, try several classifications models, and evaluate their predicitve performance:

In [None]:
#splitting the dataset into training (75%) and testing (25%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 123)


In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
#uncomment ONLY one the following two, depending on how you want to scale the data
sc = StandardScaler() #standardization
#sc = MinMaxScaler() #minmax scaling
X_train_sc=sc.fit_transform(X_train)

In [None]:
import warnings
warnings.filterwarnings('ignore')
# Let's compare the histograms of pre/post scaling for one variable ("mean radius") in our training data (X_train)
import matplotlib.pyplot as plt
fig, axs = plt.subplots(ncols=2)
sb.distplot(X_train[X_train.columns[0]], ax=axs[0])
sb.distplot(X_train_sc[[0]], ax=axs[1])


In [None]:
fig, axs = plt.subplots(ncols=2)
sb.distplot(X_train[X_train.columns[3]], ax=axs[0])
sb.distplot(X_train_sc[[3]], ax=axs[1])

Now rescale X_train using the **minmax** scaler and rerun the plots in the previous cell for some variable (e.g., pick one column from 0 to 29). 

### Question 1
Can you explain what happened after using a different scaler?

answer here

##Logistic Regression

In [None]:
#using logistic regression 
from sklearn.linear_model import LogisticRegression
logistic1 = LogisticRegression(random_state = 0, solver='liblinear')
logistic1.fit(X_train_sc, y_train) #fitting the model

In [None]:
#using the trained model to predict outcome values for the test data (X_test)
y_pred = logistic1.predict(sc.transform(X_test))


### Question 2
What is the problem in the previous step? Did we miss anything?

answer here

In [None]:
#fix issue before proceeding!!

##Model Evaluation
we can use sklearn.metrics to calculate performance metrics (the code is provided by commented out).

Another option is scikitplot (which runs on top of sklearn) to create more appealing display for evaluation metrics such as the confusion matrix and various model evaluation plots.

In [None]:
#evaluating model performance using sklearn metrics
#see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics  for definition/overview of each metric
import sklearn.metrics as metrics
metrics.confusion_matrix(y_test, y_pred) #confusion matrix

In [None]:
#!pip install scikit-plot #run once to install library
#using scikit plot for model evaluation
import scikitplot as skplt
skplt.metrics.plot_confusion_matrix(y_test, y_pred) #confusion matrix

In [None]:
skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True) #normalized confusion matrix

In [None]:
#Let's take a look at model's Precision, Recall, and Accuracy
(metrics.precision_score(y_test, y_pred), #Precision
metrics.recall_score(y_test, y_pred), #Recall
metrics.accuracy_score(y_test, y_pred), #Accuracy
metrics.balanced_accuracy_score(y_test, y_pred)) #Balanced Accuracy

In [None]:
#calculate probabilty of each observation in X-test to belong to either outcome class; Benign=0 , Malignant=1
y_pred_prob=logistic1.predict_proba(sc.fit_transform(X_test))
y_pred_prob.round(3)

In [None]:
#plotting the ROC curve to evalute the model's aggregate performance
skplt.metrics.plot_roc_curve(y_test,y_pred_prob)


In [None]:
#plot precision-recall curve
skplt.metrics.plot_precision_recall_curve(y_test,y_pred_prob)

##Decision Tree Classification 

Learn more at https://scikit-learn.org/stable/modules/tree.html#tree 

In [None]:
#using the Decision Tree Classifier from sklearn
from sklearn.tree import DecisionTreeClassifier
DecTree1 = DecisionTreeClassifier(criterion = 'entropy', random_state = 123) #change metric to 'gini' and rerun
DecTree1.fit(X_train, y_train)

In [None]:
#using the model to predict outcomes for X_test
y_pred = DecTree1.predict(X_test) 
#confusion matrix for the DecTree Classifier
skplt.metrics.plot_confusion_matrix(y_test, y_pred)


In [None]:
#calculating other model evalution metrics
(metrics.precision_score(y_test, y_pred), #Precision
metrics.recall_score(y_test, y_pred), #Recall
metrics.accuracy_score(y_test, y_pred),#Accuracy
metrics.accuracy_score(y_test, y_pred,normalize=True)) # Balanced Accuracy

Change the split criterion for the Decision Tree to 'gini' and rerun the model and evalution. 
### Question 3
Does it improve the model? which metric did you consider for your answer?

answer here

## Cross validation 
Is an approach to evalute machine learning models. CV helps with **overfitting** in supervised learning. It is also use to for model selection (selecting the best performing model between several candidates).

https://scikit-learn.org/stable/modules/cross_validation.html 


In [None]:
from sklearn.neighbors import KNeighborsClassifier
KNN1=KNeighborsClassifier(n_neighbors=5)
KNN1.fit(X_train, y_train)

In [None]:
y_pred=KNN1.predict(X_test)
(metrics.precision_score(y_test, y_pred), #Precision
metrics.recall_score(y_test, y_pred), #Recall
metrics.accuracy_score(y_test, y_pred))

##Comparing several models (using CV for model validation)
Now we will put everything together and compare several models based on several performance metrics. We use k-fold cross-validation to evaluate each model. 

In [None]:
import warnings
warnings.filterwarnings('ignore')
# Compare Algorithms
import pandas
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# prepare models
models = []
models.append(('LogisticReg', LogisticRegression(max_iter=500)))
models.append(('K-NN      ', KNeighborsClassifier()))
models.append(('DecisTree', DecisionTreeClassifier(criterion = 'entropy')))
models.append(('NaiveBayes', GaussianNB()))
models.append(('SVM       ', SVC())) # add a SVM classifieer
# the next line adds a RandomForest classifier
models.append(('RandForest', RandomForestClassifier(n_estimators = 10,criterion = 'entropy')))
# evaluate each model in turn
results = []
names = []
scoring = 'recall' #metric we want to compare
#see https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter for complete list of options
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=123)
	cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison: ' +scoring)
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

### Question 4
Which model has the best Accuracy? 

answer

change the scoring metric in the previous code cell as needed to answer the following questions. For example: 
* scoring='accuracy'
* scoring='precision'
* scoring='recall'

see https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter for a comprehensive list
### Question 5
Which model has the best Precision for the Malignant (1) class? 

answer

### Question 6
Which model has the best Recall (for the Malignant class)?

answer

Let's uncomment the line for addding a RandomForest Classifier and rerun the previous cell. 
### Question 7
Which model has a better predictive performance now? what metric did you use?

answer

## Iris dataset (3 outcome classes)

In [None]:
from sklearn.datasets import load_iris
data = load_iris()
(data.feature_names, 
data.target_names)

## Interactive Visualization of Decision Tree Classifers

In [None]:
!pip install ipywidgets

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import tree
from sklearn.datasets import load_iris, load_breast_cancer
from IPython.display import SVG
from graphviz import Source
from IPython.display import display                               
from ipywidgets import interactive
# load dataset
data = load_iris() #load_iris()
# feature matrix
X = data.data
# target vector
y = data.target
# class labels
labels = data.feature_names
def plot_tree(crit, split, depth, min_split, min_leaf=0.2):
  estimator = DecisionTreeClassifier(random_state = 0, criterion = crit, 
                                     splitter = split, max_depth = depth,
                                     min_samples_split=min_split, 
                                     min_samples_leaf=min_leaf)
  estimator.fit(X, y)
  graph = Source(tree.export_graphviz(estimator, out_file=None, 
                                      feature_names=labels, 
                                      class_names=['0', '1', '2'], filled = True))
  display(SVG(graph.pipe(format='svg')))
  return estimator

inter=interactive(plot_tree, 
                  crit = ["gini", "entropy"] , 
                  split = ["best", "random"] , 
                  depth=[1,2,3,4], 
                  min_split=(0.1,1), #min number of samples to further split a node
                  min_leaf=(0.1,0.5)) #min number of samples required to be a leaf node
display(inter)

Outcome classes for visualized Decision Tree (Breast Cancer dataset)
*   Benign=1
*   Malignant=0


### Question 8
What is most important variable to consider when classifying tumors according to the Decision Tree we have built?

Outcome classes for visualized Decision Tree (Iris dataset)
*   Iris Setosa = 0
*   Iris Versicolour = 1
*   Iris Virginica = 2

### Question 9
What is the most important variable to consider when classifying Iris plants?

answer