The purpose of this notebook is to create and implement at least four different cross-validation methods. 

Then pick two cross-validation methods to compare the performance of SVM, Decision tree, AdaBoost, and Random Forest model on the breast cancer data.

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV

from sklearn import tree
from sklearn.tree import plot_tree




In [3]:
data= pd.read_csv('/Users/danielaquijano/Documents/GitHub/Machine-Learning-Course-Projects/sourcefiles/wdbc.data')
#Trim data to only contain features a through j listed above 

data=data.iloc[:,0:12]
data.columns = ['ID', 'Diagnosis', 'Radius','Texture', 'Perimeter', 'Area', 'Smoothness', 'Compactness', 'Concavity', 'Concave Points', 'Symmetry', 'Fractal Dimension']
data

Unnamed: 0,ID,Diagnosis,Radius,Texture,Perimeter,Area,Smoothness,Compactness,Concavity,Concave Points,Symmetry,Fractal Dimension
0,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667
1,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999
2,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744
3,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883
4,843786,M,12.45,15.70,82.57,477.1,0.12780,0.17000,0.15780,0.08089,0.2087,0.07613
...,...,...,...,...,...,...,...,...,...,...,...,...
563,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623
564,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533
565,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648
566,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016


In [38]:
#Train Test Split Data
X = data.drop('Diagnosis',axis=1)
y = data['Diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)


### Decision Tree Classifier

In [39]:
dec_tree= DecisionTreeClassifier()
dec_tree.fit(X_train,y_train)

DecisionTreeClassifier()

In [40]:
predictions_DT = dec_tree.predict(X_test)

In [41]:
print(classification_report(y_test,predictions_DT))

              precision    recall  f1-score   support

           B       0.89      0.93      0.91        72
           M       0.87      0.81      0.84        42

    accuracy                           0.89       114
   macro avg       0.88      0.87      0.88       114
weighted avg       0.89      0.89      0.89       114



### Support Vector Machine Classifier

In [43]:
SVM = SVC()
SVM.fit(X_train,y_train)

SVC()

In [44]:
predictions_SVM = SVM.predict(X_test)

In [45]:
print(classification_report(y_test,predictions_SVM))

              precision    recall  f1-score   support

           B       0.63      1.00      0.77        72
           M       0.00      0.00      0.00        42

    accuracy                           0.63       114
   macro avg       0.32      0.50      0.39       114
weighted avg       0.40      0.63      0.49       114



  average_options = (None, 'micro', 'macro', 'weighted', 'samples')
  average_options = (None, 'micro', 'macro', 'weighted', 'samples')
  average_options = (None, 'micro', 'macro', 'weighted', 'samples')


### Random Forest Model 

In [46]:
Rand_Forest = RandomForestClassifier(oob_score = True)
Rand_Forest.fit(X_train,y_train)

RandomForestClassifier(oob_score=True)

In [47]:
predictions_RF = Rand_Forest.predict(X_test)

In [48]:
print(classification_report(y_test,predictions_RF))

              precision    recall  f1-score   support

           B       0.88      0.96      0.92        72
           M       0.92      0.79      0.85        42

    accuracy                           0.89       114
   macro avg       0.90      0.87      0.88       114
weighted avg       0.90      0.89      0.89       114



### AdaBoost Model 

# Implementing Cross Validation Methods

Methods for implementing cross validation:
- cross_validate: returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score
- cross_val_score
- cross_val_predict: returns, for each element in the input, the prediction that was obtained for that element when it was in the test set 
- KFold divides all the samples in  groups of samples, called folds
- RepeatedKFold: repeats K-Fold n times
- LeaveOneOut (or LOO): is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out 
- LeavePOut: similar to LeaveOneOut as it creates all the possible training/test sets by removing  samples from the complete set
- ShuffleSplit: iterator, will generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.
- StratifiedKFold: variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.
- StratifiedShuffleSplit is a variation of ShuffleSplit, which returns stratified splits
- GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. 
- StratifiedGroupKFold: cross-validation scheme that combines both StratifiedKFold and GroupKFold. The idea is to try to preserve the distribution of classes in each split while keeping each group within a single split.
- LeaveOneGroupOut: cross-validation scheme which holds out the samples according to a third-party provided array of integer groups.
- LeavePGroupsOut: is similar as LeaveOneGroupOut, but removes samples related to  groups for each training/test set.
- GroupShuffleSplit
- TimeSeriesSplit: is a variation of k-fold which returns first  folds as train set and the  th fold as test set. 