<h1 style='color:blue;' align='center'>PRC_8: KFold Cross Validation</h2>

## AIM: Apply K- Fold Validation and Implement it.

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
digits = load_digits()

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

**Logistic Regression**

In [3]:
lr = LogisticRegression(solver='liblinear',multi_class='ovr')
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.95

**SVM**

In [4]:
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.40555555555555556

**Random Forest**

In [5]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9777777777777777

<h2 style='color:purple'>KFold cross validation</h2>

**Basic example**

# k cross validation

In [6]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [7]:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


**Use KFold for our digits example**

In [8]:
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [9]:
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

scores_logistic = []
scores_svm = []
scores_rf = []

for train_index, test_index in folds.split(digits.data,digits.target):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], \
                                       digits.target[train_index], digits.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

In [10]:
scores_logistic

[0.8948247078464107, 0.9532554257095158, 0.9098497495826378]

In [11]:
scores_svm

[0.3806343906510851, 0.41068447412353926, 0.5125208681135225]

In [12]:
scores_rf

[0.9348914858096828, 0.9532554257095158, 0.9181969949916527]

<h2 style='color:purple'>cross_val_score function</h2>

In [13]:
from sklearn.model_selection import cross_val_score

**Logistic regression model performance using cross_val_score**

In [14]:
cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), digits.data, digits.target,cv=3)

array([0.89482471, 0.95325543, 0.90984975])

**svm model performance using cross_val_score**

In [15]:
cross_val_score(SVC(gamma='auto'), digits.data, digits.target,cv=3)

array([0.38063439, 0.41068447, 0.51252087])

**random forest performance using cross_val_score**

In [16]:
cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target,cv=3)

array([0.93322204, 0.95325543, 0.91652755])

cross_val_score uses stratifield kfold by default

<h2 style='color:purple'>Parameter tunning using k fold cross validation</h2>

In [17]:
scores1 = cross_val_score(RandomForestClassifier(n_estimators=5),digits.data, digits.target, cv=10)
np.average(scores1)

0.8708907510862819

In [18]:
scores2 = cross_val_score(RandomForestClassifier(n_estimators=20),digits.data, digits.target, cv=10)
np.average(scores2)

0.9393544382371198

In [19]:
scores3 = cross_val_score(RandomForestClassifier(n_estimators=30),digits.data, digits.target, cv=10)
np.average(scores3)

0.9465673494723774

In [20]:
scores4 = cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target, cv=10)
np.average(scores4)

0.9471291123525759

Here we used cross_val_score to
fine tune our random forest classifier and figured that having around 40 trees in random forest gives best result. 

<h2 style='color:purple'>Exercise</h2>

Use iris flower dataset from sklearn library and use cross_val_score against following
models to measure the performance of each. In the end figure out the model with best performance,
1. Logistic Regression
2. SVM
3. Decision Tree
4. Random Forest

# Solution to given Exercise:

In [24]:
import pandas as pd

In [28]:
# Loading Iris dataset:
from sklearn.datasets import load_iris
iris = load_iris()

In [29]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data,iris.target,test_size=0.3)

### Logestic Regression Model

In [30]:
lr = LogisticRegression(solver='liblinear',multi_class='ovr')
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.8888888888888888

In [39]:
model=[]

In [40]:
model.append(lr.score(X_test, y_test))

In [41]:
model

[0.8888888888888888]

### SVM Model

In [43]:
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.9777777777777777

### Descion Tree

In [44]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()

In [45]:
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.9555555555555556

### Random Forest Model

In [46]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9555555555555556

### Note : By Implementing prescribed model on flowers(iris) dataset we got maximum score of 97.78 with SVM model

# k cross validation for iris dataset

In [57]:
iris_scores_logistic = []
iris_scores_svm = []
iris_scores_dtree=[]
iris_scores_rf = []

for train_index, test_index in folds.split(iris.data,iris.target):
    X_train, X_test, y_train, y_test = iris.data[train_index], iris.data[test_index], \
                                       iris.target[train_index], iris.target[test_index]
    iris_scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    iris_scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    iris_scores_dtree.append(get_score(tree.DecisionTreeClassifier(),X_train, X_test, y_train, y_test))    
    iris_scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

In [58]:
iris_scores_logistic

[0.96, 0.96, 0.94]

In [75]:
np.average(iris_scores_logistic)

0.9533333333333333

In [59]:
iris_scores_svm

[0.98, 0.98, 0.96]

In [76]:
np.average(iris_scores_svm)

0.9733333333333333

In [60]:
iris_scores_dtree

[0.98, 0.94, 0.98]

In [77]:
np.average(iris_scores_dtree)

0.9666666666666667

In [62]:
iris_scores_rf

[0.98, 0.94, 0.96]

In [78]:
np.average(iris_scores_rf)

0.96

### Note : From above k_cross_validiation Score we got maximum score of 97.33 from SVM model

## cross_val_score function

### Logistic regression model performance using cross_val_score

In [66]:
cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), iris.data, iris.target,cv=3)

array([0.96, 0.96, 0.94])

### svm model performance using cross_val_score

In [67]:
cross_val_score(SVC(gamma='auto'), iris.data, iris.target,cv=3)

array([0.98, 0.98, 0.96])

### Descion Tree performance using cross_val_scores

In [69]:
cross_val_score(tree.DecisionTreeClassifier(),iris.data, iris.target,cv=3)

array([0.98, 0.92, 0.98])

### random forest performance using cross_val_score

In [68]:
cross_val_score(RandomForestClassifier(n_estimators=40),iris.data, iris.target,cv=3)

array([0.98, 0.94, 0.96])

### Therefor we now tune parameters on best model that is SVM

## Parameter tunning using k fold cross validation

In [80]:
scores1 = cross_val_score(SVC(gamma='auto'),iris.data, iris.target, cv=10)
np.average(scores1)

0.9800000000000001

## Result : I have performed all the required Ml model on flower dataset from scikit learn and compare it with cross_val_scores usinf K-fold from this we have figure out that SVM is well performed with flower(iris) dataset.