# *Model Validation*

It is the process of determining the degree to which the model corresponds to the real system referred.

It represents how your model acts to the real world data and helps in determining how good the model is trained.

## ***The steps involved in model selection are stated below:***

1) Reserve a sample data set.

2) Train the model using the remaining part of the dataset.

3) Use the reserve sample of the test (validation) set to test the effectiveness of your model’s performance.

Let's import some important libraries

In [None]:
import pandas as pd
from sklearn import tree
from sklearn import metrics
from sklearn import model_selection
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

This notebook is based on the **'Red Wine Quality dataset'** which contanis only physicochemical (inputs) and sensory (the output) data for red wine.

### **Input variables (based on physicochemical tests):**

1) fixed acidity

2) volatile acidity

3) citric acid

4) residual sugar

5) chlorides

6) free sulfur dioxide

7) total sulfur dioxide

8) density

9) pH

10) sulphates

11) alcohol

### **Output variable (based on sensory data):**

1) quality (score between 0 and 10)

In [None]:
wine = pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

In [None]:
wine.head()

In [None]:
wine.info()

In [None]:
qual = wine.quality.unique()
qual.sort()
qual

Now, since the quality varies from 3 to 8, let's map these quantities in range 0 to 5 for better predictions.

In [None]:
wine_quality_map = {
    3:0,
    4:1,
    5:2,
    6:3,
    7:4,
    8:5
}

wine.quality = wine.quality.map(wine_quality_map)

In [None]:
qual_map = wine.quality.unique()
qual_map.sort()
qual_map

### **Simple model selection technique**

For simple model selection technique I have divided the data(that contains 1599 entries) into train and test set.

In the Train set contain first 1000 entries and test set contain last 599 entries.

We will train the model on first 1000 entries and check the model by predicting and comparing with test set.

In [None]:
wine_train = wine.head(1000)
wine_train_label = wine_train['quality']
wine_train = wine_train.drop('quality', axis = 1)

In [None]:
wine_train.head()

In [None]:
wine_test = wine.tail(599)
wine_test_label = wine_test['quality']
wine_test = wine_test.drop('quality', axis = 1)

In [None]:
wine_test.head()

Now we will only use the DecisionTreeClassifier to show how the model accuracy varies with respect to different model validation methods.

In [None]:
train_accuracy = [0.5]
test_accuracy = [0.5]

for depth in range(1, 20):
    clf = tree.DecisionTreeClassifier(max_depth = depth)
    
    clf.fit(wine_train, wine_train_label)
    
    train_pred = clf.predict(wine_train)
    acc_train = metrics.accuracy_score(wine_train_label, train_pred)
    
    test_pred = clf.predict(wine_test)
    acc_test = metrics.accuracy_score(wine_test_label, test_pred)
    
    train_accuracy.append(acc_train)
    test_accuracy.append(acc_test)

In [None]:
plt.figure(figsize = (10, 5))
sns.set_style('whitegrid')
plt.plot(train_accuracy, label = 'train accuracy')
plt.plot(test_accuracy, label = 'test accuracy')
plt.legend(loc = 'upper left', prop = {'size': 15})
plt.xticks(range(0, 20, 5))
plt.xlabel('max_depth', size = 20)
plt.ylabel('accuracy', size = 20)
plt.show()

#### **Insight:**

From the graph plotted above it is quite clear that the model fails in predicting most real world data accurately, with the highest accuracy of approx 0.57 at the max depth of 5.

The model overfits as the train accuracy is much higher than the test accuracy.

### **Cross Validation:**

Cross validation is a technique in the process of building any machine learning model which ensures that the model fits the data accurately and doesn't overfit the data.

In this notebook we will only look at K-Fold cross validation and stratified K-Fold cross validation.

### **K-Fold Cross Validation:**

It involves random k-Fold Cross Validation dividing the set of observations into k groups, or Folds of approximately equal size.

The first fold is treated as a validation set, and the machine learing model is fit on the remaining (k-1) Folds.

This procedure is repeated for (k) times. Each time, a different group of observations is treated as a validation set.

In [None]:
# Let's Apply K-Fold Cross Validation

# For this create a new column K-Fold with entry -1
wine['kFold'] = -1

# Shuffle the Data
wine = wine.sample(frac = 1).reset_index(drop = True)

# Split the data into 4 Folds
kfolds= model_selection.KFold(n_splits = 5)
for fold, (t, v)in enumerate(kfolds.split(X = wine)):
    wine.loc[v, 'kFold'] = fold
    
# Saving the data for further use
wine.to_csv('k_fold.csv', index = False)

First we will see the accuracy by taking different training and validation sets and its accuracy with the increasing depth of decision tree.

In [None]:
def check(fold):
    dt = pd.read_csv('./k_fold.csv')
    dt_train = dt[dt.kFold != fold].reset_index(drop = True)
    dt_test = dt[dt.kFold == fold].reset_index(drop = True)  
    
    y_train = dt_train.quality.values
    x_train = dt_train.drop('quality', axis = 1).values
    
    y_valid = dt_test.quality.values
    x_valid = dt_test.drop('quality', axis = 1).values
    
    ktrain_acc = [0.5]
    ktest_acc = [0.5]
    for depth in range(1, 20):
        clf = tree.DecisionTreeClassifier(max_depth = depth)

        clf.fit(x_train, y_train)

        train_pred = clf.predict(x_train)
        acc_train = metrics.accuracy_score(y_train, train_pred)

        test_pred = clf.predict(x_valid)
        acc_test = metrics.accuracy_score(y_valid, test_pred)

        ktrain_acc.append(acc_train)
        ktest_acc.append(acc_test)
    plt.figure(figsize=(10,5))
    sns.set_style('whitegrid')
    plt.plot(ktrain_acc, label = 'train accuracy')
    plt.plot(ktest_acc, label = 'test accuracy')
    plt.legend(loc='upper left', prop = {'size': 15})
    plt.xticks(range(0, 20, 5))
    plt.xlabel('max_depth', size = 20)
    plt.ylabel('accuracy', size = 20)
    plt.show()

In [None]:
# Taking fold 0 as test set and rest as training set
check(fold = 0)

In [None]:
# Taking fold 1 as test set and rest as training set
check(fold = 1)

In [None]:
# Taking fold 2 as test set and rest as training set
check(fold = 2)

In [None]:
# Taking fold 3 as test set and rest as training set
check(fold = 3)

In [None]:
#Taking fold 4 as test set and rest as training set
check(fold = 4)

The performance measure reported by k-Fold Cross Validation is then the average of the values computed in the loop. The code below shows the result at the max depth of 12 and the accuracy is at differnt folds taken one at a time and the the average is calculated.

In [None]:
X = pd.read_csv('./k_fold.csv')
y = X.quality.values
X = X.drop('quality', axis = 1).values
clf = tree.DecisionTreeClassifier(max_depth = 25)
scores = model_selection.cross_val_score(clf, X, y, cv = 4)
print(scores)

In [None]:
# Calculating the average
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x = wine['quality'])

### Stratified K-Fold Cross Validation:

From the Graph above it is quite clear that the data is skewed for a classification problem as there is very less data availble for the wine with quality index 0.

Wine with quality index 1 and 5 also have a little sample, while the wine with quality index 2 and 3 have a huge amount of samples availble. 

For this classification purpose, simple k-Fold validation doesn't produced good results. So we move to another cross validation technique called stratified K-Fold cross validation.

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

In [None]:
# Applying stratified k-Fold Cross validation
y_data = wine.quality.values
kf = model_selection.StratifiedKFold(n_splits = 5)

for fold, (t, v) in enumerate(kf.split(X = wine, y = y_data)):
    wine.loc[v, 'kfold'] = fold

wine.to_csv('st_fold.csv', index = False)

In [None]:
def check_stratified(fold):
    st = pd.read_csv('./st_fold.csv')
    st_train = st[st.kFold != fold].reset_index(drop = True)
    st_test = st[st.kFold == fold].reset_index(drop = True)  
    
    y_train = st_train.quality.values
    x_train = st_train.drop('quality', axis = 1).values
    
    y_valid = st_test.quality.values
    x_valid = st_test.drop('quality', axis = 1).values
    
    sktrain_acc = [0.5]
    sktest_acc = [0.5]
    for depth in range(1, 20):
        clf = tree.DecisionTreeClassifier(max_depth = depth)

        clf.fit(x_train, y_train)

        train_pred = clf.predict(x_train)
        acc_train = metrics.accuracy_score(y_train, train_pred)

        test_pred = clf.predict(x_valid)
        acc_test = metrics.accuracy_score(y_valid, test_pred)

        sktrain_acc.append(acc_train)
        sktest_acc.append(acc_test)
    plt.figure(figsize=(10,5))
    sns.set_style('whitegrid')
    plt.plot(sktrain_acc, label = 'train accuracy')
    plt.plot(sktest_acc, label = 'test accuracy')
    plt.legend(loc='upper left', prop = {'size': 15})
    plt.xticks(range(0, 20, 5))
    plt.xlabel('max_depth', size = 20)
    plt.ylabel('accuracy', size = 20)

In [None]:
# Taking fold 0 as test set and rest as training set
check_stratified(fold = 0)

In [None]:
# aking fold 1 as test set and rest as training set
check_stratified(fold = 1)

In [None]:
# Taking fold 2 as test set and rest as training set
check_stratified(fold = 2)

In [None]:
# Taking fold 3 as test set and rest as training set
check_stratified(fold = 3)

In [None]:
# Taking fold 4 as test set and rest as training set
check_stratified(fold = 4)

In [None]:
X = pd.read_csv('./st_fold.csv')
y = X.quality.values
X = X.drop('quality', axis = 1).values
clf = tree.DecisionTreeClassifier(max_depth = 25)
scores = model_selection.cross_val_score(clf, X, y, cv=5)
print(scores)

In [None]:
# Calculating the average
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

You can see from above graph visualisation and the accuracy score that stratified k-Fold Cross Validation produced much better result than the k-Fold Cross Validation technique of model validation.

Thus whenever there is an uneven distribution of targets choose stratified k-Fold Cross Valit=datin instead of simple k-Fold Cross Validation.

## Any suggestions will be great for me to learn