<a href="https://colab.research.google.com/github/PorasS/AI/blob/master/KFoldCV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

KFolds for regression.

StrafiedFolds for classification.

Cross-validation can be used for a variety of purposes in predictive modeling. These include:

Generating out-of-sample predictions from a neural network
Estimate a good number of epochs to train a neural network for (early stopping)
Evaluate the effectiveness of certain hyperparameters, such as activation functions, neuron counts, and layer counts
Cross-validation uses a number of folds, and multiple models, to provide each segment of data a chance to serve as both the validation and training set. 

**Regression vs Classification K-Fold Cross-Validation**

Regression and classification are handled somewhat differently with regards to cross-validation. Regression is the simpler case where you can simply break up the data set into K folds with little regard for where each item lands. For regression it is best that the data items fall into the folds as randomly as possible. It is also important to remember that not every fold will necessarily have exactly the same number of data items. It is not always possible for the data set to be evenly divided into K folds. For regression cross-validation we will use the Scikit-Learn class KFold.

Cross validation for classification could also use the KFold object; however, this technique would not ensure that the class balance remains the same in each fold as it was in the original. It is very important that the balance of classes that a model was trained on remains the same (or similar) to the training set. A drift in this distribution is one of the most important things to monitor after a trained model has been placed into actual use. Because of this, we want to make sure that the cross-validation itself does not introduce an unintended shift. This is referred to as stratified sampling and is accomplished by using the Scikit-Learn object StratifiedKFold in place of KFold whenever you are using classification. In summary, the following two objects in Scikit-Learn should be used:

KFold When dealing with a regression problem.
StratifiedKFold When dealing with a classification problem.
The following two sections demonstrate cross-validation with classification and regression.

**Out-of-Sample Regression Predictions with K-Fold Cross-Validation**

The following code trains the simple dataset using a 5-fold cross-validation. The expected performance of a neural network, of the type trained here, would be the score for the generated out-of-sample predictions. We begin by preparing a feature vector using the jh-simple-dataset to predict age. This is a regression problem.

https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_05_2_kfold.ipynb

In [None]:
import pandas as pd
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

display(df[0:5])

Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b
1,2,kd,c,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c
2,3,pe,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b
3,4,11,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b
4,5,kl,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a


In [None]:
# Generate dummies 
df = pd.concat([df,pd.get_dummies(df['job'],prefix='job')],axis=1)
df.drop('job',axis=1,inplace=True)
display(df[0:5])

Unnamed: 0,id,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product,job_11,job_al,job_am,job_ax,job_bf,job_by,job_cv,job_de,job_dz,job_e2,job_f8,job_gj,job_gv,job_kd,job_ke,job_kl,job_kp,job_ks,job_kw,job_mm,job_nb,job_nn,job_ob,job_pe,job_po,job_pq,job_pz,job_qp,job_qw,job_rn,job_sa,job_vv,job_zz
0,1,c,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,2,c,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,3,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,4,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,5,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
df = pd.concat([df,pd.get_dummies(df['area'],prefix='area')],axis=1)
df.drop('area',axis=1,inplace=True)
display(df[0:5])

Unnamed: 0,id,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product,job_11,job_al,job_am,job_ax,job_bf,job_by,job_cv,job_de,job_dz,job_e2,job_f8,job_gj,job_gv,job_kd,job_ke,job_kl,job_kp,job_ks,job_kw,job_mm,job_nb,job_nn,job_ob,job_pe,job_po,job_pq,job_pz,job_qp,job_qw,job_rn,job_sa,job_vv,job_zz,area_a,area_b,area_c,area_d
0,1,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
1,2,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,3,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0
3,4,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,5,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [None]:
df = pd.concat([df,pd.get_dummies(df['product'],prefix='product')],axis=1)
df.drop('product',axis=1,inplace=True)
display(df[0:5])

Unnamed: 0,id,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,job_11,job_al,job_am,job_ax,job_bf,job_by,job_cv,job_de,job_dz,job_e2,job_f8,job_gj,job_gv,job_kd,job_ke,job_kl,job_kp,job_ks,job_kw,job_mm,job_nb,job_nn,job_ob,job_pe,job_po,job_pq,job_pz,job_qp,job_qw,job_rn,job_sa,job_vv,job_zz,area_a,area_b,area_c,area_d,product_a,product_b,product_c,product_d,product_e,product_f,product_g
0,1,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0
1,2,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
2,3,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0
3,4,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0
4,5,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0


In [None]:
# missing values for income
med = df['income'].median()
df['income'] = df['income'].fillna(med)

# Standardize ranges
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['subscriptions'] = zscore(df['subscriptions'])
df['save_rate'] = zscore(df['save_rate'])

# Convert to numpy - classification
x_columns = df.columns.drop('age').drop('id')
x = df[x_columns].values
y = df['age'].values

Now that the feature vector is created a 5-fold cross-validation can be performed to generate out of sample predictions. We will assume 500 epochs, and not use early stopping. Later we will see how we can estimate a more optimal epoch count.

In [None]:
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.model_selection import KFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

# Cross-validate
kf= KFold(5, shuffle=True, random_state=42)

oos_y=[]
oos_pred=[]

fold=0
for train, test in kf.split(x):
  fold+=1
  print(f'Fold no: {fold}')
  x_train = x[train]
  y_train = y[train]
  x_test = x[test]
  y_test = y[test]

  model = Sequential()
  model.add(Dense(20, input_dim=x.shape[1],activation='relu'))
  model.add(Dense(10, activation='relu'))
  model.add(Dense(1))
  model.compile(loss='mean_squared_error', optimizer='adam')

  model.fit(x_train,y_train,validation_data=(x_test,y_test),verbose=0,epochs=500)
  
  pred = model.predict(x_test)

  oos_y.append(y_test)
  oos_pred.append(pred)

  # Measure this fold RMSE
  score = np.sqrt(metrics.mean_squared_error(pred,y_test))
  print(f'RMSE score: {score}')

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
print(f'oos_y : {oos_y}')
print(f'oos_pred : {oos_pred}')
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print(f'Final RMSE Score: {score}')

# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [df, oos_y, oos_pred],axis=1 )


Fold no: 1
RMSE score: 0.4949236023875796
Fold no: 2
RMSE score: 0.5643183622736091
Fold no: 3
RMSE score: 0.6626730776292719
Fold no: 4
RMSE score: 0.5266866995884728
Fold no: 5
RMSE score: 0.9742748783289431
oos_y : [47 49 46 ... 49 44 38]
oos_pred : [[47.382854]
 [49.178684]
 [45.64683 ]
 ...
 [49.113636]
 [43.712086]
 [38.006805]]
Final RMSE Score: 0.667705116362065


**Classification with Stratified K-Fold Cross-Validation**

The following code trains and fits the jh-simple-dataset dataset with cross-validation to generate out-of-sample . It also writes out the out of sample (predictions on the test set) results.

It is good to perform a stratified k-fold cross validation with classification data. This ensures that the percentages of each class remains the same across all folds. To do this, make use of the StratifiedKFold object, instead of the KFold object used in regression.

In [1]:

import pandas as pd
from scipy.stats import zscore

# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

# Generate dummies for job
df = pd.concat([df,pd.get_dummies(df['job'],prefix="job")],axis=1)
df.drop('job', axis=1, inplace=True)

# Generate dummies for area
df = pd.concat([df,pd.get_dummies(df['area'],prefix="area")],axis=1)
df.drop('area', axis=1, inplace=True)

# Missing values for income
med = df['income'].median()
df['income'] = df['income'].fillna(med)

# Standardize ranges
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['save_rate'] = zscore(df['save_rate'])
df['age'] = zscore(df['age'])
df['subscriptions'] = zscore(df['subscriptions'])

# Convert to numpy - Classification
x_columns = df.columns.drop('product').drop('id')
x = df[x_columns].values
dummies = pd.get_dummies(df['product']) # Classification
products = dummies.columns
y = dummies.values

We will assume 500 epochs, and not use early stopping. Later we will see how we can estimate a more optimal epoch count.

In [5]:
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

kf = StratifiedKFold(5, shuffle=True, random_state=42)

oos_y = []
oos_pred = []
fold = 0

# Must specify y strafied kfold for
for train, test in kf.split(x,df['product']):
  fold+=1
  print(f'fold: #{fold}')
  # print(f'train: {train}')
  # print(f'test: {test}')

  x_train = x[train]
  y_train = y[train]
  x_test = x[test]
  y_test = y[test]

  model = Sequential()
  model.add(Dense(50, input_dim=x.shape[1], activation='relu')) # Hidden 1
  model.add(Dense(25, activation='relu')) # Hidden 2
  model.add(Dense(y.shape[1],activation='softmax')) # Output
  model.compile(loss='categorical_crossentropy', optimizer='adam')

  model.fit(x_train,y_train,validation_data=(x_test,y_test),verbose=0,\
              epochs=500)
    
  pred = model.predict(x_test)

   
  oos_y.append(y_test)
  # raw probabilities to chosen class (highest probability)
  pred = np.argmax(pred,axis=1) 
  oos_pred.append(pred)  

  # Measure this fold's accuracy
  y_compare = np.argmax(y_test,axis=1) # For accuracy calculation
  score = metrics.accuracy_score(y_compare, pred)
  print(f"Fold score (accuracy): {score}")

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
oos_y_compare = np.argmax(oos_y,axis=1) # For accuracy calculation

score = metrics.accuracy_score(oos_y_compare, oos_pred)
print(f"Final score (accuracy): {score}")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [df, oos_y, oos_pred],axis=1 )
#oosDF.to_csv(filename_write,index=False)


fold: #1
Fold score (accuracy): 0.6775
fold: #2
Fold score (accuracy): 0.6425
fold: #3
Fold score (accuracy): 0.7025
fold: #4
Fold score (accuracy): 0.6275
fold: #5
Fold score (accuracy): 0.67
Final score (accuracy): 0.664
