# K Fold Cross Validation Algorithm

Sometimes is necessary to create a K Fold Cross Validation manually, without any use of libraries, for academic research or other special situation. For this demonstration in R I'm going to use the Titanic database to show how to split it in k sizes. You can learn the logic behind and then code it in other language as C++, Java or Python.

In [1]:
library('titanic') #Database from Titanic.
database<-titanic_train 
head(database) #Show how it is.
print(dim(database)) #Check the database dimension.
database<-na.omit(database)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


[1] 891  12


Then, establish the $k$ needed for doing the required split. Lets check for $K=4$

In [2]:
k <- 4
N <- dim(database)[1] #Database total size

We are going to create a loop to slice the original database into equal sizes. If the $k$ is not a divisor for the total database size $N$, then we are going to truncate it to the minimal. For example if the final proportion $k/N$ is 3.5 the truncation will be 3. At the end all the leftover will be set at the end using the modulo operator

In [3]:
#database<-database[sample(1:dim(database)[1]),] #Randomize all by rows.
Min_index<-0 #This index will be useful later
folds<-list()
for(i in 1:k)
{
    if(i==k)
    {
        Max_index<-(floor(N/k))*i+(N%%k) #In the last iteration just make the size as the proportion N/k plus the remainder
    }
    else
    {
        Max_index<-(floor(N/k))*i #The firsts iterations create      
    }
    folds[i]<-list(database[(Min_index+1):Max_index,])#Append the database to the list
    Min_index<-Max_index #Replace the last s index for the new sub index
}

We can call each fold just transform it into a dataframe

In [4]:
head(data.frame(folds[1]))

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
7,7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S


For a rapid implementation we are going to calculate the accuracy using the library "MLmetrics"

In [5]:
library('MLmetrics')


Attaching package: 'MLmetrics'

The following object is masked from 'package:base':

    Recall



The next step is bind all the training folds, train your ML algorithm with it and then predict with the test fold. We repeat it as much as many $k$ you have chosen. For this implementation I'm just using the Accuracy, but you can calculate other metrics such as MSE. I'm going to explain in the code all important lines:

In [6]:
Acc<-vector() #Declare this as an empty vector. We are going to add each k Accuracy on it
for(j in 1:k) #In this loop will be taken each k test fold
{
    test<-data.frame(folds[j]) #As I said before, call in this way the dataframe to declare the test
    train<-list() #The train will be an empty list to initialize this variable
    for(i in 1:k) #In this loop only bind process
    {
        if(i!=j) #This is to avoid bind the current test fold with the other folds. Keep independence is the main goal
        {
            train<-rbind(train,data.frame(folds[i])) #Binding process             
        }
    }
    logisticmodel<-glm(Survived~Fare+Age+Sex, data = train,family=binomial) #Declare you ML algorithm. In this case I 
                                                                            #decided to use a logistic model.
    predictions<-data.frame(predict(logisticmodel,test,type="response")) #Create the predictions using probabilities
    predictions<-(predictions>0.5)*1 #Set the predictions vector as 0 and 1 
    Acc[j]<-Accuracy(test[c('Survived')],predictions) #Fill the Accuracy vector with all k declared
    
}

Let's see how the each Accuracy in  $k=5$ behaves

In [7]:
Acc

The last seems very good but the other are weak. If we haven't done K-folds CV and we would have kept the last result as the best we could have been in a mistake. The model just in this way is not the best implementation for the Titanic problem. But helps to look the importance of K-fold CV and why sometimes is important to do this kind processes without any library.

Finally, we are just going to use the mean to see general model's performance. You can use confidence intervals if you want further metrics to know where your model behaves. 

In [8]:
mean(Acc)

In general, this model is underfitted and needs further work, but is fine to see how K-Fold CV should work. You can find all answers for this problem on Kaggle and other places around Google.

I hope you've found this notebook useful for your porpuses. If you note any inconsistence, please let me know. Thank you!

email:luisangelalcantara@yahoo.com