## Cross Validation

### Table of Contents

<ul>
<li><a> Problem Statement  </a></li>
<li><a> k - fold Cross Validation </a></li>
<li><a> k - fold Cross Validation Algorithm</a></li>   
<li><a> Significance of Standardisation</a> </li>

<ul>
    <li><a>Why is it important?</a></li>
    <li><a>How are variables Standardised??</a></li>
    </ul>
<li><a>Types of Performance Metrics</a></li>
<li><a>Modus Operandi</a></li>
    
</ul>

#### Problem Statement:


Create a function for k-fold Cross Validation, $ \texttt{CV}$ (model,X_train,Y_train,$\textbf{k}$), which returns Cross validation Score as output by computing Validation Score of k- samples with Mean Squared Error as the Error metric. 

#### $\textbf{k}$ - fold Cross Validation:

For a sufficiently large sized sample, the conventional train-validate-test methodology is deployed to fit a model onto the data. However, in cases where sample size is significantly smaller, there is not enough scope for the sample to be sliced into appropriate proportions of train, validate and test data. To address this issue, a different methodology has been formulated where sample is randomly split into $\textbf{k}$ - smaller samples of identical size and thereby allowing the model to keep little more data for training.

The model runs for $\textbf{k}$ no.of iterations and takes a different ($\textbf{k-1}$) no.of samples as training data for each iteration to fit the model. The model thus fit is then tested on the left-over sample and the corresponding validation score is computed. The validation scores of all the iterations are then averaged as Arithmetic mean or Weighted average based on heterogenity of sample sizes. The average thus obtained is termed as Cross-validation Score and the technique is called $\textbf{k}$ - fold Cross Validation.

Cross validation is therefore a technique preferred when there is not enough data available for training set. Though not regarded as a thumb rule, it is preferred to have larger $\textbf{k}$ i.e slicing data into many samples ($ S_{1},S_{2},S_{3}.....S_{k} $) for smaller sized sample $ S $.

Smaller the Cross-Validation score, better fit the model is.

#### $\textbf{k}$ - fold Cross Validation algorithm:


The algorithm is as follows:

1. Randomly divide the sample into $\textbf{k}$ equal parts $ S_{1},S_{2},S_{3}.....S_{k} $ such that $ S_{1} \cup S_{2} \cup S_{3}  \cup ......\cup S_{k} = S $(Master Sample)  

   \& $ S_{i} \cap S_{j} = \varnothing $ , $\forall$   $i \neq j $
   
<table>
  <tbody>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
      <td>.......</td>
      <td>k-1</td>
      <td>k</td>
    </tr>
  </tbody>
</table>
   
   
2. for i from 1 to k do:


3. hold out = $S_{i}$


4. training data  = $ \underset{j\neq i}{\underset{j=1}{\overset{k}{\bigcup}}} S_{j}$


5. Train Model M on training data


6. Validation score = Validation score of model M on holdout sample


7. end for


8. Cross Validation Score = Average of all validation scores.

    If samples are of equal size $\Rightarrow$ Simple Average.
    
    If samples are of unequal size $\Rightarrow$ Weighted Average.


$\textbf{Simple Average}$ = $\dfrac{V_{1}+V_{2}+V_{3}+.....+V_{k}}{k}$

where  $V_{1},V_{2}....V_{k}$ are respective validation scores of k-samples.

$\textbf{Weighted Average}$ = $\dfrac{n_{1}V_{1}+n_{2}V_{2}+n_{3}V_{3}.....+n_{k}V_{k}}{n_{1}+n_{2}+n_{3}+.....+n_{k}}$

where $n_{1},n_{2},n_{3}...$ are sizes of individual samples and $V_{1},V_{2}....V_{k}$ are  respective validation scores of k-samples.

### Significance of Standardisation:


#### Why is it important???


The variables that are used as predictors are usually quite disparate. This disparate nature acts as a hindrance for the user to compare them across the board as they are spread across wide orders of magnitude. In case of Linear Regression, the coefficients of fitted model cannot be compared as the predictors are of different range. In such a scenario, Standardisation comes to the rescue by bringing variables on an equivalent scale and facilitating comparison between them.


#### How are variables Standardised??

For a variable X, 

\begin{equation}
\textbf{Standardization} = \dfrac{X - \mu\left(X\right)}{std\left(X\right)}
\end{equation}

After Standardisation, all the predictors will have a common mean of 0 and a standard deviation of 1 thus allowing comparison between variables.

#### Types of Performance Metrics:

* Mean Absolute Deviation

* Mean Squared Error

* Root Mean Squared Error


In this problem, we use Mean Squared Error(MSE) as a performance metric/error metric to compute the validation score on the holdout sample.


\begin{equation}
\textbf{Mean Squared Error (MSE)} = \dfrac{1}{n} \underset{i=1} {\overset{n}{\sum}} \left( y_{i}^{test}- {\hat{y_{i}}}\right)^{2}
\end{equation}

**Kindly go through the Modus Operandi before executing the function.**

#### Modus Operandi:

1. Cars dataset is randomly shuffled.


2. Categorical variables are dropped from the dataframe to ensure the calculation of aggregate functions mean and standard deviation.


3. The dataset is then Standardised using the expression given above.


4. From the standardised dataset, X_train and Y_train are sent in as input parameters into the function $ \texttt{CV}$ (model,X_train,Y_train,$\textbf{k}$). 


5. Inside the function, the dataset is sliced into $\textbf{k}$-parts and a model(OLS Linear Regression in this case) is fit on ($\textbf{k-1}$) samples by holding out a sample $S_{i}$, where i is in range of 1 to k.


6. The model is then fit on the hold out sample $S_{i}$ and the validation score is evaluated.
    
    #### Note: In this function, Mean Square Error is used as performance metric to calculate Validation score.


7. After iterating on all the samples, Cross Validation score is returned as output. The model which returns the least Cross validation score is regarded as the ideal model.


*Comments:*

**i) The function used in this notebook is confined to Ordinary Least Squares(OLS) Linear Regression model and hence model object is not entered as input parameter to the function.**

**ii) The function can fit a Linear Regression Model on one (or) more predictors.**

**iii) The sample is split into k-folds and each of these samples are appended into an empty list so that the user can access individual samples.**


(X_train and Y_train need not necessarily be Weight and MPG)

In [1]:
#importing the required packages
import statsmodels.api as stm
import numpy as np
import pandas as pd

In [2]:
#initializing the dataframe data from cars.csv
data = pd.read_csv(r"C:\Users\anjit\Documents\cars.csv")

#Shuffle the data 
data = data.sample(len(data))

#Drop the nominal columns from the dataframe
data = data.drop(['Car','Origin','Model'], axis = 1)


In [3]:
#Function CV outputs the cross validation score, pass the predictors as X_train and Target variable as Y_train
#and k value (number of groups the data sample is to be split into)

def CV(X_train,Y_train,k):
    #initialization of arr array, used for calculation of cv score 
    arr = np.array([])
    #initializing an empty list s which stores the samples
    s = []
    
    #Standardize the data
    X_train = ((X_train - np.mean(X_train))/np.std(X_train))
    
    # n stores the number of predictors. When X_train is a series, length of columns returns AttributeError,
    # so we convert the series to dataframe and then calculate the value of n
    try:
        n = len(X_train.columns)
    except AttributeError:
        #converting series to data frame
        X_train = X_train.to_frame()
        n = len(X_train.columns)
    
    Y_train = Y_train.to_frame()
    #train stores the join of X_train and Y_train by concatenation along the vertical axis(axis = 1),
    #this is done before spliting the data into samples
    train = pd.concat([X_train,Y_train],axis=1)
    
        
    #Split the data into k samples
    
    #Check whether the length of data is divisible by k, if yes, divide the data into k equal parts
    if len(data)%k == 0:
        for i in range(k):
            #d stores the number of data elements in a sample
            d = len(data)//k
            #train dataframe is divided into samples, each with d elements
            sample = train.iloc[d*i : d*(i+1)]
            #append each sample to s
            s.append(sample)
    #If length of data is not exactly divisible by k, create k-1 samples of equal length, 
    #the remaining data is part of kth sample
    else:
        for i in range(k-1):
            d = len(data)//k
            sample= train.iloc[d*i : d*(i+1)]
            s.append(sample)
        sample = train.iloc[d*(k-1):]
        s.append(sample)
    
    #for loop for cross validation, for k samples
    for j in range(k): 
        #Store jth sample as holdout,validation is done on the holdout sample
        holdout = s[j]
    
        #the rest of the data(other than holdout) is stored as training data
        #df1 and df2 stores data from train dataframe excluding the jth sample
        df1 = train.iloc[0:(d*j)]
        df2 = train.iloc[(d*(j+1)):]
    
        #train_new is the union of df1 and df2 by concatenation along the horizontal axis(axis = 0)
        train_new = pd.concat([df1,df2],axis = 0)   
    
        #Get all the predictors
        X = train_new.loc[:,X_train.columns]
            
       #Store the target variable 
        Y = train_new.loc[:,Y_train.columns]
        
        #Adding constant to X (predictors)
        X = stm.add_constant(X)
     
        #Fitting the line through linear regression
        linreg = stm.OLS(Y,X).fit()
    
        #calculating y_pred
        x = 0
        #Adding the intercept(b0) to y_pred
        y_pred = linreg.params[0]
        while x <n:
            #adding the coeff*predictor value to y_pred
            y_pred += (linreg.params[x+1]*(holdout.iloc[:,x]))
            x += 1
       #Slicing y_test values from holdout sample
        y_test = holdout.iloc[:,(len(holdout.columns)-1)]
        
        # calculating mean squared error
        mse = (np.dot(np.transpose(y_pred-y_test),(y_pred-y_test)))/len(s[j])
        weight = len(s[j])*mse
        arr = np.append(arr,weight)
    #Calculate the cross validation score. As the number of elements is not equal in all samples , 
    #we calculate weighted avg of validation score
    cv = np.sum(arr)/len(train)
    return cv

In [4]:
#Function call for CV, pass the predictors as X_train and Target variable as Y_train
#and k value (number of groups the data sample is to be split into)
CV(data['Weight'],data.MPG,150)

26.940841478931667