# Theory
https://www.youtube.com/watch?v=xE9cIcJf48A

## Need for Cross-Validation

- While choosing ML models, we need to compare models, to see how different models performe on our dataset
- Data is usually limited, and Training and Testing on the same portion of data does not give us an accurate view of how our model performs


<span style='color: red'> Training a model on the same data means that the model will eventually learn well for only that data and fail on new data, this is called ***OVERFITTING***</span>

## What is Cross-Validation
- is a technique which is used to train and evaluate our model on a portion of our database, before re-portioning our dataset and evalutating it on the new portions.

- we partition the dataset into training and testing data
    - the training data will be used by our model to learn.
    - the testing dataset will be used by our model to predict on unseen data. It is used to evaluate our model's performance.
    
        - we then choose a different portion of the data for training and testing and re-evaluate the model performance, to get more accurate results.

## Steps in Cross-Validation

1. Split the data into train and test sets and evaluate model's performance.
2. Split the data and split into new train and test sets. Re-evaluate model's performance.
3. To get the actual performance metric, take the average of all measures.

## Types of Cross-Validation
<span style='color: orange'>1. Leave One out Cross-Validation</span>
- The entire dataset is used for training and one singular datapoint is kept as the testing data.
    - Consider a dataset with n points. N-1 will be training set and 1 point will be testing set.
        - Another point will be chosen as the testing data and the rest of the points will be training.
        - This will repeat for the rest of dataset, i.e.: n-times
- the final performance measure will be the average of the measures for all n-iterations. 
- (Best used for small datasets)

<span style='color: orange'>2. K-Fold Cross-Validation</span>     
- The data is divided into K number of different sections.
- One section is for testing and the rest for training.
    - The number of sections, K, is selected depending on size of datset.
        - Another section will be chosen for testing and the remaining section will be for training.
        - This will continue K number of times, until all sections have been used as testing set once.
- The final performance measure will be the average of the output measures of the K-iterations.
- -----------------------------------------
<span style='color: orange'>2.1 Stratified K-Fold Cross-Validation</span>
- The data is split so that each portion has the same percentage of all the different classes that exist in the dataset.
    - Consider a dataset which has 2 classes of data
        - In normal cross-validation, the dat is divided without keeping in mind the distribution of individual classes.
        - The model, thus cannot properly predict for minority classes.

# Demo

- We will use adult census data to predict whether a person is making above 500 000 dollars a year.
- Then, we use cross-validation on our model to see which one is performing better.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split, KFold 

In [2]:
dataset = pd.read_csv('adult.csv')
dataset.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
income            0
dtype: int64

In [3]:
dataset.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
5,34,Private,216864,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
6,38,Private,150601,10th,6,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K
7,74,State-gov,88638,Doctorate,16,Never-married,Prof-specialty,Other-relative,White,Female,0,3683,20,United-States,>50K
8,68,Federal-gov,422013,HS-grad,9,Divorced,Prof-specialty,Not-in-family,White,Female,0,3683,40,United-States,<=50K
9,41,Private,70037,Some-college,10,Never-married,Craft-repair,Unmarried,White,Male,0,3004,60,?,>50K


In [4]:
dataset.nunique()

age                  73
workclass             9
fnlwgt            21648
education            16
education.num        16
marital.status        7
occupation           15
relationship          6
race                  5
sex                   2
capital.gain        119
capital.loss         92
hours.per.week       94
native.country       42
income                2
dtype: int64

In [5]:
dataset.groupby('marital.status')[['marital.status']].count()
dataset.groupby('sex')[['sex']].count()

Unnamed: 0_level_0,sex
sex,Unnamed: 1_level_1
Female,10771
Male,21790


In [6]:
# Convert sex value to 0 and 1
dataset['sex'] = dataset['sex'].map({'Male': 0, 'Female': 1})

# Create Married column - Binary Yes(1) or No(0)
dataset['marital.status'] = dataset['marital.status'].replace(['Never-married', 'Divorced', 'Separated', 'Widowed'], 0)
dataset['marital.status'] = dataset['marital.status'].replace(['Married-civ-spouse', 'Married-AF-spouse', 'Married-spouse-absent'], 1)
dataset['sex'].sample(5)

1054     0
6906     1
5377     0
24695    1
23934    0
Name: sex, dtype: int64

In [7]:
dataset.groupby('sex')[['sex']].count()

Unnamed: 0_level_0,sex
sex,Unnamed: 1_level_1
0,21790
1,10771


In [8]:
dataset.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')

In [9]:
dataset.drop(labels=['workclass', 'education', 'occupation', 'relationship', 'race', 'native.country'], axis=1, inplace=True)
dataset.head()

Unnamed: 0,age,fnlwgt,education.num,marital.status,sex,capital.gain,capital.loss,hours.per.week,income
0,90,77053,9,0,1,0,4356,40,<=50K
1,82,132870,9,0,1,0,4356,18,<=50K
2,66,186061,10,0,1,0,4356,40,<=50K
3,54,140359,4,0,1,0,3900,40,<=50K
4,41,264663,10,0,1,0,3900,40,<=50K


In [11]:
y = dataset['income']  # thats our answer
X = dataset.drop('income', axis=1) # 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models

[('LR', LogisticRegression()), ('KNN', KNeighborsClassifier())]

In [18]:
results = dict()
for name, model, in models:
    kfold = KFold(n_splits=10, random_state=7, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    results[name] = (cv_results.mean(), cv_results.std())
    
print()
print('name    results.mean    results.std')
for key, value in results.items():
    print(key, value)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



name    results.mean    results.std
LR (0.7980270424063377, 0.00828559456483822)
KNN (0.7754143779981189, 0.009084459642974708)
