# Data partitioning

In order to effectively use the available data, it's important to separate it in sets meant for training the models, and sets meant for evaluation of the data, in order to avoid overfitting the model to the available data.<br>
There are plenty of strategies one can use for splitting the data, all with their respective benefits and downsides

But first, let's get the libraries and data to be used throughout this entire notebook

<b>1. Import the needed libraries (pandas as pd and numpy as np)</b>

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

<b>2. Read the dataset `diabetes.csv`</b>

In [2]:
diabetes = pd.read_csv('datasets/diabetes.csv')
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


<b>3. Create an object named `data` that will contain your independent features and another object named `target` that will contain your independent feature/target (last column in the dataset). 

In [3]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
data = diabetes.iloc[:,:-1]
target = diabetes.iloc[:,-1]

In [5]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


A lot of the material in this notebook is only applicable to exclusively numerical data. For cases like these, it's important to properly filter features, separating the numeric features from the non-numeric features. This can be done with the pandas method `.select_dtypes()`, that can include and/or exclude features of a given data type. Numeric variables are identifiable by their data type, thanks to the `numpy` library

- _Documentation pandas.DataFrame.select_dtypes():_ https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html
    
<b>4. Define a new object named `data_n` where only the numerical variables are mantained, setting the `include` parameter's value to `np.number`, and a object named as `data_c` with all the categorical independent variables. These objects will be used later in the notebook.</b>

In [6]:
data_n = data.select_dtypes(include=np.number).set_index(data.index)
data_c = data.select_dtypes(exclude=np.number).set_index(data.index)#CODE HERE

In [7]:
data_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Empty DataFrame


Now, let's go into the data.

### 1.1. The train-test split

<img src="./img/training-validation-test-data-set.png" alt="Drawing" style="width: 700px;"/>

This is the simplest and most common approach: splitting the data into two sets, one for training the model (`train`) and one for model validation purposes (`test`). Ideally the data is split leaving 70-80% of observations for training, and the rest for validation.

In this exercise, we are going to split our dataset into train, test and validation. <br>
By default, sklearn has a function named `train_test_split`, that was used in the last week, that allows to split the dataset into two different datasets.

- _Documentation: sklearn.model_selection.train_test_split():_ https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

<b>5. Import the function `train_test_split` from `sklearn.model_selection`</b>

In [8]:
from sklearn.model_selection import train_test_split

<b>6. Divide the `data`into `X_train_val` and `X_test`, the `target`into `y_train_val` and `y_test`, and define the following arguments: `test_size = 0.2`, `random_state = 15`, `shuffle = True` and `stratify = target` </b>

In [9]:
X_train_val, X_test, y_train_val, y_test = train_test_split(data, 
                                                    target, 
                                                    test_size = 0.2,
                                                    random_state = 15,
                                                    shuffle = True,
                                                    stratify = target
                                                   )

This will create two different datasets, one for train (80% of the data) and one for test (20% of the data). <br>
`shuffle` randomizes the order of the observations, and `stratify` makes it so that every dataset resulting from the split has the same proportion of each label of the dependent variable.


### How to create the three datasets: train, validation and test?
To create three datasets (train, validation and test) with the function train_test_split, the function has to be called twice. <br>
First we are going to create two sets of datasets, one for test (X_test and y_test) and another one that includes the data for training and validation (X_train_val and y_train_val).

<b>7. Divide the `X_train_val`into `X_train` and `X_val`, the `y_train_val` into `y_train` and `y_val`, and define the following arguments: `test_size = 0.25`, `random_state = 15`, `shuffle = True` and `stratify = y_train_val`.</b>

In [10]:

X_train, X_val, y_train, y_val = train_test_split(X_train_val, 
                                                    y_train_val, 
                                                    test_size = 0.25,
                                                    random_state = 15,
                                                    shuffle = True,
                                                    stratify = y_train_val
                                                   )

<b>8. Run the cell below to check the proportion of data for each dataset. </b>

In [11]:
print('train:{}% | validation:{}% | test:{}%'.format(round(len(y_train)/len(target),2),
                                                     round(len(y_val)/len(target),2),
                                                     round(len(y_test)/len(target),2)
                                                    ))

train:0.6% | validation:0.2% | test:0.2%


Now we have three different datasets, namely:
- Training dataset, with 60% of the data, that will allow me to build the model;
- Validation dataset, with 20% of the data, that will allow me to fine tune the model and check some problems like overfitting;
- Test dataset, with 20% of the data, that will allow me to evaluate the performance of the final model.

With this approach, there is a possibility of high bias if we have limited data, since there's a higher chance of missing important information for training. Additionally, the model's performance on validation data is a less reliable indicator of the model's performance on new data, since the amount of validation data is rather small.<br>
If there's a high amount of data, and the test sample has the same distribution as the train sample, then this approach is acceptable.

****

The different techniques we are going to check in the next steps are commonly used in applied machine learning to compare and select a model for a given predictive modeling problem, since they . <br>
In the following cases, we are going to check the performance of a Logistic Regression using those different techniques.

### 1.2. K-Fold Cross-Validation

<img src="./img/kfold.png" alt="Drawing" style="width: 600px;"/>

This approach's idea is to make the most of the available data, whilst avoiding overlap between validation datasets.<br>
There are multiple steps to this approach:
1. Split the data into k partitions of equal size
2. Each partition is used for testing the model once. When a partition is used for testing, all others are used for training
3. The subsets are stratified before validation
4. The estimations from each test are averaged, resulting in an overall estimate

In the following examples we are going to only use the numeric variables of our dataset

<b>9. Check `data`

In [12]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [13]:
data_c.head()

0
1
2
3
4


<b>10. Import `KFold` from `sklearn.model_selection`</b>

In [14]:

from sklearn.model_selection import KFold

<b>11. Import `LogisticRegression` from `sklearn.linear_model`</b>

In [15]:

from sklearn.linear_model import LogisticRegression

<b>12. Create a function named as `run_model_LR` that receives as parameters the dependent variable and the independent variables and returns a fitted Logistic Regression model to the data. </b>

In [16]:

def run_model_LR(X,y):
    logge = LogisticRegression()
    logge.fit(X,y)
    return logge

<b>13. Create a function named as `evaluate_model` that receives as parameters the independent variables, the dependent variable and the model and returns the ``score`` method result.</b>

In [17]:
def evaluate_model(X,y,model):
    return model.score(X,y)

<b>14. Run the cell below to create a function named `avg_score_LR` that will return the average score value for the train and the test set. This will have as parameters the partition technique you are going to use, your dependent variable and your independent variables.</b>

In [18]:
def avg_score_LR(method,X,y):
    score_train = []
    score_test = []

    for train_index, test_index in method.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        model = run_model_LR(X_train, y_train)
        value_train = evaluate_model(X_train, y_train, model)
        value_test = evaluate_model(X_test,y_test, model)
        score_train.append(value_train)
        score_test.append(value_test)

    print('Train:', np.mean(score_train))
    print('Test:', np.mean(score_test))
    
    return score_train, score_test

<b>15. Create a KFold Instance where the number of splits is 10 (*n_splits*) and name it as `kf`</b>

In [19]:
kf = KFold(n_splits = 10)



<b>16. Call the function `avg_score_LR` and check the average score for the train and the test sets using `kf`</b>





In [20]:
avg_score_LR(kf, data,target)

Train: 0.7838532996494985
Test: 0.7721291866028708


([0.7988422575976846,
  0.7756874095513748,
  0.7858176555716353,
  0.788712011577424,
  0.7829232995658466,
  0.7901591895803184,
  0.768451519536903,
  0.7742402315484804,
  0.7933526011560693,
  0.7803468208092486],
 [0.7012987012987013,
  0.8441558441558441,
  0.7402597402597403,
  0.6883116883116883,
  0.7922077922077922,
  0.7402597402597403,
  0.8571428571428571,
  0.8181818181818182,
  0.7368421052631579,
  0.8026315789473685])

### 1.3. Repeated K-Fold

Repeated K-Fold is, as the name says, running K-Fold multiple times, and averaging the results of each time K-Fold is ran

<b>17. Import `RepeatedKFold` from `sklearn.model_selection`</b>

In [21]:

from sklearn.model_selection import RepeatedKFold

<b>18. Create a RepeatedKFold Instance where the number of splits is 6 (`n_splits=6`) and the number of times cross-validator needs to be repeated is 2 (`n_repeats=2`)  and name it as `rkf`</b>

In [22]:
rkf = RepeatedKFold(n_splits = 6, n_repeats = 2)

<b>19. Call the function `avg_score_LR` and check the average score for the train and the test sets using `rkf`</b>

In [23]:
avg_score_LR(rkf,data,target)

Train: 0.7828125
Test: 0.771484375


([0.784375,
  0.78125,
  0.7875,
  0.7703125,
  0.790625,
  0.7796875,
  0.78125,
  0.79375,
  0.7828125,
  0.7796875,
  0.7890625,
  0.7734375],
 [0.734375,
  0.7578125,
  0.75,
  0.8515625,
  0.7265625,
  0.78125,
  0.78125,
  0.75,
  0.78125,
  0.8046875,
  0.765625,
  0.7734375])

### 1.4. Leave One Out

The Leave One Out method is the most extreme version of K-Fold Cross Validation: for _n_ observations. the Leave One Out method will have _n_ splits, and _n_ training phases. <br>
For each time data is trained, a prediction is made for the one obsevation left out. <br>
This method is very effective at making sure that there's no overlap, and that our data is reliable. However, it's very computationally expensive, and will take a lot of time to run for higher values of N.

<b>20. Do the same steps you applied on the previous techniques, but this time using the Leave One Out. For that, you need to import `LeaveOneOut` from `sklearn.model_selection`</b>

In [24]:
from sklearn.model_selection import LeaveOneOut


In [25]:
loo = LeaveOneOut()

In [26]:
avg_score_LR(loo,data,target)

Train: 0.7817372202303345
Test: 0.7799479166666666


([0.7770534550195567,
  0.7796610169491526,
  0.7757496740547588,
  0.7770534550195567,
  0.7809647979139505,
  0.7796610169491526,
  0.7848761408083442,
  0.7835723598435462,
  0.7809647979139505,
  0.7809647979139505,
  0.7809647979139505,
  0.7809647979139505,
  0.7835723598435462,
  0.7822685788787483,
  0.7835723598435462,
  0.7796610169491526,
  0.7822685788787483,
  0.7835723598435462,
  0.7783572359843546,
  0.7822685788787483,
  0.7809647979139505,
  0.7822685788787483,
  0.7744458930899609,
  0.7822685788787483,
  0.7822685788787483,
  0.7835723598435462,
  0.7822685788787483,
  0.7783572359843546,
  0.7809647979139505,
  0.7809647979139505,
  0.7848761408083442,
  0.7822685788787483,
  0.7796610169491526,
  0.7861799217731421,
  0.7822685788787483,
  0.7796610169491526,
  0.7861799217731421,
  0.7835723598435462,
  0.7809647979139505,
  0.7809647979139505,
  0.7848761408083442,
  0.7848761408083442,
  0.788787483702738,
  0.7783572359843546,
  0.7770534550195567,
  0.7835723

### 1.5. Stratified K-Fold

The Stratified K-Fold method is simply a version of K-Fold where each split has identical proportions of the target variable. But before trying this split, it's necessary to adapt our scoring function first.

<b>21. Run the cell below to create a function named `avg_score_LR_skf` that will return the average score value for the train and the test set. This will have as parameters the partition technique you are going to use, your dependent variable and your independent variables. But, unlike the `avg_score_LR` function, it supports the use of `StratifiedKFold` to split the data.</b>

In [27]:
def avg_score_LR_skf(method,X,y):
    score_train = []
    score_test = []
    for train_index, test_index in method.split(X,y):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        model = run_model_LR(X_train, y_train)
        value_train = evaluate_model(X_train, y_train, model)
        value_test = evaluate_model(X_test,y_test, model)
        score_train.append(value_train)
        score_test.append(value_test)

    print('Train:', np.mean(score_train))
    print('Test:', np.mean(score_test))

<b>22. Import `StratifiedKFold` from `sklearn.model_selection`</b>

In [28]:
from sklearn.model_selection import StratifiedKFold

<b>23. Create a `StratifiedKFold` instance and store it in `skf`. Then, call the function `avg_score_LR_skf` and check the average score for the train and the test sets using `skf`</b>

In [29]:
skf = StratifiedKFold(n_splits = 10)

In [30]:
avg_score_LR_skf(skf,data,target)

Train: 0.7795136478087383
Test: 0.7734791524265209


In [31]:
lr = LogisticRegression().fit(X = X_train,y = y_train)

In [32]:
lr.score(X_train,y_train)

0.7934782608695652

In [33]:
lr.score(X_val,y_val)

0.7597402597402597

base:

0.7869565217391304<br>
0.7597402597402597

k-fold:

0.7832744284483407<br>
0.7708304853041694

repeated k-fold:

0.7833333333333333<br>
0.7747395833333334

leave one out:

0.7817168486527598<br>
0.7786458333333334

stratified k-fold:

0.779368930008449<br>
0.7734791524265209

### 1.6. Other methods

Using SkLearn you have several options to select your model, and the application is similar to the cases we saw previously.

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection


### That's all for today!

<details>
<summary><b>Image links (click this)</b></summary>



```
https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85
    
```

</details>

### Important Note:

Please remember that just because the scores are better or worse when using **k-fold cross-validation** compared to the **hold-out method** does **not** mean that the model is inherently *"better with k-fold."* These are two different techniques to evaluate the **same model**.

- **K-fold cross-validation** is more robust because it evaluates the model across multiple subsets of the data, but it doesn't change the underlying model.
- **Hold-out validation** evaluates the model on a single split of the data, which can lead to slightly different performance due to random variations in the split.

In both cases, the model being evaluated is exactly the same. What differs is how we assess its performance. The model’s performance on unseen data will be **consistent**, regardless of the validation method used. These validation techniques help you **choose the best model** and evaluate its stability, but they do not change the model’s performance on new, unknown data.

# What should I do?

The choice between using hold-out and k-fold cross-validation (or another cross-validation technique) to evaluate your model's performance depends on various factors, and each approach has its own advantages and disadvantages. The appropriate choice depends on your needs and project objectives.

**1.  Hold-Out Method (`train_test_split`):**

Advantages:

- Faster: Simple data splitting is faster than k-fold cross-validation, which is useful when you have time or resource constraints.
- Simplicity: It is easy to implement and understand, making it a good choice for quick analyses or prototyping.
- Useful for initial assessments: It can be used for an initial assessment of the model before investing more time in cross-validation techniques.

Disadvantages:

- Can be biased: Depending on how the data is split, results can be biased, as performance can vary significantly based on the random choice of training and testing data.
- Does not capture variability: It does not take into account the variability of results due to different data splits, which can result in an optimistic or pessimistic evaluation of the model.

**2. K-Fold Cross-Validation:**

Advantages:

- Robust evaluation: Provides a more robust and reliable evaluation of the model's performance, as it considers multiple data splits.
- Better use of data: Utilizes the entire dataset for both training and testing in multiple iterations, reducing data wastage.
- Helps detect overfitting: Allows you to detect whether the model is overfitting the training data.

Disadvantages:

- More time-consuming: It can be more time-consuming, especially with a large number of folds.
- Complex to set up: Requires more setup and implementation than simple data splitting.
- Not suitable for all cases: In some situations, such as when the data is highly imbalanced or when there are time constraints, using k-fold may not be ideal.

In many cases, k-fold cross-validation is the default preference, but the choice depends on the specific circumstances of the project. You may also consider other cross-validation techniques, such as stratified cross-validation or leave-one-out cross-validation, depending on the project's requirements.

### That's all!