# Splitting the data to k-folds
Link to the project: drinkability of water 
[(fr)](https://drive.google.com/file/d/1FGNR1O8EKGVKpVB_PMb5Ty2LipYgoM8q/view?usp=sharing)
[(kaggle)](https://www.kaggle.com/artimule/drinking-water-probability)

In this notebook, we will try to explore 
- the nex test csv
- the new folds csv

to see the number of samples per file/fold.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load the data

In [2]:
try:
    from google.colab import drive

    IN_COLAB = True
    drive.mount('/content/drive')
except:
    IN_COLAB = False

In [3]:
path = "/content/drive/MyDrive/Best ML model ever/input/dri_wat_pot_folds.csv"

df = pd.read_csv(path)
df = df.sample(frac=1, random_state=1).reset_index(drop=True)
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability,kfold
0,6.799376,239.05768,7986.493239,10.36568,419.881175,373.232137,18.828594,43.440644,3.819985,1,3
1,5.498515,180.825114,21542.83003,6.707095,352.250711,419.512958,13.183432,68.90437,3.074815,1,2
2,7.386582,191.585566,26351.90377,8.426161,,505.187929,18.925674,72.649614,3.791373,1,4
3,6.783888,193.653581,13677.10644,5.171454,323.728663,477.854687,15.056064,,3.250022,0,1
4,7.137429,210.502749,17506.6088,7.304928,301.642004,304.239481,13.076007,64.230942,2.964181,1,2


In [4]:
if IN_COLAB:
    path_train = "/content/drive/MyDrive/Best ML model ever/input/dri_wat_pot_folds.csv"
    path_test = "/content/drive/MyDrive/Best ML model ever/input/dri_wat_pot_test.csv"
else:
    path_train = "../input/dri_wat_pot_folds.csv"
    path_test = "../input/dri_wat_pot_test.csv"

df_test = pd.read_csv(path_test)
df_test = df_test.sample(frac=1, random_state=1).reset_index(drop=True)
df_test.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,188.743562,19037.46264,6.034236,,388.065857,15.149068,78.499418,2.723651,1
1,,139.169744,33784.82623,9.64052,275.33196,499.428133,13.664485,70.368364,4.678745,1
2,,221.126963,27158.67773,7.875937,274.658788,472.09609,15.759417,35.710402,4.688955,0
3,6.280978,205.123123,25972.80375,8.417896,383.671459,456.543945,13.95471,32.799029,4.599432,1
4,4.07792,185.852326,9975.601334,10.758464,,307.877571,9.702581,64.361116,4.789052,1


## Check train and test dataframes

In [5]:
print(df_train.shape[0])
print(df_test.shape[0])

2620
656


We have approximately 20% of the data in the test file

In [7]:
print(df_train["Potability"].value_counts()/df_train.shape[0])
print(df_test["Potability"].value_counts()/df_test.shape[0])

0    0.609924
1    0.390076
Name: Potability, dtype: float64
0    0.609756
1    0.390244
Name: Potability, dtype: float64


We have the same representation for the labels.

## Check folds

In [8]:
df_train['kfold'].value_counts()

0    524
2    524
4    524
1    524
3    524
Name: kfold, dtype: int64

All folds have 655 samples. This is expected as training data had 3276 value samples and we made 5 folds. So far, so good.

Now let's check the target distribution per fold.

In [9]:
df_train[df_train['kfold']==0]['Potability'].value_counts()

0    319
1    205
Name: Potability, dtype: int64

In [10]:
df_train[df_train['kfold']==1]['Potability'].value_counts()

0    319
1    205
Name: Potability, dtype: int64

In [11]:
df_train[df_train['kfold']==2]['Potability'].value_counts()

0    320
1    204
Name: Potability, dtype: int64

In [12]:
df_train[df_train['kfold']==3]['Potability'].value_counts()

0    320
1    204
Name: Potability, dtype: int64

In [13]:
df_train[df_train['kfold']==4]['Potability'].value_counts()

0    320
1    204
Name: Potability, dtype: int64

We see that in each fold, the distribution of targets is the same. This is what we
need. It can also be similar and doesn’t have to be the same all the time. Now, when
we build our models, we will have the same distribution of targets across every fold.