<a href="https://colab.research.google.com/github/Shubham04689/colab_notebooks/blob/main/U1W1_01_Train_Test_Split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Learning Objectives

At the end of the experiment, you will be able to :

*  Apply train test split


### Dataset

## Train-Test Split Evaluation

The train-test split is a technique for assessing a machine learning algorithm's performance.


The procedure involves dividing a dataset into two subsets. The first subset, known as the training dataset, is used to fit the model. The second subset is not used to train the model; instead, the model is fed the dataset's input element, and predictions are made and compared to expected values. The second dataset is known as the test dataset.

*   Train Dataset: Used to fit the machine learning model.
*   Test Dataset: Used to evaluate the fit machine learning model.

The goal is to estimate the machine learning model's performance on new data that was not used to train the model.



A ideal representation of the specific problem implies that there are enough records to cover all of the domain's common and uncommon cases. This could refer to input variable combinations observed in practise. Thousands, hundreds of thousands, or millions of examples may be required.

When the available dataset is small, the train-test procedure is ineffective. The reason for this is that when the dataset is divided into train and test sets, the training dataset will not contain enough data for the model to learn an effective mapping of inputs to outputs. There will also be insufficient data in the test set to evaluate the model's performance effectively. The predicted performance may be overly optimistic (good) or overly pessimistic (bad).

If you don't have enough data, the k-fold cross-validation procedure is a good alternative model evaluation method.

Some models are extremely expensive to train, making repeated evaluation, as used in other procedures, impossible. Deep neural network models are one example. The train-test procedure is commonly used in this case.


### How to set-up :

The size of the train and test sets is the procedure's main configuration parameter. For either the train or test datasets, this is most commonly expressed as a percentage between 0 and 1. A training set with a size of 0.67 (67%), for example, means that the remainder percentage of 0.33 (33%) is assigned to the test set.

There is no such thing as an optimal split percentage.

You must select a split percentage that meets the goals of your project:


*   Train: 80%, Test: 20%
*   Train: 67%, Test: 33%
*   Train: 50%, Test: 50%


Now that we are familiar with the train-test split model evaluation procedure, let’s look at how we can use this procedure in Python.







<br><br>
<center>

![](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/1_train-test-split_0.jpg)


</center>

In [1]:

!pip install kaggle

# Upload your kaggle.json file
from google.colab import files
files.upload()

# Create the .kaggle directory and move the kaggle.json file
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Download the dataset (replace 'username/dataset-name' with the actual dataset details)
!kaggle datasets download -d akshaydattatraykhare/diabetes-dataset

# Unzip the downloaded file
!unzip diabetes-dataset.zip




Saving kaggle.json to kaggle.json
Dataset URL: https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset
License(s): CC0-1.0
Downloading diabetes-dataset.zip to /content
  0% 0.00/8.91k [00:00<?, ?B/s]
100% 8.91k/8.91k [00:00<00:00, 20.7MB/s]
Archive:  diabetes-dataset.zip
  inflating: diabetes.csv            


In [2]:
import pandas as pd

In [3]:
dataframe  = pd.read_csv('diabetes.csv')
dataframe.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


##### Train Test split without sklearn library

In [4]:
# Shuffle the dataset
shuffle_df = dataframe.sample(frac=1)

# Define a size for your train set
train_size = int(0.7 * len(dataframe))

# Split your dataset
train_set = shuffle_df[:train_size]

test_set = shuffle_df[train_size:]

In [5]:
print("train data :",len(train_set), ", test data :",len(test_set))

train data : 537 , test data : 231


In [7]:
train_set.groupby(['Outcome']).count()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,350,350,350,350,350,350,350,350
1,187,187,187,187,187,187,187,187


In [8]:
test_set.groupby(['Outcome']).count()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,150,150,150,150,150,150,150,150
1,81,81,81,81,81,81,81,81


##### Train Test split with sklearn library

In [9]:
features = dataframe.iloc[:,:8]
target = dataframe['Outcome']

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
# We are splitting the data into train and test sets in the ratio of 70:30
# i.e 70 % of data is train set and 30 % of the data is test set
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size = 0.3)

In [12]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((537, 8), (231, 8), (537,), (231,))

We can divide the dataset so that 70% is used to train the model and 30% is used to evaluate it.

Then fit the train data use the model to make predictions and evaluate the predictions using the accuracy performance metric.

To summarise, you now understand what it means to divide data into two sets, namely train and test sets. In future sessions, you will also learn about the various methods available in the sklearn library for splitting data into train and test splits.