<a href="https://colab.research.google.com/github/DerrickKuria/Practice--Machine--Learning/blob/master/Splitting_data_for_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Splitting Machine Learning Data

Using data to predict how students will perform in their National exams  using the mock exams , we will explore how to split data in the correct way and make it ready for Machine Learning.

 We need split the dataset into training and test datasets so that we can train the model to predict our desired outcome.

The dataset we are going to use will comprise of 1000 students exam data from both public and private schools in Kenya. 50% of this data is from public school and the other 50% is from private schools. We need to maintain this proportion when creating our sample dataset.

[Download Dataset](https://drive.google.com/file/d/12OGVlkFkLwycegmG5zkdDfzoxCJ3qU_k/view?usp=sharing)

First we will load the data and the required libraries.


In [14]:
import pandas as pd
import numpy as np

In [15]:
#Load the data
school = pd.read_csv("/content/student_exam_data.csv")
#inspect the first 5 items in the data
school.head()

Unnamed: 0,mock_result,school_type,national_result
0,27,PUBLIC,55
1,60,PRIVATE,35
2,57,PUBLIC,39
3,52,PUBLIC,39
4,44,PUBLIC,63


There are two recommended meethods to achieve this.


1.   Using the group By Function
2.   Using the Scikit-Learn Library



##1.Using Stratified Technique

 Using the Stratified technique we want to split the dataset in such a way that 70% of our dataset will be 
train set
and 30% will be test set. Furthermore, the proportion of public and private schools should be equal in both
 the train and test dataset. For example, in train dataset we should have 350 public schools and 350 private schools represented. The same goes for the test dataset, we expect to have 150 private schools and 150 private schools.

In [16]:
# Stratified train sample
train_dataset = school.groupby('school_type', group_keys=False).apply(lambda grouped_subset : 
                                                                         grouped_subset.sample(frac=0.7))

In [17]:
# inspect the stratified train dataset
train_dataset

Unnamed: 0,mock_result,school_type,national_result
967,45,PRIVATE,60
971,64,PRIVATE,61
441,62,PRIVATE,68
835,55,PRIVATE,49
798,50,PRIVATE,65
...,...,...,...
389,61,PUBLIC,43
101,61,PUBLIC,39
320,67,PUBLIC,47
321,55,PUBLIC,57


In [18]:
train_dataset.shape

(700, 3)

As we can see, our training set now has 700 columns.

Now we will check on the test set. We can just use the drop function to take upon the data dropped from the train set.

In [19]:
# Stratified test sample
test_dataset = school.drop(train_dataset.index)

# Preview the stratified test dataset
test_dataset

Unnamed: 0,mock_result,school_type,national_result
4,44,PUBLIC,63
6,40,PUBLIC,43
7,45,PUBLIC,47
8,43,PUBLIC,83
12,62,PRIVATE,18
...,...,...,...
990,50,PRIVATE,63
992,71,PRIVATE,68
994,18,PRIVATE,70
997,30,PRIVATE,41


In [20]:
# Print out the proprortion of private vs public schools in both train and test dataset
test_count=test_dataset['school_type'].value_counts()
train_count=train_dataset['school_type'].value_counts()

print(train_count)
print('*************************************************')
print(test_count)

PRIVATE    350
PUBLIC     350
Name: school_type, dtype: int64
*************************************************
PRIVATE    150
PUBLIC     150
Name: school_type, dtype: int64


##2.Using Sklearn/Scikit-Learn

Sklearn is a free software machine learning library for the Python programming language.

It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

To split datasets we  will make use of the train_test_split method.

To check the documentation of the Sklearn Train test split, click [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [23]:
#We have to import the train_test_split.
from sklearn.model_selection import train_test_split



```
sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None,
 random_state=None, shuffle=True, stratify=None)
```



***test_size - float or int, default=None***

For the test size it has to be a float between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.

If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

***train_size - float or int, default=None***

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size

For our example ,we'll also use a third argument called stratify which will help us stratifiy the data once we split it.

From  the documentation,

***stratifyarray-like, default=None***


If not None, data is split in a stratified fashion, using this as the class labels

In [25]:
# Split our datset into train_data and test_data using sklearn's train_test_split method
train_data, test_data = train_test_split(school, test_size=0.3,stratify=school['school_type'])

# Preview the train dataset
print(train_data)

# Preview the test dataset
print(test_data)

# Print out the proprortion of private vs public schools in both train and test dataset
train_data['school_type'].value_counts()

test_data['school_type'].value_counts()

print(train_count)
print('*************************************************')
print(test_count)

     mock_result school_type  national_result
654           43      PUBLIC               58
715           79     PRIVATE               59
367           58      PUBLIC               71
770           56      PUBLIC               48
210           51      PUBLIC               49
..           ...         ...              ...
394           50      PUBLIC               54
954           72     PRIVATE               53
365           46      PUBLIC               45
347           63      PUBLIC               56
300           57      PUBLIC               52

[700 rows x 3 columns]
     mock_result school_type  national_result
656           41     PRIVATE               65
908           41     PRIVATE               72
531           58      PUBLIC               78
659           67      PUBLIC               71
829           49     PRIVATE                3
..           ...         ...              ...
786           59     PRIVATE               62
441           62     PRIVATE               68
222       

The Sklearn is more recommended in the Data Science commiunity .