## Crafting Training Sets

#### Overview
As mentioned in the intro, the first step in our predictive analytics work is to identify our test andtraining data sets. In this section, we will define key concepts, and run you through a few exerciseson how to use sklearn to achieve this. 

#### Main Resources
> Understanding your variables.

First, you must analyze your variables, and determine which variable you want your model to **predict- we will refer to it as the dependent variable**. 

Next, you must establish which other variables will help you predict your dependent variable. Thesewill be referred to as independent variables.

It is important to perform exploratory data analysis to identify if there is a relationship between your dependent and independent variables. This does not mean that your independent variable causes the dependent one, just that they are connected. For example, if we have a dataset on students, we may find variables such as student height, mock exam results, and national exam results. Plotting mock exam results against national exam results,you will see them to roughly take the shape of a line, which makes intuitive sense: Students who do poorly in the mock are likely not to be ready for the national exam, and vice versa.

Plotting height against national exam results will probably lead to a much more scattered plot,indicating that there isn't a strong relationship between height and academic performance. 


Therefore, as we create our training and testing set to predict national exam performance, we willwant to include mock exam performance, but not height. 

> Why do we need two sets?

This is where the machine learning actually happens: The training set includes data on your dependent variable, alongside all independent variables you choose to include. Your supervised learning algorithm will then go through this data set and for a given row try to predict what thedependent variable should be given the independent ones, then adjust its understanding of the process based on how good its prediction was. Over time,  your algorithm will get really good atrecognizing the patterns in your data set. Why do we need the test set then? Well, the test set is not used for training, but to validate howgood the model you've created is at predicting the desired dependent variable. Later this week, we will explore ethical considerations when creating train and test datasets.Remember this though: Your predictive model is only as good as the data you've used to train it.There have been many challenges with training the reading and exercises below will run you through ways to deal with them.


#### Additional learning links 

Here are additional learning materails for this lesson:

1. [How Do You Know You Have Enough Training Data?](https://towardsdatascience.com/how-do-you-know-you-have-enough-training-data-ad9b1fd679ee)

2. [Google ‘fixed’ its racist algorithm by removing gorillas from its image-labeling tech](https://www.theverge.com/2018/1/12/16882408/google-racist-gorillas-photo-recognition-algorithm-ai)


## 2. Practice

Our major goal here, is to predict how a student will perform in the national exam by using their mock exam scores.There's a few steps we need to do to achieve this.

First, we need split the dataset into training and test datasets so that we can train the model to predict our desired outcome

After splitting the dataset , we are going to employ a method for training the datasets.

The following example will be split into two parts; the first being how to split the dataset into train and test datasets. The second part is how to train the data using linear regression.


In this example, we are going to learn how to split a dataset into train and test sets so that we can start training our model. We will first show a naive way of splitting a dataset then continue to show different ways of efficiently splitting the dataset.

The dataset we are going to use will comprise of 1000 students exam data from both public and private schools in Kenya. 50% of this data is from public school and the other 50% is from private schools. We need to maintain this proportion when creating our sample dataset.

**Naive splitting:**

- Show a simple 3 column table, with 1 dependent 1 independent variable. The independent variable is the Mock exam column and the dependent variable is National exam column.
- use simple splits to create 2 datasets, one for train, one for test

In [5]:
# importing relevant librares and packages

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [6]:
# Load the data
data = pd.read_csv("./data/student_exam_data.csv")

# display teh first 5 rows
data.head()

Unnamed: 0,mock_result,school_type,national_result
0,27,PUBLIC,55
1,60,PRIVATE,35
2,57,PUBLIC,39
3,52,PUBLIC,39
4,44,PUBLIC,63


In [7]:
#Split the dataset into train and test sets.
# we will split the dataset such that we have the first 700 entries of our dataset as train and the rest 300 entries as test

train = data[:700]

# Drop all the indexes of the train data we created above from the main data set then store the remaining data in a variable called test
test = data.drop(train.index)

# Confirm that the train and test dataset have out desired length
print("train:" + str(len(train)))
print("test:"+ str(len(test)))

train:700
test:300


**Analyzing the sets**:
How similar are the training and test datasets?

In [8]:
# Let's analyse the training and the test dataset and see if the right proportions.
# Ideally, we want both of our training and test datasets to have a 50-50 apportionment of private and public schools

# Check the apportionment of Private and Public schools in the train data set
train_count=train['school_type'].value_counts()

# Check the apportionment of Private and Public in the test data set
test_count=test['school_type'].value_counts()

# Print out the apportionment of private and public schools in both train and test dataset
print(train_count)
print('*************************')
print(test_count)

school_type
PUBLIC     450
PRIVATE    250
Name: count, dtype: int64
*************************
school_type
PRIVATE    250
PUBLIC      50
Name: count, dtype: int64


As you can see,  the number of public schools in the train dataset is 450 while that of private schools is 250.This translates to 65% and 35%  respectively, which is not the proportion we are aiming for.

Similarly, in the test dataset there are 250 public schools and 50 private schools. This in turn translates to 84% and 16% respectively. Again, this is not quite the proportion we were aiming for.

In conclusion this differs greatly from what we are aiming for, which is to have an equal proportion of private schools and public school in both the train/test dataset.That is, to have 50% of public school and 50% of private school in both the train and test dataset.

This is why we termed this as a naive way of splitting the dataset because it does not reflect the populations initial proportion.

To achieve the proportion we want, we will employ one of the sampling techniques we covered in module 1

**Sampling**:

Remember module 1 stuff, let's do some stratified sampling, and see that our test / train are now similar to each other (public VS private student representation)

In [9]:
# Using the Stratified technique we want to split the dataset in such a way that 70% of our dataset will be train set and 30% will be test set. Furthermore, the proportion of public and private schools should be equal in both the train and test dataset. For example, in train dataset we should have 350 public schools and 350 private schools represented. The same goes for the test dataset, we expect to have 150 private schools and 150 private schools.

# Stratified train sample
train_strat_datset = data.groupby('school_type', group_keys=False).apply(lambda grouped_subset : grouped_subset.sample(frac=0.7))

# preview the stratified train dataset
train_strat_datset

# Stratified test sample
test_strat_dataset = data.drop(train_strat_datset.index)

# Preview the stratified test dataset
test_strat_dataset

# Print out the proprortion of private vs public schools in both train and test dataset
test_strat_count=test_strat_dataset['school_type'].value_counts()
train_strat_count=train_strat_datset['school_type'].value_counts()

print(train_strat_count)
print('*************************************************')
print(test_strat_count)


school_type
PRIVATE    350
PUBLIC     350
Name: count, dtype: int64
*************************************************
school_type
PRIVATE    150
PUBLIC     150
Name: count, dtype: int64


  train_strat_datset = data.groupby('school_type', group_keys=False).apply(lambda grouped_subset : grouped_subset.sample(frac=0.7))
