# Train-Test Data Sets

To avoid over-fitting your models, you should use Train-Test splitting on your data sets.  

The most common split is 80-20, that is put 80% of the data in the training data set and the rest in the testing data set.  

I realize that since we haven't discussed model building yet, this won't 100% make sense right now. But, here's the idea.  You take 80% of your full data set and split that off into the Training data.  You use this to build your model.


Building the model can mean things like picking which variables are the important one that you want to use for the predictions, or deciding which modeling method is best.  In this class, we plan to discuss logistic regression and decision trees.  With logistic regression, there is also a probability threshold that normally defaults to 50%, but can be adjusted, that too is part of what you'd use the training data to decide.  

Then, once you've decided what model method, parameters, etc., you want to use, you use you model to make predictions for the testing data, the remaining 20% that was not used in the model building.  You compare the predictions from your model to the real results in the Testing data.  

## Train-Test-Validate

There's also, Train-Test-Validate, we won't do this in this class.  Since we're not going to do it, and I'm not really supposed to cover it, I won't really other than to say this.

|Training Data | Testing Data  | Validating Data|
|:--|:--|:--|
|Build lots of competing  |Run this data through all    |Run this data through  |
|models possibly using    |the models built using the   |the final model you    |
|a variety of different   |training data.  Pick the     |picked, look at the    |
|methods and/or parameters|model that performs the best |same metrics, make sure|
|                         |according to the metrics you |they still look good.  |
|                         |care the most about.         |  |


I just wanted you to be aware that if this was a higher level, more sophisticated course that focused on building elaborate predictive models, instead of just train-test data splitting, we'd be doing train-test-validate data splitting.  If you ever encounter this, in a future class or in your career, you will have at least heard of it.  


Now, let's look at a data set that we will eventually use in our discussion of logistic regression


We're going to use `pandas` instead of `datascience` because in this case, it's much simpler.  At the end of this notebook, I'll give you some code that you can copy and change to work on any .csv file.  But, before I do that, I am going to explain the process to you, so if you had to replicate this from scratch, you could.  


In [53]:
from datascience import *
import numpy as np
import pandas as pd 

In [54]:
divorce = pd.read_csv("divorce2.csv")

len(divorce)


170

In [55]:
round(0.8* len(divorce))

136

In [56]:
np.random.seed(1) ## This ensures that we all get the same values when we run this
                  ## we should change the seed every time we start a new analysis 
                  ## or potentially omit it

train_rows = np.random.choice(np.arange(len(divorce)), round(0.8*len(divorce)))

train_rows

array([ 37, 140,  72, 137, 133,  79, 144, 129,  71, 134,  25,  20, 101,
       146, 139, 156, 157, 142,  50,  68,  96,  86, 141, 137,   7,  63,
        61,  22,  57,   1, 128,  60,   8, 141, 115, 121,  30,  71, 131,
       149,  49,  57,   3,  24,  43,  76,  26,  52,  80, 109, 115,  41,
        15,  64,  25, 111, 135,  26, 153, 104,  22,   9, 126,  23, 125,
       100, 155, 165,  57,  83, 166, 136,  32, 162,  10,  23, 143,  87,
        25,  92,  74,  46, 160, 151,  65, 113,  77,   3, 128,   6,  52,
         2,  76, 149,   7,  77,  75,  76,  43,  20,  30,  36, 103,   7,
        45,  57,  96,  13,  10,  23, 124,  81, 135, 121, 152, 148, 160,
       140,  94,  60, 152,  82, 115,  97, 130, 103,  98,  10,  96,  82,
        71,  54,  15, 133, 145,  20])

In [57]:
train_rows.tolist()

[37,
 140,
 72,
 137,
 133,
 79,
 144,
 129,
 71,
 134,
 25,
 20,
 101,
 146,
 139,
 156,
 157,
 142,
 50,
 68,
 96,
 86,
 141,
 137,
 7,
 63,
 61,
 22,
 57,
 1,
 128,
 60,
 8,
 141,
 115,
 121,
 30,
 71,
 131,
 149,
 49,
 57,
 3,
 24,
 43,
 76,
 26,
 52,
 80,
 109,
 115,
 41,
 15,
 64,
 25,
 111,
 135,
 26,
 153,
 104,
 22,
 9,
 126,
 23,
 125,
 100,
 155,
 165,
 57,
 83,
 166,
 136,
 32,
 162,
 10,
 23,
 143,
 87,
 25,
 92,
 74,
 46,
 160,
 151,
 65,
 113,
 77,
 3,
 128,
 6,
 52,
 2,
 76,
 149,
 7,
 77,
 75,
 76,
 43,
 20,
 30,
 36,
 103,
 7,
 45,
 57,
 96,
 13,
 10,
 23,
 124,
 81,
 135,
 121,
 152,
 148,
 160,
 140,
 94,
 60,
 152,
 82,
 115,
 97,
 130,
 103,
 98,
 10,
 96,
 82,
 71,
 54,
 15,
 133,
 145,
 20]

In [58]:
divorce_train =  divorce.iloc[train_rows.tolist()]
divorce_train

Unnamed: 0,Atr1,Atr2,Atr3,Atr4,Atr5,Atr6,Atr7,Atr8,Atr9,Atr10,...,Atr51,Atr52,Atr53,Atr54,Class,Positive Sum,Negative Sum,Intercept,Positive Scale,Negative Scale
37,3,3,2,3,3,1,1,3,3,3,...,3,3,4,4,1,89,86,1,2.966667,3.583333
140,0,2,0,0,0,1,0,0,0,0,...,1,3,2,2,0,7,24,1,0.233333,1.000000
72,3,3,3,3,3,1,1,3,3,3,...,3,3,3,3,1,86,69,1,2.866667,2.875000
137,0,0,1,0,0,0,0,1,1,0,...,3,3,3,1,0,4,28,1,0.133333,1.166667
133,1,2,0,0,0,0,0,0,0,0,...,2,2,1,0,0,10,22,1,0.333333,0.916667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54,4,3,3,2,4,1,0,3,3,2,...,3,4,3,4,1,89,80,1,2.966667,3.333333
15,4,4,3,2,4,0,0,4,3,2,...,4,4,4,4,1,97,92,1,3.233333,3.833333
133,1,2,0,0,0,0,0,0,0,0,...,2,2,1,0,0,10,22,1,0.333333,0.916667
145,0,0,0,0,0,0,0,0,0,0,...,3,0,1,0,0,0,34,1,0.000000,1.416667


In [59]:
divorce_test = divorce.drop(index = train_rows.tolist())

divorce_test 

Unnamed: 0,Atr1,Atr2,Atr3,Atr4,Atr5,Atr6,Atr7,Atr8,Atr9,Atr10,...,Atr51,Atr52,Atr53,Atr54,Class,Positive Sum,Negative Sum,Intercept,Positive Scale,Negative Scale
0,2,2,4,1,0,0,0,0,0,0,...,2,3,2,1,1,15,44,1,0.500000,1.833333
4,2,2,1,1,1,1,0,0,0,0,...,2,2,1,0,1,25,29,1,0.833333,1.208333
5,0,0,1,0,0,2,0,0,0,1,...,1,1,2,0,1,17,34,1,0.566667,1.416667
11,4,4,4,3,4,0,0,4,4,3,...,4,4,4,4,1,107,92,1,3.566667,3.833333
12,3,4,3,4,3,0,1,4,3,4,...,4,4,4,4,1,99,92,1,3.300000,3.833333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,2,0,1,0,0,0,0,0,0,2,...,2,1,0,0,0,15,22,1,0.500000,0.916667
164,2,1,1,0,0,2,0,0,0,2,...,3,1,1,1,0,15,21,1,0.500000,0.875000
167,1,1,0,0,0,0,0,0,0,1,...,1,3,0,0,0,11,27,1,0.366667,1.125000
168,0,0,0,0,0,0,0,0,0,0,...,2,4,3,1,0,1,37,1,0.033333,1.541667


In [60]:
# save them as .csv files so you can access them later 

divorce_train.to_csv("divorce2_train.csv")

divorce_test.to_csv("divorce2_test.csv")

In [49]:
name ="divorce"  ## Put the name of the cvs file here.  DO NOT include the .csv, the code will add that later  
Save = "Yes"     ## Use "Yes" or "No" to indicate whether you want these split data sets to available for other 
                 ## notebooks to use.  Today, for these data sets, we do.  


np.random.seed(3)  # Change frequently, but use if you want the same data everytime. 

## Change nothing below here #########################################################################
######################################################################################################

csvname = name+".csv"

ds = pd.read_csv(csvname, sep=";")  # Change only this line to read in the data that you need!!
N = len(ds)

 

train_N = round(0.8*N)

train_rows = np.random.choice(np.arange(N), train_N).tolist()

ds_train = ds.iloc[train_rows]

ds_test = ds.drop(index = train_rows)

if Save == "Yes":
    train_name = name+"_train.csv"
    test_name = name+"_test.csv"

    ds_train.to_csv(train_name)

    ds_test.to_csv(test_name)
    
    print(f"You selected 'Yes' for Save so your two new datasets are called {train_name} and {test_name}. \n ")

print("Your split data sets are called ds_train and ds_test in this notebook, now.  You can change their names by resaving them now.\n ")

print("Here's a preview of your training data, ds_train")
display(ds_train.head(5))

print("Here's a preview of your testing data, ds_test")
ds_test.head(5)



You selected 'Yes' for Save so your two new datasets are called divorce_train.csv and divorce_test.csv. 
 
Your split data sets are called ds_train and ds_test in this notebook, now.  You can change their names by resaving them now.
 
Here's a preview of your training data, ds_train


Unnamed: 0,Atr1,Atr2,Atr3,Atr4,Atr5,Atr6,Atr7,Atr8,Atr9,Atr10,...,Atr46,Atr47,Atr48,Atr49,Atr50,Atr51,Atr52,Atr53,Atr54,Class
106,0,0,0,0,0,0,0,0,0,0,...,3,1,3,1,3,3,3,1,0,0
152,1,0,0,0,0,1,0,0,0,1,...,2,1,2,1,2,2,4,2,0,0
131,0,1,1,1,0,0,0,0,0,0,...,3,0,2,2,2,2,0,0,0,0
0,2,2,4,1,0,0,0,0,0,0,...,2,1,3,3,3,2,3,2,1,1
21,4,3,3,3,4,1,0,3,3,3,...,4,4,4,4,4,4,4,4,4,1


Here's a preview of your testing data, ds_test


Unnamed: 0,Atr1,Atr2,Atr3,Atr4,Atr5,Atr6,Atr7,Atr8,Atr9,Atr10,...,Atr46,Atr47,Atr48,Atr49,Atr50,Atr51,Atr52,Atr53,Atr54,Class
3,3,2,3,2,3,3,3,3,3,3,...,2,2,3,3,3,3,2,2,2,1
4,2,2,1,1,1,1,0,0,0,0,...,2,1,2,3,2,2,2,1,0,1
5,0,0,1,0,0,2,0,0,0,1,...,2,2,1,2,1,1,1,2,0,1
6,3,3,3,2,1,3,4,3,2,2,...,3,2,3,2,3,3,2,2,2,1
10,4,4,4,3,4,0,0,4,4,3,...,4,4,4,4,4,4,4,4,4,1


In [51]:
name ="alzheimers_disease_data"  ## Put the name of the cvs file here.  DO NOT include the .csv, the code will add that later  
Save = "Yes"     ## Use "Yes" or "No" to indicate whether you want these split data sets to available for other 
                 ## notebooks to use.  Today, for these data sets, we do.  

np.random.seed(3)  # Change frequently, but use if you want the same data everytime. 

## Change nothing below here unless you know what you're changing ####################################
######################################################################################################

csvname = name+".csv"

ds = pd.read_csv(csvname)  # Change only this line as needed to read in the data!!
N = len(ds)

 

train_N = round(0.8*N)

train_rows = np.random.choice(np.arange(N), train_N).tolist()

ds_train = ds.iloc[train_rows]

ds_test = ds.drop(index = train_rows)

if Save == "Yes":
    train_name = name+"_train.csv"
    test_name = name+"_test.csv"

    ds_train.to_csv(train_name)

    ds_test.to_csv(test_name)
    
    print(f"You selected 'Yes' for Save so your two new datasets are called {train_name} and {test_name}. \n ")

print("Your split data sets are called ds_train and ds_test in this notebook, now.  You can change their names by resaving them now.\n ")

print("Here's a preview of your training data, ds_train")
display(ds_train.head(5))

print("Here's a preview of your testing data, ds_test")
ds_test.head(5)



You selected 'Yes' for Save so your two new datasets are called alzheimers_disease_data_train.csv and alzheimers_disease_data_test.csv. 
 
Your split data sets are called ds_train and ds_test in this notebook, now.  You can change their names by resaving them now.
 
Here's a preview of your training data, ds_train


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
1898,6649,67,0,0,0,34.090861,0,1.878551,1.561772,9.773309,...,0,0,6.309786,1,0,0,0,0,0,XXXConfid
1688,6439,79,1,1,1,39.053378,1,8.842797,9.095801,6.833856,...,0,0,3.031109,0,0,0,0,0,0,XXXConfid
1667,6418,67,1,0,1,29.337619,0,8.046338,4.179567,0.624651,...,0,0,8.292964,0,0,0,0,1,0,XXXConfid
968,5719,89,1,3,1,38.866212,0,8.624554,9.869697,4.370038,...,0,0,1.71792,0,0,0,0,0,1,XXXConfid
789,5540,69,1,0,1,34.519868,0,19.078696,1.393494,7.963929,...,0,0,6.672147,1,0,0,0,0,0,XXXConfid


Here's a preview of your testing data, ds_test


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
0,4751,73,0,0,2,22.927749,0,13.297218,6.327112,1.347214,...,0,0,1.725883,0,0,0,1,0,0,XXXConfid
3,4754,74,1,0,1,33.800817,1,12.209266,8.428001,7.435604,...,0,1,6.481226,0,0,0,0,0,0,XXXConfid
5,4756,86,1,1,1,30.626886,0,4.140144,0.211062,1.584922,...,0,0,9.015686,1,0,0,0,0,0,XXXConfid
6,4757,68,0,3,2,38.387622,1,0.646047,9.257695,5.897388,...,0,0,9.236328,0,0,0,0,1,0,XXXConfid
16,4767,63,1,1,2,22.822896,1,4.433961,7.182895,7.929486,...,1,0,1.382086,0,0,0,0,0,1,XXXConfid
