# Assigning Panel Data to Training, Testing and Validation Groups for Machine Learning Models

Cross-Sectional data includes individual entities measured in one time period.   For example, if you have 10,000 people measured once, you have cross-sectional data.

Time series includes one entity measured over multiple time periods.  For example, if you have a single machine measured every day for ten years, you have a time-series.

Panel data includes multiple entities measured over multiple time periods.  For example, if you have 1,000 consumers measured over ten years, you have panel data. Or, if you have 100 machines measured over 100 months, you have panel data.

Panel data is quite common in data science. Sometimes, it is called cross-sectional time-series data. I have also heard it referred to as pooled time series data. Whatever you want to call it, as a practicing data scientist, you'll more than likely have to deal with it.

It is standard procedure when building machine learning models that you assign your data to modeling groups. Typically, we randomly sub-set the data into Training, Testing and Validation groups. Random, in this case, means that each record in the data set has an equal chance of being assigned to one of the three groups.

When you are working with Panel Data, however, you will need to alter the normal process a little.

In this notebook, I walk through a simple example on how to do this.




## Table of Contents

1. [Getting Set Up](#setup1)<br>
 
2. [Data Exploration](#explore)<br>
 
3. [Create the Testing, Training and Validation Groupings by Entity (machine id)](#groups)<br>

4. [Conclusions](#conc)<br>

### 1. Getting Set Up <a id="setup1"></a>

Import all of the relevant Python libraries:

In [1]:
import numpy as np
import numpy.dual as dual

import pandas as pd

Import the data from GitHub:

In [None]:
#Remove the data if you run this notebook more than once
!rm equipment_failure_data_1.csv

In [None]:
#import first half from github
!wget https://raw.githubusercontent.com/shadgriffin/panelsampling/main/equipment_failure_data_1.csv

In [4]:
# Convert csv to pandas dataframe
pd_data_1 = pd.read_csv("equipment_failure_data_1.csv", sep=",", header=0)

In [None]:
#Remove the data if you run this notebook more than once
!rm equipment_failure_data_2.csv

In [None]:
#Import the second half from github
!wget https://raw.githubusercontent.com/shadgriffin/panelsampling/main/equipment_failure_data_2.csv

In [7]:
# convert to pandas dataframe
pd_data_2 = pd.read_csv("equipment_failure_data_2.csv", sep=",", header=0)

In [8]:
#concatenate the two data files into one dataframe
pd_data = pd.concat([pd_data_1, pd_data_2])

### 2. Data Exploration <a id="explore"></a>

To see the first rows:

In [9]:
pd_data.head()

Unnamed: 0,ID,DATE,REGION_CLUSTER,MAINTENANCE_VENDOR,MANUFACTURER,WELL_GROUP,S15,S17,S13,S5,S16,S19,S18,EQUIPMENT_FAILURE,S8,AGE_OF_EQUIPMENT
0,100001,12/2/14,G,O,Y,1,11.088,145.223448,39.34,3501.0,8.426869,1.9,24.610345,0,0.0,880
1,100001,12/3/14,G,O,Y,1,8.877943,187.573214,39.2,3489.0,6.483714,1.9,24.671429,0,0.0,881
2,100001,12/4/14,G,O,Y,1,8.676444,148.363704,38.87,3459.0,6.159659,2.0,24.733333,0,0.0,882
3,100001,12/5/14,G,O,Y,1,9.988338,133.66,39.47,3513.0,9.320308,2.0,24.773077,0,0.0,883
4,100001,12/6/14,G,O,Y,1,8.475264,197.1816,40.33,3589.0,8.02296,1.5,24.808,0,0.0,884


ID - machine ID

DATE - date of observation

REGION_CLUSTER - region in which the machine is located

MAINTENANCE_VENDOR - company that provides machine maintenance and service

MANUFACTURER - equipment manufacturer

WELL_GROUP - machine type

EQUIPMENT_AGE - machine age, in days

S15 - sensor value

S17 - sensor value

S13 - sensor value

S16 - sensor value

S19 - sensor value

S18 - sensor value

S8  - sensor value

EQUIPMENT_FAILURE - '1' means that the equipment failed. '0' means the equipment did not fail.

As you can see, this data represents a panel data set.  We have multiple machines measured over mulitple time periods.  ID represents the machine and DATE represents the date.  Now, let's examine how many machines and how many dates we have.

Examine the number of rows and columns.  The data has 307,751 rows and 16 columns.

In [10]:
pd_data.shape

(307751, 16)

There are 421 machines in the data set.

In [11]:
xxxx = pd.DataFrame(pd_data.groupby(['ID']).agg(['count']))
xxxx.shape

(421, 15)


There are 731 unique dates in the data set.


In [12]:
xxxx = pd.DataFrame(pd_data.groupby(['DATE']).agg(['count']))
xxxx.shape

(731, 15)

So if we have 421 machines and 731 unique dates...

### 3. Create the Testing, Training and Validation Groupings by Entity (machine id) <a id="groups"></a>

We could just randomly assign each record to one of the three groups.  While that could work, I wouldn't recommend it.  I would recommend assigning the group at an entity level (machine in this case).  

Why?  

Well, I could use some multi-syllabic words (like auto-correlation or mayonnaise) to describe why, but let's just think about it.

Why do we separate the data into training, testing and validation groups? 

We want to ensure that our model is not over-fit.  In other words, we want to make sure that our model applies to new data as it comes available. 

For example, let's pretend that we built a model that predicts what happened last year with 100% accuracy. Good job, right?  Well, it really doesn't matter how well the model predicts last year.  We need it to predict today, tomorrow and the day after that.  So, if a model predicts last year with 100% accuracy but fails to predict tomorrow, it's no good.

Building a model on the training data and verifying the accuracy on the testing and validation data set keeps this from happening.







In order to prevent over-fitting we need our training, testing and validation groups to be independent.  That is, we need to ensure that the data in the training group is different from the testing and validation groups.  Or, at least as different as possible.

So what happens if we just randomly assign each record to each of the groups in question?  We end up with records from each entity in each group.  For example with a simple random selection method, if we are dealing with machines, it is probable that machine 123 will appear in your training, testing and validation groups.  If you are dealing with individuals, it is probable that Steve Wakahookie will appear in all three groups.  

In other words, your training, testing and validation groups ARE NOT as independent because good ol' Steve and machine 123 are present in all three groups.  

Now, if you assign group membership based on entity, all of the Steve's records will be in either the training, testing or validation group.  Likewise, all of the records associated with machine 123 will be in only one of the three groups.

Get a unique list of all IDs:

In [13]:
aa = pd_data

pd_id = aa.drop_duplicates(subset='ID')
pd_id = pd_id[['ID']]
pd_id.shape

(421, 1)

Create a new variable with a random number between 0 and 1:

In [14]:
np.random.seed(42)
pd_id['wookie'] = (np.random.randint(0, 10000, pd_id.shape[0]))/10000

In [15]:
pd_id = pd_id[['ID', 'wookie']]

Give each record a 30% chance of being in the validation, a 35% chance of being in the testing and a 35% chance of being in the training data set.


In [16]:
pd_id['MODELING_GROUP'] = np.where(((pd_id.wookie <= 0.35)), 'TRAINING', np.where(((pd_id.wookie <= 0.65)), 'VALIDATION', 'TESTING'))

This is how many machines fall in each group

In [17]:
tips_summed = pd_id.groupby(['MODELING_GROUP'])['wookie'].count()
tips_summed

MODELING_GROUP
TESTING       149
TRAINING      146
VALIDATION    126
Name: wookie, dtype: int64

Append the Group of each id to each individual record:

In [18]:
pd_data = pd_data.sort_values(by=['ID'], ascending=[True])
pd_id = pd_id.sort_values(by=['ID'], ascending=[True])

In [19]:
pd_data = pd_data.merge(pd_id, on=['ID'], how='inner')

pd_data.head()

Unnamed: 0,ID,DATE,REGION_CLUSTER,MAINTENANCE_VENDOR,MANUFACTURER,WELL_GROUP,S15,S17,S13,S5,S16,S19,S18,EQUIPMENT_FAILURE,S8,AGE_OF_EQUIPMENT,wookie,MODELING_GROUP
0,100001,12/2/14,G,O,Y,1,11.088,145.223448,39.34,3501.0,8.426869,1.9,24.610345,0,0.0,880,0.727,TESTING
1,100001,3/29/16,G,O,Y,1,18.96,0.0,38.87,3459.0,10.0473,1.3,36.6,0,34.37,1363,0.727,TESTING
2,100001,3/30/16,G,O,Y,1,29.04,0.0,37.36,3325.0,10.2351,1.4,36.0,0,32.37,1364,0.727,TESTING
3,100001,3/31/16,G,O,Y,1,18.0,0.0,38.81,3454.0,8.5449,1.4,36.1,0,34.44,1365,0.727,TESTING
4,100001,4/1/16,G,O,Y,1,26.16,0.0,39.47,3513.0,10.9863,1.4,36.3,0,33.26,1366,0.727,TESTING


This is how many records are in each group:

In [20]:
tips_summed = pd_data.groupby(['MODELING_GROUP'])['wookie'].count()
tips_summed

MODELING_GROUP
TESTING       108919
TRAINING      106726
VALIDATION     92106
Name: wookie, dtype: int64

###  4. Conclusion <a id="conc"></a>

So, there you go.  Now, we are ready to build a machine learning model.  By using the placing entities and not records into your training, testing and validation groups you can ensure independence between the groups and build models that work yesterday, today and tomorrow.




### Author





**Shad Griffin** is a Certified Thought Leader and a Data Scientist at IBM.

<hr>
Copyright © 2021 IBM. This notebook and its source code are released under the terms of the MIT License.