# Create Evenly-Spaced Time Series Labels 

Field observation data will be used as labels (*i.e. y column in the data*) for training image processing algorithms to learn the convoluted relationships between images (*i.e. x columns in the data*) and the chick behavior events of interest. Images, including their position or index as frames in a single video, are used as the input data with the assumption that there are hidden or convoluted relationships to the target variables (**behavior events that are within the y column**). Note: having the target (y) column turns an unsupervised problem into a supervised problem in machine learning. 

The field observation data is not evenly-spaced temporally. The images, however, are evenly-spaced since they are frames of videos that are necessarily homogeneous. This notebook will import the field data (*.csv files*), extract the time at which chick behaviors of interest occurred, and create a new time-series data / variable that is a combination of generated time steps that are blank with the field observation events. 

In [1]:
## Here we go! 

import sys 
import os 
import pandas as pd 
import numpy as np


### Step 1: Read in raw field data 

In [2]:
## Explore CSV of single day data (July 26)

nest13B35A_0726 = pd.read_table("csvdata/nest13B35A/13B35A_0726_1240.csv") 
nest13B35A_0726.head()

Unnamed: 0,Name,Start,Duration,Time Format,Type,Description
0,13B35A_0726_1240,0:00.000,0:00.000,decimal,Cue,
1,AP N1N4,1:00.096,0:00.438,decimal,Cue,ON BEAK
2,AP N1N4,2:06.909,0:00.701,decimal,Cue,
3,AP N1N3,6:31.457,0:01.467,decimal,Cue,ON BEAK
4,AP N1N4,6:33.624,0:00.773,decimal,Cue,ON BEAK


At **1:00.096** (*Start column*) there is a behavior (*Description column*) that lasts **0:00.438** (*Duration column*). That is almost half of a second. Videos are either *30 or 60 frames per second*. That implies that we can, at the worst, capture / quantify behaviors that occur for at least 1/30 of a second. However, processing all of these frames per second will be computationally expensive for videos that are several minutes long. For example, a 10 min video with 30 frames per second will become a time-series of **18,000 images. And two weeks of daily 10 min videos results in 252,000 images**. 

Note: There is a chance this will be necessary in the future. But for now, especially for exploring the data and initial probing, I will downsample the number of images. 

I will need to know the shortest duration of a behavior and round the *start column* in order to downsample the number of images and still create an evenly-spaced time-series file to be used as our model's target variable. I need to confirm that rounding the *start column* will not change the relative position of the chick behavior events recorded in the *description column*. Also, I need to add an annotated *y observation* for every time unit that *duration* lasted for. For example, if I downsample to **10 frames per second**, then a behavior that lasted **0.7 seconds** would have *7 observations*. 

### Step 2: Determine time unit 

Find the minimal duration to determine what the *start and duration columns* should be round to. 

In [12]:
## Remove 0th row because it is an artifact of the data collection method 

duration_raw = nest13B35A_0726["Duration"]
duration_mod = duration_raw.iloc[1:]
#duration_mod.head()

duration_mod.min()

def get_min_duration(pd_series):
    return pd_series.iloc[1:].min()

#### The shortest recorded behavior lasted just under 2/10 of a second 

I will assume I am going to use **10 frames per second**. Thus, all durations should be floored (rounded down) to the nearest **0:00.1** decimal point in time that is above its current time unit. 

Example: 

*0:00.195* will be rounded based on the calculation **round(0:00.195 + 0:00.1) = 0:00.2**. This behavior is, thus, being used in the target variable as *2 observations of the same chick behavior*. 

And *0:00.402* would be rounded like so, **round(0:00.402 + 0:00.1) = 0:00.5**. And this behavior would be *5 observations of the same chick behavior*. 

This is a liberal approach to rounding and recording the number of chick behavior. For now, I believe this approach has fewers cons and potential for introducing bias to the computational system than the analogous conservative approach. 

### Step 3: Floor Start column 

Round down the *start column* to the nearest time unit.  

### Step 4: Round up Duration column

The *duration column* should be counted for the entirety of a time unit it was recorded in. 