# Create Evenly-Spaced Time Series Labels 

Field observation data will be used as labels (*i.e. y column in the data*) for training image processing algorithms to learn the convoluted relationships between images (*i.e. x columns in the data*) and the chick behavior events of interest. Images, including their position or index as frames in a single video, are used as the input data with the assumption that there are hidden or convoluted relationships to the target variables (**behavior events that are within the y column**). Note: having the target (y) column turns an unsupervised problem into a supervised problem in machine learning. 

The field observation data is not evenly-spaced temporally. The images, however, are evenly-spaced since they are frames of videos that are necessarily homogeneous. This notebook will import the field data (*.csv files*), extract the time at which chick behaviors of interest occurred, and create a new time-series data / variable that is a combination of generated time steps that are blank with the field observation events. 

In [1]:
## Here we go! 

import sys 
import os 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as pyplot
import datetime

%matplotlib inline 

### Step 1: Read in raw field data 

In [2]:
## Explore CSV of single day data (July 26)

nest13B35A_0726 = pd.read_table("csvdata/nest13B35A/13B35A_0726_1240.csv") 
nest13B35A_0726.head()

Unnamed: 0,Name,Start,Duration,Time Format,Type,Description
0,13B35A_0726_1240,0:00.000,0:00.000,decimal,Cue,
1,AP N1N4,1:00.096,0:00.438,decimal,Cue,ON BEAK
2,AP N1N4,2:06.909,0:00.701,decimal,Cue,
3,AP N1N3,6:31.457,0:01.467,decimal,Cue,ON BEAK
4,AP N1N4,6:33.624,0:00.773,decimal,Cue,ON BEAK


In [3]:
## Create a dataframe 

#nest_df["Duration_in_sec"] = pd.to_datetime(nest_df["Duration"])
#nest_df["Duration_in_sec"].dt.second
#nest_df["Milli"] = nest_df.index.second

def get_nest_string(nest_id, nest_day):
    return "csvdata/" + nest_id + "/" + nest_day + ".csv"

def create_df_with_duration_as_sec(nest_raw):
    nest_df = pd.DataFrame(nest_raw, 
                           columns = ["Name", "Start", "Duration", "Time_Format", "Type", "Description"])
    del nest_df["Time_Format"]                   ## Remove time format column
    nest_df.set_index("Start", inplace = True)   ## Make column "Start" index of time series data
    nest_df["Duration_in_sec"] = pd.to_datetime(nest_df["Duration"])
    return nest_df

nest_raw = pd.read_table(get_nest_string("nest13B35A", "13B35A_0726_1240"))
nest_mod = create_df_with_duration_as_sec(nest_raw)

In [4]:
## See distribution of "Duration" column 

nest_min = nest_mod["Duration"].min()
nest_max = nest_mod["Duration"].max()

print("Nest Duration min:", nest_min, "\nNest Duration max: ", nest_max)            

Nest Duration min: 0:00.000 
Nest Duration max:  9:16.643


In [5]:
from pandas.lib import Timestamp

nest_mod_dt = pd.to_datetime(nest_mod["Duration"], format = "%M:%S.%f").dt  # .time

def round_datetime_to_millisecond(pd_datetime):
    return Timestamp(round(pd_datetime.value, -3))

#nest_mod_dt_avg = nest_mod_dt.map(round_datetime_to_millisecond)
#nest_mod_dt_avg = nest_mod["Duration"].map(round_datetime_to_millisecond)


##### -or-

np.round(nest_mod["Duration"].astype(np.int64), -3).astype('datetime64[Ms]')

  """Entry point for launching an IPython kernel.


ValueError: invalid literal for int() with base 10: '0:00.000'

At **1:00.096** (*Start column*) there is a behavior (*Description column*) that lasts **0:00.438** (*Duration column*). That is almost half of a second. Videos are either *30 or 60 frames per second*. That implies that we can, at the worst, capture / quantify behaviors that occur for at least 1/30 of a second. However, processing all of these frames per second will be computationally expensive for videos that are several minutes long. For example, a 10 min video with 30 frames per second will become a time-series of **18,000 images. And two weeks of daily 10 min videos results in 252,000 images**. 

Note: There is a chance this will be necessary in the future. But for now, especially for exploring the data and initial probing, I will downsample the number of images. 

I will need to know the shortest duration of a behavior and round the *start column* in order to downsample the number of images and still create an evenly-spaced time-series file to be used as our model's target variable. I need to confirm that rounding the *start column* will not change the relative position of the chick behavior events recorded in the *description column*. Also, I need to add an annotated *y observation* for every time unit that *duration* lasted for. For example, if I downsample to **10 frames per second**, then a behavior that lasted **0.7 seconds** would have *7 observations*. 

### Step 2: Determine time unit 

Find the minimal duration to determine what the *start and duration columns* should be round to. 

In [32]:
## Remove 0th row because it is an artifact of the data collection method 

def get_min_duration(pd_series):
    return pd_series.iloc[1:].min()

def get_max_duration(pd_series):
    return pd_series.iloc[1:].max()

def get_time_unit(duration_min):
    return np.ceil(duration_min)

duration_sec = nest_mod["Duration_in_sec"]
duration_min = get_min_duration(duration_sec)
duration_max = get_max_duration(duration_sec)

print("Nest Duration in sec min: ", duration_min, "\nNest Duration in sec max: ", duration_max)


##get_time_unit(duration_min)
#get_time_unit(get_min_duration(duration_raw))

Nest Duration in sec min:  2019-04-29 00:00:11 
Nest Duration in sec max:  2019-04-29 09:16:38


In [14]:
duration_raw.dt.minute  

Start
0:00.000        0
1:00.096        0
2:06.909        0
6:31.457        1
6:33.624        0
6:58.795        1
7:03.140        7
11:34.386       0
13:20.027       0
17:10.425      41
21:09.881       1
23:59.449       0
24:05.854       1
24:10.825      11
24:27.889       0
24:41.320       4
24:47.875       1
24:56.962       1
25:11.142       3
25:16.728       1
25:18.922       2
25:22.180       2
25:29.737       4
26:01.253      29
34:06.454      26
41:09.338       0
45:39.895       1
45:43.073       1
45:53.000       0
53:47.469       0
               ..
54:38.379       0
54:43.597       0
59:22.337       1
59:41.952       0
1:00:05.845     3
1:00:11.881     3
1:00:16.210     0
1:00:22.984     4
1:00:32.216     1
1:00:54.684    41
1:02:47.507     0
1:03:15.900     0
1:03:17.472     0
1:03:19.034     0
1:03:20.851     0
1:03:41.893     0
1:03:43.703     1
1:03:45.127     3
1:03:51.595     0
1:03:55.940     2
1:04:01.568     3
1:04:10.110     2
1:04:19.200     1
1:04:22.223     4
1:04

#### The shortest recorded behavior lasted just under 2/10 of a second 

I will assume I am going to use **10 frames per second**. Thus, all durations should be floored (rounded down) to the nearest **0:00.1** decimal point in time that is above its current time unit. 

Example: 

*0:00.195* will be rounded based on the calculation **round(0:00.195 + 0:00.1) = 0:00.2**. This behavior is, thus, being used in the target variable as *2 observations of the same chick behavior*. 

And *0:00.402* would be rounded like so, **round(0:00.402 + 0:00.1) = 0:00.5**. And this behavior would be *5 observations of the same chick behavior*. 

This is a liberal approach to rounding and recording the number of chick behavior. For now, I believe this approach has fewers cons and potential for introducing bias to the computational system than the analogous conservative approach. 

### Step 3: Floor Start column 

Round down the *start column* to the nearest time unit.  

### Step 4: Round up Duration column

The *duration column* should be counted for the entirety of a time unit it was recorded in. 