Applying the HMM to the sleep dataset
======

Back to the dataset we'll be working with for this tutorial, a Drosophila movement dataset. Data can come in all formats, but a csv file is one of the most common due to its simplicity and intergration with spreadsheets. The data we will be using for this turorial is real, raw data from the Gilestro lab, where we track and record the movement of fruit flies using machine vision. The tracking is able to discern small movements in the fly that can robustly record the flies several times per second giving a multiude of variables to work with.

We'll be using the pandas package to import and store our data in the notebooks. Pandas is a widely used tool in data science, it is built ontop of numpy which we briefly used previously. At the core of pandas is the DataFrame, a table format you will all be familiar wih from spreadsheets. Pandas gives you many tools to manipulate the data before you feed it into any analysis or machine learning tool. As with numpy everything used here will be explained as we use it, but if you'd like to read more into how to use pandas there is a quick tutorial on their website -> [here](https://pandas.pydata.org/docs/user_guide/10min.html)

In [319]:
# first we need to import pandas and numpy
# like numpy it is often imported in a shorthand 
import pandas as pd
import numpy as np

In [320]:
import sys
sys.path.append('../src/HMM')
from misc import get_data

df = get_data()

The Data Structure
=================

In [321]:
df

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk
0,2016-04-04_17-39-22_033aee|01,31140,0.269116,0.069594,0.038829,0.020012,75.662162,25.713480,True,False,True
1,2016-04-04_17-39-22_033aee|01,31170,0.606590,0.068019,0.048224,0.020609,27.471271,9.145901,True,False,True
2,2016-04-04_17-39-22_033aee|01,31200,0.398307,0.070464,0.049073,0.020628,19.718721,5.478951,True,False,True
3,2016-04-04_17-39-22_033aee|01,31230,0.469571,0.066383,0.046558,0.020423,20.224544,7.475374,True,False,True
4,2016-04-04_17-39-22_033aee|01,31260,0.260085,0.073667,0.047548,0.020133,34.824007,6.163203,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...
2453010,2016-09-27_10-56-35_053c6b|15,86250,0.776577,0.064865,0.034109,0.022879,0.799611,0.673182,False,False,False
2453011,2016-09-27_10-56-35_053c6b|15,86280,0.776577,0.064537,0.033866,0.022686,0.774246,0.659115,False,False,False
2453012,2016-09-27_10-56-35_053c6b|15,86310,0.776577,0.064823,0.035156,0.021957,0.779612,0.679327,False,False,False
2453013,2016-09-27_10-56-35_053c6b|15,86340,0.776577,0.064693,0.035478,0.022051,0.772465,0.678201,False,False,False


The first column is 'id' which contains a unique ID per fly and will allow us to filter and apply methods to just one fly at a time. The next most important variable is 't' or time, as we are working with time series data we must ensure this is strucutred properly i.e. in sequential order and at regular intervals (the later we will go over). The rest are various variables per each timestamp, for this tutorial we'll only be interested in 'moving', 'micro', and 'walk'.

Checking for missing data
======

Most real datasets will not be perfectly populated, with tracking dropping out over the course of an experiment. In a dataframe or an array where there is data missing at a timepoint or index this will be represented by a NaN value, which lets methods and functions know there is no data rather than a zero value. However, often analysing packages will throw an error if you feed it NaN values, so it's good practice to check for them first and either remove them or replace then with an approximation.

In [322]:
# Lets filter our dataframe for nan values
# With pandas you can filter the dataframe by the columns
# To filter or slice the dataframe put some square brackets after the dataframe and inside call the column slice 
# For finding NaN values we have to call a method, for other regualr filtering you just use =, <, > and so on

df[df['x'].isnull()]

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk
25938,2016-04-04_17-39-22_033aee|04,236160,,,,0.019395,0.756622,0.650191,False,False,
41513,2016-04-04_17-39-05_009aee|01,126150,,0.076196,0.050891,,,0.892686,,True,False
48302,2016-04-04_17-39-05_009aee|01,329820,,0.077477,,0.019779,0.715943,,,False,False
56176,2016-04-04_17-39-05_009aee|01,566040,,0.077477,0.050651,0.020420,,,False,,False
66013,2016-04-04_17-39-05_009aee|04,283950,,0.075793,,,1.153131,0.802417,True,True,
...,...,...,...,...,...,...,...,...,...,...,...
2389223,2016-09-27_10-55-13_052c6b|15,569280,,0.043125,,0.016434,,0.643020,False,,False
2396065,2016-09-27_10-55-13_052c6b|17,174930,,,,0.022859,0.841160,,False,False,False
2423645,2016-09-27_10-55-13_052c6b|19,402930,,,0.045413,0.018493,0.779612,0.668706,False,,
2427575,2016-09-27_10-55-13_052c6b|19,520830,,,0.043981,0.018593,45.172980,12.037528,,False,


In [323]:
# To break down whats happening we can just call whats inside the brackets, you can see that it is an array (or series in pandas terms) with False or True per row.
# This array then dictates what rows get returned from the whole dataframe, i.e. only the ones that fullfill the argument and are True

df['x'].isnull()

0          False
1          False
2          False
3          False
4          False
           ...  
2453010    False
2453011    False
2453012    False
2453013    False
2453014    False
Name: x, Length: 2453015, dtype: bool

In [324]:
# However, we are not just looking at one column. 
# Luckily with pandas you can filter by multiple columns, all you need to do is put each filter argument in round brackets and then seperate them by an & ("and") or | ("or") logical operator
# By calling OR here we get all the examples where X or Y are NaNs

df[(df['x'].isnull()) | (df['y'].isnull())]

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk
2774,2016-04-04_17-39-22_033aee|01,114630,0.195341,,0.043329,0.020550,,0.655093,,,False
12327,2016-04-04_17-39-22_033aee|01,401220,0.618850,,,0.020569,35.882174,2.237015,,,True
20704,2016-04-04_17-39-22_033aee|04,79140,0.470888,,0.049861,,0.767147,,False,,False
25938,2016-04-04_17-39-22_033aee|04,236160,,,,0.019395,0.756622,0.650191,False,False,
41513,2016-04-04_17-39-05_009aee|01,126150,,0.076196,0.050891,,,0.892686,,True,False
...,...,...,...,...,...,...,...,...,...,...,...
2445481,2016-09-27_10-56-35_053c6b|10,458460,0.117117,,0.042843,,0.744524,0.654873,,,False
2446139,2016-09-27_10-56-35_053c6b|10,478200,0.203604,,,0.019738,0.730935,,,False,False
2451076,2016-09-27_10-56-35_053c6b|15,28230,0.864988,,,,1.348586,0.819512,True,True,
2451728,2016-09-27_10-56-35_053c6b|15,47790,0.872031,,,,,0.674368,False,False,False


In [325]:
# Now we want to remove those rows containing NaN values as they aren't providing any information
# This time we'll want to call the & operator as we only want rows where both X and Y are not NaNs
# When filtering for NaNs above we're selecting for them, adding the ~ opertor tells the filter to look for the opposite, so when NaN is True it now becomes False
# If taking a slice of a dataframe its good practice to make it a copy, otherwise it will throw up warnings
df_filtered = df[~(df['x'].isnull()) & ~ (df['y'].isnull())].copy(deep = True)

# the new DataFrame now won't have any rows where 'x' and 'y' have NaN values
df_filtered

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk
0,2016-04-04_17-39-22_033aee|01,31140,0.269116,0.069594,0.038829,0.020012,75.662162,25.713480,True,False,True
1,2016-04-04_17-39-22_033aee|01,31170,0.606590,0.068019,0.048224,0.020609,27.471271,9.145901,True,False,True
2,2016-04-04_17-39-22_033aee|01,31200,0.398307,0.070464,0.049073,0.020628,19.718721,5.478951,True,False,True
3,2016-04-04_17-39-22_033aee|01,31230,0.469571,0.066383,0.046558,0.020423,20.224544,7.475374,True,False,True
4,2016-04-04_17-39-22_033aee|01,31260,0.260085,0.073667,0.047548,0.020133,34.824007,6.163203,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...
2453010,2016-09-27_10-56-35_053c6b|15,86250,0.776577,0.064865,0.034109,0.022879,0.799611,0.673182,False,False,False
2453011,2016-09-27_10-56-35_053c6b|15,86280,0.776577,0.064537,0.033866,0.022686,0.774246,0.659115,False,False,False
2453012,2016-09-27_10-56-35_053c6b|15,86310,0.776577,0.064823,0.035156,0.021957,0.779612,0.679327,False,False,False
2453013,2016-09-27_10-56-35_053c6b|15,86340,0.776577,0.064693,0.035478,0.022051,0.772465,0.678201,False,False,False


## Task:
As stated before for this tutorial we will be focussing on the variables 'moving', 'micro', 'walk'. Now you know how to filter out NaN values apply this to only these columns.

In [326]:
# To complete 

# df = 

df = df[(~df['moving'].isnull()) & (~df['micro'].isnull()) & (~df['walk'].isnull())].copy(deep = True)


## Extra Task:
1) If you're new to pandas (or just want some practice) have a play around with other types of filtering (such as df[df['mean_velocity'] > 5]). It makes a quick and easy way to filter your data and if you're doing the same thing repeatably you can create a function to do it instantly.

2) rather than filtering out the NaN values you can replace them with something else. We could know that tracking drops out when the flies are still for a long time, so we could resonably replace all of 'moving', 'micro', and 'walk' with False.
This can be done with the .fillna method, see here for how to do it -> [fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html).

Binning the data to a larger time step
======

#Outcome - Why you would want to do it, the benfits and downsides

Its important with Hidden Markov models that any timeseries dataset is complete with no skips due to missing data, as the model will assume the array you're feeding it all has the same time step. One way to do this is to increase the timestep, currently the dataset has a row for every 30 seconds. However, we know from filtering out the NaN values that we will won't have the all. So to recitfy this we will increase the time step to 60, as long as there is at least 1 row out of 2 for the 60 we'll have a perfectly populated dataset. 

Addtionally, doing so will decrease the size of the data, meaning the model will train more quickly. It's always worth trying the model with a few different timesteps to see how this affects the output, then you can pick the one you think is the most representative and quickest to train

In [327]:
# we'll go through it step by step, before wrapping it all in a function
# First we'll need the function floor which rounds down numbers
from math import floor

In [328]:
# first we'll create a new column with the new time step
# lambda functions are an easy way to apply a function per row with a specific column
# We then divide the time by our new time and round down. The end result is multiplied by the new time step giving the minimum time as divisable by the time step given
df['bin_t'] = df['t'].map(lambda t: 60 * floor(t / 60)) # the t represenst each row value for the column 't'
df

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk,bin_t
0,2016-04-04_17-39-22_033aee|01,31140,0.269116,0.069594,0.038829,0.020012,75.662162,25.713480,True,False,True,31140
1,2016-04-04_17-39-22_033aee|01,31170,0.606590,0.068019,0.048224,0.020609,27.471271,9.145901,True,False,True,31140
2,2016-04-04_17-39-22_033aee|01,31200,0.398307,0.070464,0.049073,0.020628,19.718721,5.478951,True,False,True,31200
3,2016-04-04_17-39-22_033aee|01,31230,0.469571,0.066383,0.046558,0.020423,20.224544,7.475374,True,False,True,31200
4,2016-04-04_17-39-22_033aee|01,31260,0.260085,0.073667,0.047548,0.020133,34.824007,6.163203,True,False,True,31260
...,...,...,...,...,...,...,...,...,...,...,...,...
2453010,2016-09-27_10-56-35_053c6b|15,86250,0.776577,0.064865,0.034109,0.022879,0.799611,0.673182,False,False,False,86220
2453011,2016-09-27_10-56-35_053c6b|15,86280,0.776577,0.064537,0.033866,0.022686,0.774246,0.659115,False,False,False,86280
2453012,2016-09-27_10-56-35_053c6b|15,86310,0.776577,0.064823,0.035156,0.021957,0.779612,0.679327,False,False,False,86280
2453013,2016-09-27_10-56-35_053c6b|15,86340,0.776577,0.064693,0.035478,0.022051,0.772465,0.678201,False,False,False,86340


You should see in the column 'bin_t' that rows next to each other now share a time step. Now we have that we'll want to pivot or group by this column so all that have the same time stamp are collected together.

In [329]:
# The pandas groupby method does this, all you need to do is call the method with the column you want to pivot by in the brackets
# Then you can tell it what aggregating function you want to call on the columns of interest
df_grouped = df.groupby('bin_t').agg(**{
            'x' : ('x', 'mean'), # before the brackets is the name of the new column, we'll keep it the same
            'y' : ('y', 'mean')  # within the brackets is the column you want to use and the function to apply to it. YoFalseu have 'mean', 'median', 'max'... ect built in, but you can also use your own functions
})

df_grouped

Unnamed: 0_level_0,x,y
bin_t,Unnamed: 1_level_1,Unnamed: 2_level_1
6960,0.504669,0.051607
7020,0.468417,0.051249
7080,0.493713,0.051730
7140,0.495665,0.052071
7200,0.540849,0.051219
...,...,...
608100,0.516213,0.052228
608160,0.443002,0.052321
608220,0.499276,0.052075
608280,0.497366,0.053447


Some of you may have noticed that doing it this way will aggregate our whole dataset and lose the information per fly. To keep this information we can call a groupby with two levels, the first will be the higher level that the data is grouped by first, and the second the one that the functions will be applied to.

In [330]:
# We do exactly the same, but instead of just 'bin_t' we have a list with 'id' first
# Calling it this way on a lot of rows can take a few minutes or more depending on your computer, so don't worry if it takes a while
df_grouped = df.groupby(['id', 'bin_t']).agg(**{
            'x' : ('x', 'mean'),
            'y' : ('y', 'mean')
})
# We need to reset the index as it will have both 'id' and 'bin_t' as the index
df_grouped.reset_index(inplace = True)
# We'll also rename the column 'bin_t' back to 't' for clarity
df_grouped.rename(columns = {'bin_t' : 't'}, inplace = True)
df_grouped

Unnamed: 0,id,t,x,y
0,2016-04-04_17-38-06_019aee|07,31080,0.450670,0.052448
1,2016-04-04_17-38-06_019aee|07,31140,0.598823,0.044937
2,2016-04-04_17-38-06_019aee|07,31200,0.624971,0.049116
3,2016-04-04_17-38-06_019aee|07,31260,0.511102,0.052778
4,2016-04-04_17-38-06_019aee|07,31320,0.577133,0.049853
...,...,...,...,...
1226646,2016-09-27_11-30-35_009d6b|19,606300,0.568432,0.052146
1226647,2016-09-27_11-30-35_009d6b|19,606360,0.568489,0.052408
1226648,2016-09-27_11-30-35_009d6b|19,606420,0.568645,0.053953
1226649,2016-09-27_11-30-35_009d6b|19,606480,0.550620,0.051864


## Task:
The Same as before, re-create the steps above but for the columns 'moving', 'micro', 'walk'. Instead of mean use max as we care about the most dominant behaviour in that time window, also it will keep our results as eith True of False which are discrete catorgories.

In [331]:
# To complete 

# df = 

df = df.groupby(['id', 'bin_t']).agg(**{
            'moving' : ('moving', 'max'),
            'micro' : ('micro', 'max'),
            'walk' : ('walk', 'max')
})
df.reset_index(inplace = True)
df.rename(columns = {'bin_t' : 't'}, inplace = True)

Filling in the gaps
========

Another method to fill in the gaps in data is interpolation. This is where you determin a value at any given timepoint given the rest of the dataset. If you have just a few points missing the interpolation results can be quite accurate. Here we'll run through the steps to interpolate your data

In [332]:
# First we can check if we have all the data points for each fly
# We'll use this method to check

def check_num_points(data, timestep=60):

    array_time = max(data['t']) - min(data['t'])

    if (array_time / timestep) + 1 == len(data):
        return True
    else:
        return False

# We need to call the function on each fly individually
# To do this you cal a groupby with 'id' as the argument
df_check = df.groupby('id').apply(check_num_points)

# This gives us a pandas series of True and False for each fly
df_check

id
2016-04-04_17-38-06_019aee|07     True
2016-04-04_17-38-06_019aee|12     True
2016-04-04_17-38-06_019aee|17     True
2016-04-04_17-38-06_019aee|20    False
2016-04-04_17-38-27_007aee|06     True
                                 ...  
2016-09-27_11-22-19_007d6b|15     True
2016-09-27_11-22-19_007d6b|17    False
2016-09-27_11-22-19_007d6b|19     True
2016-09-27_11-30-35_009d6b|15    False
2016-09-27_11-30-35_009d6b|19    False
Length: 135, dtype: bool

In [333]:
# We can count all the True and Falses with the method .value_counts()
df_check.value_counts()

True     86
False    49
Name: count, dtype: int64

We can see that nearly 50% of our flies are missing some points, so it's best we move ahead with interpolation.

## Extra Task:
Rather than just returning True or False you can create a function that returns the percentage of points you have from the amount needed. You can then combine this to filter our the flies that have less than 75% of points. 

Like when check for the points we'll need to create a function that we can call to apply the interpolation per fly, as we want it to be only using each flies data. But we'll walk through the steps before creating it. As the data is descrete we'll be using forward filling interpolation, which propagates the last valid observation to the next, see [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html) for more information. 

If we were working on continuous data we could use linear interpolation, which we'll briefly demonstrate at the end with np.interp, a one-dimnesional linear interpolator, see [here](https://numpy.org/devdocs/reference/generated/numpy.interp.html#numpy.interp) for the documentation from numpy.

In [334]:
# for now we'll work with a subsample of the main DataFrame so we can check things are working before creating the function

# Task:
# With the 'id' of '2016-04-04_17-38-06_019aee|20', create a sub DataFrame with just this data

# small_df = 

small_df = df[df['id'] == '2016-04-04_17-38-06_019aee|20'].copy(deep=True) 

small_df

Unnamed: 0,id,t,moving,micro,walk
28866,2016-04-04_17-38-06_019aee|20,31080,True,False,True
28867,2016-04-04_17-38-06_019aee|20,31140,True,False,True
28868,2016-04-04_17-38-06_019aee|20,31200,True,False,True
28869,2016-04-04_17-38-06_019aee|20,31260,True,False,True
28870,2016-04-04_17-38-06_019aee|20,31320,True,False,True
...,...,...,...,...,...
38400,2016-04-04_17-38-06_019aee|20,608100,True,False,True
38401,2016-04-04_17-38-06_019aee|20,608160,True,False,True
38402,2016-04-04_17-38-06_019aee|20,608220,True,False,True
38403,2016-04-04_17-38-06_019aee|20,608280,True,False,True


In [335]:
# Now we'll want to create a time series that contains all the points we want
# For this we can use np.arange which creates an array from a given start point to an end point, with regular steps
# You need to add on you time step to the end as it will only give it to the one step below otherwise
ts_seq = np.arange(min(small_df['t']), max(small_df['t']) + 60, 60)

# You can see it's an array that increase by 60 at each step by checking the difference per point
np.all(np.diff(ts_seq) == 60)

True

In [336]:
# Next we'll need to merge this back with the data to create new rows with NaN values which we'll replace
# To do this we make a pandas series, which is like singular column dataframe, named 't'
# With both the small dataframe and the series containing 't' we can merge the two together using this column as the key
ts_seq = pd.Series(ts_seq, name = 't')
small_df = small_df.merge(ts_seq, on = 't', how = 'right') # The merge is down to the right as we want the final result to be the length of the new sequence

# Checking for NaN values we can see the new time points are all there
small_df[small_df['moving'].isnull()]

Unnamed: 0,id,t,moving,micro,walk
1132,,99000,,,
1133,,99060,,,
1134,,99120,,,
1135,,99180,,,
1136,,99240,,,
...,...,...,...,...,...
8324,,530520,,,
8325,,530580,,,
8326,,530640,,,
8327,,530700,,,


In [337]:
# Now all we need to call is ffill
small_df.ffill(inplace=True)

In [338]:
# The NaNs have bbeen filled
small_df[small_df['moving'].isnull()]

Unnamed: 0,id,t,moving,micro,walk


In [339]:
small_df

Unnamed: 0,id,t,moving,micro,walk
0,2016-04-04_17-38-06_019aee|20,31080,True,False,True
1,2016-04-04_17-38-06_019aee|20,31140,True,False,True
2,2016-04-04_17-38-06_019aee|20,31200,True,False,True
3,2016-04-04_17-38-06_019aee|20,31260,True,False,True
4,2016-04-04_17-38-06_019aee|20,31320,True,False,True
...,...,...,...,...,...
9617,2016-04-04_17-38-06_019aee|20,608100,True,False,True
9618,2016-04-04_17-38-06_019aee|20,608160,True,False,True
9619,2016-04-04_17-38-06_019aee|20,608220,True,False,True
9620,2016-04-04_17-38-06_019aee|20,608280,True,False,True


We can now make a function that will complete this for the whole dataset

In [355]:
# Fill in the missing parts with what we've done above

def fill_interpolate(data, timestep = 60):

    # ts_seq = 
    # new_df = 
    ts_seq = np.arange(min(data['t']), max(data['t']) + timestep, timestep)
    ts_seq = pd.Series(ts_seq, name = 't')
    new_df = data.merge(ts_seq, on = 't', how = 'right')
    new_df.ffill(inplace=True)

    return new_df

In [360]:
# Now call a groupby method, applying the interpolate function
df = df.groupby('id', group_keys=False).apply(fill_interpolate)

In [361]:
# Lets use the check function to see if its worked
df_check = df.groupby('id').apply(check_num_points)
df_check.value_counts()

True    135
Name: count, dtype: int64

### Linear interpolation
For contiuous data like the X, Y coordinates we can use linear interpolation that fills in the data given where it would place on a fitted linear line of the true data.

In [362]:
# We'll load in the original dataset to get some continuous data again.
interp_df  = get_data()

# We'll check to see if any are missing datapoints
df_check = interp_df.groupby('id').apply(check_num_points)
df_check.value_counts()

False    135
Name: count, dtype: int64

They're all missing points so we'll use the same specimen as last time

In [363]:
small_interp = interp_df[interp_df['id'] == '2016-04-04_17-38-06_019aee|20'].copy(deep=True) 
small_interp

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk
422359,2016-04-04_17-38-06_019aee|20,31080,0.556231,0.047642,0.035975,0.016934,180.666963,30.257077,True,False,True
422360,2016-04-04_17-38-06_019aee|20,31110,0.417508,0.048524,0.047534,0.020466,154.127007,26.580541,True,False,True
422361,2016-04-04_17-38-06_019aee|20,31140,0.326009,0.047780,0.042044,0.019375,32.499655,9.865146,True,False,True
422362,2016-04-04_17-38-06_019aee|20,31170,0.381538,0.049076,0.043198,0.019001,26.114321,10.625785,True,False,True
422363,2016-04-04_17-38-06_019aee|20,31200,0.361699,0.043517,0.045145,0.019143,23.707117,5.229156,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...
441432,2016-04-04_17-38-06_019aee|20,608250,0.796086,0.045157,0.052065,0.020053,7.012595,1.313101,True,False,True
441433,2016-04-04_17-38-06_019aee|20,608280,0.374443,0.051471,0.050208,0.019719,31.036929,4.431430,True,False,True
441434,2016-04-04_17-38-06_019aee|20,608310,0.210739,0.050675,0.051257,0.020699,13.768250,1.658855,True,False,True
441435,2016-04-04_17-38-06_019aee|20,608340,0.433078,0.046695,0.048090,0.020305,36.887459,6.745071,True,False,True


In [364]:
# Like before we'll make a new time series of the length and intervals we want
ts_seq = np.arange(min(small_interp['t']), max(small_interp['t']) + 60, 60)

# Call np.interp with the new time series first, the old second, and the corresponding data third
new_seq = np.interp(ts_seq, small_interp['t'].to_numpy(), small_interp['x'].to_numpy())
new_seq

array([0.55623076, 0.32600946, 0.36169883, ..., 0.37444296, 0.43307758,
       0.66402162])

### Extra task:
- Can you make a function that will use np.interp on the whole interp_df dataset per fly for variables 'x' and 'y'?

In [370]:
# EXTRA TASK COMPLETED
def interpolate(data, timestep = 60):

    ts_seq = np.arange(min(data['t']), max(data['t']) + timestep, timestep)
    new_df = pd.DataFrame(data = {'t' : ts_seq})
    
    for i in ['x', 'y']:

        new_df[i] = np.interp(ts_seq, data['t'].to_numpy(), data[i].to_numpy())

    return new_df

idf = interp_df.groupby('id').apply(interpolate)
idf.reset_index(inplace=True, level = 0)

In [154]:
# we can do this for the rest of the columns quickly with a for loop
# for loops are useful for when you need to do the same thing over and over, with a few things changed
for i in ['moving', 'micro', 'walk']:
    small_df[i] = np.where(small_df[i] == True, 1, 0)

# The columns are now nicely binary
small_df

Unnamed: 0,id,t,moving,micro,walk
28866,2016-04-04_17-38-06_019aee|20,31080,1,0,1
28867,2016-04-04_17-38-06_019aee|20,31140,1,0,1
28868,2016-04-04_17-38-06_019aee|20,31200,1,0,1
28869,2016-04-04_17-38-06_019aee|20,31260,1,0,1
28870,2016-04-04_17-38-06_019aee|20,31320,1,0,1
...,...,...,...,...,...
38400,2016-04-04_17-38-06_019aee|20,608100,1,0,1
38401,2016-04-04_17-38-06_019aee|20,608160,1,0,1
38402,2016-04-04_17-38-06_019aee|20,608220,1,0,1
38403,2016-04-04_17-38-06_019aee|20,608280,1,0,1


In [372]:
# Finally we don't know if each flies data has the a good amount of data points for it
# Flies with a low amount could indicate they died early or the tracking stopped working

len_check = df.groupby('id').agg(**{
    'length' : ('t', len)
})
len_check['length'].value_counts()

# You can see most flies have over 9000 data points, but 2 have only 200 odd, we'll want to remove them 

length
9697    22
9622    10
9698     9
9702     8
9994     7
9620     6
9621     6
9703     5
9967     4
2385     3
9782     3
9458     2
9498     2
9976     2
9452     2
9947     2
9978     2
9968     2
9711     2
9499     2
9959     2
9701     2
9696     2
9769     2
2672     1
9966     1
9557     1
9680     1
2028     1
9977     1
1299     1
9992     1
9940     1
9768     1
9754     1
9710     1
9459     1
9461     1
9477     1
6633     1
1212     1
6637     1
258      1
259      1
9743     1
9717     1
9505     1
9612     1
9714     1
9753     1
Name: count, dtype: int64

In [374]:
# Can you devise some code that will remove these two flies data?

len_df = df.groupby('id').agg(**{
    'len' : ('t', len)
})
filt_len = len_df[len_df['len'] < 300]
filt_len
filt_list = filt_len.index.tolist()

tdf = df[~df['id'].isin(filt_list)]
tdf

Unnamed: 0,id,t,moving,micro,walk
0,2016-04-04_17-38-06_019aee|07,31080,True,False,True
1,2016-04-04_17-38-06_019aee|07,31140,True,False,True
2,2016-04-04_17-38-06_019aee|07,31200,True,False,True
3,2016-04-04_17-38-06_019aee|07,31260,True,False,True
4,2016-04-04_17-38-06_019aee|07,31320,True,False,True
...,...,...,...,...,...
9954,2016-09-27_11-30-35_009d6b|19,606300,True,False,True
9955,2016-09-27_11-30-35_009d6b|19,606360,False,False,False
9956,2016-09-27_11-30-35_009d6b|19,606420,True,True,False
9957,2016-09-27_11-30-35_009d6b|19,606480,True,False,True


Coding our categories
=======

The HMM we'll be using is categorical, which means if we want to use all the information for the 3 columns we must create a new column that has numbers that represent each variable when they are True. Hmmlearn takes each observable as a number, with the frst being 0, the next 1 and so on. Here we have 3 observables, not moving, micro moving, and walking, so we would like them to be 0, 1, 2 respectively.

In [375]:
# To do this we'll use np.where
# np.where will search a row given a logic question and create a new one with answers depending on if the question comes back True or False

# At first we'll look for all rows where the flies arent moving, if True we label it 0, if not we give it a NaN value for now
df['hmm'] = np.where(df['moving'] == 0, 0, np.nan)

# Next we'll look at micro, a fly cannot be both micro moving and walking, they are distinct. So we can just select for True cases.
# We make the False argument what the column was previously to keep the old category
df['hmm'] = np.where(df['micro'] == 1, 1, df['hmm'])

# Now we'll finish with walk, can you complete it?
# df['hmm'] = np.where()
df['hmm'] = np.where(df['walk'] == 1, 2, df['hmm'])

In [380]:
# Next we'll save our cleaned dataframe as a pickle file 
# Pickles are a popular format to save multiple variable formts
# Pandas has a built-in function to save the file
df.to_pickle('YOUR_PATH/data/cleaned_data.pkl')

# **Extra Tasks**

## 1. Split the data by Male and Female into seperate dataframes 
## 2. Convert a continuous float column to a discrete categorical column