Applying the HMM to the sleep dataset
======

# add links to documentation for bits used

Back to the dataset we'll be working with for this tutorial, a Drosophila movement dataset. Data can come in all formats, but a csv file is one of the most common due to its simplicity and intergration with spreadsheets. The data we will be using for this turorial is real, raw data from the Gilestro lab, where we track and record the movement of fruit flies using machine vision. The tracking is able to discern small movements in the fly that can robustly record the flies several times per second giving a multiude of variables to work with.

We'll be using the pandas package to import and store our data in the notebooks. Pandas is a widely used tool in data science, it is built ontop of numpy which we briefly used previously. At the core of pandas is the DataFrame, a table format you will all be familiar wih from spreadsheets. Pandas gives you many tools to manipulate the data before you feed it into any analysis or machine learning tool. As with numpy everything used here will be explained as we use it, but if you'd like to read more into how to use pandas there is a quick tutorial on their website -> [here](https://pandas.pydata.org/docs/user_guide/10min.html)

In [16]:
# first we need to import pandas
# like numpy it is often imported in a shorthand 
import pandas as pd
import numpy as np

In [17]:
import sys
sys.path.append('../src/HMM')
from misc import get_data

# df = get_data()

df = pd.read_csv('/home/lab/Desktop/ReCoDE-HMMs-for-the-discovery-of-behavioural-states/admin/training_data_30_nans.zip')


  df = pd.read_csv('/home/lab/Desktop/ReCoDE-HMMs-for-the-discovery-of-behavioural-states/admin/training_data_30_nans.zip')


In [18]:
# def add_nans(data):
#     t_array = data['t'].to_numpy()
#     t_select = np.random.permutation(t_array)[:3]
#     for i in t_select:
#         for q in np.random.permutation(['x', 'y', 'w', 'h', 'max_velocity', 'mean_velocity', 'moving', 'micro', 'walk'])[:4]:
#             data[q][data['t'] == i] = np.nan
#     return data
# df = df.groupby('id', group_keys = False).apply(add_nans)

The Data Structure
=================

In [19]:
df

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk
0,2016-04-04_17-39-22_033aee|01,31140,0.269116,0.069594,0.038829,0.020012,75.662162,25.713480,True,False,True
1,2016-04-04_17-39-22_033aee|01,31170,0.606590,0.068019,0.048224,0.020609,27.471271,9.145901,True,False,True
2,2016-04-04_17-39-22_033aee|01,31200,0.398307,0.070464,0.049073,0.020628,19.718721,5.478951,True,False,True
3,2016-04-04_17-39-22_033aee|01,31230,0.469571,0.066383,0.046558,0.020423,20.224544,7.475374,True,False,True
4,2016-04-04_17-39-22_033aee|01,31260,0.260085,0.073667,0.047548,0.020133,34.824007,6.163203,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...
6919122,2016-09-27_10-56-35_053c6b|19,606420,0.537450,0.047642,0.052998,0.023141,7.428117,2.037493,True,False,True
6919123,2016-09-27_10-56-35_053c6b|19,606450,0.211436,0.063828,0.048854,0.024929,21.177698,4.470726,True,False,True
6919124,2016-09-27_10-56-35_053c6b|19,606480,0.131377,0.065893,0.041694,0.025711,10.986990,3.057987,True,False,True
6919125,2016-09-27_10-56-35_053c6b|19,606510,0.512140,0.064421,0.054938,0.021951,29.166126,6.249765,True,False,True


The first column is 'id' which contains a unique ID per fly and will allow us to filter and apply methods to just one fly at a time. The next most important variable is 't' or time, as we are working with time series data we must ensure this is strucutred properly i.e. in sequential order and at regular intervals (the later we will go over). The rest are various variables per each timestamp, for this turorial we'll only be interested in 'moving', 'micro', and 'walk'.

Checking for missing data
======

Most real datasets will not be perfectly populated, with tracking dropping out over the course of an experiment. In a dataframe or an array where there is data missing at a timepoint or index this will be represented by a NaN value, which lets methods and functions know there is no data rather than a zero value. However, often analysing packages will throw an error if you feed it NaN values, so it's good practice to check for them first and either remove them or replace then with an approximation.

In [20]:
# Lets filter our dataframe for nan values
# With pandas you can filter the dataframe by the columns
# To filter or slice the dataframe put some square brackets after the dataframe and inside call the column slice 
# For finding NaN values we have to call a method, for other regualr filtering you just use =, <, > and so on

df[df['x'].isnull()]

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk
33037,2016-04-04_17-39-22_033aee|03,445260,,0.065451,0.046868,0.019324,7.445241,,,False,
45180,2016-04-04_17-39-22_033aee|04,236160,,,,0.019395,0.756622,0.650191,False,False,
64277,2016-04-04_17-39-22_033aee|05,231810,,0.061041,0.050922,,0.751413,0.657350,,,False
71447,2016-04-04_17-39-22_033aee|05,447090,,,,0.017913,,0.727550,True,True,False
108397,2016-04-04_17-39-22_033aee|07,402270,,0.052370,0.053485,,3.418840,0.886484,,False,
...,...,...,...,...,...,...,...,...,...,...,...
6842546,2016-09-27_10-56-35_053c6b|11,183990,,,0.051491,0.024905,0.702876,,,False,False
6853428,2016-09-27_10-56-35_053c6b|11,510450,,0.062462,,0.019540,16.783354,3.954978,True,,
6873720,2016-09-27_10-56-35_053c6b|13,519690,,,0.050000,,0.715943,,False,False,False
6878121,2016-09-27_10-56-35_053c6b|15,53610,,,,0.019024,0.744524,,False,False,False


In [21]:
# To break down whats happening we can just call whats inside the brackets, you can see that it is an array (or series in pandas terms) with False or True per row.
# This array then dictates what rows get returned from the whole dataframe, i.e. only the ones that fullfill the argument and are True

df['x'].isnull()

0          False
1          False
2          False
3          False
4          False
           ...  
6919122    False
6919123    False
6919124    False
6919125    False
6919126    False
Name: x, Length: 6919127, dtype: bool

In [22]:
# However, we are not just looking at one column. 
# Luckily with pandas you can filter by multiple columns, all you need to do is put each filter argument in round brackets and then seperate them by an & ("and") or | ("or") logical operator

df[(df['x'].isnull()) | (df['y'].isnull())]

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk
2774,2016-04-04_17-39-22_033aee|01,114630,0.195341,,0.043329,0.020550,,0.655093,,,False
12327,2016-04-04_17-39-22_033aee|01,401220,0.618850,,,0.020569,35.882174,2.237015,,,True
33037,2016-04-04_17-39-22_033aee|03,445260,,0.065451,0.046868,0.019324,7.445241,,,False,
39946,2016-04-04_17-39-22_033aee|04,79140,0.470888,,0.049861,,0.767147,,False,,False
45180,2016-04-04_17-39-22_033aee|04,236160,,,,0.019395,0.756622,0.650191,False,False,
...,...,...,...,...,...,...,...,...,...,...,...
6877927,2016-09-27_10-56-35_053c6b|15,47790,0.872031,,,,,0.674368,False,False,False
6878121,2016-09-27_10-56-35_053c6b|15,53610,,,,0.019024,0.744524,,False,False,False
6880860,2016-09-27_10-56-35_053c6b|17,56400,0.647482,,,,0.776030,0.665181,False,,False
6880918,2016-09-27_10-56-35_053c6b|17,58140,,,0.043084,0.018435,,0.687163,False,,False


In [36]:
# Now we want to remove those rows containing NaN values as they aren't providing any information
# When filtering for NaNs above we're selecting for them, adding the ~ opertor tells the filter to look for the opposite, so when NaN is True it now becomes False
# If taking a slice of a dataframe its good practice to make it a copy, otherwise it will throw up warnings
df_filtered = df[~ (df['x'].isnull()) | ~ (df['y'].isnull())].copy(deep = True)

# the new DataFrame now won't have any rows where 'x' and 'y' have NaN values
df_filtered

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk
0,2016-04-04_17-39-22_033aee|01,31140,0.269116,0.069594,0.038829,0.020012,75.662162,25.713480,True,False,True
1,2016-04-04_17-39-22_033aee|01,31170,0.606590,0.068019,0.048224,0.020609,27.471271,9.145901,True,False,True
2,2016-04-04_17-39-22_033aee|01,31200,0.398307,0.070464,0.049073,0.020628,19.718721,5.478951,True,False,True
3,2016-04-04_17-39-22_033aee|01,31230,0.469571,0.066383,0.046558,0.020423,20.224544,7.475374,True,False,True
4,2016-04-04_17-39-22_033aee|01,31260,0.260085,0.073667,0.047548,0.020133,34.824007,6.163203,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...
6919122,2016-09-27_10-56-35_053c6b|19,606420,0.537450,0.047642,0.052998,0.023141,7.428117,2.037493,True,False,True
6919123,2016-09-27_10-56-35_053c6b|19,606450,0.211436,0.063828,0.048854,0.024929,21.177698,4.470726,True,False,True
6919124,2016-09-27_10-56-35_053c6b|19,606480,0.131377,0.065893,0.041694,0.025711,10.986990,3.057987,True,False,True
6919125,2016-09-27_10-56-35_053c6b|19,606510,0.512140,0.064421,0.054938,0.021951,29.166126,6.249765,True,False,True


## Task:
As stated before for this tutorial we will be focussing on the variables 'moving', 'micro', 'walk'. Now you know how to filter out NaN values apply this to only these columns.

In [38]:
# To complete 

# df = 

df = df[(~df['moving'].isnull()) | (~df['micro'].isnull()) | (~df['walk'].isnull())].copy(deep = True)


## Extra Task:
1) If you're new to pandas (or just want some practice) have a play around with other types of filtering (such as df[df['mean_velocity'] > 5]). It makes a quick and easy way to filter your data and if you're doing the same thing repeatably you can create a function to do it instantly.

2) rather than filtering out the NaN values you can replace them with something else. We could know that tracking drops out when the flies are still for a long time, so we could resonably replace all of 'moving', 'micro', and 'walk' with False.
This can be done with the .fillna method, see here for how to do it -> [fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html).

Binning the data to a larger time step
======

#Outcome - Why you would want to do it, the benfits and downsides

Its important with Hidden Markov models that any timeseries dataset is complete with no skips due to missing data, as the model will assume the array you're feeding it all has the same time step. One way to do this is to increase the timestep, currently the dataset has a row for every 30 seconds. However, we know from filtering out the NaN values that we will won't have the all. So to recitfy this we will increase the time step to 60, as long as there is at least 1 row out of 2 for the 60 we'll have a perfectly populated dataset. 

Addtionally, doing so will decrease the size of the data, meaning the model will train more quickly. It's always worth trying the model with a few different timesteps to see how this affects the output, then you can pick the one you think is the most representative and quickest to train

In [39]:
# we'll go through it step by step, before wrapping it all in a function
# First we'll need the function floor which rounds down numbers
from math import floor

In [40]:
df2 = df.copy(deep=True)

In [41]:
# first we'll create a new column with the new time step
# lambda functions are an easy way to apply a function per row with a specific column
# We then divide the time by our new time and round down. The end result is multiplied by the new time step giving the minimum time as divisable by the time step given
df['bin_t'] = df['t'].map(lambda t: 60 * floor(t / 60)) # the t represenst each row value for the column 't'
df

Unnamed: 0,id,t,x,y,w,h,max_velocity,mean_velocity,moving,micro,walk,bin_t
0,2016-04-04_17-39-22_033aee|01,31140,0.269116,0.069594,0.038829,0.020012,75.662162,25.713480,True,False,True,31140
1,2016-04-04_17-39-22_033aee|01,31170,0.606590,0.068019,0.048224,0.020609,27.471271,9.145901,True,False,True,31140
2,2016-04-04_17-39-22_033aee|01,31200,0.398307,0.070464,0.049073,0.020628,19.718721,5.478951,True,False,True,31200
3,2016-04-04_17-39-22_033aee|01,31230,0.469571,0.066383,0.046558,0.020423,20.224544,7.475374,True,False,True,31200
4,2016-04-04_17-39-22_033aee|01,31260,0.260085,0.073667,0.047548,0.020133,34.824007,6.163203,True,False,True,31260
...,...,...,...,...,...,...,...,...,...,...,...,...
6919122,2016-09-27_10-56-35_053c6b|19,606420,0.537450,0.047642,0.052998,0.023141,7.428117,2.037493,True,False,True,606420
6919123,2016-09-27_10-56-35_053c6b|19,606450,0.211436,0.063828,0.048854,0.024929,21.177698,4.470726,True,False,True,606420
6919124,2016-09-27_10-56-35_053c6b|19,606480,0.131377,0.065893,0.041694,0.025711,10.986990,3.057987,True,False,True,606480
6919125,2016-09-27_10-56-35_053c6b|19,606510,0.512140,0.064421,0.054938,0.021951,29.166126,6.249765,True,False,True,606480


You should see in the column 'bin_t' that rows next to each other now share a time step. Now we have that we'll want to pivot or group by this column so all that have the same time stamp are collected together.

In [42]:
# The pandas groupby method does this, all you need to do is call the method with the column you want to pivot by in the brackets
# Then you can tell it what aggregating function you want to call on the columns of interest
df_grouped = df.groupby('bin_t').agg(**{
            'x' : ('x', 'mean'), # before the brackets is the name of the new column, we'll keep it the same
            'y' : ('y', 'mean')  # within the brackets is the column you want to use and the function to apply to it. YoFalseu have 'mean', 'median', 'max'... ect built in, but you can also use your own functions
})

df_grouped

Unnamed: 0_level_0,x,y
bin_t,Unnamed: 1_level_1,Unnamed: 2_level_1
6960,0.548046,0.055010
7020,0.483673,0.055646
7080,0.502220,0.056058
7140,0.499007,0.056736
7200,0.535441,0.055255
...,...,...
608100,0.486551,0.052753
608160,0.466558,0.052861
608220,0.485503,0.052645
608280,0.474725,0.053159


Some of you may have noticed that doing it this way will aggregate our whole dataset and lose the information per fly. To keep this information we can call a groupby with two levels, the first will be the higher level that the data is grouped by first, and the second the one that the functions will be applied to.

In [43]:
# We do exactly the same, but instead of just 'bin_t' we have a list with 'id' first
# Calling it this way on a lot of rows can take a few minutes or more depending on your computer, so don't worry if it takes a while
df_grouped = df.groupby(['id', 'bin_t']).agg(**{
            'x' : ('x', 'mean'),
            'y' : ('y', 'mean')
})
# We need to reset the index as it will have both 'id' and 'bin_t' as the index
df_grouped.reset_index(inplace = True)
# We'll also rename the column 'bin_t' back to 't' for clarity
df_grouped.rename(columns = {'bin_t' : 't'}, inplace = True)
df_grouped

Unnamed: 0,id,t,x,y
0,2016-04-04_17-38-06_019aee|01,31080,0.571842,0.056286
1,2016-04-04_17-38-06_019aee|01,31140,0.566849,0.056479
2,2016-04-04_17-38-06_019aee|01,31200,0.490586,0.059259
3,2016-04-04_17-38-06_019aee|01,31260,0.438665,0.056716
4,2016-04-04_17-38-06_019aee|01,31320,0.749282,0.051902
...,...,...,...,...
3460000,2016-09-27_11-30-35_009d6b|19,606300,0.568432,0.052146
3460001,2016-09-27_11-30-35_009d6b|19,606360,0.568489,0.052408
3460002,2016-09-27_11-30-35_009d6b|19,606420,0.568645,0.053953
3460003,2016-09-27_11-30-35_009d6b|19,606480,0.550620,0.051864


## Task:
The Same as before, re-create the steps above but for the columns 'moving', 'micro', 'walk'. Instead of mean use max as we care about the most dominant behaviour in that time window, also it will keep our results as eith True of False which are discrete catorgories.

In [44]:
# To complete 

# df = 

df = df.groupby(['id', 'bin_t']).agg(**{
            'moving' : ('moving', 'max'),
            'micro' : ('micro', 'max'),
            'walk' : ('walk', 'max')
})

In [45]:
df.reset_index(inplace = True)
# We'll also rename the column 'bin_t' back to 't' for clarity
df.rename(columns = {'bin_t' : 't'}, inplace = True)
df

Unnamed: 0,id,t,moving,micro,walk
0,2016-04-04_17-38-06_019aee|01,31080,True,False,True
1,2016-04-04_17-38-06_019aee|01,31140,True,False,True
2,2016-04-04_17-38-06_019aee|01,31200,True,False,True
3,2016-04-04_17-38-06_019aee|01,31260,True,False,True
4,2016-04-04_17-38-06_019aee|01,31320,True,False,True
...,...,...,...,...,...
3460000,2016-09-27_11-30-35_009d6b|19,606300,True,False,True
3460001,2016-09-27_11-30-35_009d6b|19,606360,False,False,False
3460002,2016-09-27_11-30-35_009d6b|19,606420,True,True,False
3460003,2016-09-27_11-30-35_009d6b|19,606480,True,False,True


Interpolating the data
========

Another method to fill in the gaps in data is interpolation. This is where you determin a value at any given timepoint given the rest of the dataset. If you have just a few points missing the interpolation results can be quite accurate. Here we'll run through the steps to interpolating your data

In [53]:
# First we can check if we have all the data points for each fly
# We'll use this method to check

def check_num_points(data, timestep=60):

    array_time = max(data['t']) - min(data['t'])

    if (array_time / timestep) >= len(data):
        return False
    else:
        return True

# We need to call the function on each fly individually
# To do this you cal a groupby with 'id' as the argument
df_check = df.groupby('id').apply(check_num_points)

# This gives us a pandas series of True and False for each fly
df_check

id
2016-04-04_17-38-06_019aee|01    False
2016-04-04_17-38-06_019aee|02    False
2016-04-04_17-38-06_019aee|03    False
2016-04-04_17-38-06_019aee|04    False
2016-04-04_17-38-06_019aee|05    False
                                 ...  
2016-09-27_11-30-35_009d6b|11     True
2016-09-27_11-30-35_009d6b|13     True
2016-09-27_11-30-35_009d6b|15    False
2016-09-27_11-30-35_009d6b|17     True
2016-09-27_11-30-35_009d6b|19    False
Length: 381, dtype: bool

In [54]:
# We can count all the True and Falses with the method .value_counts()
df_check.value_counts()

True     216
False    165
Name: count, dtype: int64

We can see that nearly 50% of our flies are missing some points, so it's best we move ahead with interpolation.

## Extra Task:
Rather than just returning True or False you can create a function that returns the percentage of points you have from the amount needed. You can then combine this to filter our the flies that have less than 75% of points. 

Like when check for the points we'll need to create a function that we can call to apply the interpolation per fly, as we want it to be only using each flies data. But we'll walk through the steps before creating it. The function we'll be using is np.interp, a one-dimnesional linear interpolator, see [here](https://numpy.org/devdocs/reference/generated/numpy.interp.html#numpy.interp) for the documentation from numpy.

In [48]:
# for now we'll work with a subsample of the main DataFrame so we can check things are working before creating the function

# Task:
# With the 'id' of '2016-04-04_17-38-06_019aee|01', create a sub DataFrame with just this data

# small_df = 

small_df = df[df['id'] == '2016-04-04_17-38-06_019aee|01'].copy(deep=True)

small_df

Unnamed: 0,id,t,moving,micro,walk
0,2016-04-04_17-38-06_019aee|01,31080,True,False,True
1,2016-04-04_17-38-06_019aee|01,31140,True,False,True
2,2016-04-04_17-38-06_019aee|01,31200,True,False,True
3,2016-04-04_17-38-06_019aee|01,31260,True,False,True
4,2016-04-04_17-38-06_019aee|01,31320,True,False,True
...,...,...,...,...,...
9554,2016-04-04_17-38-06_019aee|01,608100,False,False,False
9555,2016-04-04_17-38-06_019aee|01,608160,False,False,False
9556,2016-04-04_17-38-06_019aee|01,608220,False,False,False
9557,2016-04-04_17-38-06_019aee|01,608280,False,False,False


In [49]:
# Now we'll want to create a time series that contains all the points we want
# For this we can use np.arange which creates an array from a given start point to an end point, with regular steps
# You need to add on you time step to the end as it will only give it to the one step below otherwise
ts_seq = np.arange(min(small_df['t']), max(small_df['t']) + 60, 60)

# You can see it's an array that increase by 60 at each step
len(ts_seq)

9622

In [50]:
# Now we can all np.interp, the first argument takes the new array, the second the x axis data (which would be time) and then the y axis data (movment)
np.interp(ts_seq, small_df['t'].to_numpy(), small_df['moving'].to_numpy())

TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

####  You'll notice this causes an error. This is because our data at the moment is a boolean, Trues and Falses. We can rectify this by changing them to 1s and 0s

In [51]:
# To do this we'll use np.where
# np.where will search a row given a logic question and create a new one with answers depending on if the question comes back True or False
small_df['moving'] = np.where(small_df['moving'] == True, 1, 0)

In [52]:
# we can do this for the rest of the columns quickly with a for loop
# for loops are useful for when you need to do the same thing over and over, with a few things changed
for i in ['moving', 'micro', 'walk']:
    small_df[i] = np.where(small_df[i] == True, 1, 0)

# The columns are now nicely binary
small_df

Unnamed: 0,id,t,moving,micro,walk
0,2016-04-04_17-38-06_019aee|01,31080,1,0,1
1,2016-04-04_17-38-06_019aee|01,31140,1,0,1
2,2016-04-04_17-38-06_019aee|01,31200,1,0,1
3,2016-04-04_17-38-06_019aee|01,31260,1,0,1
4,2016-04-04_17-38-06_019aee|01,31320,1,0,1
...,...,...,...,...,...
9554,2016-04-04_17-38-06_019aee|01,608100,0,0,0
9555,2016-04-04_17-38-06_019aee|01,608160,0,0,0
9556,2016-04-04_17-38-06_019aee|01,608220,0,0,0
9557,2016-04-04_17-38-06_019aee|01,608280,0,0,0


In [55]:
# Calling np.interp should now work, returning an array the length og ts_seq
new_mov = np.interp(ts_seq, small_df['t'].to_numpy(), small_df['moving'].to_numpy())
new_mov

array([1., 1., 1., ..., 0., 0., 1.])

We can now try and make a function to work on the whole dataframe, but first we'll change all the Trues and Falses to 1s and 0s in the main dataset

In [56]:
for i in ['moving', 'micro', 'walk']:
    df[i] = np.where(df[i] == True, 1, 0)

In [57]:
# Fill in the missing parts with what we've done above

def interpolate(data, timestep = 60):
    
    # ts_seq = 
    ts_seq = np.arange(min(data['t']), max(data['t']) + timestep, timestep)
    
    id_list = len(ts_seq) * data['id'].iloc[0]

    new_df = pd.DataFrame(data = {'id' : id_list, 't' : ts_seq})

    for i in ['moving', 'micro', 'walk']:

        new_df[i] = np.interp(ts_seq, data['t'].to_numpy(), data[i].to_numpy())

    return new_df

In [58]:
# Now call a groupby method, applying the interpolate function
# df = 

df = df.groupby('id', group_keys=False).apply(interpolate)

In [None]:
# Add check function to see if 

In [None]:
df_check = df.groupby('id').apply(check_num_points)
df_check.value_counts()

The HMM we'll be using is categorical, which means if we want to use all the information for the 3 columns we must create a new column that has numbers that represent each variable when they are True. Hmmlearn takes each observable as a number, with the frst being 0, the next 1 and so on. Here we have 3 observables, not moving, micro moving, and walking, so 0, 1, 2 respectively.

In [59]:
# We'll use np.where again here

# At first we'll look for all rows where the flies arent moving, if True we label it 0, if not we give it a NaN value for now
df['hmm'] = np.where(df['moving'] == 0, 0, np.nan)

# Next we'll look at micro, a fly cannot be both micro moving and walking, they are distinct. So we can just select for True cases.
# We make the False argument what the column was previously to keep the old category
df['hmm'] = np.where(df['micro'] == 1, 1, df['hmm'])

# Now we'll finish with walk, can you complete it?
# df['hmm'] = np.where()
df['hmm'] = np.where(df['walk'] == 1, 2, df['hmm'])

In [61]:
df

In [67]:
# Next we'll save our cleaned dataframe as a pickle file 
# Pickles are a popular format to save multiple variable formts
# Pandas has a built-in function to save the file
df.to_pickle('YOUR_PATH/cleaned_data.pkl')

TypeError: bad operand type for unary ~: 'str'

# **Extra Tasks**

## 1. Split the data by Male and Female into seperate dataframes 
## 2. Convert a continuous float column to a discrete categorical column