# Allen Cell Types Database

This notebook will serve as an introduction to the Allen Cell Types database. We'll work with the AllenSDK to see what information we can gain about our cells.

First, we'll `import` the CellTypesCache module. This module provides tools to allow us to get information from the cell types database. We're giving it a **manifest** filename as well. CellTypesCache will create this manifest file, which contains metadata about the cache. You can look under cell_types in your directory, and take a look at the file.

(If you're curious you can see the full documentation for the core package <a href="https://allensdk.readthedocs.io/en/latest/allensdk.core.html">here</a>.)

<b>Note</b>: In order to run the line below, you need to have the AllenSDK installed. You can find information on how to do that <a href="http://alleninstitute.github.io/AllenSDK/install.html">here</a>. If you're running this on the UCSD Datahub, the Allen SDK has already been installed for you.

In [1]:
#Import the "Cell Types Cache" from the AllenSDK core package
from allensdk.core.cell_types_cache import CellTypesCache

#Import CellTypesApi, which will allow us to query the database.
from allensdk.api.queries.cell_types_api import CellTypesApi

# We'll then initialize the cache as 'ctc' (cell types cache)
ctc = CellTypesCache(manifest_file='cell_types/manifest.json')

## Step One: Get Cells & Manipulate Dataframe

Look through <a href="https://allensdk.readthedocs.io/en/latest/allensdk.core.cell_types_cache.html">the documentation for the CellTypesCache</a> for information on the `get_cells` method.

Use the `get_cells` method in the cell below to get information about all of the human cells in the database. Assign the output of this to `human_cells`, and look at the output when it is done.

In [4]:
human_cells = ctc.get_cells(species=[CellTypesApi.HUMAN])
human_cells

[{'reporter_status': None,
  'cell_soma_location': [273.0, 354.0, 216.0],
  'species': 'Homo Sapiens',
  'id': 525011903,
  'name': 'H16.03.003.01.14.02',
  'structure_layer_name': '3',
  'structure_area_id': 12113,
  'structure_area_abbrev': 'FroL',
  'transgenic_line': '',
  'dendrite_type': 'spiny',
  'apical': 'intact',
  'reconstruction_type': None,
  'disease_state': 'epilepsy',
  'donor_id': 524848408,
  'structure_hemisphere': 'right',
  'normalized_depth': None},
 {'reporter_status': None,
  'cell_soma_location': [69.0, 254.0, 96.0],
  'species': 'Homo Sapiens',
  'id': 528642047,
  'name': 'H16.06.009.01.02.06.05',
  'structure_layer_name': '5',
  'structure_area_id': 12141,
  'structure_area_abbrev': 'MTG',
  'transgenic_line': '',
  'dendrite_type': 'aspiny',
  'apical': 'NA',
  'reconstruction_type': None,
  'disease_state': 'epilepsy',
  'donor_id': 528574320,
  'structure_hemisphere': 'left',
  'normalized_depth': None},
 {'reporter_status': None,
  'cell_soma_location':

Chances are, your output looks a bit messy. This is where pandas can really come in handy! Convert `human_cells` into a Pandas Dataframe by:
1. Importing `pandas` as `pd`
2. Creating a dataframe with `pd.DataFrame()`
3. Assigning that dataframe to `human_df`
4. Showing the first five rows of the df using the `.head()` method.

Note: If you're having trouble with Pandas, it can help to look at <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/">the user guide</a>.

In [6]:
import pandas as pd

human_df = pd.DataFrame(human_cells)
human_df.head()

Unnamed: 0,reporter_status,cell_soma_location,species,id,name,structure_layer_name,structure_area_id,structure_area_abbrev,transgenic_line,dendrite_type,apical,reconstruction_type,disease_state,donor_id,structure_hemisphere,normalized_depth
0,,"[273.0, 354.0, 216.0]",Homo Sapiens,525011903,H16.03.003.01.14.02,3,12113,FroL,,spiny,intact,,epilepsy,524848408,right,
1,,"[69.0, 254.0, 96.0]",Homo Sapiens,528642047,H16.06.009.01.02.06.05,5,12141,MTG,,aspiny,,,epilepsy,528574320,left,
2,,"[322.0, 255.0, 92.0]",Homo Sapiens,537256313,H16.03.006.01.05.02,4,12141,MTG,,spiny,truncated,,epilepsy,536912860,right,
3,,"[79.0, 273.0, 91.0]",Homo Sapiens,519832676,H16.03.001.01.09.01,3,12141,MTG,,spiny,truncated,full,epilepsy,518641172,left,0.290951
4,,"[66.0, 220.0, 105.0]",Homo Sapiens,596020931,H17.06.009.11.04.02,4,12141,MTG,,aspiny,,full,tumor,595954915,left,0.497825


Let's get some information about our cells. We can use `len()` on a dataframe to get the number of rows. Alternatively, we can use the `count()` method on our dataframe to get detailed information for each column. For our purposes today, let's just get the number of rows, which is equivalent to the number of observations. 

1. Use the `count()` method to see how many observations there are in each of the columns. Why might some be missing?
2. Use `len()` to see the length of the whole dataframe and assign the output to `n_human_cells`.

In [8]:
n_human_cells = len(human_df)
print(n_human_cells)

411


At the moment, our rows don't have any useful information -- they're simply a list of indices. We can reassign the row values by using the method `set_index`. Execute this method to set the 'id' column as the index, and reassign your dataframe as `human_df`.

In [9]:
human_df = human_df.set_index('id')
human_df.head()

Unnamed: 0_level_0,reporter_status,cell_soma_location,species,name,structure_layer_name,structure_area_id,structure_area_abbrev,transgenic_line,dendrite_type,apical,reconstruction_type,disease_state,donor_id,structure_hemisphere,normalized_depth
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
525011903,,"[273.0, 354.0, 216.0]",Homo Sapiens,H16.03.003.01.14.02,3,12113,FroL,,spiny,intact,,epilepsy,524848408,right,
528642047,,"[69.0, 254.0, 96.0]",Homo Sapiens,H16.06.009.01.02.06.05,5,12141,MTG,,aspiny,,,epilepsy,528574320,left,
537256313,,"[322.0, 255.0, 92.0]",Homo Sapiens,H16.03.006.01.05.02,4,12141,MTG,,spiny,truncated,,epilepsy,536912860,right,
519832676,,"[79.0, 273.0, 91.0]",Homo Sapiens,H16.03.001.01.09.01,3,12141,MTG,,spiny,truncated,full,epilepsy,518641172,left,0.290951
596020931,,"[66.0, 220.0, 105.0]",Homo Sapiens,H17.06.009.11.04.02,4,12141,MTG,,aspiny,,full,tumor,595954915,left,0.497825


It would help to know what information is in our dataset. In other words, what is across the columns at the top? We can get a list by accessing the attribute `.columns`. Assign the output of this method to `human_df_columns`.

In [11]:
human_df_columns = human_df.columns
print(human_df_columns)

Index(['reporter_status', 'cell_soma_location', 'species', 'name',
       'structure_layer_name', 'structure_area_id', 'structure_area_abbrev',
       'transgenic_line', 'dendrite_type', 'apical', 'reconstruction_type',
       'disease_state', 'donor_id', 'structure_hemisphere',
       'normalized_depth'],
      dtype='object')


We can access individual columns with the notation `dataframe['column name']`. Check out the `dendrite_type` column by using this notation.

In [13]:
human_df['dendrite_type']

id
525011903     spiny
528642047    aspiny
537256313     spiny
519832676     spiny
596020931    aspiny
              ...  
508298270     spiny
545612828     spiny
527952884     spiny
488701127     spiny
561469082    aspiny
Name: dendrite_type, Length: 411, dtype: object

Like numpy arrays, we can use boolean indexing to filter our pandas dataframe. Our dataframe has data on two different dendrite types. Filter your dataframe by using the following syntax:
```
new_df = original_df[original_df['Column of Interest'] == 'Desired Value']
```
In plain english, what this is saying is: save a dataframe from the original dataframe, where the original dataframe values in my Column of Interest are equal to my Desired Value.

1. Assign your new dataframe and give it a reasonable name (e.g., `spiny_df`)
2. Create a second dataframe for the *other* dendrite type.

In [14]:
spiny_df = human_df[human_df['dendrite_type'] == 'spiny']
spiny_df.head()

Unnamed: 0_level_0,reporter_status,cell_soma_location,species,name,structure_layer_name,structure_area_id,structure_area_abbrev,transgenic_line,dendrite_type,apical,reconstruction_type,disease_state,donor_id,structure_hemisphere,normalized_depth
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
525011903,,"[273.0, 354.0, 216.0]",Homo Sapiens,H16.03.003.01.14.02,3,12113,FroL,,spiny,intact,,epilepsy,524848408,right,
537256313,,"[322.0, 255.0, 92.0]",Homo Sapiens,H16.03.006.01.05.02,4,12141,MTG,,spiny,truncated,,epilepsy,536912860,right,
519832676,,"[79.0, 273.0, 91.0]",Homo Sapiens,H16.03.001.01.09.01,3,12141,MTG,,spiny,truncated,full,epilepsy,518641172,left,0.290951
569095789,,"[110.0, 122.0, 240.0]",Homo Sapiens,H17.06.004.11.05.04,2,12136,AnG,,spiny,intact,full,tumor,569008241,left,0.0564
545608578,,"[312.0, 280.0, 89.0]",Homo Sapiens,H16.03.010.13.06.01,3,12141,MTG,,spiny,intact,,epilepsy,545510854,right,


## Step Two: Get Electrophysiology Data

At this point, you might have realized that this dataframe doesn't contain any data about the electrophysiology -- it's just metadata about the cells. In order to get information about the electrophysiological properties of these cells, we need to use the `get_ephys_features()` method on our instance of the cell types cache.

1. Execute the `get_ephys_features` method on our cell types instance and assign the output of this to `ephys_features`.
2. Convert `ephys_features` into a pandas dataframe.
3. Re-assign the index to be the column labeled 'specimen_id' of the cell (and reassign to `ephys_features`). 'specimen_id' is the label that can link this dataframe to our metadata (human_df) dataframe.

In [16]:
ephys_features = pd.DataFrame(ctc.get_ephys_features()).set_index('specimen_id')
ephys_features.head()

Unnamed: 0_level_0,adaptation,avg_isi,electrode_0_pa,f_i_curve_slope,fast_trough_t_long_square,fast_trough_t_ramp,fast_trough_t_short_square,fast_trough_v_long_square,fast_trough_v_ramp,fast_trough_v_short_square,...,trough_t_ramp,trough_t_short_square,trough_v_long_square,trough_v_ramp,trough_v_short_square,upstroke_downstroke_ratio_long_square,upstroke_downstroke_ratio_ramp,upstroke_downstroke_ratio_short_square,vm_for_sag,vrest
specimen_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
529878215,,134.7,22.697498,0.08335459,1.18768,13.2952,1.025916,-56.375004,-57.38542,-57.431251,...,13.29568,1.13478,-56.593754,-57.739586,-74.143753,3.029695,3.061646,2.969821,-80.46875,-73.553391
548459652,,,-24.887498,-3.9136299999999995e-19,1.09984,20.650105,1.02546,-54.0,-54.828129,-54.656254,...,20.650735,1.16094,-55.406254,-55.242191,-73.5,2.441895,2.245653,2.231575,-84.406258,-73.056595
579978640,0.00977,39.0448,-46.765002,0.5267857,1.15784,2.55131,1.025387,-59.5,-58.234378,-59.940975,...,2.55196,1.089851,-60.0625,-58.570314,-61.371531,2.023762,2.162878,2.006406,-93.375008,-60.277321
439024551,-0.007898,117.816429,5.99625,0.1542553,1.989165,9.572025,1.028733,-47.53125,-50.359375,-65.5,...,9.576308,1.423229,-49.406254,-52.718752,-75.273443,3.105931,3.491663,1.733896,-87.65625,-75.205559
515188639,0.022842,68.321429,14.91,0.1714041,1.08198,2.46288,1.02562,-48.437504,-46.520837,-51.406253,...,2.490433,1.47969,-53.000004,-54.645837,-64.250003,3.28576,3.363504,4.234701,-81.625008,-63.474991


Now we have two dataframes, one with the metadata for human cells (indexed by id) and another with the electrophysiology data for all cells, also indexed by id. Usefully, these ids are unique to each cell, meaning we can match them across dataframes.

We can use either the `merge` or `join` pandas methods in order to pull all of this data into one dataframe. 

![](http://www.datasciencemadesimple.com/wp-content/uploads/2017/09/join-or-merge-in-python-pandas-1.png)

There are different types of joins/merges you can do in pandas, illustrated <a href="http://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/">above</a>. Here, we want to do an **inner** merge, where we're only keeping entries with indices that are in both dataframes. We could do this merge based on columns, alternatively.

**Inner** is the default kind of join, so we do not need to specify it. And by default, join will use the 'left' dataframe, in other words, the dataframe that is executing the `join` method.

If you need more information, look at the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html">join</a> and <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html">merge</a> documentation: you can use either of these to unite your dataframes, though join will be simpler!

In [17]:
human_ephys_features = human_df.join(ephys_features)
human_ephys_features.head()

Unnamed: 0_level_0,reporter_status,cell_soma_location,species,name,structure_layer_name,structure_area_id,structure_area_abbrev,transgenic_line,dendrite_type,apical,...,trough_t_ramp,trough_t_short_square,trough_v_long_square,trough_v_ramp,trough_v_short_square,upstroke_downstroke_ratio_long_square,upstroke_downstroke_ratio_ramp,upstroke_downstroke_ratio_short_square,vm_for_sag,vrest
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
525011903,,"[273.0, 354.0, 216.0]",Homo Sapiens,H16.03.003.01.14.02,3,12113,FroL,,spiny,intact,...,4.134987,1.375253,-53.968754,-59.51042,-71.197919,2.895461,2.559876,3.099787,-88.843758,-70.561035
528642047,,"[69.0, 254.0, 96.0]",Homo Sapiens,H16.06.009.01.02.06.05,5,12141,MTG,,aspiny,,...,,1.05116,-67.468758,,-70.875002,1.891881,,1.989616,-101.0,-69.20961
537256313,,"[322.0, 255.0, 92.0]",Homo Sapiens,H16.03.006.01.05.02,4,12141,MTG,,spiny,truncated,...,5.694547,1.3899,-52.125004,-51.520836,-72.900002,3.121182,3.464528,3.054681,-87.53125,-72.628105
519832676,,"[79.0, 273.0, 91.0]",Homo Sapiens,H16.03.001.01.09.01,3,12141,MTG,,spiny,truncated,...,9.96278,1.21102,-53.875004,-52.416668,-73.693753,4.574865,3.817988,4.980603,-84.218758,-72.547661
596020931,,"[66.0, 220.0, 105.0]",Homo Sapiens,H17.06.009.11.04.02,4,12141,MTG,,aspiny,,...,14.66734,1.336668,-63.593754,-63.239583,-75.518753,1.45289,1.441754,1.556087,-82.53125,-74.260269


## Step Three: Confirm the data and take a look!

As a result, you should have a dataframe called 'human_ephys_features' that contains metadata about your cells, as well as their electrophysiological properties.

1. Confirm that you have right amount of data by checking it's length using a Boolean to test whether it is equal to `n_human_cells` that you assigned above.
2. Confirm that you have all of the columns from *both* the human_df and ephys_features dataframes programmatically. Remember that you can get the columns by accessing the `columns` attribute, and that you already assigned the human_df columns to a variable above. There are a few different ways to do this!
3. Confirm that the only 'species' in your `human_ephys_features` dataframe is 'Homo Sapiens'. You can use the `unique()` method to show unique values in a column.

In [23]:
print(n_human_cells == len(human_ephys_features))

all_columns = list(ephys_features.columns) + list(human_df_columns)
complete_final_list = list(human_ephys_features.columns)

print(len(all_columns) == len(complete_final_list))

print(human_ephys_features['dendrite_type'].unique())

True
True
['spiny' 'aspiny' 'sparsely spiny']


*Finally*, let's take a look at the data. You can use the `describe()` method to show the basic statistics for your cells. We'll start plotting these metrics next week!

In [25]:
human_ephys_features.head()

Unnamed: 0_level_0,reporter_status,cell_soma_location,species,name,structure_layer_name,structure_area_id,structure_area_abbrev,transgenic_line,dendrite_type,apical,...,trough_t_ramp,trough_t_short_square,trough_v_long_square,trough_v_ramp,trough_v_short_square,upstroke_downstroke_ratio_long_square,upstroke_downstroke_ratio_ramp,upstroke_downstroke_ratio_short_square,vm_for_sag,vrest
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
525011903,,"[273.0, 354.0, 216.0]",Homo Sapiens,H16.03.003.01.14.02,3,12113,FroL,,spiny,intact,...,4.134987,1.375253,-53.968754,-59.51042,-71.197919,2.895461,2.559876,3.099787,-88.843758,-70.561035
528642047,,"[69.0, 254.0, 96.0]",Homo Sapiens,H16.06.009.01.02.06.05,5,12141,MTG,,aspiny,,...,,1.05116,-67.468758,,-70.875002,1.891881,,1.989616,-101.0,-69.20961
537256313,,"[322.0, 255.0, 92.0]",Homo Sapiens,H16.03.006.01.05.02,4,12141,MTG,,spiny,truncated,...,5.694547,1.3899,-52.125004,-51.520836,-72.900002,3.121182,3.464528,3.054681,-87.53125,-72.628105
519832676,,"[79.0, 273.0, 91.0]",Homo Sapiens,H16.03.001.01.09.01,3,12141,MTG,,spiny,truncated,...,9.96278,1.21102,-53.875004,-52.416668,-73.693753,4.574865,3.817988,4.980603,-84.218758,-72.547661
596020931,,"[66.0, 220.0, 105.0]",Homo Sapiens,H17.06.009.11.04.02,4,12141,MTG,,aspiny,,...,14.66734,1.336668,-63.593754,-63.239583,-75.518753,1.45289,1.441754,1.556087,-82.53125,-74.260269
