# Tutorial 05, Part 1: Pandas DataFrames 
[The official project homepage](https://pandas.pydata.org)

## Basic data structures - start with Series then build up to DataFrames

[Pandas quick start guide for Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series)

* A **Series** is a 1D array that can hold any type of data (numeric types, non-numeric, Python objects and so forth).
    * Unlike a 1D numpy array, each entry is **labeled** with an index that is used to keep track of what each entry is, and can be used to lookup the value corresponding to each index during analysis.
    * These labels are fixed - they will always index the same value unless you explicitly break that link.
    * The list of labels that forms the index can either be declared upon series creation or, by default, it will range from 0 to len(data)-1.
        * If you're going to use Pandas to organize your data, specifying usable and informative labels is a good idea because that's one of the main advantages of organizing your data in this manner - if you just want to fly blind then NumPy is usually fine on its own
        
<div class="alert alert-warning">
Pandas will allow you to specify non-unique labels. This can be ok for operations that don't rely on indexing by label. However, operations that do rely on unique labels for indexing may throw an unexpected error so in general its good practice to use unique labels!
</div>

## Import libs

In [None]:
# standard numpy and matplotlib imports
import numpy as np
import matplotlib as plt

# for plotting in a separte window (not inline with notebook output)
# %matplotlib qt

# import a generic pandas object and also a few specific functions that we'll use
import pandas as pd 
from pandas import DataFrame, read_csv

# new - get and store current file path for file i/o later on in tutorial
import os
cwd = os.getcwd()

# also define the default font we'll use for figures. 
fig_font = {'fontname':'Arial', 'size':'20'}

## DataFrames

* 
* 
* 


## Make a data set that we can play with, will import some real data later on
* just make up some stuff here...lets say responses in 5 different neurons to different stimuli 

In [None]:
# seed random number generator
np.random.RandomState(0)

# dependent variables - 5 neurons...
neurons = ['Nrn1','Nrn2','Nrn3','Nrn4','Nrn5','Nrn6','Nrn7','Nrn8','Nrn9','Nrn10']  

# independent variables...responses in Hz to two stimulus conditions
resp1_hz = [14, 27, 62, 88, 45, 56, 75, 63, 33, 46]

# set up our response to stimulus 2...use random.randint for fun
min_resp = 1  # inclusive
max_resp = 90 # exclusive
resp2_hz = np.random.randint(1, 90, len(resp1_hz))

## New - use 'zip' function to wrap up the data from each list into one list
* does like it sounds like it does - takes three iterators and groups them together into a single iterator with the 1st element in each iterator together, then the second, etc. 
* length of resulting iterator limited by the length of the shortest input iterator

[reference](https://www.w3schools.com/python/ref_func_zip.asp)

In [None]:
neuron_data = list(zip(neurons, resp1_hz, resp2_hz))
print(neuron_data)

## Make a dataframe object to hold the contents of the data set

In [None]:
df = pd.DataFrame(data = neuron_data, columns = ['neuron', 'resp1', 'resp2'])

# take a look at the nice output here...
df

## Saving data in a csv file

In [None]:
 # lets save our header as well so that it doesn't think our first row is the header when we read the file back in
df.to_csv('spike_rates.csv',index=False,header=True)

In [None]:
# use our current working directory to build a path to the file
print(cwd)
file_name = cwd + '/spike_rates.csv'
print(file_name)

df = pd.read_csv(file_name)
df

## Get a high-level summary of the data

In [None]:
df.describe()

## Can also apply a set of more targeted analyses using the df object

* [Pandas doc for all functions](https://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats)

In [None]:
df.mean(axis=0)

## Making cooler DataFrame styles (and more useful...although that should take a backseat to coolness)
[Check here for a bunch of neat style options](https://pandas.pydata.org/pandas-docs/stable/style.html)
* Simple demo - can write custom functions that highlight specific aspects of your data - can be very useful for more clearly highlighting/communicating key points in the data within a notebook  

In [None]:
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]