# Pandas, data frames and making nice figures with Seaborn (and Matplotlib)
[the official homepage](https://pandas.pydata.org)

## Basic data structures - start with Series then build up to DataFrames

[Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)

* A **Series** is a 1D array that can hold any type of data (numeric types, non-numeric, Python objects and so forth).
    * Unlike a 1D numpy array, each entry is **labeled** with an index that is used to keep track of what each entry is, and can be used to lookup the value corresponding to each index during analysis.
    * These labels are fixed - they will always index the same value unless you explicitly break that link.
    * The list of labels that forms the index can either be declared upon series creation or, by default, it will range from 0 to len(data)-1.
        * If you're going to use Pandas to organize your data, specifying usable and informative labels is a good idea!
        
<div class="alert alert-warning">
Pandas will allow you to specify non-unique labels. This can be fine for operations that don't rely on idnexing by label. However, operations that do rely on unique labels for indexing may through an unexpected error so its good general practice to use unique labels!
</div>

## import libs

In [1]:
import numpy as np
import matplotlib as plt

# import a generic pandas object and also a few specific functions
import pandas as pd 
from pandas import DataFrame, read_csv

# new - get current path for file i/o later on in tutorial
import os
cwd = os.getcwd()

# also define the default font we'll use for figures. 
fig_font = {'fontname':'Arial', 'size':'20'}

## Create a series from an numpy ndarray

In [61]:
# make some data and then use pd.Series

# random seed so we get the same thing each time 
np.random.RandomState(0)

# For this simulation, lets have 20 subjects, and some data
# generated from a Rayleigh distribution 
# (no particular motivation for selecting this distribution, just for something different)
# Rayleigh is the distribution of a vector generated by two independent components 
N = 20
data = np.random.rayleigh(scale=1, size=N)

# make a list of subject names for use as an index labels
var_name = 'Sub'
index=[]
for n in np.arange(N):
    index.append(var_name+str(n))

print(index)

# then make our series by passing in data and our index labels
s = pd.Series(data, index=index)
print(s)

['Sub0', 'Sub1', 'Sub2', 'Sub3', 'Sub4', 'Sub5', 'Sub6', 'Sub7', 'Sub8', 'Sub9', 'Sub10', 'Sub11', 'Sub12', 'Sub13', 'Sub14', 'Sub15', 'Sub16', 'Sub17', 'Sub18', 'Sub19']
Sub0     2.387730
Sub1     0.668752
Sub2     1.812427
Sub3     1.708040
Sub4     0.063006
Sub5     1.697212
Sub6     0.645894
Sub7     1.323838
Sub8     1.033158
Sub9     1.213241
Sub10    0.727011
Sub11    0.652808
Sub12    1.359432
Sub13    2.069343
Sub14    2.047089
Sub15    1.500805
Sub16    2.006576
Sub17    1.412180
Sub18    0.594815
Sub19    1.066827
dtype: float64


## Note that each subject is now a field in the series

In [20]:
# access by field
print(s.Sub11)

# access by index label (like a dictionary)
print(s['Sub11'])

0.968412104214
0.968412104214


## Can also use labels to check for membership or to index over labels

In [25]:
# check for membership
'Sub11' in s

# iterate over index labels, with l==index name
for l in s.index:
    print(l)

# iterate over data in series
for d in s:
    print(d)

Sub0
Sub1
Sub2
Sub3
Sub4
Sub5
Sub6
Sub7
Sub8
Sub9
Sub10
Sub11
Sub12
Sub13
Sub14
Sub15
Sub16
Sub17
Sub18
Sub19
0.5724313933715071
0.919161253533595
1.4223315782361374
0.5738025918805562
2.0509992892515525
0.6066176319710674
2.057260422755941
0.568230663141102
0.7151931567545999
0.7539588152057879
1.2883602368122091
0.9684121042138205
1.1152165323867755
0.9469896126811969
0.7543922178439852
1.7990168647407292
1.3515529404611313
1.711253932085822
1.3732442911700165
0.7410062582482115


## And we can do operations on the series 
* A pd Series behaves similar to a NumPy ndarray, and can be passed to many NumPy functions
* Slicing also works like a ndarray - note that index is also sliced
* Lots of built in methods as well that emulate NumPy functionality

### Can pass pd.Series to most NumPy functions... 

In [64]:
print('Mean: ', np.mean(s), 'Max: ', np.max(s))

Mean:  1.29950919891203 Max:  2.38772968529


In [43]:
# Grab specific values (3rd entry here)
print(s[2])

# find all entries where data > .9
print(s[s>.9])

0.304905494296
Sub1     2.190370
Sub4     1.181104
Sub5     2.246532
Sub6     1.425475
Sub7     1.259178
Sub10    1.692030
Sub13    1.298887
Sub17    1.312072
Sub18    1.702105
Sub19    0.987555
dtype: float64


In [46]:
# More advanced slicing - note that index labels come along for the ride 
s[:-1]    #0:end-1

Sub0     0.583014
Sub1     2.190370
Sub2     0.304905
Sub3     0.864659
Sub4     1.181104
Sub5     2.246532
Sub6     1.425475
Sub7     1.259178
Sub8     0.551014
Sub9     0.853764
Sub10    1.692030
Sub11    0.844297
Sub12    0.823154
Sub13    1.298887
Sub14    0.636744
Sub15    0.508969
Sub16    0.626780
Sub17    1.312072
Sub18    1.702105
dtype: float64

### Series have many built in operations, much like NumPy 
[list of attributes and methods](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.html)

In [59]:
# attributes
print('Data Type: ', s.dtype)

# basic methods
print('Mean: ', s.mean(), ' Std:', s.std(), 'Max: ', s.max())

# numerical derivative
print('Diff: ', s.diff())

Data Type:  int64
Mean:  14.0  Std: 0.0 Max:  14
Diff:  0    NaN
1    0.0
2    0.0
3    0.0
dtype: float64


## Can also make series from scalars (assign all indices same value) or from dicts

### Suppose you want a series with all the same values...you can do this using np.repeat

In [54]:
N=4
data = np.repeat(14, N)
index = np.arange(N) 

# make the series
s = pd.Series(data, index=index)

# all entries will have the same value
s

0    14
1    14
2    14
3    14
dtype: int32

### Can achieve the same thing like this


In [55]:
# series from scalars
N=4

# don't need repeat cause its a single scalar linked to each index
data = 14
index = np.arange(N) 

# make the series
s = pd.Series(data, index=index)

# all entries will have the same value
s

0    14
1    14
2    14
3    14
dtype: int64

In [32]:
data = {'Bob' : 20, 'Ella' : 17, 'Sam' : 23, 'Tim' : 25.3}
s = pd.Series(data)
print(s)

Bob     20.0
Ella    17.0
Sam     23.0
Tim     25.3
dtype: float64


<div class="alert alert-info">
Note that data type is upcast to highest precision entry
</div>

## DataFrames

* 
* 
* 


## Make a data set that we can play with, will import some real data later on
* just make up some stuff here...lets say responses in 5 different neurons to different stimuli 

In [None]:
# seed random number generator
np.random.RandomState(0)

# dependent variables - 5 neurons...
neurons = ['Nrn1','Nrn2','Nrn3','Nrn4','Nrn5','Nrn6','Nrn7','Nrn8','Nrn9','Nrn10']  

# independent variables...responses in Hz to two stimulus conditions
resp1_hz = [14, 27, 62, 88, 45, 56, 75, 63, 33, 46]

# set up our response to stimulus 2...use random.randint for fun
min_resp = 1  # inclusive
max_resp = 90 # exclusive
resp2_hz = np.random.randint(1, 90, len(resp1_hz))

## New - use 'zip' function to wrap up the data from each list into one list
* does like it sounds like it does - takes three iterators and groups them together into a single iterator with the 1st element in each iterator together, then the second, etc. 
* length of resulting iterator limited by the length of the shortest input iterator

[reference](https://www.w3schools.com/python/ref_func_zip.asp)

In [None]:
neuron_data = list(zip(neurons, resp1_hz, resp2_hz))
print(neuron_data)

## Make a dataframe object to hold the contents of the data set

In [None]:
df = pd.DataFrame(data = neuron_data, columns = ['neuron', 'resp1', 'resp2'])

# take a look at the nice output here...
df

## Saving data in a csv file

In [None]:
 # lets save our header as well so that it doesn't think our first row is the header when we read the file back in
df.to_csv('spike_rates.csv',index=False,header=True)

In [None]:
# use our current working directory to build a path to the file
print(cwd)
file_name = cwd + '/spike_rates.csv'
print(file_name)

df = pd.read_csv(file_name)
df

## Get a high-level summary of the data

In [None]:
df.describe()

## Can also apply a set of more targeted analyses using the df object

* [Pandas doc for all functions](https://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats)

In [None]:
df.mean(axis=0)