# Tutorial 05, Part 0: Pandas, series, data frames 
[The official project homepage](https://pandas.pydata.org)

## Basic data structures - start with Series then build up to DataFrames

[Pandas quick start guide for Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series)

* A **Series** is a 1D array that can hold any type of data (numeric types, non-numeric, Python objects and so forth).
    * Unlike a 1D numpy array, each entry is **labeled** with an index that is used to keep track of what each entry is, and can be used to lookup the value corresponding to each index during analysis.
    * These labels are fixed - they will always index the same value unless you explicitly break that link.
    * The list of labels that forms the index can either be declared upon series creation or, by default, it will range from 0 to len(data)-1.
        * If you're going to use Pandas to organize your data, specifying usable and informative labels is a good idea because that's one of the main advantages of organizing your data in this manner - if you just want to fly blind then NumPy is usually fine on its own
        
<div class="alert alert-warning">
Pandas will allow you to specify non-unique labels. This can be ok for operations that don't rely on indexing by label. However, operations that do rely on unique labels for indexing may throw an unexpected error so in general its good practice to use unique labels!
</div>

## Import libs

In [1]:
# standard numpy and matplotlib imports
import numpy as np
import matplotlib as plt

# for plotting in a separte window (not inline with notebook output)
# %matplotlib qt

# import a generic pandas object and also a few specific functions that we'll use
import pandas as pd 
from pandas import DataFrame, read_csv

# new - get and store current file path for file i/o later on in tutorial
import os
cwd = os.getcwd()

# also define the default font we'll use for figures. 
fig_font = {'fontname':'Arial', 'size':'20'}

## Create a series from an numpy ndarray

In [16]:
# make some data and then use pd.Series

# random seed so we get the same thing each time 
np.random.RandomState(0)

# For this simulation, lets have 20 subjects, and some data
# generated from a Rayleigh distribution 
# (no particular motivation for selecting this distribution, just for something different)
# Rayleigh is the distribution of vector magnitudes generated by two independent components (e.g. wind speed)
N = 12
data = np.random.rayleigh(scale=1, size=N)

# make a list of subject names for use as an index labels
label_prefix = 'Sub'
index=[]
for n in np.arange(N):
    index.append(label_prefix+str(n))

print('Index labels: ', index, '\n')

# then make our pandas series by passing in our data array and our index labels
s = pd.Series(data, index=index)
print(s)

Index labels:  ['Sub0', 'Sub1', 'Sub2', 'Sub3', 'Sub4', 'Sub5', 'Sub6', 'Sub7', 'Sub8', 'Sub9', 'Sub10', 'Sub11'] 

Sub0     0.699258
Sub1     1.274437
Sub2     1.780352
Sub3     0.201680
Sub4     0.804139
Sub5     1.835721
Sub6     0.493615
Sub7     1.047985
Sub8     1.871220
Sub9     2.090627
Sub10    1.343622
Sub11    1.080105
dtype: float64


## Note that each subject is now a field in the series and can be used to retrieve the corresponding value...there are a few ways to do this

In [None]:
# access by field
print(s.Sub11)

# access by index label (like a dictionary)
print(s['Sub11'])

## Can also use labels to check for membership or to index over labels

In [None]:
# check for membership
'Sub11' in s

# iterate over index labels, with l==index name
for l in s.index:
    print(l)

# iterate over data in series
for d in s:
    print(d)

## Before moving on, there are a few other optional (but important) parameters of the pd.Series call
* dtype - default is to infer the data type (int32, float64, str, etc) based on the values in data
    * However, can also explicitly declare the data
    * This can be good if you want to, for example, re-cast the data to save space or to make types compatible
    * But this may also have important negative consequences if not done thoughtfully! 
* copy - if not specified then the default behavior is set to False and the new series will have a 'view' of the data.
    * This can save space, but can sometimes lead to confusion as any change to the values in s will also change the values in the original 'data' array
    * Setting copy=False will make a new copy of the data in 's' that is independent of the input 'data' array


### Explicitly declare a different dtype to see where things can go wrong

In [None]:
# make a series with the data array from above, but make it int32 instead of the inferred (and correct) float64 type
s = pd.Series(data, index=index, dtype='int32')

# first 4 values in our original data array
print(data[:4])

# first 4 values in our series of type int32...might not be what you want!
print('\n', s[:4])

### Another example: declaring dtype can be handy if you want to, for example, do str manipulations with the data array or if you want to merge with another series of type str

In [None]:
# make a series with the data array from above, but this time make it a str
# instead of the inferred float64 type
s = pd.Series(data, index=index, dtype='str')

# first 4 values in our original data array
print(data[:4])

# first 4 values in our series of type str...preserves info and we're now
# all set to do a bunch of str operation without having to deal with 
# explictly recasting each time we interact with the values in s
print('\n', s[:4])

<div class="alert alert-info">
Note that the dtype of series 's' is now an 'object'. This is the Pandas equivalent of a Python 'str'
</div>

### Now explicitly ask for a 'view' of the data instead of the default copy

In [8]:
# same as before - create a series based on a short data array (0:4 in this case for simplicity)
# let Pandas figure out the dtype, and use the default copy behavior (i.e. copy=False)

N = 4                # number of data points

# make data
data = np.arange(N)
# make index labels
index = ['d1','d2','d3','d4']

# print out the original data array for reference
print('Original data: ', data, '\n')

# make a series with the default behavior of copy=False
s = pd.Series(data, index=index, copy=False)

# print out the new series
print('Original values in series')
print(s)

# now change the value of the first entry in the series
s['d1'] = 100

# new values in series 's'
print('\nNew values in series')
print(s)

# and then print the corresponding entry in the data array
print('\nNew data:', data, '\ndata[0] changed too!')

# Note that data[0] changed because the values in s are a view of data...
# both are referencing the same chunk of memory

Original data:  [0 1 2 3] 

Original values in series
d1    0
d2    1
d3    2
d4    3
dtype: int32

New values in series
d1    100
d2      1
d3      2
d4      3
dtype: int32

New data: [100   1   2   3] 
data[0] changed too!


<div class="alert alert-danger">
Note that this works in the other direction too, which can be more insidious...if you create a Series based on the values in 'data', and then do more work with 'data', then every time you change a value in the original data array, you will also change the corresponding value in s!!!
</div>

In [11]:
# now do the same thing but this time lets explicitly ask for a copy of the data
N = 4                # number of data points

# make data
data = np.arange(N)
# make index labels
index = ['d1','d2','d3','d4']

# print out the original data array for reference
print('Original data: ', data, '\n')

# make a series, but change the default behavior of copy to copy=True
s = pd.Series(data, index=index, copy=True)

# print out the new series
print('Original values in series')
print(s)

# now change the value of the first entry in the series
s['d1'] = 100

# new values in series 's'
print('\nNew values in series')
print(s)

# and then print the corresponding entry in the data array
print('\nNew data is the same as the old data:', data)
print('data[0] did not change because it is independent from values in s')

Original data:  [0 1 2 3] 

Original values in series
d1    0
d2    1
d3    2
d4    3
dtype: int32

New values in series
d1    100
d2      1
d3      2
d4      3
dtype: int32

New data is the same as the old data: [0 1 2 3]
data[0] did not change because it is independent from values in s


## After creating a pandas series, you can do many common operations and access the functionality of other modules 
* A pd Series behaves similar to a NumPy ndarray, and can be passed to many NumPy functions
* Slicing also works like a ndarray - note that index is also sliced
* Lots of built in methods as well that emulate NumPy functionality

### Can pass pd.Series to most NumPy functions... 

In [28]:
# make a new series...
N = 16
data = np.random.exponential(size=N)

# make some labels
label_prefix = 'Exp'
index=[]
for n in np.arange(N):
    index.append(label_prefix+str(n))
    
# make the series
s = pd.Series(data, index=index)
print('\nMean: ', np.mean(s), 'Max: ', np.max(s))


Mean:  1.16039477001477 Max:  3.25989519135


#### Note that the index labels come along for the ride 

In [40]:
# print our series - set of index labels along with data values
print(s)

# then apply the NumPy cumulative product operation (multiply N with N+1, then that result by N+2, etc)
cp = np.cumprod(s)

print('\nCumproduct\n')
print(print(cp))

# cool part: note that the output also contains the label info, which is handy to keep track of things!
# and you can index into cp using the index labels
print(cp['Exp10'])
print(cp.Exp10)

Exp0     0.922508
Exp1     3.259895
Exp2     0.271713
Exp3     1.029167
Exp4     0.616481
Exp5     2.044292
Exp6     0.467407
Exp7     2.982640
Exp8     0.379235
Exp9     0.319465
Exp10    1.292761
Exp11    0.036772
Exp12    0.979448
Exp13    2.557037
Exp14    1.111187
Exp15    0.296307
dtype: float64

Cumproduct

Exp0     0.922508
Exp1     3.007281
Exp2     0.817117
Exp3     0.840950
Exp4     0.518430
Exp5     1.059822
Exp6     0.495368
Exp7     1.477506
Exp8     0.560321
Exp9     0.179003
Exp10    0.231408
Exp11    0.008509
Exp12    0.008334
Exp13    0.021311
Exp14    0.023681
Exp15    0.007017
dtype: float64
None
0.231408434073
0.231408434073


In [None]:
data = np.random.exponential(N)

In [13]:
# Grab specific values (3rd entry here)
print(s[2])

# find all entries where data > .9
print(s[s>.9])

2
d1    100
d2      1
d3      2
d4      3
dtype: int32


In [None]:
# More advanced slicing - note that index labels come along for the ride 
s[:-1]    #0:end-1

### Series have many built in operations, much like NumPy 
[list of attributes and methods](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.html)

In [None]:
# attributes
print('Data Type: ', s.dtype)

# basic methods
print('Mean: ', s.mean(), ' Std:', s.std(), 'Max: ', s.max())

# numerical derivative
print('Diff: ', s.diff())

## Can also make series from scalars (assign all indices same value) or from dicts

### Suppose you want a series with all the same values...you can do this using np.repeat

In [None]:
N=4
data = np.repeat(14, N)
index = np.arange(N) 

# make the series
s = pd.Series(data, index=index)

# all entries will have the same value
s

### Can achieve the same thing like this

In [None]:
# series from scalars
N=4

# don't need repeat cause its a single scalar linked to each index
data = 14
index = np.arange(N) 

# make the series
s = pd.Series(data, index=index)

# all entries will have the same value
s

### Can also initialize with a dict (keys become index, values become data)

In [None]:
data = {'Bob' : 20, 'Ella' : 17, 'Sam' : 23, 'Jack' : 25.3}
s = pd.Series(data)
print(s)

<div class="alert alert-info">
Note that data type is upcast to highest precision entry
</div>

## DataFrames

* 
* 
* 


## Make a data set that we can play with, will import some real data later on
* just make up some stuff here...lets say responses in 5 different neurons to different stimuli 

In [None]:
# seed random number generator
np.random.RandomState(0)

# dependent variables - 5 neurons...
neurons = ['Nrn1','Nrn2','Nrn3','Nrn4','Nrn5','Nrn6','Nrn7','Nrn8','Nrn9','Nrn10']  

# independent variables...responses in Hz to two stimulus conditions
resp1_hz = [14, 27, 62, 88, 45, 56, 75, 63, 33, 46]

# set up our response to stimulus 2...use random.randint for fun
min_resp = 1  # inclusive
max_resp = 90 # exclusive
resp2_hz = np.random.randint(1, 90, len(resp1_hz))

## New - use 'zip' function to wrap up the data from each list into one list
* does like it sounds like it does - takes three iterators and groups them together into a single iterator with the 1st element in each iterator together, then the second, etc. 
* length of resulting iterator limited by the length of the shortest input iterator

[reference](https://www.w3schools.com/python/ref_func_zip.asp)

In [None]:
neuron_data = list(zip(neurons, resp1_hz, resp2_hz))
print(neuron_data)

## Make a dataframe object to hold the contents of the data set

In [None]:
df = pd.DataFrame(data = neuron_data, columns = ['neuron', 'resp1', 'resp2'])

# take a look at the nice output here...
df

## Saving data in a csv file

In [None]:
 # lets save our header as well so that it doesn't think our first row is the header when we read the file back in
df.to_csv('spike_rates.csv',index=False,header=True)

In [None]:
# use our current working directory to build a path to the file
print(cwd)
file_name = cwd + '/spike_rates.csv'
print(file_name)

df = pd.read_csv(file_name)
df

## Get a high-level summary of the data

In [None]:
df.describe()

## Can also apply a set of more targeted analyses using the df object

* [Pandas doc for all functions](https://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats)

In [None]:
df.mean(axis=0)

## Making cooler DataFrame styles (and more useful...although that should take a backseat to coolness)
[Check here for a bunch of neat style options](https://pandas.pydata.org/pandas-docs/stable/style.html)
* Simple demo - can write custom functions that highlight specific aspects of your data - can be very useful for more clearly highlighting/communicating key points in the data within a notebook  

In [None]:
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]