# Tutorial 11292021: Pandas, series, data frames 
* data structures and data analysis tools

[The official project homepage](https://pandas.pydata.org)

## Basic data structures - start with Series then build up to DataFrames

[Pandas quick start guide for Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series)

* A **Series** is a 1D array that can hold any type of data (numeric types, non-numeric, Python objects and so forth).
    * Unlike a 1D numpy array, each entry is **labeled** with an index that is used to keep track of what each entry is, and can be used to lookup the value corresponding to each index during analysis.
    * These labels are fixed - they will always index the same value unless you explicitly break that link.
    * The list of labels that forms the index can either be declared upon series creation or, by default, it will range from 0 to len(data)-1.
        * If you're going to use Pandas to organize your data, specifying usable and informative labels is a good idea because that's one of the main advantages of organizing your data in this manner - if you just want to fly blind then NumPy is usually fine on its own
        
<div class="alert alert-warning">
Pandas will allow you to specify non-unique labels. This can be ok for operations that don't rely on indexing by label. However, operations that do rely on unique labels for indexing may throw an unexpected error so in general its good practice to use unique labels!
</div>

## Import libs

In [14]:
# standard numpy module
import numpy as np

# import a generic pandas object and also a few specific functions that we'll use
import pandas as pd 

# new - get and store current file path for file i/o later on in tutorial
import os
cwd = os.getcwd()

## Create a series from an numpy ndarray

In [15]:
# make some data and then use pd.Series

# random seed so we get the same thing each time 
np.random.seed(0)

# For this simulation, lets have 12 subjects, and some data
N = 12
data = np.random.random(N)

# make a list of subject names for use as index labels
label_prefix = 'Sub'
index=[]
for n in np.arange(N):
    index.append(label_prefix+str(n))

# print our list of index labels
print('Index labels: ', index, '\n')

# then make our pandas Series by passing in our data array and our index labels
s = pd.Series(data, index=index)
print(s)

Index labels:  ['Sub0', 'Sub1', 'Sub2', 'Sub3', 'Sub4', 'Sub5', 'Sub6', 'Sub7', 'Sub8', 'Sub9', 'Sub10', 'Sub11'] 

Sub0     0.548814
Sub1     0.715189
Sub2     0.602763
Sub3     0.544883
Sub4     0.423655
Sub5     0.645894
Sub6     0.437587
Sub7     0.891773
Sub8     0.963663
Sub9     0.383442
Sub10    0.791725
Sub11    0.528895
dtype: float64


## Note that each subject is now a field in the series and can be used to retrieve the corresponding value...there are a few ways to do this

In [16]:
# access by field
print(s.Sub11)

# access by index label
print(s['Sub11'])

# will cover more advanced slicing below

0.5288949197529045
0.5288949197529045


## Can also use labels to check for membership or to index over labels

In [12]:
# check for membership
print('Sub11' in s)
print('\n')

# iterate over index labels, with l==index name
for l in s.index:
    print(l)

print('\n')

# iterate over values...
for l in s.values:
    print(l)    

True


Sub0
Sub1
Sub2
Sub3
Sub4
Sub5
Sub6
Sub7
Sub8
Sub9
Sub10
Sub11


0.5488135039273248
0.7151893663724195
0.6027633760716439
0.5448831829968969
0.4236547993389047
0.6458941130666561
0.4375872112626925
0.8917730007820798
0.9636627605010293
0.3834415188257777
0.7917250380826646
0.5288949197529045


In [13]:
# can also get to the values more directly like this:
for d in s:
    print(d)

0.5488135039273248
0.7151893663724195
0.6027633760716439
0.5448831829968969
0.4236547993389047
0.6458941130666561
0.4375872112626925
0.8917730007820798
0.9636627605010293
0.3834415188257777
0.7917250380826646
0.5288949197529045


## Before moving on, there are a few other optional (but important) parameters of the pd.Series call
* dtype - default is to infer the data type (int32, float64, str, etc) based on the values in data
    * However, can also explicitly declare the data
    * This can be good if you want to, for example, re-cast the data to save space or to make types compatible
    * But this may also have important negative consequences if not done thoughtfully! 
* copy - if not specified then the default behavior is set to False and the new series will have a 'view' of the data.
    * This can save space, but can sometimes lead to confusion as any change to the values in s will also change the values in the original 'data' array
    * Setting copy=False will make a new copy of the data in 's' that is independent of the input 'data' array


### Example for using dtype: declaring dtype can be handy if you want to, for example, do str manipulations with the data array later or if you want to merge with another series of type str

In [5]:
# make a series with the data array from above, but this time make it a str
# instead of the inferred float64 type
s = pd.Series(data, index=index, dtype='str')

# first 4 values in our original data array
print(data[:4])

# first 4 values in our series of type str...preserves info and we're now
# all set to do a bunch of str operation without having to deal with 
# recasting each time we interact with the values in s
print('\n', s[:4])

[0.5488135  0.71518937 0.60276338 0.54488318]

 Sub0    0.5488135039273248
Sub1    0.7151893663724195
Sub2    0.6027633760716439
Sub3    0.5448831829968969
dtype: object


<div class="alert alert-info">
Note that the dtype of series 's' is now an 'object'. This is the Pandas version of a Python 'str'
</div>

### Now explicitly ask for a 'copy' of the data instead of the default view

In [7]:
# Same as before - create a series based on a short data array (0:4 in this case for simplicity)
# let Pandas figure out the dtype, and use the default copy behavior (i.e. copy=False)

N = 4                # number of data points

# make data
data = np.arange(N)

# make index labels
index = ['d1','d2','d3','d4']

# print out the original data array for reference
print('Original data: ', data, '\n')

# make a series with the default behavior of copy=False
s = pd.Series(data, index=index, copy=False)

# print out the new series
print('Original values in series')
print(s)

Original data:  [0 1 2 3] 

Original values in series
d1    0
d2    1
d3    2
d4    3
dtype: int64


In [8]:
# now change the value of the first entry in the series
s['d1'] = 100

# new values in series 's'
print('\nNew values in series')
print(s)

# and then print the corresponding entry in the data array
print('\nNew data:', data, '\ndata[0] changed too!')

# Note that data[0] changed because the values in s are a view of data...
# both are referencing the same chunk of memory


New values in series
d1    100
d2      1
d3      2
d4      3
dtype: int64

New data: [0 1 2 3] 
data[0] changed too!


<div class="alert alert-danger">
Note that this works in the other direction too, which can be more insidious...if you create a Series based on the values in 'data', and then do more work with 'data', then every time you change a value in the original data array, you will also change the corresponding value in s!!!
</div>

## After creating a pandas series, you can do many common operations and access the functionality of other modules 
* A pd Series behaves similar to a NumPy ndarray, and can be passed to many NumPy functions
* Slicing also works like a ndarray - note that index is also sliced
* Lots of built in methods as well that emulate NumPy functionality

### Can pass pd.Series to most NumPy functions... 
* Note that the index labels come along for the ride 

In [8]:
# print our series - set of index labels along with data values
print(s)

# then apply the NumPy cumulative product operation (multiply N with N+1, then that result by N+2, etc)
cp = np.cumprod(s)

print('\nCumproduct\n')
print(cp)

# cool part: note that the output also contains the label info, which is handy to keep track of things,
# e.g. you can index into cp using thes labels
print('\nIndex by label')
print(cp['d1'])

d1    100
d2      1
d3      2
d4      3
dtype: int64

Cumproduct

d1    100
d2    100
d3    200
d4    600
dtype: int64

Index by label
100


### Series objects have many built in operations, much like NumPy 
[list of attributes and methods](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.html)

In [7]:
# attributes
print('Data Type: ', s.dtype)

# basic methods
print('Mean: ', s.mean(), ' Std:', s.std(), 'Max: ', s.max())

# numerical derivative
print('Diff: ', s.diff())

Data Type:  int64
Mean:  1.5  Std: 1.2909944487358056 Max:  3
Diff:  d1    NaN
d2    1.0
d3    1.0
d4    1.0
dtype: float64


### Slicing also works like NumPy

In [8]:
# print the series
print(s)
print('\n')

# first 3 values
print('First 3 entries')
print(s[:3])
print('\n')

# Unary operations
print('S * 22')
print(s * 22)

d1    0
d2    1
d3    2
d4    3
dtype: int64


First 3 entries
d1    0
d2    1
d3    2
dtype: int64


S * 22
d1     0
d2    22
d3    44
d4    66
dtype: int64


### Note that after slicing, labels stay attached to data...

In [9]:
# Example using conditional indexing: find all entries where data > .9
print('Values >= 2')
new_s = s[s>=2]
print(new_s)
print('\n')

Values >= 2
d1    100
d3      2
d4      3
dtype: int64




<div class="alert alert-info">
The fact that labels stay attached to the corresponding values is often useful beacuse you don't have to compute and store a separate index for the new data set like you would in Matlab if you wanted to keep track of where the values > 2 were in the original array.
</div>

## Although series can be treated much like NumPy arrays, there is one key difference (and often a big advantage)
* When you do an operation on a NumPy array, the operation is performed in an element-by-element manner
* However, when you do an operation on two pandas series, the operation will be applied to like-labeled values
* This can save a lot of trouble in terms of lining up corresponding entries in two data arrays when the data sets are initialized in different orders!

<div class="alert alert-info">
Info alert - the next part is neat and really really useful in many real world applications where data sets are messy...Series operations are performed based on matching labels, not on matching positions in an array!
</div>

### Following on the NumPy example in the last cell...Now suppose that you ran a set of subjects in two experiments, but the data from each subject were entered in a different order in each study
* Even though the data were entered in different orders, you want an easy way to perform operations on specific subjects across experiments 
* Using NumPy - or Matlab - you'd probably now try to sort your second data set so that the labels from the second study were in the same order as in the first study.
* Then you would save an index indicating the sort order, and you'd use that index to rearange the data values from the second data set so that everything lined up with the first data set.
* A series can make life much easier here because operations are done on a union of the labels involved!

In [None]:
# set up two series - as if we have two data sets from the same set of 5 participants
N=5
data0 = np.arange(N)
index0 = ['s0','s1','s2','s3','s4']
s0 = pd.Series(data0, index=index0)

# now do our second 'experiment' but this time the subjects were run in a different order
data1 = np.arange(N)+7
index1 = ['s3','s2','s4','s1','s0']
s1 = pd.Series(data1, index=index1)

# print out our data series
print(s0)
print('\nData from the second experiment - same subjects, but different order\n')
print(s1)

In [None]:
# Do a simple unary operation like addition across data sets
sum_data = s0+s1
print(sum_data)
# Even though the numerical position of each subject differs across experiments, Pandas figured out how 
# to properly perform the operation by aligning based on index labels!

## Last notes on creation of series (not covered in class)
* Thus far we've been initializing series with ndarrays
* Can also make series from scalars (assign all indices same value) or from dicts

### Suppose you want a series with all the same values...you can do this using np.repeat

In [10]:
N=4
data = np.repeat(14, N)
index = np.arange(N) 

# make the series
s = pd.Series(data, index=index)

# all entries will have the same value
s

0    14
1    14
2    14
3    14
dtype: int64

### OR you can achieve the same thing in a more straightforward manner

In [11]:
# series from scalars
N=4

# don't need repeat cause its a single scalar linked to each index
data = 14
index = np.arange(N) 

# make the series
s = pd.Series(data, index=index)

# all entries will have the same value
s

0    14
1    14
2    14
3    14
dtype: int64

### Can also initialize with a dict
* dict keys become index labels
* data become values

In [12]:
data = {'Bob' : 20, 'Ella' : 17, 'Sam' : 23, 'Jack' : 25.3}
s = pd.Series(data)
print(s)

Bob     20.0
Ella    17.0
Sam     23.0
Jack    25.3
dtype: float64


<div class="alert alert-info">
Note that data type is upcast to highest precision entry when you create a Series with mixed numerical data types
</div>