# In class exercises - Intro to Pandas Series and DataFrames

## Import libs

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# get and store current file path for file i/o later on in tutorial
import os
cwd = os.getcwd()

## First import 'response_time_data.csv' data file
* Contains RTs from 800 trials of a simple detection task from each of 20 subjects
* Organizing into a DataFrame and then saved out in csv format
* The index (row) and column labels are encoded in the csv file, so you'll need to read those in explcitly
* Make sure to have a look at the DataFrame - use the df.head() function

In [None]:
file_name = cwd + '/response_time_data.csv'

# because the row and column labels are already specified, set index_col and header = 0
df = pd.read_csv(file_name, index_col=0, header=0)
df.head()

## Now have a look at the data using built in Padas functionality
* Check out the max/min of each row, standard deviation, percentiles, etc.

## Are there missing values (NaNs) in the data?
* one way: use the np.isnan(df) method from numpy
* combine with np.sum to count the number of NaNs for each subject...

## After you've found the NaNs for each subject, check out this function:
[pandas.DataFrame.interpolate](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate)

* Use this function to interpolate the missing values for each subject (do not interpolate across subjects!)
* Just use linear interpolation...
* reassign to a new df without any NaNs (that is, after you've interpolated across any NaNs)
* Make sure that your new df indeed doesn't have any NaNs in it!

## You can explore the "Missing Values" page for Pandas to figure out other ways of filling in missing values (or outliers)

[page is here](https://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data)

* Go back to the original data set with NaNs, but this time figure out how to replace the NaNs with the mean of each subject
* Check out the 'fillna' method...

## Use the Pandas.DataFrame.Sample function to generate bootstrapped confidence intervals for the data from subject 11

[see this page for info about the "samples" method](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.sample.html)


* Take the mean interpolated data from the last step...use that for this problem
* Resample Sub11's data with replacement 1000 times, each time pulling N samples (800 in this case)
* On each bootstrap iteration, compute the mean of the data - this will give you a distribution of means across all resamples
* Compute 95% confidence intervals using "quantiles" method or the "percentile" method:


[this page for quantiles](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.quantile.html)
[this page for percentiles](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html#numpy.percentile)

*Note that percentile and quantile are the same except that with percentile you use values between 0-100 and for quantile you use values between 0-1*
    
* Then make a plot - use the matplotlib "errorbar" method. Hints - because the lower and upper confidence intervals are different, pass them in as a 2 element np.array. And since you have just one data point, you can make the "x" parameter that you pass into errorbar just = 1. 

[errorbar page](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.errorbar.html)