# In class exercises - Intro to Pandas Series and DataFrames

## Import libs

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# get and store current file path for file i/o later on in tutorial
import os

cwd = os.getcwd()

## First import 'response_time_data.csv' data file
* Contains RTs from 800 trials of a simple detection task from each of 20 subjects
* Organizing into a DataFrame and then saved out in csv format
* The index (row) and column labels are encoded in the csv file, so you'll need to read those in explcitly
    * To do this, set index_col = 0 (indicating the first column is the index) and set header = 0. 
* Make sure to have a look at the DataFrame - use the df.head() function

In [22]:
file_name = cwd + '/response_time_data.csv'

df = pd.read_csv(file_name, index_col=0, header=0)

df.head()

Unnamed: 0,Sub0,Sub1,Sub2,Sub3,Sub4,Sub5,Sub6,Sub7,Sub8,Sub9,Sub10,Sub11,Sub12,Sub13,Sub14,Sub15,Sub16,Sub17,Sub18,Sub19
Tri0,2797.22424,1039.571212,4045.345952,3530.93421,2410.276348,6541.494156,1977.919842,2343.555594,143.695964,8147.939691,5183.942423,4548.240971,2076.921296,4230.548795,4134.589984,2067.132295,4087.049471,2704.327437,2790.476384,5141.106292
Tri1,786.895089,3076.223066,1033.310418,3758.043454,4000.805778,2756.802996,2918.768116,2613.934992,2655.684434,7410.337807,3182.903975,4324.103096,1843.506277,1338.453235,2693.772203,7239.094853,1320.715043,4449.372349,1085.884483,3556.231671
Tri2,3516.902396,4632.818016,4874.066155,3031.377402,2485.677228,4929.841314,435.950399,3059.241733,2923.3256,3530.389021,3002.555229,7537.781867,1989.249165,4513.510928,4473.73304,7422.364759,3338.164717,4840.676786,2721.343095,1972.689272
Tri3,333.88183,104.448476,2304.093856,586.098266,4575.178155,2365.682721,1285.101296,5050.566343,2446.870606,5096.855057,1047.603006,5431.187785,2879.554454,311.31906,2814.385809,3396.500194,1324.780081,1518.991979,1676.395223,2051.924695
Tri4,6790.330061,2629.751046,3148.222058,1894.867975,2274.057485,8186.457041,1195.253881,3747.385847,1456.694541,3437.159878,6745.578676,4101.871682,1944.773775,1571.942134,3186.806328,6588.562378,2866.277989,2079.88084,1086.063139,7051.740732


## Now have a look at the data using built in Padas functionality
* Check out the max/min of each row, standard deviation, percentiles, etc.

In [29]:
x = df.describe()
# etc...
x[0:3]

Unnamed: 0,Sub0,Sub1,Sub2,Sub3,Sub4,Sub5,Sub6,Sub7,Sub8,Sub9,Sub10,Sub11,Sub12,Sub13,Sub14,Sub15,Sub16,Sub17,Sub18,Sub19
count,800.0,800.0,800.0,800.0,796.0,800.0,800.0,799.0,800.0,798.0,800.0,789.0,800.0,797.0,797.0,800.0,800.0,785.0,793.0,800.0
mean,3492.614323,2549.787915,2498.108943,3502.338174,2489.637962,4583.557298,2587.373753,3528.493482,1587.012676,4367.761563,3435.810762,4549.103034,2692.333031,2552.094429,4462.378792,4534.814089,2478.180462,2583.731375,2495.609643,4454.240975
std,1779.474153,1476.122674,1434.749989,1722.695784,1394.508376,2544.771595,1529.182544,2000.548574,1302.153904,1935.519959,1745.629161,2662.686275,2898.41857,1452.494803,2151.655387,1976.030065,1497.644375,2648.316102,1456.803723,2051.493761


## Are there missing values (NaNs) in the data?
* one way: use the np.isnan(df) method from numpy
* combine with np.sum to count the number of NaNs for each subject...so np.sum(np.isnan(df), axis=0)

In [8]:
np.sum( np.isnan(df), axis=0)

Sub0      0
Sub1      0
Sub2      0
Sub3      0
Sub4      4
Sub5      0
Sub6      0
Sub7      1
Sub8      0
Sub9      2
Sub10     0
Sub11    11
Sub12     0
Sub13     3
Sub14     3
Sub15     0
Sub16     0
Sub17    15
Sub18     7
Sub19     0
dtype: int64

## Alternatively, you can use the pd.isnull().sum() approach to get the same thing!
[instructions here](https://chartio.com/resources/tutorials/how-to-check-if-any-value-is-nan-in-a-pandas-dataframe/)

In [10]:
df.isnull().sum()

Sub0      0
Sub1      0
Sub2      0
Sub3      0
Sub4      4
Sub5      0
Sub6      0
Sub7      1
Sub8      0
Sub9      2
Sub10     0
Sub11    11
Sub12     0
Sub13     3
Sub14     3
Sub15     0
Sub16     0
Sub17    15
Sub18     7
Sub19     0
dtype: int64

## After you've found the NaNs for each subject, check out this function:
[pandas.DataFrame.interpolate](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate)

* Read the manual pages, and then use this function to interpolate the missing values for each subject (do not interpolate across subjects!)
* Just use linear interpolation...
* reassign to a new df without any NaNs (that is, after you've interpolated across any NaNs)
* Make sure that your new df indeed doesn't have any NaNs in it!

In [33]:
new_df = df.interpolate(axis=0)
np.sum(np.isnan(new_df), axis=0)

Sub0     0
Sub1     0
Sub2     0
Sub3     0
Sub4     0
Sub5     0
Sub6     0
Sub7     0
Sub8     0
Sub9     0
Sub10    0
Sub11    0
Sub12    0
Sub13    0
Sub14    0
Sub15    0
Sub16    0
Sub17    0
Sub18    0
Sub19    0
dtype: int64

## You can explore the "Missing Values" page for Pandas to figure out other ways of filling in missing values (or outliers)

[page is here](https://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data)

* Go back to the original data set with NaNs, but this time figure out how to replace the NaNs with the mean of each subject
* Check out the 'fillna' method...
* remember to update the df after replacing values - so like new_df = df.fillna(xxxx)
* then double check to make sure that there are no nulls left. 

In [16]:
mean_df = df.fillna(df.mean(axis=0))
#np.sum(np.isnan(mean_df), axis=0)
mean_df.isnull().sum()

Sub0     0
Sub1     0
Sub2     0
Sub3     0
Sub4     0
Sub5     0
Sub6     0
Sub7     0
Sub8     0
Sub9     0
Sub10    0
Sub11    0
Sub12    0
Sub13    0
Sub14    0
Sub15    0
Sub16    0
Sub17    0
Sub18    0
Sub19    0
dtype: int64