# Pandas: the Data Analysis Library

Just like the numpy library centers around the 'ndarray' object, the Pandas Library centers around the DataFrame object.

In numpy, you want to think about the ndarray as a vector/matrix. In Pandas, you want to think of a DataFrame like an excel spreadsheet with rows representing individual 'events' and columns containing data on different 'attributes' of each event.

As with all python libraries, the documentation will be your friend: https://pandas.pydata.org/docs/index.html

The standard import statement for the pandas libaray is:

In [1]:
import pandas as pd

import numpy as np

## Initializing a DataFrame

### Importing
Importing data from an excel spread sheet (.xlsx, .xls, etc) or comma separeted value (.csv) file is miraculously easy.

Depending on the file type, you will want to use either:
   - [pd.read_csv('relative_path_to_file', ...)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
   - [pd.read_excel('relative_path_to_file', 'sheet_name', ...)](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)
   
There are similar import functions for JSON files, SQL databases, html documents, HDF5 files, and more!

Notice, that when importing from an excel document, you have a way to handle spreadsheets with mulitple sheets. Lets see some examples of importing:

In [3]:
vg_data   = pd.read_csv('data/vgsales.csv')

FMCE_data = pd.read_excel('data/FMCE_pre_post_Fa7_anon.xlsx')
#FMCE_data = pd.read_excel('data/FMCE_pre_post_Fa7_anon.xlsx', ['Pre', 'PostTest', 'Matched Valid'])

type(FMCE_data)

pandas.core.frame.DataFrame

### From scratch using iterables (ie lists or arrays)

This is less likely, but you can create a DataFrame by tying together a set of arrays.

This requires the use of the pd.DataFrame() function.

In [13]:
arr_1 = np.random.randn(10)
arr_2 = np.random.rand(10)
arr_3 = np.arange(0,10, 1)

df = pd.DataFrame([arr_1, arr_2, arr_3], index = ['Normal', 'Flat', 'Linear'])

df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Normal,1.339624,0.611129,-0.203193,0.813491,-0.214602,0.764093,-0.15398,-0.246819,-0.899544,-0.150707
Flat,0.755858,0.901615,0.701756,0.478256,0.848976,0.292371,0.653099,0.936446,0.884592,0.076153
Linear,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0


## Referencing and Finding Slices of Data

Oftentimes you will only want to deal with data in a single column, or some subset of rows. There are many ways to go about acessing this data. To grab a column, use the format: dataframe['ColumnName']. 

In [20]:
print(df[1])

vg_data['NA_Sales']

Normal    0.611129
Flat      0.901615
Linear    1.000000
Name: 1, dtype: float64


0        41.49
1        29.08
2        15.85
3        15.75
4        11.27
         ...  
16593     0.01
16594     0.01
16595     0.00
16596     0.00
16597     0.01
Name: NA_Sales, Length: 16598, dtype: float64

To grab just a single row

## Built-in Pandas Functions

It may be tempting to use numpy functions to find things like the mean and standard deviation of some column of values, or a built-in function like len() to find the number of elements in a column of data, but Pandas has it's own functions to do this that have been optimized to save time:

In [42]:
import numpy as np
import time

def mean_with_numpy():
    t0 = time.time_ns()
    mean = np.std(vg_data['NA_Sales'])
    t1 = time.time_ns()

    print('Mean calculated with Numpy as : ' + str(mean) + ' in ' + str(t1-t0) + ' seconds.' )
    
def mean_with_pandas():
    t0 = time.time_ns()
    mean = vg_data['NA_Sales'].std()
    t1 = time.time_ns()

    print('Mean calculated with Pandas as: ' + str(mean) + ' in ' + str(t1-t0) + ' seconds.' )
    
mean_with_numpy()
mean_with_pandas()

Mean calculated with Numpy as : 0.8166584270779742 in 0 seconds.
Mean calculated with Pandas as: 0.8166830292990428 in 0 seconds.


## Plotting with Pandas