# Intro to Pandas

In [None]:
# imports for this notebook
import numpy as np
import pandas as pd
from numpy.random import randn

## Pandas Series

In [None]:
# We can create a numpy array from a list like this:
series_1 = ['Delaware', 'Georgia', 'New Hampshire', 'Tennessee', 'Arkansas']
np.array(series_1)


In [None]:
# We can also create a pandas Series from a list like this:
pd.Series(data = series_1)


In [None]:
series_2 = [1, 4, 9, 16, 25] 

pd.Series(data = series_2)

In [None]:
pd.Series(data = series_1, index = series_2) # axis labels set to series_2 values

### Interactive Learning Moment:
1. Hover your cursor over "Series" below. You will see information pop up (Documentation).
2. The python syntax informs you of the parameters (the stuff inside of the parentheses) and their order.
3. When using a method (.Series) or function, you can specify what information is assigned to each parameter (e.g., pd.Series(data = series_1, index = series_2)).
4. If a parameter is not specified by name, python assumes the order given in the method or function definition.
5. Parameters not specified are set to default values.

In [None]:
pd.Series(series_1, series_2) 


In [None]:
test = pd.Series(series_1, series_2) # verify type of object created (should be pandas Series)
type(test)

In [None]:
# We can also pass in a dictionary to create a Series (keys become the index)
# Remember dictionaries are a python data type made up of key:value pairs (like an address book)
# Dictionaries are created with curly braces {}, not square brackets [] or parentheses ()
# Here is an example dictionary:
d = {'1': 'Delaware',
        '4': 'Georgia',
        '9': 'New Hampshire',
        '16': 'Tennessee', 
        '25': 'Arkansas'}
type(d) # should be 'dict'

In [None]:
pd.Series(d) # you can pass in a dictionary to create a pandas Series (keys become the index)

In [None]:
# What would you expect the data type to be when using pd.Series on a dictionary?
type(pd.Series(d))

## Pandas DataFrame

In [41]:
# We will be using numpy's random number generator
# Set the random seed for reproducibility (so we all get the same random numbers)
np.random.seed(0)

In [None]:
# Hover over `randn` to see what the parameters are (you will have to scroll a bit -- look for "Parameters")
# Here we create a DataFrame with 5 rows and 4 columns of random numbers
# IMPORTANT: In python, rows come first, then columns (unlike MATLAB)
df = pd.DataFrame(randn(5,4), index = ['A','B','C','D','E'], columns = ['W','X','Y','Z'])
df

### Indexing

In [None]:
df.columns

In [None]:
df.rows # this will give an error because DataFrames do not have a 'rows' attribute

In [None]:
df.index # shows the row index labels

In [None]:
# Index columns

df['W']  # get a single column

In [None]:
# Select mulitple columns

df[['W','Z']] # double brackets for multiple columns

In [None]:
# What is the data type of a single column?

type(df['W'])

In [None]:
# Select a single row

df.loc['A'] # loc --> location (label based)

In [None]:
# Select a single row
df.iloc[0] # iloc --> integer location (position based)


#### Learning Moment:
- Python indexing is not intuitive.
- Indexing means to pull out or parse elements from a larger group of elements (List, Dict, DataFrame, etc.)
- Python indexing starts at value = 0.
- In a list [1,2,3,4,5], the index position of integer 1 is actually 0.

In [None]:
# What is the data type of a single row?
type(df.loc['A'])

Dataframes are just a collection of indexable pandas Series!!!

In [None]:
# You can select rows and columns together

df.loc['B','Y']  # row B, column Y

In [None]:
df.loc[['A','B'],['W','Y']]  # rows A and B, columns W and Y

### Manipulating DataFrames

In [None]:
# refresh original dataframe view

df

In [None]:
# Add a new column

df['new_column'] = df['W'] + df['Y']
df

In [None]:
df.drop('new_column', axis=1)  # axis=1 means drop a column, while axis=0 would mean drop a row

In [None]:
# dataframe manipulations do not happen inplace (permenantly) unless specified (inplace=True)

df 

In [None]:
df.drop('new_column', axis = 1, inplace = True) # inplace = False is a safeguard and prevents accidental data loss


In [None]:
df # verify the column has been dropped

## Importing Data as DataFrame

In [65]:
my_data = pd.read_csv('data.csv')

In [None]:
my_data

### Practice Exercises

Each cell below will ask you to perform specific actions. Please save the notebook after you have completed all the exercises and push to GitHub under your branch.

In [None]:
# Confirm that `my_data` is a DataFrame

In [None]:
# Index the 'Pulse' column

In [None]:
# Index the 'Pulse' and 'Maxpulse' columns
my_data[['Pulse','Maxpulse']]

In [None]:
# Index row 2 data

In [None]:
# Add column '% Intensity' which is (Pulse/Max Pulse)*100

In [None]:
# Drop the 'Calories' column permanently

In [None]:
# Challenge Exercise (Hint: my_data[my_data[...]])

# Index rows where Pulse is greater than 100 and display the raw values