![Data Applications](https://www.durhamtech.edu/themes/custom/durhamtech/images/durham-tech-logo-web.svg) 

## Manipulating Data with Pandas â€“ The Fundamentals
The pandas package in python is an industry standard that allows analysts to work with small to medium data sets.  Pandas will enable an analyst to quickly clean data and gather insights.  The purpose of this lecture is to expose you to the core capabilities of the package.  This is perhaps the most important data science package as it functions in a similar way that people use excel and SQL.

---

### Set Up
1.	Go to github 
2.	Download the 'SPY.csv', 'Inventory_Data.csv' and 'Demand_Plan.csv' files
3.	Move these files to a dedicated folder on your desktop or other location
4.	Open the command terminal in Anaconda Navigatory and run 'pip install pandas'



### Needed Packages
1.	pandas
2.  numpy
3.  datetime
---

# Table of Contents

### The basics
#### <a href='#1'>What are Pandas DataFrames?</a>
#### <a href='#2'>DataFrame From 1D Array</a>
#### <a href='#3'>DataFrame From 2D Array</a>
#### <a href='#4'>Create DataFrame From a Dictionary</a>
#### <a href='#5'>Create a Pandas Series Object (i.e. a DataFrame column) using a Python list</a>

### Interacting with Data Frames
#### <a href='#6'>Accessing A DataFrame</a>
#### <a href='#7'>Attribute Access</a>
#### <a href='#8'>Slicing Ranges</a>
#### <a href='#9'>Selection by Position Using .iloc Attribute</a>
#### <a href='#10'>Boolean Indexing</a>

### Examing, Grouping & Describing
#### <a href='#11'>Some Basic Statistics on a DataFrame</a>
#### <a href='#12'>Reading in data from a CSV</a>
#### <a href='#13'>Head & Tails</a>
#### <a href='#14'>Filtering</a>
#### <a href='#15'>Changing Column Attributes</a>
#### <a href='#16'>Grouping</a>
#### <a href='#17'>Exporting Data</a>

### Combining, Editing , and Time
#### <a href='#19'>Concatenating Frames</a>
#### <a href='#20'>Merging Frames</a>
#### <a href='#21'>Renaming Cells</a>
#### <a href='#22'>Dates & Time</a>
#### <a href='#23'>Sorting Columns</a>
#### <a href='#24'>Shifting Columns</a>

### Loops, Functions, and DataFrames
#### <a href='#26'>Reseting an Index</a>
#### <a href='#27'>Creating new columns from old columns</a>
#### <a href='#28'>Lambda Functions</a>
#### <a href='#29'>Looping through Columns</a>

### Pivoting & Misc. Methods
#### <a href='#31'>Rolling columns</a>
#### <a href='#32'>Pivoting</a>
#### <a href='#33'>Transpose</a>
#### <a href='#34'>Removing Duplicates</a>
#### <a href='#35'>Dropping Rows with Null values</a>
#### <a href='#36'>Filling Null Values</a>
#### <a href='#38'>Concluding Remarks</a>
#### <a href='#55'>Weekly Readings/Videos</a>


### Practice
#### <a href='#39'>Exercise Set 1</a>
#### <a href='#40'>Exercise Set 2</a>
#### <a href='#18'>Exercise Set 3</a>
#### <a href='#25'>Exercise Set 4</a>
#### <a href='#30'>Exercise Set 5</a>
#### <a href='#37'>Exercise Set 6</a>



<a id='1'></a>
## What are Pandas DataFrames?
Pandas DataFrames are data structures that contain data organized in two dimensions, rows and columns, which are themselves organized via labels. In most cases, Pandas DataFrames are built using the DataFrame Constructor to which you can pass two-dimensional data (list, tuple and sequences, or NumPy array), dictionaries, or time series data -- to name a few data types.

In [None]:
###Ensure this file is in the same folder as 'Demand_Plan.csv' & 'Inventory_Data.csv' 
import pandas as pd
import numpy as np
from datetime import timedelta

spy_path = 'SPY.csv'
demand_path =  'Demand_Plan.csv'
inventory_path = 'Inventory_Data.csv'
test_path = 'test.csv'


<a id='2'></a>
### DataFrame From 1D Array

In [None]:
# Create random seed
np.random.seed(58) 

# 3 different 1 dimensional arrays of length 3
a1 = np.random.randn(3)
a2 = np.random.randn(3)
a3 = np.random.randn(3)

print (a1)
print (a2)
print (a3)

In [None]:
# Create our first DataFrame with the above numpy array
df0 = pd.DataFrame(a1)
df0

In [None]:
# Printing the dataframe gives a different result than the return value
print(df0)

In [None]:
# Check type
type(df0)

In [None]:
# DataFrame from all 3 numpy arrays
df0 = pd.DataFrame([a1, a2, a3])
df0

In [None]:
# We can set the column and index names
df0 = pd.DataFrame([a1, a2, a3],columns=['col_a1','col_a2','col_a3'],index=['row_a','row_b','row_c'])
df0

In [None]:
# Adding  more columns to dataframe requires that the dimensions must match
df0['col4']=a2
df0

<a id='3'></a>
### DataFrame From 2D Array

In [None]:
# Create a DataFrame from 2D np.array
np.random.seed(63)
array_2d = np.array(np.random.randn(9)).reshape(3,3)
array_2d

In [None]:
# Again you can label your columns and indexes however you please
df0 = pd.DataFrame(array_2d,columns=['1stColumn','Another_Column','ThirdOne'] \
                   , index=[58,12,725]) 

df0

<a id='4'></a>
### Create DataFrame From a Dictionary

In [None]:
# Create a DataFrame from a Dictionary
dict1 = {'a1':a1, 'a2':a2,'a3':a3}
dict1

In [None]:
# Assign the indexes
df1 = pd.DataFrame(dict1,index=[1,2,3]) 
df1

In [None]:
# We can add a list with strings and ints as a column 
df1['Mixed'] = ["Apples", 92, "Cars"]
df1

<a id='5'></a>
### Create a Pandas Series Object (i.e. a DataFrame column) using a Python list

In [None]:
# Every column is a series object
type(df1['Mixed'])

In [None]:
# View one column
df1['Mixed']

In [None]:
# Different datatypes in a column
print(type(df1['Mixed'][1]), type(df1['Mixed'][2]))

In [None]:
# Create a Series from a Python list
s = pd.Series([21,15,32]) # an automatic index is created in numerical sequence order, 0,1,2...
s

In [None]:
# Creating a Series from a Python List but with user specified list
s2 = pd.Series([21, 15, 32], index = ['h','i','j']) #specific index
s2

In [None]:
# View element
s2['h']

<a id='39'></a>
## -------------PRACTICE-------------

1. In the cell below, create a 3x3 Data Frame with each value as '1' and add unique column names.

2. In the cell below, find the type of the first row of the first column of your new DataFrame.

3. In the cell below, find and print the value of the center cell of your DataFrame.

4. In the cell below, create a list with elements 'a', 'b', and 'c', transform it into a DataFrame, and find the type of the second row of the first column.

5. In the cell below, creat a for loop that will creat a list of length 10 with numbers ranging from 0 to 9, then, convert this new list into a DataFrame.

<a id='6'></a>
### Accessing A DataFrame


In [None]:
# We can add the Series s to the DataFrame above as column Series
# Remember to match indices
df1['Series'] = s
df1

In [None]:
# We can rename columns
df1 = df1.rename(columns = {'Mixed':'RenamedColumn'})
df1

In [None]:
# We can delete columns
del df1['RenamedColumn']
df1

In [None]:
# or drop columns, see axis = 1 which is the step we use the most
# however this does not change the dataframe if we don't set inplace=True
df1.drop('a2',axis=1) # returns a copy

In [None]:
# Sanity Check
df1

In [None]:
# or drop rows
df1.drop(1,axis=0)

In [None]:
# Remove a column with inplace=True
df1.drop('Series',axis=1,inplace=True)
df1

<a id='7'></a>
### Attribute Access

In [None]:
# View 1 column
df1['a1']

In [None]:
# View several columns
df1[['a1','a3']]

<a id='8'></a>
### Slicing Ranges

In [None]:
# slice of the DataFrame returned
# this slices the first three rows first followed by first 2 rows of the sliced frame
(df1[0:3][0:2])

In [None]:
# Lets print the five first 2  elements of column a1
# This is a new Series (like a new table)
df1['a1'][0:2]

In [None]:
# Print the 2 columns and the top 2 values
df1[['a1','a3']][0:2]

<a id='9'></a>
### Selection by Position Using .iloc Attribute

In [None]:
# View element
df1.iloc[0,0]

In [None]:
# Get the 2nd to 4th row, 4th to 5th column
df1.iloc[0:2,0:2]

In [None]:
# Can also use 2 'lists' of position numbers with iloc
df1.iloc[[0,2],[0,2]]

In [None]:
# Data only from row with index value '1'
print (df1.iloc[1])
print('\n')
print (df1.iloc[1,:])

<a id='10'></a>
### Boolean Indexing

In [None]:
# return  full rows where a2>0
df1[df1['a2']>0]

# The df1['a2']>0 checks condition and returns boolean (T/F)
# The df1[] outside of it only selects the rows where this is true

In [None]:
# return column a3 values where a2 >0
df1['a3'][df1['a2']>0]

In [None]:
# If you want the values in an np array
npg = df1.loc[:,"a2"].values #otherwise it returns a  indexed series
print(type(npg))
print()
npg

<a id='40'></a>
## -------------PRACTICE-------------

1. In the cell below, create a new 4x4 DataFrame of random numbers between 0 and 1.

2. In the cell below, print the 2nd row of the 3rd column.

3. In the cell below, create a new DataFrame using only the 1st and 3rd column.

4.  In the cell below, create a new DataFrame from problem 1 where you include only rows where the first column is greater than .4, your answer may return an empty DataFrame.

5. In the cell below, using the iloc method, pring the 4th row of the data frame from problem 1.

<a id='11'></a>
### Some Basic Statistics on a DataFrame

In [None]:
# Show general statistics
df1.describe()

In [None]:
# Only view desired, siame as slicing rows and columns in a normal dataframe
df1.describe().loc[['mean','std'],['a2','a3']]

In [None]:
# We can change the index sorting
df1.sort_index(axis=0, ascending=False).head()

<a id='12'></a>
### Reading in data from a CSV

In [None]:
#The read_csv method requires one argument, the file path, to the CSV file you want to read.
demand_data = pd.read_csv(demand_path)

<a id='13'></a>
### Head & Tails

In [None]:
#The head method defaults to displaying the first 5 rows of a dataframe.  
#Inputting an integer argument will adjust the number of rows displayed.
#In this case we use 10
demand_data.head(10)

In [None]:
#The tail method defaults to displaying the last 5 rows of a dataframe.  
#Inputting an integer argument will adjust the number of rows displayed.
demand_data.tail(10)

<a id='14'></a>
### Filtering 

In [None]:
# Filtering in pandas works with standard python logic symbols for equal to '==', 
# greater than '>', less than '<', greater than or equal to '>=', and less than or equal to '<='.  
# The example below shows 'demand_data' being filtered by 'Product_Family' to only include data from the 'PF_1' 
# product family.
pf_1_demand = demand_data[demand_data['Product_Family']=='PF_1']
pf_1_demand.head()

In [None]:
# Multiple filters can be applied using '&'.  
# The below gives an example of filtering 'demand_data' 
# to only show the product family 'PF_1' at warehouse 'A'.  
pf_1A_demand = demand_data[(demand_data['Product_Family']=='PF_1') & (demand_data['Warehouse']=='W_A')]
pf_1A_demand.head()

<a id='15'></a>
### Changing Column Attributes


<a href ='https://numpy.org/doc/stable/reference/arrays.dtypes.html'>Data Types</a>

In [None]:
# Columns can are automatically assigned a data type when the data is read in, but they can be changed.  
# The below converts several columns from 'demand_data' from 'int' to 'string'.  
# A full list of available types can be found at the link above.
demand_data['Year'] = demand_data['Year'].astype('str')
demand_data['Month'] = demand_data['Month'].astype('str')
demand_data['Weeks in Month'] = demand_data['Weeks in Month'].astype('str')
demand_data['Lookup Value'] = demand_data['Lookup Value'].astype('str')
demand_data['SKU_ID'] = demand_data['SKU_ID'].astype('str')

<a id='16'></a>
### Grouping

<a href ='https://pandas.pydata.org/docs/reference/groupby.html'>Groupby Methods</a>

In [None]:

# The groupby method partitions data into groups by specified columns and 
# consolidates the numberical columns using a specified method.  
# The below gives an example grouping 'demand_data' by 'Product_Family', 
# 'Month', and 'Year' and showing the sum of 'Demand' by product family, month, and year.  
# A full list of methods that can be applied to the consolidated data can be found at the link above.
data = demand_data.groupby(['Product_Family','Month','Year']).sum()
data.head()

<a id='17'></a>
### Exporting Data

In [None]:
# The below exports 'data' to a CSV located at your 'test_path'.
data.to_csv(test_path)

<a id='18'></a>
## -------------PRACTICE-------------

1. In the below cell, store the data from the 'Inventory_Data.csv' in the variable 'inventory_data'. 

2. In the below cell, display the first 10 rows 'inventory_data'. 

3. In the below cell, create a dataframe from 'demand_data' that shows the average 'Demand' by 'Warehouse', 'Month', and 'Year' for only warehouse 'W_A' and 'W_B'.

4. In the below cell, using 'demand_data',show the basic statistics for the 'Demand' at warehouse 'W_C'.

5. In the below cell, using 'demand_data',find the 'SKU_ID' with the highest demand for each 'Product_Family' at warehouse 'W_B'.

6. In the below cell, retrieve the first row of 'demand_data' and export it to a CSV with the name 'My_First_Export'. 

<a id='19'></a>
### Concatenating Frames

In [None]:
# The concat method stacks frames on top of each other lining up identically named columns.  The below script stacks
# two product famil data sets.
pf_1 = demand_data[demand_data['Product_Family'] == 'PF_1']
pf_2 = demand_data[demand_data['Product_Family'] == 'PF_2']
pf_1_and_2 = pd.concat([pf_1,pf_2])
pf_1_and_2.head()

In [None]:
# The append method will work the same as the concat function when only working with 2 frames.  When stacking more than 2
# frames at once, it is necessary to use concat.
pf_1_and_2 = pf_1.append(pf_2)
pf_1_and_2.head()

<a id='20'></a>
### Merging Frames

In [None]:
# The merge method allows for relationships between databases.  The below does a full or 'outer' merge which will include
# all rows from both data sets
month_1_demand = demand_data[demand_data['Month'] == '1']
inventory_data['SKU_ID'] = inventory_data['SKU_ID'].astype('str')
consol_data = pd.merge(month_1_demand, inventory_data, how='outer', left_on = ['SKU_ID','Warehouse','Product_Family'],
                       right_on = ['SKU_ID','Warehouse','Product_Family'])
consol_data.head()

In [None]:
# The below does a left merge which will include
# all rows from month_1_demand, but only rows from inventory_data that have a matching 'SKU_ID' in month_1_demand.
month_1_demand = demand_data[demand_data['Month'] == '1']
inventory_data['SKU_ID'] = inventory_data['SKU_ID'].astype('str')
consol_data = pd.merge(month_1_demand, inventory_data, how='left', left_on = ['SKU_ID','Warehouse','Product_Family'],
                       right_on = ['SKU_ID','Warehouse','Product_Family'])
consol_data.head()

In [None]:
# The below does a right merge which will include
# all rows from inventory_data, but only rows from month_1_demand that have a matching 'SKU_ID' in inventory_data.
month_1_demand = demand_data[demand_data['Month'] == '1']
inventory_data['SKU_ID'] = inventory_data['SKU_ID'].astype('str')
consol_data = pd.merge(month_1_demand, inventory_data, how='right', left_on = ['SKU_ID','Warehouse','Product_Family'],
                       right_on = ['SKU_ID','Warehouse','Product_Family'])
consol_data.head()

<a id='21'></a>
### Renaming Cells

In [None]:
# The below renames the 'Lookup Value' column to 'Unique_ID'
demand_data = demand_data.rename(columns={"Lookup Value": "Unique_ID"})

<a id='22'></a>
### Dates & Time

In [None]:
# the 'to_datetime' method converts a pandas column to a datetime object
spy_data = pd.read_csv(spy_path)
spy_data['Date'] = pd.to_datetime(spy_data['Date'])
spy_data.head()

In [None]:
# Datetime objects have many attributes including month, day, and year.  
spy_data['Month'] = pd.DatetimeIndex(spy_data['Date']).month
spy_data['Year'] = pd.DatetimeIndex(spy_data['Date']).year
spy_data['Day'] = pd.DatetimeIndex(spy_data['Date']).day
spy_data.head()

In [None]:
# The below is a quick algorithm to map in quarter
spy_data['Quarter'] = (spy_data['Month'] -1) // 3 + 1
spy_data.head()

In [None]:
# Dates can be modified formulaicly using the timedelta method. The below creates a new column 30 days ahead of the date
# column. 
spy_data['Date_+_30'] = spy_data['Date'] + timedelta(days=30)
spy_data.head()
spy_data = spy_data.drop(columns=['Date_+_30'])

<a id='23'></a>
### Sorting Columns

In [None]:
# The 'sort_values' method sorts the data be a provided column name
spy_data = spy_data.sort_values(by=['Date'], ascending=False)
spy_data.head()

<a id='24'></a>
### Shifting Columns

In [None]:
# The below shows a use of the shift method which moves a column up or down a specified integer number of columns
# relative to the rest of the data.  we also remove some unnecessary columns
spy_data = spy_data.drop(columns=['Open','High','Low','Close','Volume'])
spy_data['Return_%'] = spy_data['Adj Close']/spy_data['Adj Close'].shift(-1)-1
spy_data.head()

<a id='25'></a>
## -------------PRACTICE-------------

1. In the cell below, Find all demand_data for 'Product_Family' 'PF_3', then, use the merge method to pull in the inventory positions for those SKUs.  Your final data set should contain no rows with blank 'Demand'.  Output your data to CSV title 'PF_3_All_Data'.

2. In the cell below, mirror the DataFrame created in the section on shifting, except calculate the two day return instead of the daily return.  You will need to read in the SPY data again, and ensure to drop 'Open', 'High','Low','Close',and 'Volume'.  

3. In the cell below, find the average daily return of SPY for February, 2021.

4.  Pull the demand data for 'Product_Family' 'PF_1' in month '1', and inventory data for 'Product_Family' PF_2.  Concatenate these two Dataframes and include a new column to the resulting DataFrame specifying which data source each row of data is from.

5. Find the daily standard deviation of the returns of SPY for March of 2021. 

<a id='26'></a>
### Reseting an Index

In [None]:
# Note that the index of the below frame is not sequenced correctly. 
pf_1 = demand_data[demand_data['Product_Family'] == 'PF_1']
pf_1.head()

In [None]:
# The reset_index method resets the index of a frame.  Note that the old index will need to be dropped, as python
# will by default make it a new column
pf_1.reset_index(inplace = True)
pf_1.drop(columns = ['index'], inplace = True)
pf_1.head()


<a id='27'></a>
### Creating new columns from old columns

In [None]:
# Using columns to create new columns works very similarly to standar python variables.  The below creates a new
# ID combining the 'Year','Month', and 'Weeks in Month' columns. 
pf_1['New_ID'] = pf_1['Year'] + pf_1['Month'] + pf_1['Weeks in Month']
pf_1.head()

<a id='28'></a>
### Lambda Functions

In [None]:
# Lambda functions are a way to create new operations when methods don't exist for them, and still avoiding using for loops.  
# The below lambda function creates a new column that finds the squared value of 'Adj Close'
spy_data['Price Squared'] = spy_data['Adj Close'].map(lambda x: x ** 2)
spy_data.head()

<a id='29'></a>
### Looping through Columns

In [None]:
# For loops can be applied to a list of columns in a data frame to apply methods and functions to multiple columns rapidly
spy_data = pd.read_csv(spy_path)
target_cols = ['Open','High','Low','Close']
for col in target_cols:
    spy_data[col + '_Price Squared'] = spy_data[col].map(lambda x: x ** 2)
    
spy_data.head()

<a id='30'></a>
## -------------PRACTICE-------------

1. In the cell below, design a lambda function that will multiply the column by 4, then add 5.  Apply this function to the 'Low' and 'High' columns in 'spy_data' using a for loop, you will need to read in the data again.  Come up with a naming conventory to uniquely identify your new columns.

2. In the cell below, find all demand data for month '1' from 'demand_data', merge the 'inventory_data' onto it, and create a new column that calculates the ratio of demand to inventory for each row.

3. In the cell below, find all of the inventory_data at warehouse 'W_A'.  Then reset the resulting DataFrame's index, and be sure to the new DataFrame does not have any new columns.

4. In the cell below, find all of the inventory_data at warehouse 'W_C'. Then reset the resulting DataFrame's index, and be sure to the new DataFrame does not have any new columns.  Then, sort the resulting DataFrame by inventory amount in descending order.

5. In the cell below, create a DataFrame from 'spy_data' that only shows data from the 3rd quarter of 2021.  Assume a standard calendar year.

<a id='31'></a>
### Rolling columns

In [None]:
# the rolling method will calculate a rollow operation on a column to create a new column.   The below column calculates 
# rolling 30 day average ETF price of SPY. 
spy_data = pd.read_csv(spy_path)
spy_data['Date'] = pd.to_datetime(spy_data['Date'])
spy_data = spy_data.sort_values(by=['Date'], ascending=True)
spy_data.reset_index(inplace = True)
spy_data.drop(columns = 'index',inplace = True)
spy_data = spy_data.drop(columns=['Open','High','Low','Close','Volume'])
spy_data['30-Day Average Price'] = spy_data['Adj Close'].rolling(30).mean()
spy_data = spy_data.sort_values(by=['Date'], ascending=False)
spy_data.head()

<a id='32'></a>
### Pivoting

In [None]:
# The pivot table, just as in excel, allows you to quickly consolidate data around defined columns.  Very similar
# in concept to the groupby method.  
pd.pivot_table(demand_data, values = 'Demand', index=['Warehouse','Product_Family']).reset_index()

<a id='33'></a>
### Transpose

In [None]:
# The Transpose method, as in linear algebra, will transpose a dataframe as if it were a matrix
test = pd.DataFrame([[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4]])
test


In [None]:
# Not how the transpose has rotated each column

test.T

<a id='34'></a>
### Removing Duplicates

In [None]:
# Note that test has 3 copies of the same row.  The drop_duplicates will remove the extra copies
test

In [None]:
test.drop_duplicates()

<a id='35'></a>
### Dropping Rows with Null values

In [None]:
# Note the null value in the first row.  To remove rows with null values, the dropna function will work
test = pd.DataFrame([[0,2,3,4],[0,1,2,3,4],[0,1,2,3,4]])
test

In [None]:
test.dropna()

<a id='36'></a>
### Filling Null Values

In [None]:
# The fillna method allows you to keep rows with null values, and control what fills them.  In the below, 0 replaces
# null values
test.fillna(0)

<a id='37'></a>
## -------------PRACTICE-------------

1. In the cell below, find the rolling 30 day standard deviation of the daily returns for the SPY from January 2021 through September 2021.

2. In the cell below, create a lamda function that will create a new column in spy_data with '1' for days with positive returns, and null for days without.  Drop days null values in this column.

3. Go finance.yahoo.com, and download the historical data of your favorite stock or ETF for September of 2021.  Then, import the data, sort it in descending order, transpose it, then export it to a CSV. 

4. Read in a fresh pull of your new stock data and SPY data, then do a left merge of the data, with your new stock being the left DataFrame. 

5. Using the DataFrame from question 4, create a new column that shows the difference in returns between SPY and your security. 

<a id='55'></a>
# Weekly Readings/Videos

https://www.thinkful.com/blog/what-is-data-science/
    
https://hbr.org/2013/11/how-to-start-thinking-like-a-data-scientist

http://www.tylervigen.com/spurious-correlations

<a id='38'></a>
# Concluding Remarks
Pandas continues to evolve and offer more and more capabilities.  While this lecture covers the rudimentary aspects of the packs, you will find as you work with it more, you will continue to find new methods, and ways to combine it with other python functionalities.  