<!-- ![Data Applications](https://www.durhamtech.edu/themes/custom/durhamtech/images/durham-tech-logo-web.svg)  -->

## Manipulating Data with Pandas – The Fundamentals
The pandas package in python is an industry standard that allows analysts to work with small to medium data sets.  Pandas will enable an analyst to quickly clean data and gather insights.  The purpose of this lecture is to expose you to the core capabilities of the package.  This is perhaps the most important data science package as it functions in a similar way that people use excel and SQL.

![panda-wave](https://adultspaint.com/wp-content/uploads/2020/12/happy-panda-adult-paint-by-numbers.jpg)

---

### Set Up
1.	Go to github 
2.	Download the Pandas lecture folder and make sure it includes the 'SPY.csv', 'Inventory_Data.csv' and 'Demand_Plan.csv' files
3.	Move these folder to a dedicated folder on your desktop or other location
4.	Open the command terminal in Anaconda Navigatory and run 'pip install pandas'



### Needed Packages
1.	pandas
2.  numpy
3.  datetime
---

# Table of Contents

### The basics
#### <a href='#1'>What are Pandas DataFrames?</a>
#### <a href='#2'>DataFrame From 1D Array</a>
#### <a href='#3'>DataFrame From 2D Array</a>
#### <a href='#4'>Create DataFrame From a Dictionary</a>
#### <a href='#5'>Create a Pandas Series Object (i.e. a DataFrame column) using a Python list</a>

### Interacting with Data Frames
#### <a href='#6'>Accessing A DataFrame</a>
#### <a href='#7'>Attribute Access</a>
#### <a href='#8'>Slicing Ranges</a>
#### <a href='#9'>Selection by Position Using .iloc Attribute</a>
#### <a href='#10'>Boolean Indexing</a>

### Examing, Grouping & Describing
#### <a href='#11'>Some Basic Statistics on a DataFrame</a>
#### <a href='#12'>Reading in data from a CSV</a>
#### <a href='#13'>Head & Tails</a>
#### <a href='#14'>Filtering</a>
#### <a href='#15'>Changing Column Attributes</a>
#### <a href='#16'>Grouping</a>
#### <a href='#17'>Exporting Data</a>

### Combining, Editing, and Time
#### <a href='#19'>Concatenating Frames</a>
#### <a href='#20'>Merging Frames</a>
#### <a href='#21'>Renaming Cells</a>
#### <a href='#22'>Dates & Time</a>
#### <a href='#23'>Sorting Columns</a>
#### <a href='#24'>Shifting Columns</a>

### Loops, Functions, and DataFrames
#### <a href='#26'>Reseting an Index</a>
#### <a href='#27'>Creating New Columns from Old Columns</a>
#### <a href='#28'>Lambda Functions</a>
#### <a href='#29'>Looping through Columns</a>

### Pivoting & Misc. Methods
#### <a href='#31'>Rolling columns</a>
#### <a href='#32'>Pivoting</a>
#### <a href='#33'>Transpose</a>
#### <a href='#34'>Removing Duplicates</a>
#### <a href='#35'>Dropping Rows with Null Values</a>
#### <a href='#36'>Filling Null Values</a>
#### <a href='#38'>Concluding Remarks</a>
#### <a href='#55'>Weekly Readings/Videos</a>


### Practice
#### <a href='#39'>Exercise Set 1</a>
#### <a href='#40'>Exercise Set 2</a>
#### <a href='#18'>Exercise Set 3</a>
#### <a href='#25'>Exercise Set 4</a>
#### <a href='#30'>Exercise Set 5</a>
#### <a href='#37'>Exercise Set 6</a>



<a id='1'></a>
## What are Pandas DataFrames?
Pandas DataFrames are a way of organizing data organized in two dimensions: rows and columns, which are themselves organized via labels (like a matrix with column headers and indices). You can insert one or two-dimensional data (list, tuple and sequences, or NumPy array), dictionaries, or time series data to pass as input to create a DataFrame. Pandas is built on top of Numpy code, which we covered in the previous lecture.
![panda-bamboo](https://images.unsplash.com/photo-1594917668779-f95b4f72059f?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1yZWxhdGVkfDE0fHx8ZW58MHx8fHw%3D&w=1000&q=80)

In [1]:
###Ensure this file is in the same folder as 'Demand_Plan.csv' & 'Inventory_Data.csv' 
import pandas as pd
import numpy as np
from datetime import timedelta

spy_path = 'SPY.csv'
demand_path =  'Demand_Plan.csv'
inventory_path = 'Inventory_Data.csv'
test_path = 'test.csv'

<a id='2'></a>
### DataFrame From 1D Array

In [2]:
# Create random seed
np.random.seed(58) 

# 3 different 1 dimensional arrays of length 3
a1 = np.random.randn(3)
a2 = np.random.randn(3)
a3 = np.random.randn(3)

print (a1)
print (a2)
print (a3)

[-0.76018615 -2.10158401 -0.80975668]
[-0.00751726 -1.70436077  0.58140398]
[-0.96914406 -1.4809011  -0.52334255]


In [3]:
# Create our first DataFrame with the above numpy array
df0 = pd.DataFrame(a1)
df0

Unnamed: 0,0
0,-0.760186
1,-2.101584
2,-0.809757


In [4]:
# Printing the dataframe gives a different result than the return value
print(df0)

          0
0 -0.760186
1 -2.101584
2 -0.809757


In [5]:
# Check type
type(df0)

pandas.core.frame.DataFrame

In [6]:
# DataFrame from all 3 numpy arrays
df0 = pd.DataFrame([a1, a2, a3])
df0

Unnamed: 0,0,1,2
0,-0.760186,-2.101584,-0.809757
1,-0.007517,-1.704361,0.581404
2,-0.969144,-1.480901,-0.523343


In [7]:
# We can set the column and index names
df0 = pd.DataFrame([a1, a2, a3],columns=['col_a1','col_a2','col_a3'],index=['row_a','row_b','row_c'])
df0

Unnamed: 0,col_a1,col_a2,col_a3
row_a,-0.760186,-2.101584,-0.809757
row_b,-0.007517,-1.704361,0.581404
row_c,-0.969144,-1.480901,-0.523343


In [8]:
# Adding  more columns to dataframe requires that the dimensions must match
df0['col4']=a2
df0

Unnamed: 0,col_a1,col_a2,col_a3,col4
row_a,-0.760186,-2.101584,-0.809757,-0.007517
row_b,-0.007517,-1.704361,0.581404,-1.704361
row_c,-0.969144,-1.480901,-0.523343,0.581404


<a id='3'></a>
### DataFrame From 2D Array

In [9]:
# Create a DataFrame from 2D np.array
np.random.seed(63)
array_2d = np.array(np.random.randn(9)).reshape(3,3)
array_2d

array([[-2.13897865e+00,  1.11206124e+00,  3.58015526e-02],
       [-6.30157742e-01,  3.54051160e-05, -1.20895742e+00],
       [ 2.91628075e-01, -1.65065515e+00, -1.50712909e+00]])

In [10]:
# Again you can label your columns and indexes however you please
df0 = pd.DataFrame(array_2d,columns=['1stColumn','Another_Column','ThirdOne'] \
                   , index=[58,12,725]) 

df0

Unnamed: 0,1stColumn,Another_Column,ThirdOne
58,-2.138979,1.112061,0.035802
12,-0.630158,3.5e-05,-1.208957
725,0.291628,-1.650655,-1.507129


<a id='4'></a>
### Create DataFrame From a Dictionary
![panda-2](https://media.kidadl.com/medium_5f89b77ffae2a75827e84c40_a_day_in_the_life_of_a_panda_consists_solely_of_eating_and_sleeping_df016da354.jpeg)

In [11]:
# Create a DataFrame from a Dictionary
dict1 = {'a1':a1, 'a2':a2,'a3':a3}
dict1

{'a1': array([-0.76018615, -2.10158401, -0.80975668]),
 'a2': array([-0.00751726, -1.70436077,  0.58140398]),
 'a3': array([-0.96914406, -1.4809011 , -0.52334255])}

In [12]:
# Assign the indexes
df1 = pd.DataFrame(dict1,index=[1,2,3]) 
df1

Unnamed: 0,a1,a2,a3
1,-0.760186,-0.007517,-0.969144
2,-2.101584,-1.704361,-1.480901
3,-0.809757,0.581404,-0.523343


In [13]:
# We can add a list with strings and ints as a column 
df1['Mixed'] = ["Apples", 92, "Cars"]
df1

Unnamed: 0,a1,a2,a3,Mixed
1,-0.760186,-0.007517,-0.969144,Apples
2,-2.101584,-1.704361,-1.480901,92
3,-0.809757,0.581404,-0.523343,Cars


<a id='5'></a>
### Create a Pandas Series Object (i.e. a DataFrame column) using a Python list

In [14]:
# Every column is a series object
type(df1['Mixed'])

pandas.core.series.Series

In [15]:
# View one column
df1['Mixed']

1    Apples
2        92
3      Cars
Name: Mixed, dtype: object

In [16]:
# Different datatypes in a column
print(type(df1['Mixed'][1]), type(df1['Mixed'][2]))

<class 'str'> <class 'int'>


In [17]:
# Create a Series from a Python list
s = pd.Series([21,15,32]) # an automatic index is created in numerical sequence order, 0,1,2...
s

0    21
1    15
2    32
dtype: int64

In [18]:
# Creating a Series from a Python List but with user specified list
s2 = pd.Series([21, 15, 32], index = ['h','i','j']) #specific index
s2

h    21
i    15
j    32
dtype: int64

In [19]:
# View element
s2['h']

21

<a id='39'></a>
## -------------PRACTICE-------------

1. In the cell below, create a 3x3 Data Frame with each value as '1' and add unique column names.

2. In the cell below, find the type of the first row of the first column of your new DataFrame.

3. In the cell below, find and print the value of the center cell of your DataFrame.

4. In the cell below, create a list with elements 'a', 'b', and 'c', transform it into a DataFrame, and find the type of the second row of the first column.

5. In the cell below, create a for loop that will creat a list of length 10 with numbers ranging from 0 to 9, then, convert this new list into a DataFrame.

<a id='6'></a>
### Accessing A DataFrame


In [20]:
# We can add the Series s to the DataFrame above as column Series
# Remember to match indices
df1['Series'] = s
df1

Unnamed: 0,a1,a2,a3,Mixed,Series
1,-0.760186,-0.007517,-0.969144,Apples,15.0
2,-2.101584,-1.704361,-1.480901,92,32.0
3,-0.809757,0.581404,-0.523343,Cars,


In [21]:
# We can rename columns
df1 = df1.rename(columns = {'Mixed':'RenamedColumn'})
df1

Unnamed: 0,a1,a2,a3,RenamedColumn,Series
1,-0.760186,-0.007517,-0.969144,Apples,15.0
2,-2.101584,-1.704361,-1.480901,92,32.0
3,-0.809757,0.581404,-0.523343,Cars,


In [22]:
# We can delete columns
del df1['RenamedColumn']
df1

Unnamed: 0,a1,a2,a3,Series
1,-0.760186,-0.007517,-0.969144,15.0
2,-2.101584,-1.704361,-1.480901,32.0
3,-0.809757,0.581404,-0.523343,


In [23]:
# or drop columns, see axis = 1 which is the step we use the most
# however this does not change the dataframe if we don't set inplace=True
df1.drop('a2',axis=1) # returns a copy

Unnamed: 0,a1,a3,Series
1,-0.760186,-0.969144,15.0
2,-2.101584,-1.480901,32.0
3,-0.809757,-0.523343,


In [24]:
# Sanity Check
df1

Unnamed: 0,a1,a2,a3,Series
1,-0.760186,-0.007517,-0.969144,15.0
2,-2.101584,-1.704361,-1.480901,32.0
3,-0.809757,0.581404,-0.523343,


In [25]:
# or drop rows
df1.drop(1,axis=0)

Unnamed: 0,a1,a2,a3,Series
2,-2.101584,-1.704361,-1.480901,32.0
3,-0.809757,0.581404,-0.523343,


In [26]:
# Remove a column with inplace=True
df1.drop('Series',axis=1,inplace=True)
df1

Unnamed: 0,a1,a2,a3
1,-0.760186,-0.007517,-0.969144
2,-2.101584,-1.704361,-1.480901
3,-0.809757,0.581404,-0.523343


<a id='7'></a>
### Attribute Access

In [27]:
# View 1 column
df1['a1']

1   -0.760186
2   -2.101584
3   -0.809757
Name: a1, dtype: float64

In [28]:
# View several columns
df1[['a1','a3']]

Unnamed: 0,a1,a3
1,-0.760186,-0.969144
2,-2.101584,-1.480901
3,-0.809757,-0.523343


<a id='8'></a>
### Slicing Ranges

In [29]:
# slice of the DataFrame returned
# this slices the first three rows first followed by first 2 rows of the sliced frame
(df1[0:3][0:2])

Unnamed: 0,a1,a2,a3
1,-0.760186,-0.007517,-0.969144
2,-2.101584,-1.704361,-1.480901


In [30]:
# Lets print the five first 2  elements of column a1
# This is a new Series (like a new table)
df1['a1'][0:2]

1   -0.760186
2   -2.101584
Name: a1, dtype: float64

In [31]:
# Print the 2 columns and the top 2 values
df1[['a1','a3']][0:2]

Unnamed: 0,a1,a3
1,-0.760186,-0.969144
2,-2.101584,-1.480901


<a id='9'></a>
### Selection by Position Using .iloc Attribute

In [32]:
# View element
df1.iloc[0,0]

-0.7601861461745268

In [33]:
# Get the 2nd to 4th row, 4th to 5th column
df1.iloc[0:2,0:2]

Unnamed: 0,a1,a2
1,-0.760186,-0.007517
2,-2.101584,-1.704361


In [34]:
# Can also use 2 'lists' of position numbers with iloc
df1.iloc[[0,2],[0,2]]

Unnamed: 0,a1,a3
1,-0.760186,-0.969144
3,-0.809757,-0.523343


In [35]:
# Data only from row with index value '1'
print (df1.iloc[1])
print('\n')
print (df1.iloc[1,:])

a1   -2.101584
a2   -1.704361
a3   -1.480901
Name: 2, dtype: float64


a1   -2.101584
a2   -1.704361
a3   -1.480901
Name: 2, dtype: float64


<a id='10'></a>
### Boolean Indexing

In [36]:
# return  full rows where a2>0
df1[df1['a2']>0]

# The df1['a2']>0 checks condition and returns boolean (T/F)
# The df1[] outside of it only selects the rows where this is true

Unnamed: 0,a1,a2,a3
3,-0.809757,0.581404,-0.523343


In [37]:
# return column a3 values where a2 >0
df1['a3'][df1['a2']>0]

3   -0.523343
Name: a3, dtype: float64

In [38]:
# If you want the values in an np array
npg = df1.loc[:,"a2"].values #otherwise it returns a  indexed series
print(type(npg))
print()
npg

<class 'numpy.ndarray'>



array([-0.00751726, -1.70436077,  0.58140398])

<a id='40'></a>
## -------------PRACTICE-------------

1. In the cell below, create a new 4x4 DataFrame of random numbers between 0 and 1.

2. In the cell below, print the 2nd row of the 3rd column.

3. In the cell below, create a new DataFrame using only the 1st and 3rd column.

4.  In the cell below, create a new DataFrame from problem 1 where you include only rows where the first column is greater than .4, your answer may return an empty DataFrame.

5. In the cell below, using the iloc method, pring the 4th row of the data frame from problem 1.

<a id='11'></a>
### Some Basic Statistics on a DataFrame

In [39]:
# Show general statistics
df1.describe()

Unnamed: 0,a1,a2,a3
count,3.0,3.0,3.0
mean,-1.223842,-0.376825,-0.991129
std,0.760551,1.18679,0.479158
min,-2.101584,-1.704361,-1.480901
25%,-1.45567,-0.855939,-1.225023
50%,-0.809757,-0.007517,-0.969144
75%,-0.784971,0.286943,-0.746243
max,-0.760186,0.581404,-0.523343


In [40]:
# Only view desired, siame as slicing rows and columns in a normal dataframe
df1.describe().loc[['mean','std'],['a2','a3']]

Unnamed: 0,a2,a3
mean,-0.376825,-0.991129
std,1.18679,0.479158


In [41]:
# We can change the index sorting
df1.sort_index(axis=0, ascending=False).head()

Unnamed: 0,a1,a2,a3
3,-0.809757,0.581404,-0.523343
2,-2.101584,-1.704361,-1.480901
1,-0.760186,-0.007517,-0.969144


<a id='12'></a>
### Reading in data from a CSV

In [42]:
#The read_csv method requires one argument, the file path, to the CSV file you want to read.
demand_data = pd.read_csv(demand_path)

<a id='13'></a>
### Head & Tails

In [43]:
#The head method defaults to displaying the first 5 rows of a dataframe.  
#Inputting an integer argument will adjust the number of rows displayed.
#In this case we use 10
demand_data.head(10)

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
0,11395072,1395072,W_B,PF_1,17631,2022,1,4
1,11039394,1039394,W_C,PF_2,15438,2022,1,4
2,11975221,1975221,W_A,PF_3,12725,2022,1,4
3,11396615,1396615,W_B,PF_4,38768,2022,1,4
4,11026987,1026987,W_C,PF_0,44662,2022,1,4
5,11885799,1885799,W_A,PF_1,37943,2022,1,4
6,11844486,1844486,W_B,PF_2,26850,2022,1,4
7,11633773,1633773,W_C,PF_3,46241,2022,1,4
8,11280204,1280204,W_A,PF_4,49826,2022,1,4
9,11461444,1461444,W_B,PF_0,23768,2022,1,4


In [44]:
#The tail method defaults to displaying the last 5 rows of a dataframe.  
#Inputting an integer argument will adjust the number of rows displayed.
demand_data.tail(10)

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
1190,121602257,1602257,W_B,PF_1,58554,2022,12,5
1191,121542470,1542470,W_C,PF_2,17861,2022,12,5
1192,121172141,1172141,W_A,PF_3,14832,2022,12,5
1193,121686011,1686011,W_B,PF_4,54873,2022,12,5
1194,121760339,1760339,W_C,PF_0,35569,2022,12,5
1195,121544715,1544715,W_A,PF_1,56244,2022,12,5
1196,121715505,1715505,W_B,PF_2,41897,2022,12,5
1197,121539334,1539334,W_C,PF_3,53392,2022,12,5
1198,121803831,1803831,W_A,PF_4,48499,2022,12,5
1199,121431913,1431913,W_B,PF_0,28570,2022,12,5


<a id='14'></a>
### Filtering 

In [45]:
# Filtering in pandas works with standard python logic symbols for equal to '==', 
# greater than '>', less than '<', greater than or equal to '>=', and less than or equal to '<='.  
# The example below shows 'demand_data' being filtered by 'Product_Family' to only include data from the 'PF_1' 
# product family.
pf_1_demand = demand_data[demand_data['Product_Family']=='PF_1']
pf_1_demand.head()

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
0,11395072,1395072,W_B,PF_1,17631,2022,1,4
5,11885799,1885799,W_A,PF_1,37943,2022,1,4
10,11118364,1118364,W_C,PF_1,35654,2022,1,4
15,11950549,1950549,W_B,PF_1,14407,2022,1,4
20,11633085,1633085,W_A,PF_1,18798,2022,1,4


In [46]:
# Multiple filters can be applied using '&'.  
# The below gives an example of filtering 'demand_data' 
# to only show the product family 'PF_1' at warehouse 'A'.  
pf_1A_demand = demand_data[(demand_data['Product_Family']=='PF_1') & (demand_data['Warehouse']=='W_A')]
pf_1A_demand.head()

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
5,11885799,1885799,W_A,PF_1,37943,2022,1,4
20,11633085,1633085,W_A,PF_1,18798,2022,1,4
35,11838070,1838070,W_A,PF_1,15533,2022,1,4
50,11421180,1421180,W_A,PF_1,43700,2022,1,4
65,11470217,1470217,W_A,PF_1,43925,2022,1,4


<a id='15'></a>
### Changing Column Attributes


<a href ='https://numpy.org/doc/stable/reference/arrays.dtypes.html'>Data Types</a>

In [47]:
# Columns can are automatically assigned a data type when the data is read in, but they can be changed.  
# The below converts several columns from 'demand_data' from 'int' to 'string'.  
# A full list of available types can be found at the link above.
demand_data['Year'] = demand_data['Year'].astype('str')
demand_data['Month'] = demand_data['Month'].astype('str')
demand_data['Weeks in Month'] = demand_data['Weeks in Month'].astype('str')
demand_data['Lookup Value'] = demand_data['Lookup Value'].astype('str')
demand_data['SKU_ID'] = demand_data['SKU_ID'].astype('str')

<a id='16'></a>
### Grouping

![panda-group](https://miro.medium.com/max/1400/1*6d5dw6dPhy4vBp2vRW6uzw.png)

<a href ='https://pandas.pydata.org/docs/reference/groupby.html'>Groupby Methods</a>

In [48]:
# The groupby method partitions data into groups by specified columns and 
# consolidates the numberical columns using a specified method.  
# The below gives an example grouping 'demand_data' by 'Product_Family', 
# 'Month', and 'Year' and showing the sum of 'Demand' by product family, month, and year.  
# A full list of methods that can be applied to the consolidated data can be found at the link above.

data = demand_data.groupby(['Product_Family','Month','Year']).sum()
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Demand
Product_Family,Month,Year,Unnamed: 3_level_1
PF_0,1,2022,644783
PF_0,10,2022,540286
PF_0,11,2022,530712
PF_0,12,2022,643068
PF_0,2,2022,708408


<a id='17'></a>
### Exporting Data

In [49]:
# The below exports 'data' to a CSV located at your 'test_path'.
data.to_csv(test_path)

<a id='18'></a>
## -------------PRACTICE-------------

1. In the below cell, store the data from the 'Inventory_Data.csv' in the variable 'inventory_data'. 

In [53]:
inventory_data=pd.read_csv(inventory_path)

2. In the below cell, display the first 10 rows 'inventory_data'. 

3. In the below cell, create a dataframe from 'demand_data' that shows the average 'Demand' by 'Warehouse', 'Month', and 'Year' for only warehouse 'W_A' and 'W_B'.

4. In the below cell, using 'demand_data',show the basic statistics for the 'Demand' at warehouse 'W_C'.

5. In the below cell, using 'demand_data',find the 'SKU_ID' with the highest demand for each 'Product_Family' at warehouse 'W_B'.

6. In the below cell, retrieve the first row of 'demand_data' and export it to a CSV with the name 'My_First_Export'. 

<a id='19'></a>
### Concatenating Frames

In [50]:
# The concat method stacks frames on top of each other lining up identically named columns.  The below script stacks
# two product family data sets.
pf_1 = demand_data[demand_data['Product_Family'] == 'PF_1']
pf_2 = demand_data[demand_data['Product_Family'] == 'PF_2']
pf_1_and_2 = pd.concat([pf_1,pf_2])
pf_1_and_2.head()

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
0,11395072,1395072,W_B,PF_1,17631,2022,1,4
5,11885799,1885799,W_A,PF_1,37943,2022,1,4
10,11118364,1118364,W_C,PF_1,35654,2022,1,4
15,11950549,1950549,W_B,PF_1,14407,2022,1,4
20,11633085,1633085,W_A,PF_1,18798,2022,1,4


In [51]:
# The append method will work the same as the concat function when only working with 2 frames.  When stacking more than 2
# frames at once, it is necessary to use concat.
pf_1_and_2 = pf_1.append(pf_2)
pf_1_and_2.head()

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
0,11395072,1395072,W_B,PF_1,17631,2022,1,4
5,11885799,1885799,W_A,PF_1,37943,2022,1,4
10,11118364,1118364,W_C,PF_1,35654,2022,1,4
15,11950549,1950549,W_B,PF_1,14407,2022,1,4
20,11633085,1633085,W_A,PF_1,18798,2022,1,4


<a id='20'></a>
### Merging Frames

In [54]:
# The merge method allows for relationships between databases.  The below does a full or 'outer' merge which will include
# all rows from both data sets
month_1_demand = demand_data[demand_data['Month'] == '1']
inventory_data['SKU_ID'] = inventory_data['SKU_ID'].astype('str')
consol_data = pd.merge(month_1_demand, inventory_data, how='outer', left_on = ['SKU_ID','Warehouse','Product_Family'],
                       right_on = ['SKU_ID','Warehouse','Product_Family'])
consol_data.head()

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month,Inventory as of 1/1/22,Cost
0,11395072,1395072,W_B,PF_1,17631,2022,1,4,78423.33949,8.36
1,11039394,1039394,W_C,PF_2,15438,2022,1,4,131276.4804,8.97
2,11975221,1975221,W_A,PF_3,12725,2022,1,4,23069.57275,8.65
3,11396615,1396615,W_B,PF_4,38768,2022,1,4,53988.69284,7.89
4,11026987,1026987,W_C,PF_0,44662,2022,1,4,23517.17321,8.68


In [55]:
# The below does a left merge which will include
# all rows from month_1_demand, but only rows from inventory_data that have a matching 'SKU_ID' in month_1_demand.
month_1_demand = demand_data[demand_data['Month'] == '1']
inventory_data['SKU_ID'] = inventory_data['SKU_ID'].astype('str')
consol_data = pd.merge(month_1_demand, inventory_data, how='left', left_on = ['SKU_ID','Warehouse','Product_Family'],
                       right_on = ['SKU_ID','Warehouse','Product_Family'])
consol_data.head()

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month,Inventory as of 1/1/22,Cost
0,11395072,1395072,W_B,PF_1,17631,2022,1,4,78423.33949,8.36
1,11039394,1039394,W_C,PF_2,15438,2022,1,4,131276.4804,8.97
2,11975221,1975221,W_A,PF_3,12725,2022,1,4,23069.57275,8.65
3,11396615,1396615,W_B,PF_4,38768,2022,1,4,53988.69284,7.89
4,11026987,1026987,W_C,PF_0,44662,2022,1,4,23517.17321,8.68


In [56]:
# The below does a right merge which will include
# all rows from inventory_data, but only rows from month_1_demand that have a matching 'SKU_ID' in inventory_data.
month_1_demand = demand_data[demand_data['Month'] == '1']
inventory_data['SKU_ID'] = inventory_data['SKU_ID'].astype('str')
consol_data = pd.merge(month_1_demand, inventory_data, how='right', left_on = ['SKU_ID','Warehouse','Product_Family'],
                       right_on = ['SKU_ID','Warehouse','Product_Family'])
consol_data.head()

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month,Inventory as of 1/1/22,Cost
0,11395072,1395072,W_B,PF_1,17631,2022,1,4,78423.33949,8.36
1,11039394,1039394,W_C,PF_2,15438,2022,1,4,131276.4804,8.97
2,11975221,1975221,W_A,PF_3,12725,2022,1,4,23069.57275,8.65
3,11396615,1396615,W_B,PF_4,38768,2022,1,4,53988.69284,7.89
4,11026987,1026987,W_C,PF_0,44662,2022,1,4,23517.17321,8.68


<a id='21'></a>
### Renaming Cells

In [57]:
# The below renames the 'Lookup Value' column to 'Unique_ID'
demand_data = demand_data.rename(columns={"Lookup Value": "Unique_ID"})

<a id='22'></a>
### Dates & Time

In [58]:
# the 'to_datetime' method converts a pandas column to a datetime object
spy_data = pd.read_csv(spy_path)
spy_data['Date'] = pd.to_datetime(spy_data['Date'])
spy_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-10-30,328.279999,329.690002,322.600006,326.540009,322.004059,120287300
1,2020-11-02,330.200012,332.359985,327.23999,330.200012,325.613251,86068300
2,2020-11-03,333.690002,338.25,330.290009,336.029999,331.362274,93294200
3,2020-11-04,340.859985,347.940002,339.589996,343.540009,338.767914,126959700
4,2020-11-05,349.23999,352.190002,348.859985,350.23999,345.374817,82039700


In [59]:
# Datetime objects have many attributes including month, day, and year.  
spy_data['Month'] = pd.DatetimeIndex(spy_data['Date']).month
spy_data['Year'] = pd.DatetimeIndex(spy_data['Date']).year
spy_data['Day'] = pd.DatetimeIndex(spy_data['Date']).day
spy_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Month,Year,Day
0,2020-10-30,328.279999,329.690002,322.600006,326.540009,322.004059,120287300,10,2020,30
1,2020-11-02,330.200012,332.359985,327.23999,330.200012,325.613251,86068300,11,2020,2
2,2020-11-03,333.690002,338.25,330.290009,336.029999,331.362274,93294200,11,2020,3
3,2020-11-04,340.859985,347.940002,339.589996,343.540009,338.767914,126959700,11,2020,4
4,2020-11-05,349.23999,352.190002,348.859985,350.23999,345.374817,82039700,11,2020,5


In [60]:
# The below is a quick algorithm to map in quarter
spy_data['Quarter'] = (spy_data['Month'] -1) // 3 + 1
spy_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Month,Year,Day,Quarter
0,2020-10-30,328.279999,329.690002,322.600006,326.540009,322.004059,120287300,10,2020,30,4
1,2020-11-02,330.200012,332.359985,327.23999,330.200012,325.613251,86068300,11,2020,2,4
2,2020-11-03,333.690002,338.25,330.290009,336.029999,331.362274,93294200,11,2020,3,4
3,2020-11-04,340.859985,347.940002,339.589996,343.540009,338.767914,126959700,11,2020,4,4
4,2020-11-05,349.23999,352.190002,348.859985,350.23999,345.374817,82039700,11,2020,5,4


In [61]:
# Dates can be modified formulaicly using the timedelta method. The below creates a new column 30 days ahead of the date
# column. 
spy_data['Date_+_30'] = spy_data['Date'] + timedelta(days=30)
spy_data.head()
spy_data = spy_data.drop(columns=['Date_+_30'])

<a id='23'></a>
### Sorting Columns

In [62]:
# The 'sort_values' method sorts the data be a provided column name
spy_data = spy_data.sort_values(by=['Date'], ascending=False)
spy_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Month,Year,Day,Quarter
251,2021-10-29,455.869995,459.559998,455.559998,459.25,459.25,70108200,10,2021,29,4
250,2021-10-28,455.459991,458.399994,455.450012,458.320007,458.320007,51437900,10,2021,28,4
249,2021-10-27,456.450012,457.160004,453.859985,453.940002,453.940002,72438000,10,2021,27,4
248,2021-10-26,457.200012,458.48999,455.559998,455.959991,455.959991,56075100,10,2021,26,4
247,2021-10-25,454.279999,455.899994,452.390015,455.549988,455.549988,45214500,10,2021,25,4


<a id='24'></a>
### Shifting Columns

In [63]:
# The below shows a use of the shift method which moves a column up or down a specified integer number of columns
# relative to the rest of the data.  we also remove some unnecessary columns
spy_data = spy_data.drop(columns=['Open','High','Low','Close','Volume'])
spy_data['Return_%'] = spy_data['Adj Close']/spy_data['Adj Close'].shift(-1)-1
spy_data.head()

Unnamed: 0,Date,Adj Close,Month,Year,Day,Quarter,Return_%
251,2021-10-29,459.25,10,2021,29,4,0.002029
250,2021-10-28,458.320007,10,2021,28,4,0.009649
249,2021-10-27,453.940002,10,2021,27,4,-0.00443
248,2021-10-26,455.959991,10,2021,26,4,0.0009
247,2021-10-25,455.549988,10,2021,25,4,0.005363


<a id='25'></a>
## -------------PRACTICE-------------

1. In the cell below, Find all demand_data for 'Product_Family' 'PF_3', then, use the merge method to pull in the inventory positions for those SKUs.  Your final data set should contain no rows with blank 'Demand'.  Output your data to CSV title 'PF_3_All_Data'.

2. In the cell below, mirror the DataFrame created in the section on shifting, except calculate the two day return instead of the daily return.  You will need to read in the SPY data again, and ensure to drop 'Open', 'High','Low','Close',and 'Volume'.  

3. In the cell below, find the average daily return of SPY for February, 2021.

4.  Pull the demand data for 'Product_Family' 'PF_1' in month '1', and inventory data for 'Product_Family' PF_2.  Concatenate these two Dataframes and include a new column to the resulting DataFrame specifying which data source each row of data is from.

5. Find the daily standard deviation of the returns of SPY for March of 2021. 

<a id='26'></a>
### Reseting an Index

In [64]:
# Note that the index of the below frame is not sequenced correctly. 
pf_1 = demand_data[demand_data['Product_Family'] == 'PF_1']
pf_1.head()

Unnamed: 0,Unique_ID,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
0,11395072,1395072,W_B,PF_1,17631,2022,1,4
5,11885799,1885799,W_A,PF_1,37943,2022,1,4
10,11118364,1118364,W_C,PF_1,35654,2022,1,4
15,11950549,1950549,W_B,PF_1,14407,2022,1,4
20,11633085,1633085,W_A,PF_1,18798,2022,1,4


In [65]:
# The reset_index method resets the index of a frame.  Note that the old index will need to be dropped, as python
# will by default make it a new column
pf_1.reset_index(inplace = True)
pf_1.drop(columns = ['index'], inplace = True)
pf_1.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,Unique_ID,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
0,11395072,1395072,W_B,PF_1,17631,2022,1,4
1,11885799,1885799,W_A,PF_1,37943,2022,1,4
2,11118364,1118364,W_C,PF_1,35654,2022,1,4
3,11950549,1950549,W_B,PF_1,14407,2022,1,4
4,11633085,1633085,W_A,PF_1,18798,2022,1,4


<a id='27'></a>
### Creating new columns from old columns

In [66]:
# Using columns to create new columns works very similarly to standar python variables.  The below creates a new
# ID combining the 'Year','Month', and 'Weeks in Month' columns. 
pf_1['New_ID'] = pf_1['Year'] + pf_1['Month'] + pf_1['Weeks in Month']
pf_1.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Unique_ID,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month,New_ID
0,11395072,1395072,W_B,PF_1,17631,2022,1,4,202214
1,11885799,1885799,W_A,PF_1,37943,2022,1,4,202214
2,11118364,1118364,W_C,PF_1,35654,2022,1,4,202214
3,11950549,1950549,W_B,PF_1,14407,2022,1,4,202214
4,11633085,1633085,W_A,PF_1,18798,2022,1,4,202214


<a id='28'></a>
### Lambda Functions

In [67]:
# Lambda functions are a way to create new operations when methods don't exist for them, and still avoiding using for loops.  
# The below lambda function creates a new column that finds the squared value of 'Adj Close'
spy_data['Price Squared'] = spy_data['Adj Close'].map(lambda x: x ** 2)
spy_data.head()

Unnamed: 0,Date,Adj Close,Month,Year,Day,Quarter,Return_%,Price Squared
251,2021-10-29,459.25,10,2021,29,4,0.002029,210910.5625
250,2021-10-28,458.320007,10,2021,28,4,0.009649,210057.228816
249,2021-10-27,453.940002,10,2021,27,4,-0.00443,206061.525416
248,2021-10-26,455.959991,10,2021,26,4,0.0009,207899.513393
247,2021-10-25,455.549988,10,2021,25,4,0.005363,207525.791567


<a id='29'></a>
### Looping through Columns

In [68]:
# For loops can be applied to a list of columns in a data frame to apply methods and functions to multiple columns rapidly
spy_data = pd.read_csv(spy_path)
target_cols = ['Open','High','Low','Close']
for col in target_cols:
    spy_data[col + '_Price Squared'] = spy_data[col].map(lambda x: x ** 2)
    
spy_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Open_Price Squared,High_Price Squared,Low_Price Squared,Close_Price Squared
0,10/30/2020,328.279999,329.690002,322.600006,326.540009,322.004059,120287300,107767.757743,108695.497419,104070.763871,106628.377478
1,11/2/2020,330.200012,332.359985,327.23999,330.200012,325.613251,86068300,109032.047925,110463.159629,107086.011055,109032.047925
2,11/3/2020,333.690002,338.25,330.290009,336.029999,331.362274,93294200,111349.017435,114413.0625,109091.490045,112916.160228
3,11/4/2020,340.859985,347.940002,339.589996,343.540009,338.767914,126959700,116185.529374,121062.244992,115321.365383,118019.737784
4,11/5/2020,349.23999,352.190002,348.859985,350.23999,345.374817,82039700,121968.570615,124037.797509,121703.289134,122668.050595


<a id='30'></a>
## -------------PRACTICE-------------

1. In the cell below, design a lambda function that will multiply the column by 4, then add 5.  Apply this function to the 'Low' and 'High' columns in 'spy_data' using a for loop, you will need to read in the data again.  Come up with a naming conventory to uniquely identify your new columns.

2. In the cell below, find all demand data for month '1' from 'demand_data', merge the 'inventory_data' onto it, and create a new column that calculates the ratio of demand to inventory for each row.

3. In the cell below, find all of the inventory_data at warehouse 'W_A'.  Then reset the resulting DataFrame's index, and be sure to the new DataFrame does not have any new columns.

4. In the cell below, find all of the inventory_data at warehouse 'W_C'. Then reset the resulting DataFrame's index, and be sure to the new DataFrame does not have any new columns.  Then, sort the resulting DataFrame by inventory amount in descending order.

5. In the cell below, create a DataFrame from 'spy_data' that only shows data from the 3rd quarter of 2021.  Assume a standard calendar year.

<a id='31'></a>
### Rolling columns
![pandas-babies](https://i.insider.com/614ade28c2c9630018f5b6b3?width=1000&format=jpeg&auto=webp)

In [69]:
# the rolling method will calculate a rollow operation on a column to create a new column.   The below column calculates 
# rolling 30 day average ETF price of SPY. 
spy_data = pd.read_csv(spy_path)
spy_data['Date'] = pd.to_datetime(spy_data['Date'])
spy_data = spy_data.sort_values(by=['Date'], ascending=True)
spy_data.reset_index(inplace = True)
spy_data.drop(columns = 'index',inplace = True)
spy_data = spy_data.drop(columns=['Open','High','Low','Close','Volume'])
spy_data['30-Day Average Price'] = spy_data['Adj Close'].rolling(30).mean()
spy_data = spy_data.sort_values(by=['Date'], ascending=False)
spy_data.head()

Unnamed: 0,Date,Adj Close,30-Day Average Price
251,2021-10-29,459.25,442.393334
250,2021-10-28,458.320007,441.798334
249,2021-10-27,453.940002,441.379067
248,2021-10-26,455.959991,441.129391
247,2021-10-25,455.549988,440.689111


<a id='32'></a>
### Pivoting

In [70]:
# The pivot table, just as in excel, allows you to quickly consolidate data around defined columns.  Very similar
# in concept to the groupby method.  
pd.pivot_table(demand_data, values = 'Demand', index=['Warehouse','Product_Family']).reset_index()

Unnamed: 0,Warehouse,Product_Family,Demand
0,W_A,PF_0,33830.638889
1,W_A,PF_1,34613.702381
2,W_A,PF_2,34301.791667
3,W_A,PF_3,30861.083333
4,W_A,PF_4,32812.214286
5,W_B,PF_0,31416.654762
6,W_B,PF_1,32435.47619
7,W_B,PF_2,32885.321429
8,W_B,PF_3,32502.486111
9,W_B,PF_4,31580.154762


<a id='33'></a>
### Transpose

In [71]:
# The Transpose method, as in linear algebra, will transpose a dataframe as if it were a matrix
test = pd.DataFrame([[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4]])
test

Unnamed: 0,0,1,2,3,4
0,0,1,2,3,4
1,0,1,2,3,4
2,0,1,2,3,4


In [72]:
# Not how the transpose has rotated each column
test.T

Unnamed: 0,0,1,2
0,0,0,0
1,1,1,1
2,2,2,2
3,3,3,3
4,4,4,4


<a id='34'></a>
### Removing Duplicates

In [73]:
# Note that test has 3 copies of the same row.  The drop_duplicates will remove the extra copies
test

Unnamed: 0,0,1,2,3,4
0,0,1,2,3,4
1,0,1,2,3,4
2,0,1,2,3,4


In [74]:
test.drop_duplicates()

Unnamed: 0,0,1,2,3,4
0,0,1,2,3,4


<a id='35'></a>
### Dropping Rows with Null values

In [75]:
# Note the null value in the first row.  To remove rows with null values, the dropna function will work
test = pd.DataFrame([[0,2,3,4],[0,1,2,3,4],[0,1,2,3,4]])
test

Unnamed: 0,0,1,2,3,4
0,0,2,3,4,
1,0,1,2,3,4.0
2,0,1,2,3,4.0


In [76]:
test.dropna()

Unnamed: 0,0,1,2,3,4
1,0,1,2,3,4.0
2,0,1,2,3,4.0


<a id='36'></a>
### Filling Null Values

In [77]:
# The fillna method allows you to keep rows with null values, and control what fills them.  In the below, 0 replaces
# null values
test.fillna(0)

Unnamed: 0,0,1,2,3,4
0,0,2,3,4,0.0
1,0,1,2,3,4.0
2,0,1,2,3,4.0


## Stacking Data
The basic premise of verticalizing a data set is to consolidate columns such that you have fewer columns after the process than when you started.  In general, the best way to do this is to find columns that can be grouped in a meaningful way.  Below we will see an example.

In [14]:
temp = pd.read_csv('Inventory_Data.csv')
print(temp.head())
temp=temp.set_index(['SKU_ID','Warehouse','Product_Family']).stack().reset_index(name='Value').rename(columns={'level_3': 'Measure'})
print(temp.head())

  Warehouse   SKU_ID Product_Family  Inventory as of 1/1/22  Cost
0       W_B  1395072           PF_1             78423.33949  8.36
1       W_C  1039394           PF_2            131276.48040  8.97
2       W_A  1975221           PF_3             23069.57275  8.65
3       W_B  1396615           PF_4             53988.69284  7.89
4       W_C  1026987           PF_0             23517.17321  8.68
    SKU_ID Warehouse Product_Family                 Measure         Value
0  1395072       W_B           PF_1  Inventory as of 1/1/22   78423.33949
1  1395072       W_B           PF_1                    Cost       8.36000
2  1039394       W_C           PF_2  Inventory as of 1/1/22  131276.48040
3  1039394       W_C           PF_2                    Cost       8.97000
4  1975221       W_A           PF_3  Inventory as of 1/1/22   23069.57275


<a id='37'></a>
## -------------PRACTICE-------------

1. In the cell below, find the rolling 30 day standard deviation of the daily returns for the SPY from January 2021 through September 2021.

2. In the cell below, create a lamda function that will create a new column in spy_data with '1' for days with positive returns, and null for days without.  Drop days null values in this column.

3. Go finance.yahoo.com, and download the historical data of your favorite stock or ETF for September of 2021.  Then, import the data, sort it in descending order, transpose it, then export it to a CSV. 

4. Read in a fresh pull of your new stock data and SPY data, then do a left merge of the data, with your new stock being the left DataFrame. 

5. Using the DataFrame from question 4, create a new column that shows the difference in returns between SPY and your security. 

<a id='55'></a>
# Weekly Readings/Videos

https://www.thinkful.com/blog/what-is-data-science/
    
https://hbr.org/2013/11/how-to-start-thinking-like-a-data-scientist

http://www.tylervigen.com/spurious-correlations

<a id='38'></a>
# Concluding Remarks
Pandas continues to evolve and offer more and more capabilities.  While this lecture covers the rudimentary aspects of the package, you will find as you work with it more, you will continue to find new methods, and ways to combine it with other python functionalities.  
![panda-tree](https://images.fineartamerica.com/images-medium-large-5/juvenile-panda-climbing-a-tree-tony-camacho.jpg)