![Data Applications](https://www.durhamtech.edu/themes/custom/durhamtech/images/durham-tech-logo-web.svg) 

## Manipulating Data with Pandas – The Fundamentals
The pandas package in python is an industry standard that allows analysts to work with small to medium data sets.  Pandas will enable an analyst to quickly clean data and gather insights.  The purpose of this lecture is to expose you to the core capabilities of the package.

---

### Set Up
1.	Go to github 
2.	Download the 'Inventory_Data.csv' and 'Demand_Plan.csv' files
3.	Move these files to a dedicated folder on your desktop or other location
4.	Open the command terminal in Anaconda Navigatory and run 'pip install pandas'



### Needed Packages
1.	pandas
2.  numpy
---

# Table of Contents

### Useful Methods & Practices
#### <a href='#1'>What are Pandas DataFrames?</a>
#### <a href='#2'>DataFrame From 1D Array</a>
#### <a href='#3'>DataFrame From 2D Array</a>
#### <a href='#4'>Create DataFrame From a Dictionary</a>
#### <a href='#5'>Create a Pandas Series Object (i.e. a DataFrame column) using a Python list</a>
#### <a href='#6'>Accessing A DataFrame</a>
#### <a href='#7'>Attribute Access</a>
#### <a href='#8'>Slicing Ranges</a>
#### <a href='#9'>Selection by Position Using .iloc Attribute</a>
#### <a href='#10'>Boolean Indexing</a>
#### <a href='#11'>Some Basic Statistics on a DataFrame</a>
#### <a href='#12'>Reading in data from a CSV</a>
#### <a href='#13'>Head & Tails</a>
#### <a href='#14'>Filtering</a>
#### <a href='#15'>Changing Column Attributes</a>
#### <a href='#16'>Grouping</a>
#### <a href='#17'>Exporting Data</a>


### Practice
#### <a href='#18'>Exercises</a>



<a id='1'></a>
### What are Pandas DataFrames?
Pandas DataFrames are data structures that contain data organized in two dimensions, rows and columns, which are themselves organized via labels. In most cases, Pandas DataFrames are built using the DataFrame Constructor to which you can pass two-dimensional data (list, tuple and sequences, or NumPy array), dictionaries, or time series data -- to name a few data types.

In [121]:
###Ensure this file is in the same folder as 'Demand_Plan.csv' & 'Inventory_Data.csv' 
import pandas as pd
import numpy as np

demand_path =  'Demand_Plan.csv'
inventory_path = 'Inventory_Data.csv'
test_path = 'test.csv'


<a id='2'></a>
### DataFrame From 1D Array

In [122]:
# Try it with an array
np.random.seed(0) # set seed for reproducibility

a1 = np.random.randn(3)
a2 = np.random.randn(3)
a3 = np.random.randn(3)

print (a1)
print (a2)
print (a3)

[1.76405235 0.40015721 0.97873798]
[ 2.2408932   1.86755799 -0.97727788]
[ 0.95008842 -0.15135721 -0.10321885]


In [123]:
# Create our first DataFrame w/ an np.array - it becomes a column
df0 = pd.DataFrame(a1)
df0

Unnamed: 0,0
0,1.764052
1,0.400157
2,0.978738


In [124]:
# Difference when you print and output of the last row
print(df0)

          0
0  1.764052
1  0.400157
2  0.978738


In [125]:
# Check type
type(df0)

pandas.core.frame.DataFrame

In [126]:
# DataFrame from list of np.arrays
df0 = pd.DataFrame([a1, a2, a3])
df0

Unnamed: 0,0,1,2
0,1.764052,0.400157,0.978738
1,2.240893,1.867558,-0.977278
2,0.950088,-0.151357,-0.103219


In [127]:
# We can set column and index names
df0 = pd.DataFrame([a1, a2, a3],columns=['a1','a2','a3'],index=['a','b','c'])
df0

Unnamed: 0,a1,a2,a3
a,1.764052,0.400157,0.978738
b,2.240893,1.867558,-0.977278
c,0.950088,-0.151357,-0.103219


In [128]:
# Add  more columns to dataframe, like a dictionary, dimensions must match
df0['col4']=a2
df0

Unnamed: 0,a1,a2,a3,col4
a,1.764052,0.400157,0.978738,2.240893
b,2.240893,1.867558,-0.977278,1.867558
c,0.950088,-0.151357,-0.103219,-0.977278


<a id='3'></a>
### DataFrame From 2D Array

In [129]:
# DataFrame from 2D np.array
np.random.seed(0)
array_2d = np.array(np.random.randn(9)).reshape(3,3)
array_2d

array([[ 1.76405235,  0.40015721,  0.97873798],
       [ 2.2408932 ,  1.86755799, -0.97727788],
       [ 0.95008842, -0.15135721, -0.10321885]])

In [130]:
# Label columns when creating DataFrame
df0 = pd.DataFrame(array_2d,columns=['rand_normal_1','Random Again','Third'] \
                   , index=[100,200,99]) 

df0

Unnamed: 0,rand_normal_1,Random Again,Third
100,1.764052,0.400157,0.978738
200,2.240893,1.867558,-0.977278
99,0.950088,-0.151357,-0.103219


<a id='4'></a>
### Create DataFrame From a Dictionary

In [131]:
# DataFrame from a Dictionary
dict1 = {'a1':a1, 'a2':a2,'a3':a3}
dict1

{'a1': array([1.76405235, 0.40015721, 0.97873798]),
 'a2': array([ 2.2408932 ,  1.86755799, -0.97727788]),
 'a3': array([ 0.95008842, -0.15135721, -0.10321885])}

In [132]:
# Assign index values when creating DataFrame
df1 = pd.DataFrame(dict1,index=[1,2,3]) 

# Note that we now have columns without assignment
df1

Unnamed: 0,a1,a2,a3
1,1.764052,2.240893,0.950088
2,0.400157,1.867558,-0.151357
3,0.978738,-0.977278,-0.103219


In [133]:
# We can add a list with strings and ints as a column 
df1['L'] = ["List", 3, "words"]
df1

Unnamed: 0,a1,a2,a3,L
1,1.764052,2.240893,0.950088,List
2,0.400157,1.867558,-0.151357,3
3,0.978738,-0.977278,-0.103219,words


<a id='5'></a>
### Create a Pandas Series Object (i.e. a DataFrame column) using a Python list

In [134]:
# Every column is a series object
type(df1['L'])

pandas.core.series.Series

In [135]:
# View column
df1['L']

1     List
2        3
3    words
Name: L, dtype: object

In [136]:
# Different datatypes in a column
print(type(df1['L'][1]), type(df1['L'][2]))

<class 'str'> <class 'int'>


In [137]:
# Create a Series from a Python list
s = pd.Series([1,5,3]) # automatic index, 0,1,2...
s

0    1
1    5
2    3
dtype: int64

In [138]:
# same, but now add index
s2 = pd.Series([2, 3, 4], index = ['a','b','c']) #specific index
s2

a    2
b    3
c    4
dtype: int64

In [139]:
# View element
s2['a']

2

<a id='6'></a>
### Accessing A DataFrame


In [140]:
# We can add the Series s to the DataFrame above as column Series
# Remember to match indices
df1['Series'] = s
df1

Unnamed: 0,a1,a2,a3,L,Series
1,1.764052,2.240893,0.950088,List,5.0
2,0.400157,1.867558,-0.151357,3,3.0
3,0.978738,-0.977278,-0.103219,words,


In [141]:
# We can rename columns
df1 = df1.rename(columns = {'L':'RenamedL'})
df1

Unnamed: 0,a1,a2,a3,RenamedL,Series
1,1.764052,2.240893,0.950088,List,5.0
2,0.400157,1.867558,-0.151357,3,3.0
3,0.978738,-0.977278,-0.103219,words,


In [142]:
# We can delete columns
del df1['RenamedL']
df1

Unnamed: 0,a1,a2,a3,Series
1,1.764052,2.240893,0.950088,5.0
2,0.400157,1.867558,-0.151357,3.0
3,0.978738,-0.977278,-0.103219,


In [143]:
# or drop columns, see axis = 1
# does not change df1 if we don't set inplace=True
df1.drop('a2',axis=1) # returns a copy

Unnamed: 0,a1,a3,Series
1,1.764052,0.950088,5.0
2,0.400157,-0.151357,3.0
3,0.978738,-0.103219,


In [144]:
# Sanity Check
df1

Unnamed: 0,a1,a2,a3,Series
1,1.764052,2.240893,0.950088,5.0
2,0.400157,1.867558,-0.151357,3.0
3,0.978738,-0.977278,-0.103219,


In [145]:
# or drop rows
df1.drop(1,axis=0)

Unnamed: 0,a1,a2,a3,Series
2,0.400157,1.867558,-0.151357,3.0
3,0.978738,-0.977278,-0.103219,


<a id='7'></a>
### Attribute Access

In [146]:
# Example: view only one column
df1['a1']

1    1.764052
2    0.400157
3    0.978738
Name: a1, dtype: float64

In [147]:
# Or view several column
df1[['a1','a3']]

Unnamed: 0,a1,a3
1,1.764052,0.950088
2,0.400157,-0.151357
3,0.978738,-0.103219


<a id='8'></a>
### Slicing Ranges

In [148]:
# slice of the DataFrame returned
# this slices the first three rows first followed by first 2 rows of the sliced frame
(df1[0:3][0:2])

Unnamed: 0,a1,a2,a3,Series
1,1.764052,2.240893,0.950088,5.0
2,0.400157,1.867558,-0.151357,3.0


In [149]:
# Lets print the five first 2  elements of column a1
# This is a new Series (like a new table)
df1['a1'][0:2]

1    1.764052
2    0.400157
Name: a1, dtype: float64

In [150]:
# Lets print the 2 column, and top 2 values- note the list of columns
df1[['a1','a3']][0:2]

Unnamed: 0,a1,a3
1,1.764052,0.950088
2,0.400157,-0.151357


<a id='9'></a>
### Selection by Position Using .iloc Attribute

In [151]:
# View element
df1.iloc[0,0]

1.764052345967664

In [152]:
# Extract 2nd to 4th row, 4th to 5th column
df1.iloc[0:2,0:2]

Unnamed: 0,a1,a2
1,1.764052,2.240893
2,0.400157,1.867558


In [153]:
# iloc will also accept 2 'lists' of position numbers
df1.iloc[[0,2],[0,2]]

Unnamed: 0,a1,a3
1,1.764052,0.950088
3,0.978738,-0.103219


In [154]:
# Data only from row with index value '1'
print (df1.iloc[1])
print()
print (df1.iloc[1,:])

a1        0.400157
a2        1.867558
a3       -0.151357
Series    3.000000
Name: 2, dtype: float64

a1        0.400157
a2        1.867558
a3       -0.151357
Series    3.000000
Name: 2, dtype: float64


<a id='10'></a>
### Boolean Indexing

In [155]:
# return  full rows where a2>0
df1[df1['a2']>0]

# df1['a2']>0 - checks condition ans returns boolean and gives

Unnamed: 0,a1,a2,a3,Series
1,1.764052,2.240893,0.950088,5.0
2,0.400157,1.867558,-0.151357,3.0


In [156]:
# return column a3 values where a2 >0
df1['a3'][df1['a2']>0]

1    0.950088
2   -0.151357
Name: a3, dtype: float64

In [157]:
# If you want the values in an np array
npg = df1.loc[:,"a2"].values #otherwise it returns a  indexed series
print(type(npg))
print()
npg

<class 'numpy.ndarray'>



array([ 2.2408932 ,  1.86755799, -0.97727788])

<a id='11'></a>
### Some Basic Statistics on a DataFrame

In [158]:
# Show general statistics
df1.describe()

Unnamed: 0,a1,a2,a3,Series
count,3.0,3.0,3.0,2.0
mean,1.047649,1.043724,0.231837,4.0
std,0.684554,1.760165,0.622489,1.414214
min,0.400157,-0.977278,-0.151357,3.0
25%,0.689448,0.44514,-0.127288,3.5
50%,0.978738,1.867558,-0.103219,4.0
75%,1.371395,2.054226,0.423435,4.5
max,1.764052,2.240893,0.950088,5.0


In [159]:
# Only view desired
df1.describe().loc[['mean','std'],['a2','a3']]

Unnamed: 0,a2,a3
mean,1.043724,0.231837
std,1.760165,0.622489


In [160]:
# We can change the index sorting
df1.sort_index(axis=0, ascending=False).head() # starts a year ago

Unnamed: 0,a1,a2,a3,Series
3,0.978738,-0.977278,-0.103219,
2,0.400157,1.867558,-0.151357,3.0
1,1.764052,2.240893,0.950088,5.0


<a id='12'></a>
### Reading in data from a CSV

In [161]:
#The read_csv method requires one argument, the file path, to the CSV file you want to read.
demand_data = pd.read_csv(demand_path)

<a id='13'></a>
### Head & Tails

In [162]:
#The head method defaults to displaying the first 5 rows of a dataframe.  
#Inputting an integer argument will adjust the number of rows displayed.
demand_data.head(10)

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
0,11395072,1395072,W_B,PF_1,17631,2022,1,4
1,11039394,1039394,W_C,PF_2,15438,2022,1,4
2,11975221,1975221,W_A,PF_3,12725,2022,1,4
3,11396615,1396615,W_B,PF_4,38768,2022,1,4
4,11026987,1026987,W_C,PF_0,44662,2022,1,4
5,11885799,1885799,W_A,PF_1,37943,2022,1,4
6,11844486,1844486,W_B,PF_2,26850,2022,1,4
7,11633773,1633773,W_C,PF_3,46241,2022,1,4
8,11280204,1280204,W_A,PF_4,49826,2022,1,4
9,11461444,1461444,W_B,PF_0,23768,2022,1,4


In [163]:
#The tail method defaults to displaying the last 5 rows of a dataframe.  
#Inputting an integer argument will adjust the number of rows displayed.
demand_data.tail(10)

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
1190,121602257,1602257,W_B,PF_1,58554,2022,12,5
1191,121542470,1542470,W_C,PF_2,17861,2022,12,5
1192,121172141,1172141,W_A,PF_3,14832,2022,12,5
1193,121686011,1686011,W_B,PF_4,54873,2022,12,5
1194,121760339,1760339,W_C,PF_0,35569,2022,12,5
1195,121544715,1544715,W_A,PF_1,56244,2022,12,5
1196,121715505,1715505,W_B,PF_2,41897,2022,12,5
1197,121539334,1539334,W_C,PF_3,53392,2022,12,5
1198,121803831,1803831,W_A,PF_4,48499,2022,12,5
1199,121431913,1431913,W_B,PF_0,28570,2022,12,5


<a id='14'></a>
### Filtering 

In [164]:
# Filtering in pandas works with standard python logic symbols for equal to '==', 
# greater than '>', less than '<', greater than or equal to '>=', and less than or equal to '<='.  
# The example below shows 'demand_data' being filtered by 'Product_Family' to only include data from the 'PF_1' 
# product family.
pf_1_demand = demand_data[demand_data['Product_Family']=='PF_1']
pf_1_demand.head()

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
0,11395072,1395072,W_B,PF_1,17631,2022,1,4
5,11885799,1885799,W_A,PF_1,37943,2022,1,4
10,11118364,1118364,W_C,PF_1,35654,2022,1,4
15,11950549,1950549,W_B,PF_1,14407,2022,1,4
20,11633085,1633085,W_A,PF_1,18798,2022,1,4


In [165]:
# Multiple filters can be applied using '&'.  
# The below gives an example of filtering 'demand_data' 
# to only show the product family 'PF_1' at warehouse 'A'.  
pf_1A_demand = demand_data[(demand_data['Product_Family']=='PF_1') & (demand_data['Warehouse']=='W_A')]
pf_1A_demand.head()

Unnamed: 0,Lookup Value,SKU_ID,Warehouse,Product_Family,Demand,Year,Month,Weeks in Month
5,11885799,1885799,W_A,PF_1,37943,2022,1,4
20,11633085,1633085,W_A,PF_1,18798,2022,1,4
35,11838070,1838070,W_A,PF_1,15533,2022,1,4
50,11421180,1421180,W_A,PF_1,43700,2022,1,4
65,11470217,1470217,W_A,PF_1,43925,2022,1,4


<a id='15'></a>
### Changing Column Attributes


<a href ='https://numpy.org/doc/stable/reference/arrays.dtypes.html'>Data Types</a>

In [166]:
# Columns can are automatically assigned a data type when the data is read in, but they can be changed.  
# The below converts several columns from 'demand_data' from 'int' to 'string'.  
# A full list of available types can be found at the link above.
demand_data['Year'] = demand_data['Year'].astype('str')
demand_data['Month'] = demand_data['Month'].astype('str')
demand_data['Weeks in Month'] = demand_data['Weeks in Month'].astype('str')
demand_data['Lookup Value'] = demand_data['Lookup Value'].astype('str')
demand_data['SKU_ID'] = demand_data['SKU_ID'].astype('str')


<a id='16'></a>
### Grouping

<a href ='https://pandas.pydata.org/docs/reference/groupby.html'>Groupby Methods</a>

In [167]:
# The groupby method partitions data into groups by specified columns and 
# consolidates the numberical columns using a specified method.  
# The below gives an example grouping 'demand_data' by 'Product_Family', 
# 'Month', and 'Year' and showing the sum of 'Demand' by product family, month, and year.  
# A full list of methods that can be applied to the consolidated data can be found at the link above.
data = demand_data.groupby(['Product_Family','Month','Year']).sum()
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Demand
Product_Family,Month,Year,Unnamed: 3_level_1
PF_0,1,2022,644783
PF_0,10,2022,540286
PF_0,11,2022,530712
PF_0,12,2022,643068
PF_0,2,2022,708408


<a id='17'></a>
### Exporting Data

In [168]:
# The below exports 'data' to a CSV located at your 'test_path'.
data.to_csv(test_path)

<a id='18'></a>
## -------------PRACTICE-------------

1. In the below cell, store the data from the 'Inventory_Data.csv' in the variable 'inventory_data'. 

2. In the below cell, display the first 10 rows 'inventory_data'. 

3. In the below cell, create a dataframe from 'demand_data' that shows the average 'Demand' by 'Warehouse', 'Month', and 'Year' for only warehouse 'W_A' and 'W_B'.

4. In the below cell, using 'demand_data',show the basic statistics for the 'Demand' at warehouse 'W_C'.

5. In the below cell, using 'demand_data',find the 'SKU_ID' with the highest demand for each 'Product_Family' at warehouse 'W_B'.

6. In the below cell, retrieve the first row of 'demand_data' and export it to a CSV with the name 'My_First_Export'. 