### Pandas Playground

We will start by loading the necessary libraries we will be using.

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
import jinja2 as jnj

Now that we have imported our libraries let's lead python to the .csv files we will be using <br>
Notice that we are creating variables for each of the pathways, this makes it easy to call later.

In [65]:
orders = '/Users/markhinojosa/coffee-shop/Raw Data/ORDERS.csv'
products = '/Users/markhinojosa/coffee-shop/Raw Data/PRODUCTS.csv'
customers = '/Users/markhinojosa/coffee-shop/Raw Data/CUSTOMERS.csv'

We have our filepaths variables ready to go, let's feed through Pandas to create our Dataframes.<br>
[Learn More About Dataframes](https://realpython.com/pandas-dataframe/#:~:text=The%20pandas%20DataFrame%20is%20a,with%20in%20Excel%20or%20Calc.)

In [68]:
ordersDF = pd.read_csv(orders)
productsDF = pd.read_csv(products)
customersDF = pd.read_csv(customers)

#### Data Population Validation
Our data has been loaded, but let's verify that is is populating correctly.

In [143]:
ordersDF.head()

Unnamed: 0,Order ID,Order Date,Customer ID,Product ID,Quantity
0,QEV-37451-860,2019-09-05,17670-51384-MA,R-M-1,2
1,QEV-37451-860,2019-09-05,17670-51384-MA,E-M-0.5,5
2,FAA-43335-268,2021-06-17,21125-22134-PX,A-L-1,1
3,KAC-83089-793,2021-07-15,23806-46781-OU,E-M-1,2
4,KAC-83089-793,2021-07-15,23806-46781-OU,R-L-2.5,2


In [144]:
productsDF.head()

Unnamed: 0,Product ID,Coffee Type,Roast Type,Size,Unit Price,Price per 100g,Profit
0,A-L-0.2,Ara,L,0.2,3.885,1.9425,0.34965
1,A-L-0.5,Ara,L,0.5,7.77,1.554,0.6993
2,A-L-1,Ara,L,1.0,12.95,1.295,1.1655
3,A-L-2.5,Ara,L,2.5,29.785,1.1914,2.68065
4,A-M-0.2,Ara,M,0.2,3.375,1.6875,0.30375


In [145]:
customersDF.head()

Unnamed: 0,Customer ID,Customer Name,Email,Phone Number,Address Line 1,City,Country,Postcode,Loyalty Card
0,17670-51384-MA,Aloisia Allner,aallner0@lulu.com,+1 (862) 817-0124,57999 Pepper Wood Alley,Paterson,United States,7505,Yes
1,73342-18763-UW,Piotr Bote,pbote1@yelp.com,+353 (913) 396-4653,2112 Ridgeway Hill,Crumlin,Ireland,D6W,No
2,21125-22134-PX,Jami Redholes,jredholes2@tmall.com,+1 (210) 986-6806,5214 Bartillon Park,San Antonio,United States,78205,Yes
3,71253-00052-RN,Dene Azema,dazema3@facebook.com,+1 (217) 418-0714,27 Maywood Place,Springfield,United States,62711,Yes
4,23806-46781-OU,Christoffer O' Shea,,+353 (698) 362-9201,38980 Manitowish Junction,Cill Airne,Ireland,N41,No


#### Validating Data Types
We know that we have the right data but we need to ensure that the data we will be analysing is cast correcrtly. <br>
the '.dtype' module helps with this.

In [72]:
ordersDF.dtypes

Order ID       object
Order Date     object
Customer ID    object
Product ID     object
Quantity        int64
dtype: object

In [73]:
customersDF.dtypes

Customer ID       object
Customer Name     object
Email             object
Phone Number      object
Address Line 1    object
City              object
Country           object
Postcode          object
Loyalty Card      object
dtype: object

In [74]:
ordersDF.dtypes

Order ID       object
Order Date     object
Customer ID    object
Product ID     object
Quantity        int64
dtype: object

In ordersDF it looks like we have to convert 'Order Date' into a datetime type.

In [146]:
ordersDF['Order Date'] = pd.to_datetime(ordersDF['Order Date'])

^^ Notice that above we used this syntax [DF Object].[DF Column Name] <br>
We will use this syntax quite a bit in our analysis

Let's call the .dtypes module once again to verify the change.

In [149]:
ordersDF.dtypes

Order ID               object
Order Date     datetime64[ns]
Customer ID            object
Product ID             object
Quantity                int64
dtype: object

It looks like the change took.

#### Selecting Rows and Columns
We may not need all of the data within a dataframe so let's look at the different ways to select specific data.

Below is the syntax for selecting a single column

In [158]:
ordersDF['Order ID']

0      QEV-37451-860
1      QEV-37451-860
2      FAA-43335-268
3      KAC-83089-793
4      KAC-83089-793
           ...      
995    RLM-96511-467
996    AEZ-13242-456
997    UME-75640-698
998    GJC-66474-557
999    IRV-20769-219
Name: Order ID, Length: 1000, dtype: object

Here is example of selecting multiple columns, notice the bracket syntax.

In [160]:
ordersDF[['Order ID', 'Quantity']]

Unnamed: 0,Order ID,Quantity
0,QEV-37451-860,2
1,QEV-37451-860,5
2,FAA-43335-268,1
3,KAC-83089-793,2
4,KAC-83089-793,2
...,...,...
995,RLM-96511-467,1
996,AEZ-13242-456,5
997,UME-75640-698,4
998,GJC-66474-557,1


Below is an example of selecting specific rows. <br> 
Note that pandas will assign a row index when you load in a data frame. <br>
The : symbol acts as 'between' and in the below snippet we are calling rows between 1 and 10

In [164]:
ordersDF[1 : 10]

Unnamed: 0,Order ID,Order Date,Customer ID,Product ID,Quantity
1,QEV-37451-860,2019-09-05,17670-51384-MA,E-M-0.5,5
2,FAA-43335-268,2021-06-17,21125-22134-PX,A-L-1,1
3,KAC-83089-793,2021-07-15,23806-46781-OU,E-M-1,2
4,KAC-83089-793,2021-07-15,23806-46781-OU,R-L-2.5,2
5,CVP-18956-553,2021-08-04,86561-91660-RB,L-D-1,3
6,IPP-31994-879,2022-01-21,65223-29612-CB,E-D-0.5,3
7,SNZ-65340-705,2022-05-20,21134-81676-FR,L-L-0.2,1
8,EZT-46571-659,2019-01-02,03396-68805-ZC,R-M-0.5,3
9,NWQ-70061-912,2019-09-05,61021-27840-ZN,R-M-0.5,1


#### Filtering
Below are method for filters data with Pandas <br>
In the example below we are surfacing rows where the order quantity was greater than 5 <br>
##### Note I am only using the .head module to limit to return results and it not a part of the filtering syntax.

In [173]:
ordersDF[ordersDF['Quantity'] > 5].head()

Unnamed: 0,Order ID,Order Date,Customer ID,Product ID,Quantity
16,VAU-44387-624,2019-03-20,99643-51048-IQ,A-M-0.2,6
17,RDW-33155-159,2019-10-19,62173-15287-CU,A-L-1,6
21,NUO-20013-488,2020-12-04,03090-88267-BQ,A-D-0.2,6
31,WOQ-36015-429,2021-09-25,51427-89175-QJ,A-D-0.5,6
32,WOQ-36015-429,2021-09-25,51427-89175-QJ,L-M-0.5,6


Because we converted 'Order Date' to datetime earlier we can now filter for dates as well

In [177]:
ordersDF[ordersDF['Order Date'] > '2022-01-01'].head()

Unnamed: 0,Order ID,Order Date,Customer ID,Product ID,Quantity
6,IPP-31994-879,2022-01-21,65223-29612-CB,E-D-0.5,3
7,SNZ-65340-705,2022-05-20,21134-81676-FR,L-L-0.2,1
12,SZW-48378-399,2022-07-02,34136-36674-OM,R-M-1,5
14,GNZ-46006-527,2022-04-05,95875-73336-RG,L-D-0.2,3
15,FYQ-78248-319,2022-06-07,25473-43727-BY,R-M-2.5,5


Below is an example of how a column can be selected and then filtered. <br>
Looking for orders after 2022-01-01

In [175]:
ordersDF['Order Date'][ordersDF['Order Date'] > '2022-01-01'].head()

6    2022-01-21
7    2022-05-20
12   2022-07-02
14   2022-04-05
15   2022-06-07
Name: Order Date, dtype: datetime64[ns]

Same as above but with multiple columns selected and then filtered. <br>
Looking for Orders quantity greater than 4

In [176]:
ordersDF[['Order ID', 'Quantity', 'Customer ID']][ordersDF['Quantity'] > 4].head()

Unnamed: 0,Order ID,Quantity,Customer ID
1,QEV-37451-860,5,17670-51384-MA
11,VQR-01002-970,5,49315-21985-BB
12,SZW-48378-399,5,34136-36674-OM
15,FYQ-78248-319,5,25473-43727-BY
16,VAU-44387-624,6,99643-51048-IQ


Below is an example of how we search for a specific value. <br>
Looking for orders that had a Product ID of 'E-D-0.5'

In [179]:
ordersDF[ordersDF['Product ID']=='E-D-0.5'].head()

Unnamed: 0,Order ID,Order Date,Customer ID,Product ID,Quantity
6,IPP-31994-879,2022-01-21,65223-29612-CB,E-D-0.5,3
118,RFH-64349-897,2019-10-22,61954-61462-RJ,E-D-0.5,3
131,VDZ-76673-968,2020-12-31,82246-82543-DW,E-D-0.5,2
162,CBT-55781-720,2021-11-15,97855-54761-IS,E-D-0.5,3
164,BYU-58154-603,2020-12-17,51971-70393-QM,E-D-0.5,4


#### Aggregation Modules

There are some nifty quick aggreagations we can do against our data to understand it more.

The below syntax selects the Order ID columns, filters to having Quantity greater than 5, and returns the count of rows that meet the criteria <br>
The .count() modules returns the number of rows meeting any criteria; returns all if no criteria is provided.

In [182]:
ordersDF['Order ID'][ordersDF['Quantity'] > 5].count()

175

The below syntax returns a count of every row for each column.

In [180]:
ordersDF.count()

Order ID       1000
Order Date     1000
Customer ID    1000
Product ID     1000
Quantity       1000
dtype: int64

The two statments above can be used in calculations as well. Below is example of extracting the percentage of orders that had a quantiy greater than 5.

In [184]:
ordersDF['Order ID'][ordersDF['Quantity'] > 5].count() / ordersDF['Order ID'].count()

0.175

The .sum() module can be used on INT data types to return the sum values.

In [181]:
ordersDF['Quantity'].sum()

3551

#### Null Values
The below methods are used to understand NaN (Null values) presented in your data.

The .isnull() module when used will return the table will BOOLEAN values depending on whether NaN is present.

In [185]:
customersDF.isnull()

Unnamed: 0,Customer ID,Customer Name,Email,Phone Number,Address Line 1,City,Country,Postcode,Loyalty Card
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,False,False,False,False
996,False,False,True,False,False,False,False,False,False
997,False,False,True,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False,False


If you want to check if any values are null against a dataframe the .value.any() module can be called.

In [141]:
customersDF.isnull().values.any()

True

^^ We can see that in customersDF there are NaN values that exist. <br>

We can chain the .sum() module to .isnull() to count the number of null values for each column

In [187]:
customersDF.isnull().sum()

Customer ID         0
Customer Name       0
Email             204
Phone Number      130
Address Line 1      0
City                0
Country             0
Postcode            0
Loyalty Card        0
dtype: int64

Further we can do calculations to get the NaN percentage.

In [189]:
customersDF.isnull().sum() / customersDF.count()

Customer ID       0.000000
Customer Name     0.000000
Email             0.256281
Phone Number      0.149425
Address Line 1    0.000000
City              0.000000
Country           0.000000
Postcode          0.000000
Loyalty Card      0.000000
dtype: float64