# EDA for Pro Data Analyst

A Comprehensive Guide to Exploratory Data Analysis with Real-Life Data, Transforming Beginners into Professionals.

---
[Download PDF version of this notebook](https://github.com/AmitXShukla/RPA/blob/main/SampleData/The%20Ultimate%20Guide%20to%20Data%20Wrangling%20with%20Python%20-%20Rust%20Polars%20Data%20Frame.pdf)

[Video Tutorials](https://www.youtube.com/playlist?list=PLp0TENYyY8lHJaY4t5bAihnFS5TBUQYV1)

    Author: Amit Shukla

[https://github.com/AmitXShukla](https://github.com/AmitXShukla)

[https://twitter.com/ashuklax](https://github.com/AShuklaX)

[https://youtube.com/AmitXShukla](https://youtube.com/@Amit.Shukla)

by the end of this blog, you will learn techniques to

- Data Discovery using Pandas 2.0
- Create Data ERD diagram with animation (using manim)
- Data Visualization using Matplotlib
- Data Visualization using PlotLy
- Data Visualization using Seaborn
- Analyze Distributions
- Spot Anomalies
- Test Hypothesis
- Data Patterns
- Check Assumptions
- Create Interactive Visualizations
- what-if Analysis
- would, could, should
- Time Travel on Time Series Data
- Linear Regression

---

#### Introduction
I'm Amit Shukla, and I specialize in training neural networks for finance supply chain analysis, enabling them to identify data patterns and make accurate predictions.
During the challenges posed by the COVID-19 pandemic, I successfully trained GL and Supply Chain neural networks to anticipate supply chain shortages. The valuable insights gained from this effort have significantly influenced the content of this tutorial series.
	
#### Objective:
By delving into this powerful tool, we will master the fundamental techniques of using Exploratory Data Analysis. This knowledge is crucial in preparing finance and supply chain data for advanced analytics, visualization, and predictive modeling using neural networks and machine learning.
	
#### Subject
It's important to note that this particular series will concentrate solely on `Exploratory Data Analysis`.
	
#### Following
However, in future installments, we will explore Data Analytics and delve into the realm of machine learning for predictive analytics.
	Thank you for joining me, and I'm excited to embark on this educational journey together.
	
Let's get started.

---

# Table of content
---

- What is EDA
- Installation
- Loading Finance and Supply chain Data
- Data Discovery using Pandas 2.0
- Create Data ERD diagram with animation (using manim)
- Data Visualization using Matplotlib
- Data Visualization using PlotLy
- Data Visualization using Seaborn
- Analyze Distributions
- Spot Anomalies
- Test Hypothesis
- Data Patterns
- Check Assumptions
- Create Interactive Visualizations
- what-if Analysis
- would, could, should
- Time Travel on Time Series Data
- Linear Regression

# what is EDA
---
EDA is often characterized as a tool for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.


# Installation
---

In [None]:
pip install polars pandas numpy matplotlib seaborn

# Loading Finance and Supply chain Data
---

using this section, we will first load our dataset.

It's important for user to make sure, that all steps discussed in this section run without any error and data is loaded before starting our data discovery journey on EDA.

Please see, I want user to ignore technical content of this section for now, as this section is only required to load data.
How this data is loaded is not the point, point is, this excercise is about using EDA on a real life dataset.

As we progress more, in later sections, I want users to use EDA to discover data patterns and confirm those findings with dataset created in this section, this is why technical details used to create these datasets are intentionally ignored for now.

Let's load this data now.

# Data Discovery
---


In the initial stage of Data Discovery, the primary step involves recognizing and establishing a dynamic repository that encompasses all accessible datasets. It is imperative to identify the relationships between these datasets before embarking on data transformation or analytics.

This phase is of utmost importance, as it entails creating an official diagram reminiscent of an Entity-Relationship Diagram (ERD). The crucial tasks include pinpointing data types and discerning the fields that contain valuable information. This not only aids in comprehending the data but also facilitates a deeper understanding of the business processes or the insights derived from these datasets.

In this section, we will delve into the Data Discovery phase. We will initiate the process by scrutinizing the available data and crafting an ERD that encapsulates the dataset structures.

In [None]:
pip install pandas

Let's begin by examining the available dataset.

For now, we won't concern ourselves with its source, I'll provide the scripts used to generate it later. 

Our goal is to simulate a real-world project scenario where analysts often receive unfamiliar datasets and initiate data exploration.

The following steps demonstrate this process, and we'll take it one step at a time to learn how to approach data discovery. 

Keep in mind that there's no one-size-fits-all approach, it varies based on data types and quality. 

Consider these steps as general guidelines. Let's begin.

In [None]:
# this dataset is generated using Polars DataFrame
# make sure you have polars installed before generating this dataset

pip install polars

In [65]:
# use this script to create sample Finance dataset

import polars as pl
import os

dirPath = "../../../downloads/" # directory where sample csv are generated
sampleSize = 100_000 # generate 100k sample rows

print(os.listdir(dirPath))

# Creating DataFrame from a dict or a collection of dicts.
# let's create a more sophisticated DataFrame
# in real world, Organization maintain dozens of record structure to store 
# different type of locations, like ShipTo Location, Receiving, 
# Mailing, Corp. office, head office,
# field office etc. etc.

########################
## LOCATION DataFrame ##
########################
import random
from datetime import datetime

location = pl.DataFrame({
    "ID":  list(range(11, 23)),
    "AS_OF_DATE" : datetime(2022, 1, 1),
    "DESCRIPTION" : ["Boston","New York","Philadelphia","Cleveland","Richmond",
                     "Atlanta","Chicago","St. Louis","Minneapolis","Kansas City",
                     "Dallas","San Francisco"],
    "REGION": ["Region A","Region B","Region C","Region D"] * 3,
    "TYPE" : "Physical",
    "CATEGORY" : ["Ship","Recv","Mfg"] * 4
})
location.sample(5).with_row_count("Row #")

########################
## ACCOUNTS DataFrame ##
########################

accounts = pl.DataFrame({
    "ID":  list(range(10000, 45000, 1000)),
    "AS_OF_DATE" : datetime(2022, 1, 1),
    "DESCRIPTION" : ["Operating Expenses","Non Operating Expenses","Assets",
                     "Liabilities","Net worth accounts", "Statistical Accounts",
                     "Revenue"] * 5,
    "REGION": ["Region A","Region B","Region C","Region D", "Region E"] * 7,
    "TYPE" : ["E","E","A","L","N","S","R"] * 5,
    "STATUS" : "Active",
    "CLASSIFICATION" : ["OPERATING_EXPENSES","NON-OPERATING_EXPENSES", 
                        "ASSETS","LIABILITIES","NET_WORTH","STATISTICS",
                        "REVENUE"] * 5,
    "CATEGORY" : [
       		"Travel","Payroll","non-Payroll","Allowance","Cash",
       		"Facility","Supply","Services","Investment","Misc.",
       		"Depreciation","Gain","Service","Retired","Fault.",
       		"Receipt","Accrual","Return","Credit","ROI",
       		"Cash","Funds","Invest","Transfer","Roll-over",
       		"FTE","Members","Non_Members","Temp","Contractors",
       		"Sales","Merchant","Service","Consulting","Subscriptions"
       	],
})
accounts.sample(5).with_row_count("Row #")

##########################
## DEPARTMENT DataFrame ##
##########################

dept = pl.DataFrame({
    "ID":  list(range(1000, 2500, 100)),
    "AS_OF_DATE" : datetime(2022, 1, 1),
    "DESCRIPTION" : ["Sales & Marketing","Human Resource",
                     "Information Technology","Business leaders","other temp"] * 3,
    "REGION": ["Region A","Region B","Region C"] * 5,
    "STATUS" : "Active",
    "CLASSIFICATION" : ["SALES","HR", "IT","BUSINESS","OTHERS"] * 3,
    "TYPE" : ["S","H","I","B","O"] * 3,
    "CATEGORY" : ["sales","human_resource","IT_Staff","business","others"] * 3,
})
dept.sample(5).with_row_count("Row #")

######################
## LEDGER DataFrame ##
######################

org = "ABC Inc."
ledger_type = "ACTUALS" # BUDGET, STATS are other Ledger types
fiscal_year_from = 2020
fiscal_year_to = 2023
random.seed(123)

ledger = pl.DataFrame({
	"LEDGER" : ledger_type,
	"ORG" : org,
	"FISCAL_YEAR": random.choices(list(range(fiscal_year_from, 
                                          fiscal_year_to+1, 1)),k=sampleSize),
	"PERIOD": random.choices(list(range(1, 12+1, 1)),k=sampleSize),
	"ACCOUNT" : random.choices(accounts["ID"], k=sampleSize),
	"DEPT" : random.choices(dept["ID"], k=sampleSize),
	"LOCATION" : random.choices(location["ID"], k=sampleSize),
	"POSTED_TOTAL": random.sample(range(1000000), sampleSize)
})
ledger.sample(5).with_row_count("Row #")

ledger_type = "BUDGET" # ACTUALS, STATS are other Ledger types

ledgerBudget = pl.DataFrame({
	"LEDGER" : ledger_type,
	"ORG" : org,
	"FISCAL_YEAR": random.choices(list(range(fiscal_year_from, fiscal_year_to+1, 1))
                               ,k=sampleSize),
	"PERIOD": random.choices(list(range(1, 12+1, 1)),k=sampleSize),
	"ACCOUNT" : random.choices(accounts["ID"], k=sampleSize),
	"DEPT" : random.choices(dept["ID"], k=sampleSize),
	"LOCATION" : random.choices(location["ID"], k=sampleSize),
	"POSTED_TOTAL": random.sample(range(1000000), sampleSize)
})
ledgerBudget.sample(5).with_row_count("Row #")
#########################################
# combined ledger for Actuals and Budget
#########################################
dfLedger = pl.concat([ledger, ledger_budg], how="vertical")
dfLedger.sample(5).with_row_count("Row #")

location.write_csv(f"{dirPath}location.csv")
dept.write_csv(f"{dirPath}dept.csv")
accounts.write_csv(f"{dirPath}accounts.csv")
dfLedger.write_csv(f"{dirPath}ledger.csv")

print(os.listdir(dirPath))

['accounts.csv', 'dept.csv', 'earth.jpg', 'ledger.csv', 'ledger.json', 'ledger.parquet', 'location.csv']
['accounts.csv', 'dept.csv', 'earth.jpg', 'ledger.csv', 'ledger.json', 'ledger.parquet', 'location.csv']


## Pandas to load data into DataFrames

In [38]:
import os
dirPath = "../../../downloads/"
# print(os.listdir(dirPath))

# as you can see, there are many files,
# we will focus on loading only csv/xls/xlsx files for now

for filename in os.listdir(dirPath):
    if filename.endswith(".csv"):
        print("Eligible to read into DataFrame: ", filename)

import pandas as pd
dfAccounts = pd.read_csv(dirPath+"accounts.csv")
dfDept = pd.read_csv(dirPath+"dept.csv")
dfLocation = pd.read_csv(dirPath+"location.csv")
dfLedger = pd.read_csv(dirPath+"ledger.csv")

print(dfAccounts.shape, dfDept.shape, dfLocation.shape, dfLedger.shape)
dfLedger.sample(10)

Eligible to read into DataFrame:  accounts.csv
Eligible to read into DataFrame:  dept.csv
Eligible to read into DataFrame:  ledger.csv
Eligible to read into DataFrame:  location.csv
(35, 8) (15, 8) (12, 6) (200000, 8)


Unnamed: 0,LEDGER,ORG,FISCAL_YEAR,PERIOD,ACCOUNT,DEPT,LOCATION,POSTED_TOTAL
16322,ACTUALS,ABC Inc.,2022,5,39000,2400,20,919779
41713,ACTUALS,ABC Inc.,2021,10,27000,1800,22,282452
94363,ACTUALS,ABC Inc.,2020,5,44000,2000,16,622392
179527,BUDGET,ABC Inc.,2020,7,35000,1300,21,739302
176177,BUDGET,ABC Inc.,2022,12,39000,1200,17,573450
37782,ACTUALS,ABC Inc.,2022,11,23000,1300,22,51061
108868,BUDGET,ABC Inc.,2021,3,38000,1300,18,500272
112988,BUDGET,ABC Inc.,2022,7,17000,1700,16,102737
39437,ACTUALS,ABC Inc.,2021,11,15000,1000,13,923997
72200,ACTUALS,ABC Inc.,2023,2,42000,2400,22,417441


In [1]:
# while discussing IO operations
# let's take a quick look at how to read JSON, parquet and excel files

import os
import pandas as pd
dirPath = "../../../downloads/"
os.listdir(dirPath)

# read json file
dfTemp = pd.read_json(dirPath+"ledger.json")

# # read parquet file
# dfTemp = pd.read_parquet(dirPath+"ledger.parquet", engine="pyarrow")
# # pip install pyarrow before use
# # fastparquet is another engine used

# # in case of excel files, use read_excel function
# dfTemp = pd.read_excel(dirPath+"ledger.xls", sheet_name="Sheet1")

# # handy function to read each Excel sheet in a Python dictionary
# workbook = pd.ExcelFile('ledger.xlsx')
# dictionary = {}
# for sheet_name in workbook.sheet_names:
# df = workbook.parse(sheet_name)
# dictionary[sheet_name] = dfLedger

dfTemp.sample

import os
import pandas as pd
dirPath = "../../../downloads/"
os.listdir(dirPath)

# read json file
dfTemp = pd.read_json(dirPath+"ledger.json")

# # read parquet file
# dfTemp = pd.read_parquet(dirPath+"ledger.parquet", engine="pyarrow")
# # pip install pyarrow before use
# # fastparquet is another engine used

# # in case of excel files, use read_excel function
# dfTemp = pd.read_excel(dirPath+"ledger.xls", sheet_name="Sheet1")

# # handy function to read each Excel sheet in a Python dictionary
# workbook = pd.ExcelFile('ledger.xlsx')
# dictionary = {}
# for sheet_name in workbook.sheet_names:
# df = workbook.parse(sheet_name)
# dictionary[sheet_name] = dfLedger

dfTemp.sample

<bound method NDFrame.sample of                                              columns
0  {'name': 'LEDGER', 'datatype': 'Utf8', 'values...
1  {'name': 'ORG', 'datatype': 'Utf8', 'values': ...
2  {'name': 'FISCAL_YEAR', 'datatype': 'Int64', '...
3  {'name': 'PERIOD', 'datatype': 'Int64', 'value...
4  {'name': 'ACCOUNT', 'datatype': 'Int64', 'valu...
5  {'name': 'DEPT', 'datatype': 'Int64', 'values'...
6  {'name': 'LOCATION', 'datatype': 'Int64', 'val...
7  {'name': 'POSTED_TOTAL', 'datatype': 'Int64', ...>

Now, we have got a glimpse of Finance datasets,

let's learn more about Pandas Data Structures which holds these datasets and allowed us to perform multiple operations.

Pandas DataFrame provide two major data structure to work this.

#### Series and DataFrame

# Create Data ERD diagram with animation (using manim)
---

# Data Visualization using Matplotlib
---

# Data Visualization using PlotLy
---

# Data Visualization using Seaborn
---