**Table of contents**<a id='toc0_'></a>    
- [What's EDA?](#toc1_)    
- [Why is EDA important?](#toc2_)    
    - [⚠️ **Attention** ⚠️](#toc2_1_1_)    
- [How do we EDA?](#toc3_)    
  - [Software](#toc3_1_)    
  - [Plots / Charts](#toc3_2_)    
  - [Methodology](#toc3_3_)    
- [Let the EDA begin](#toc4_)    
  - [Histogram](#toc4_1_)    
  - [Box plot](#toc4_2_)    
  - [Bar plots](#toc4_3_)    
  - [~~Pie charts~~](#toc4_4_)    
  - [Treemap](#toc4_5_)    
  - [Scatter plot](#toc4_6_)    
  - [Line plot](#toc4_7_)    
  - [💡 Check for understanding](#toc4_8_)    
  - [Common mistakes!](#toc4_9_)    
    - [Plotting without understanding what you're plotting (e.g. `customer_id`)](#toc4_9_1_)    
    - [Doing barplots for numerical continuous data](#toc4_9_2_)    
    - [Doing boxplots for numerical discrete data](#toc4_9_3_)    
    - [Considering histograms on numerical discrete data as normal distributions](#toc4_9_4_)    
    - [Creating noisy plots](#toc4_9_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[What's EDA?](#toc0_)

> **Exploratory data analysis (EDA) is a process of examining and summarizing data sets** without making any formal assumptions or hypotheses. It helps to understand the structure, patterns, variability, and relationships in data, as well as identify potential problems or anomalies.

> It can be carried out at various stages of the data analytics process, but it is usually conducted before a firm hypothesis or end goal is defined.

# <a id='toc2_'></a>[Why is EDA important?](#toc0_)

> It can be used for:
> + data cleaning
> + subgroup analyses
> + understanding data better

> It aims to:
> + spot patterns and trends
> + identify anomalies
> + test early hypotheses

*Example:* A scatterplot can show us different "clusters" or groups (i.e. concentrations of data points)

![clusters](https://imgs.search.brave.com/OzO3GC6FI7JzvoZX8LzeAqso6bjk7bw-H5XwqOdrt00/rs:fit:860:0:0/g:ce/aHR0cHM6Ly9kM21t/MnM5cjE1aXFjdi5j/bG91ZGZyb250Lm5l/dC9lbi93cC1jb250/ZW50L3VwbG9hZHMv/b2xkLWJsb2ctdXBs/b2Fkcy81MDBweC1z/bGluay1nYXVzc2lh/bi1kYXRhLnBuZw)

*Example:* A lineplot can show us how one variable is connected to another

![lineplot](https://imgs.search.brave.com/m8MYbCLtnAGWF1zDpGJa7_pQ0ARO2bWGHlFss-Siet4/rs:fit:860:0:0/g:ce/aHR0cHM6Ly9pMC53/cC5jb20vc3RhdGlz/dGljc2J5amltLmNv/bS93cC1jb250ZW50/L3VwbG9hZHMvMjAx/OS8xMC9vdXRsaWVy/X2NpcmNsZWQucG5n/P3Jlc2l6ZT01NzYs/MzgzJnNzbD0x)

*Example:* A boxplot can show us if our data has outliers or not

![outliers](https://imgs.search.brave.com/OWWypOQK8IXV4gq9m6mBUQHpZ7Y9Bi0-tdtwHhulMnk/rs:fit:860:0:0/g:ce/aHR0cHM6Ly9tZWRp/YS5nZWVrc2Zvcmdl/ZWtzLm9yZy93cC1j/b250ZW50L3VwbG9h/ZHMvMjAyMTA3MDcy/MzMyMjIvb3V0bGll/cnNFREEucG5n)

### <a id='toc2_1_1_'></a>[⚠️ **Attention** ⚠️](#toc0_)

**Data visualization is QUALITATIVE, not quantitative**! Many insights (errors, correlations) that you get from plots need to be confirmed using statistical methods or calculations!

# <a id='toc3_'></a>[How do we EDA?](#toc0_)

## <a id='toc3_1_'></a>[Software](#toc0_)

To do EDA, you can use multiple tools, such as:
+ Python libraries: matplotlib, seaborn, plotly, bokeh, etc.
+ Dashboarding software: Tableau, PowerBI, Looker, AWS Quicksight, Dash, etc.
+ The OG of data viz: Excel

## <a id='toc3_2_'></a>[Plots / Charts](#toc0_)

There are many types of plots, but we can group them into:  
+ Basic plots:
    + histograms  
    + box plots  
    + bar plots   
    + ~~pie charts~~  
    + treemap   
    + scatter plots  
    + line plots   
    + heatmaps    
     
+ More advanced plots:   
    + violin plots  
    + candlestick chart  
    + lollipop chart  
    + density plot  
    + PCA  
    + ridge plots  
    + ~~3D plots~~   

For more plot possibilities, please have a look at [Data-to-viz](https://www.data-to-viz.com/).

*🗒️Note:* Please look at pie charts and 3D plots once, then forget they exist.

## <a id='toc3_3_'></a>[Methodology](#toc0_)

The typical process of doing EDA looks something like this:
+ Examining each of the variables for the whole sample (univariate analysis)
+ Examining relationships between >2 variables (bivariate and multivariate analysis)
+ Examining variables across subgroups
+ Re-iterating the previous 3 steps

Today we'll run an EDA process using a few different libraries, with a focus on `plotly` - and some `matplotlib` and `seaborn` alongside!

# <a id='toc4_'></a>[Let the EDA begin](#toc0_)

<iframe src="https://giphy.com/embed/LmHFLSnktq4vK" width="480" height="241" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/game-let-begin-LmHFLSnktq4vK">via GIPHY</a></p>

In [None]:
# Un-comment these if you cannot import the libraries
#!pip install matplotlib
#!pip install seaborn
#!pip install plotly



In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [2]:
fortune = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/Fortune_1000.csv")
fortune.head()

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
0,Walmart,1,0.0,523964.0,14881.0,2200000,Retailing,Bentonville,AR,no,no,no,yes,1.0,C. Douglas McMillon,https://www.stock.walmart.com,WMT,411690
1,Amazon,2,3.0,280522.0,11588.0,798000,Retailing,Seattle,WA,no,yes,no,yes,5.0,Jeffrey P. Bezos,https://www.amazon.com,AMZN,1637405
2,Exxon Mobil,3,-1.0,264938.0,14340.0,74900,Energy,Irving,TX,no,no,no,yes,2.0,Darren W. Woods,https://www.exxonmobil.com,XOM,177923
3,Apple,4,-1.0,260174.0,55256.0,137000,Technology,Cupertino,CA,no,no,no,yes,3.0,Timothy D. Cook,https://www.apple.com,AAPL,2221176
4,CVS Health,5,3.0,256776.0,6634.0,290000,Health Care,Woonsocket,RI,no,no,yes,yes,8.0,Karen S. Lynch,https://www.cvshealth.com,CVS,98496


*🗒️Note:*
> Market Cap = Market cap—or capitalization—refers to the total value of all a company's shares of stock

In [None]:
# Check shape
fortune.shape

In [4]:
# Describe numerical dataset
round(fortune.describe())

Unnamed: 0,rank,rank_change,revenue,profit,num. of employees
count,1000.0,1000.0,1000.0,998.0,1000.0
mean,500.0,0.0,15902.0,1345.0,34616.0
std,289.0,22.0,34763.0,4516.0,92024.0
min,1.0,-186.0,1990.0,-8506.0,51.0
25%,251.0,0.0,3164.0,111.0,6400.0
50%,500.0,0.0,5647.0,381.0,13000.0
75%,750.0,0.0,12820.0,1061.0,29192.0
max,1000.0,224.0,523964.0,81417.0,2200000.0


In [5]:
# Describe categorical columns
fortune.describe(include='object')

Unnamed: 0,company,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
count,1000,1000,1000,1000,500,1000,1000,1000,1000.0,992,1000,938,960
unique,1000,21,402,46,2,2,2,2,477.0,989,999,938,946
top,Walmart,Financials,New York,CA,no,no,no,yes,,Patricia K. Poppe,https://www.rtx.com,WMT,-
freq,1,162,70,121,477,956,932,854,523.0,2,2,1,10


## <a id='toc4_2_'></a>[Box plot](#toc0_)

One of the best ways to see the elements you studied in the descriptive statistics lesson (mean, quartiles, outliers) is to visualize them using a boxplot:

In [6]:
# plotly box horizontal
px.box(x=fortune.profit)

We can also have a look at the profit per sector using boxplots:

In [20]:
# Boxplot per sector
px.box(data_frame = fortune, x='profit', color='sector', hover_data='company', facet_col='ceo_woman')

In [18]:
# Too much noise! I will select only a few industries
industries = ['Telecommunications', 'Energy', 'Financials', 'Hotels, Restaurants & Leisure']
fortune_selection = fortune[fortune.sector.isin(industries)]
px.box(x=fortune_selection.profit, color=fortune_selection.sector)

In [19]:
# Get plotly to display the outlier companies
px.box(data_frame=fortune_selection, x="profit", color="sector", hover_data="company")

In [13]:
fortune_selection.sector.value_counts()

sector
Financials                       162
Energy                           109
Hotels, Restaurants & Leisure     27
Telecommunications                11
Name: count, dtype: int64

## <a id='toc4_1_'></a>[Histogram](#toc0_)

We use histograms to get a complete view of our numerical continuous data, which we cannot do when using a boxplot:

In [None]:
#We have discovered that our profit and revenue need to be in million

In [16]:
fortune.profit = fortune.profit * 1_000_000
fortune.revenue = fortune.revenue * 1_000_000

In [None]:
# plotly express histogram
px.histogram(fortune.profit, nbins=100) #60% of my companies are between 0 -1B, 13% are in the negative, 11% between 1 and 2B

We can also review subsets of data:

In [24]:
# how are companies ran by female CEOs different in terms of revenue
px.histogram(x=fortune.profit, facet_row=fortune.ceo_woman)#facet=create a subplot

We can clearly see there are far fewer female CEOs than otherwise, but we still can't quite read our distribution, so let's fix that:

In [25]:
fig = px.histogram(x=fortune.profit, facet_row=fortune.ceo_woman)
fig.update_yaxes(matches=None)
fig.show()

In [None]:
# And finally, to remove the facet_row from the plot on the side of the graph
fig = px.histogram(x=fortune.profit, facet_row=fortune.ceo_woman)
fig.update_yaxes(matches=None)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.show()

## <a id='toc4_3_'></a>[Bar plots](#toc0_)

Next, we want to have an idea of the proportions and absolute number of companies within each category:

In [27]:
# Check proportion of CEO founders
px.histogram(fortune['ceo_founder'])

In [28]:
fortune_founders = fortune[fortune['ceo_founder'] == 'yes']

In [31]:
px.histogram(fortune_founders.sector)

In [None]:
#Order our values
fortune_founders_amount = fortune_founders.sector.value_counts().sort_values()
px.histogram(x =fortune_founders_amount.index, y=fortune_founders_amount.values)

In [35]:
px.histogram(y =fortune_founders_amount.index, x=fortune_founders_amount.values)

Which sectors make it to the Fortune 1000 most?

In [38]:
# Try visualizing sector
px.histogram(fortune.sector)

In [None]:
# Better visualized in a horizontal plot and sorted
fortune_sectors = fortune.sector.value_counts().sort_values()
px.histogram(y=fortune_sectors.index, x=fortune_sectors.values)

Which companies are the most profitable though?

In [41]:
# sector profit
sector_profit = fortune.groupby('sector').agg({'profit':'mean'})
sector_profit = sector_profit.sort_values(by="profit").round()
sector_profit

Unnamed: 0_level_0,profit
sector,Unnamed: 1_level_1
Materials,224478300.0
Wholesalers,235662900.0
Engineering & Construction,282030000.0
Motor Vehicles & Parts,413827300.0
Household Products,456523100.0
Apparel,563712500.0
Chemicals,607926900.0
Industrials,673434000.0
Energy,735494500.0
"Hotels, Restaurants & Leisure",805829600.0


In [42]:
# Let's redo the chart
px.histogram(y=sector_profit.index, x=sector_profit.profit)

In [43]:
sector_profit = fortune.groupby('sector').agg({'profit':'sum', 'revenue': 'count'}).rename({'revenue': 'count'}, axis=1)
sector_profit['avg_profit_per_company'] = round(sector_profit['profit']/ sector_profit['count'])
sector_profit.sort_values(by = 'avg_profit_per_company', inplace=True)
px.histogram(y=sector_profit.index, x=sector_profit['avg_profit_per_company'])

In [50]:
#Look at video to correct
sector_profit = fortune.groupby('sector').agg({'profit':'sum', 'revenue': 'sum'})
sector_profit['avg_op_margin_per_company'] = round(sector_profit['profit']*100/ sector_profit['count'])
sector_profit.sort_values(by = 'avg_op_margin_per_company', inplace=True)
px.histogram(y=sector_profit.index, x=sector_profit['avg_op_margin_per_company'])

KeyError: 'count'

In [48]:
fortune['op_margin'] = round(fortune['profit']*100 / fortune['revenue'], 2)
px.box(y=fortune.sector, x=fortune['op_margin'])

## <a id='toc4_4_'></a>[~~Pie charts~~](#toc0_)

~~Show proportions~~

In [49]:
# Sector
px.pie(values=fortune.sector.value_counts().values, names=fortune.sector.value_counts().index)

In [51]:
# Check revenue per sector
sector_revenue = fortune.groupby('sector')['revenue'].sum()
px.pie(values=sector_revenue.values, names=sector_revenue.index)

In [52]:
companies_per_ceo = fortune.groupby('ceo_woman')['company'].count()
px.pie(values=companies_per_ceo.values, names=companies_per_ceo.index)

## <a id='toc4_5_'></a>[Treemap](#toc0_)

A bit more difficult to do in matplotlib than a pie chart but much better. Used to show proportions.

In [53]:
fig = px.treemap(fortune, path=[px.Constant("all"), 'sector'], values='revenue')
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

## <a id='toc4_6_'></a>[Scatter plot](#toc0_)

As we've seen in the field relationships lesson, we can use a scatterplot to visualize the relationship between 2 numerical continuous variables:

In [55]:
px.scatter(x=fortune.profit, y=fortune.revenue)

In [54]:
# Let's get some axis titles and labels
fig = px.scatter(x=fortune.profit, y=fortune.revenue)
fig.update_xaxes(title="profit")
fig.update_yaxes(title="revenue")
fig.show()

## <a id='toc4_7_'></a>[Line plot](#toc0_)

Line plots are particularly useful when dealing with trends over time:

In [None]:
# Run this before the next cell
#!pip install yfinance

Collecting yfinance
  Downloading yfinance-0.2.49-py2.py3-none-any.whl.metadata (13 kB)
Collecting multitasking>=0.0.7 (from yfinance)
  Downloading multitasking-0.0.11-py3-none-any.whl.metadata (5.5 kB)
Collecting peewee>=3.16.2 (from yfinance)
  Downloading peewee-3.17.8.tar.gz (948 kB)
     ---------------------------------------- 0.0/948.2 kB ? eta -:--:--
     ---------------------------------------- 10.2/948.2 kB ? eta -:--:--
     - ----------------------------------- 30.7/948.2 kB 435.7 kB/s eta 0:00:03
     --- --------------------------------- 92.2/948.2 kB 871.5 kB/s eta 0:00:01
     ----------- -------------------------- 276.5/948.2 kB 2.1 MB/s eta 0:00:01
     ---------------- --------------------- 419.8/948.2 kB 2.6 MB/s eta 0:00:01
     --------------------- ---------------- 532.5/948.2 kB 2.6 MB/s eta 0:00:01
     ------------------------------- ------ 778.2/948.2 kB 3.1 MB/s eta 0:00:01
     -------------------------------------- 948.2/948.2 kB 3.3 MB/s eta 0:00:00
  I

In [58]:
# Don't worry about this too much, I'm just interested in getting a time series
import yfinance as yf
import datetime as dt
amazon = yf.Ticker('AMZN')
amazon_data = amazon.history(start='2022-01-01', end=dt.date.today())
amazon_data.tail()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-11-07 00:00:00-05:00,207.440002,212.25,207.190002,210.050003,52878400,0.0,0.0
2024-11-08 00:00:00-05:00,209.720001,209.960007,207.440002,208.179993,36075800,0.0,0.0
2024-11-11 00:00:00-05:00,208.5,209.649994,205.589996,206.839996,35456000,0.0,0.0
2024-11-12 00:00:00-05:00,208.369995,209.539993,206.009995,208.910004,38942900,0.0,0.0
2024-11-13 00:00:00-05:00,209.399994,215.089996,209.139999,214.100006,46212900,0.0,0.0


In [59]:
fig = px.line(x=amazon_data.index, y=amazon_data.Close)
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Close")

In [60]:
# Do not use when dealing with a typical scatterplot!
px.line(x=fortune.profit, y=fortune.revenue)

## <a id='toc4_8_'></a>[💡 Check for understanding](#toc0_)

You will still be working with the Fortune 1000 dataset like last time but this time you will visualize the results!

In [None]:
fortune = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/Fortune_1000.csv")
fortune.sample(10)

**Questions**
- Show the number of different companies per state with an appropriate chart. 
- Show the relative proportion of states in the dataset with an appropriate chart.
- Can you do the same to check the overall revenue per state?
- Check the distribution of market cap for the companies in the dataset. What do you see?
- Are there many outlier companies when looking at market cap? Choose an appropriate graph to show this.
- Display the number of companies per sector for the top 10% of companies. Do the same for the bottom 10%.
- Check how market cap changes in relationship to the profit with an appropriate plot. What do you see?
- Lastly, have a look at the sectors that have female CEOs. Which one is the most prevalent?
- What new information did we get through EDA compared to last time? Feel free to look for other things as well 😉

*Notes:*
- You might need to remove NaNs from some of the columns!
- You might need to convert some data types!

## <a id='toc4_9_'></a>[Common mistakes!](#toc0_)

In [61]:
import plotly.express as px
customer_data = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv")

### <a id='toc4_9_1_'></a>[Plotting without understanding what you're plotting (e.g. `customer_id`)](#toc0_)

In [63]:
px.histogram(customer_data, x='Customer')

This graph doesn't give us any significant information. It just shows the number of claims per customer but it's too cluttered to read.

### <a id='toc4_9_2_'></a>[Doing barplots for numerical continuous data](#toc0_)

In [64]:
px.bar(customer_data, y='Income')

### <a id='toc4_9_3_'></a>[Doing boxplots for numerical discrete data](#toc0_)

In [65]:
px.box(customer_data, x='Number of Open Complaints')

### <a id='toc4_9_4_'></a>[Considering histograms on numerical discrete data as normal distributions](#toc0_)

In [66]:
px.histogram(customer_data, x='Number of Open Complaints')

### <a id='toc4_9_5_'></a>[Creating incorrectly formatted plots](#toc0_)

In [67]:
px.box(customer_data, x='Effective To Date', y='Total Claim Amount')

In [68]:
# The dates aren't properly formatted!
customer_data["Effective To Date"] = pd.to_datetime(customer_data["Effective To Date"])
px.box(customer_data, x='Effective To Date', y='Total Claim Amount')


Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.



### <a id='toc4_9_5_'></a>[Creating noisy plots](#toc0_)

In [69]:
customer_data["Effective To Date"] = pd.to_datetime(customer_data["Effective To Date"])
px.box(customer_data, x='Effective To Date', y='Total Claim Amount')

In [70]:
# The chart above is still not quite useful to understand trends over time
pivot = customer_data.groupby('Effective To Date')['Total Claim Amount'].sum().reset_index()
px.line(pivot, x='Effective To Date', y='Total Claim Amount')