**Table of contents**<a id='toc0_'></a>    
- [What's EDA?](#toc1_)    
- [Why is EDA important?](#toc2_)    
    - [⚠️ **Attention** ⚠️](#toc2_1_1_)    
- [How do we EDA?](#toc3_)    
  - [Software](#toc3_1_)    
  - [Plots / Charts](#toc3_2_)    
  - [Methodology](#toc3_3_)    
- [Let the EDA begin](#toc4_)    
  - [Histogram](#toc4_1_)    
  - [Box plot](#toc4_2_)    
  - [Bar plots](#toc4_3_)    
  - [~~Pie charts~~](#toc4_4_)    
  - [Treemap](#toc4_5_)    
  - [Scatter plot](#toc4_6_)    
  - [Line plot](#toc4_7_)    
  - [💡 Check for understanding](#toc4_8_)    
  - [Common mistakes!](#toc4_9_)    
    - [Plotting without understanding what you're plotting (e.g. `customer_id`)](#toc4_9_1_)    
    - [Doing barplots for numerical continuous data](#toc4_9_2_)    
    - [Doing boxplots for numerical discrete data](#toc4_9_3_)    
    - [Considering histograms on numerical discrete data as normal distributions](#toc4_9_4_)    
    - [Creating noisy plots](#toc4_9_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[What's EDA?](#toc0_)

> **Exploratory data analysis (EDA) is a process of examining and summarizing data sets** without making any formal assumptions or hypotheses. It helps to understand the structure, patterns, variability, and relationships in data, as well as identify potential problems or anomalies.

> It can be carried out at various stages of the data analytics process, but it is usually conducted before a firm hypothesis or end goal is defined.

# <a id='toc2_'></a>[Why is EDA important?](#toc0_)

> It can be used for:
> + data cleaning
> + subgroup analyses
> + understanding data better

> It aims to:
> + spot patterns and trends
> + identify anomalies
> + test early hypotheses

*Example:* A scatterplot can show us different "clusters" or groups (i.e. concentrations of data points)

![clusters](https://imgs.search.brave.com/OzO3GC6FI7JzvoZX8LzeAqso6bjk7bw-H5XwqOdrt00/rs:fit:860:0:0/g:ce/aHR0cHM6Ly9kM21t/MnM5cjE1aXFjdi5j/bG91ZGZyb250Lm5l/dC9lbi93cC1jb250/ZW50L3VwbG9hZHMv/b2xkLWJsb2ctdXBs/b2Fkcy81MDBweC1z/bGluay1nYXVzc2lh/bi1kYXRhLnBuZw)

*Example:* A lineplot can show us how one variable is connected to another

![lineplot](https://imgs.search.brave.com/m8MYbCLtnAGWF1zDpGJa7_pQ0ARO2bWGHlFss-Siet4/rs:fit:860:0:0/g:ce/aHR0cHM6Ly9pMC53/cC5jb20vc3RhdGlz/dGljc2J5amltLmNv/bS93cC1jb250ZW50/L3VwbG9hZHMvMjAx/OS8xMC9vdXRsaWVy/X2NpcmNsZWQucG5n/P3Jlc2l6ZT01NzYs/MzgzJnNzbD0x)

*Example:* A boxplot can show us if our data has outliers or not

![outliers](https://imgs.search.brave.com/OWWypOQK8IXV4gq9m6mBUQHpZ7Y9Bi0-tdtwHhulMnk/rs:fit:860:0:0/g:ce/aHR0cHM6Ly9tZWRp/YS5nZWVrc2Zvcmdl/ZWtzLm9yZy93cC1j/b250ZW50L3VwbG9h/ZHMvMjAyMTA3MDcy/MzMyMjIvb3V0bGll/cnNFREEucG5n)

### <a id='toc2_1_1_'></a>[⚠️ **Attention** ⚠️](#toc0_)

**Data visualization is QUALITATIVE, not quantitative**! Many insights (errors, correlations) that you get from plots need to be confirmed using statistical methods or calculations!

# <a id='toc3_'></a>[How do we EDA?](#toc0_)

## <a id='toc3_1_'></a>[Software](#toc0_)

To do EDA, you can use multiple tools, such as:
+ Python libraries: matplotlib, seaborn, plotly, bokeh, etc.
+ Dashboarding software: Tableau, PowerBI, Looker, AWS Quicksight, Dash, etc.
+ The OG of data viz: Excel

## <a id='toc3_2_'></a>[Plots / Charts](#toc0_)

There are many types of plots, but we can group them into:  
+ Basic plots:
    + histograms  
    + box plots  
    + bar plots   
    + ~~pie charts~~  
    + treemap   
    + scatter plots  
    + line plots   
    + heatmaps    
     
+ More advanced plots:   
    + violin plots  
    + candlestick chart  
    + lollipop chart  
    + density plot  
    + PCA  
    + ridge plots  
    + ~~3D plots~~   

For more plot possibilities, please have a look at [Data-to-viz](https://www.data-to-viz.com/).

*🗒️Note:* Please look at pie charts and 3D plots once, then forget they exist.

## <a id='toc3_3_'></a>[Methodology](#toc0_)

The typical process of doing EDA looks something like this:
+ Examining each of the variables for the whole sample (univariate analysis)
+ Examining relationships between >2 variables (bivariate and multivariate analysis)
+ Examining variables across subgroups
+ Re-iterating the previous 3 steps

Today we'll run an EDA process using a few different libraries, with a focus on `plotly` - and some `matplotlib` and `seaborn` alongside!

# <a id='toc4_'></a>[Let the EDA begin](#toc0_)

<iframe src="https://giphy.com/embed/LmHFLSnktq4vK" width="480" height="241" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/game-let-begin-LmHFLSnktq4vK">via GIPHY</a></p>

In [None]:
# Un-comment these if you cannot import the libraries
# !pip install matplotlib
# !pip install seaborn
# !pip install plotly

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
fortune = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/Fortune_1000.csv")
fortune.head()

*🗒️Note:*
> Market Cap = Market cap—or capitalization—refers to the total value of all a company's shares of stock

In [None]:
# Check shape
fortune.shape

In [None]:
# Describe numerical dataset
round(fortune.describe())

In [None]:
# Describe categorical columns
fortune.describe(include='object')

## <a id='toc4_2_'></a>[Box plot](#toc0_)

One of the best ways to see the elements you studied in the descriptive statistics lesson (mean, quartiles, outliers) is to visualize them using a boxplot:

In [None]:
# plotly box horizontal
px.box(x=fortune.profit)

We can also have a look at the profit per sector using boxplots:

In [None]:
# Boxplot per sector
px.box(x=fortune.profit, color=fortune.sector)

In [None]:
# Too much noise! I will select only a few industries
industries = ['Media', 'Energy', 'Financials', 'Aerospace & Defense']
fortune_selection = fortune[fortune.sector.isin(industries)]
px.box(x=fortune_selection.profit, color=fortune_selection.sector)

In [None]:
# Get plotly to display the outlier companies
px.box(data_frame=fortune_selection, x="profit", color="sector", hover_data="company")

## <a id='toc4_1_'></a>[Histogram](#toc0_)

We use histograms to get a complete view of our numerical continuous data, which we cannot do when using a boxplot:

In [None]:
# plotly express histogram
px.histogram(fortune.revenue)

We can also review subsets of data:

In [None]:
# how are companies ran by female CEOs different in terms of revenue
px.histogram(x=fortune.revenue, facet_row=fortune.ceo_woman)

We can clearly see there are far fewer female CEOs than otherwise, but we still can't quite read our distribution, so let's fix that:

In [None]:
fig = px.histogram(x=fortune.revenue, facet_row=fortune.ceo_woman)
fig.update_yaxes(matches=None)
fig.show()

In [None]:
# And finally, to remove the facet_row from the plot
fig = px.histogram(x=fortune.revenue, facet_row=fortune.ceo_woman)
fig.update_yaxes(matches=None)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.show()

## <a id='toc4_3_'></a>[Bar plots](#toc0_)

Next, we want to have an idea of the proportions and absolute number of companies within each category:

In [None]:
# Check proportion of CEO founders
px.histogram(fortune['ceo_founder'])

Which sectors make it to the Fortune 1000 most?

In [None]:
# Try visualizing sector
px.histogram(fortune.sector)

In [None]:
# Better visualized in a horizontal plot
px.histogram(y=fortune.sector)

Which companies are the most profitable though?

In [None]:
# sector profit
sector_profit = fortune.groupby('sector').agg({'profit':'mean'})
sector_profit = sector_profit.sort_values(by="profit").round()
sector_profit

In [None]:
# Let's redo the chart
px.histogram(y=sector_profit.index, x=sector_profit.profit)

## <a id='toc4_4_'></a>[~~Pie charts~~](#toc0_)

~~Show proportions~~

In [None]:
# Sector
px.pie(values=fortune.sector.value_counts().values, names=fortune.sector.value_counts().index)

In [None]:
# Check revenue per sector
sector_revenue = fortune.groupby('sector')['revenue'].sum()
px.pie(values=sector_revenue.values, names=sector_revenue.index)

## <a id='toc4_5_'></a>[Treemap](#toc0_)

A bit more difficult to do in matplotlib than a pie chart but much better. Used to show proportions.

In [None]:
fig = px.treemap(fortune, path=[px.Constant("all"), 'sector'], values='revenue')
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

## <a id='toc4_6_'></a>[Scatter plot](#toc0_)

As we've seen in the field relationships lesson, we can use a scatterplot to visualize the relationship between 2 numerical continuous variables:

In [None]:
px.scatter(x=fortune.profit, y=fortune.revenue)

In [None]:
# Let's get some axis titles and labels
fig = px.scatter(x=fortune.profit, y=fortune.revenue)
fig.update_xaxes(title="profit")
fig.update_yaxes(title="revenue")
fig.show()

## <a id='toc4_7_'></a>[Line plot](#toc0_)

Line plots are particularly useful when dealing with trends over time:

In [None]:
# Run this before the next cell
# !pip install yfinance

In [None]:
# Don't worry about this too much, I'm just interested in getting a time series
import yfinance as yf
import datetime as dt
amazon = yf.Ticker('AMZN')
amazon_data = amazon.history(start='2022-01-01', end=dt.date.today())
amazon_data.head()

In [None]:
fig = px.line(x=amazon_data.index, y=amazon_data.Close)
fig.update_xaxes(title="Date")
fig.update_yaxes(title="Close")

In [None]:
# Do not use when dealing with a typical scatterplot!
px.line(x=fortune.profit, y=fortune.revenue)

## <a id='toc4_8_'></a>[💡 Check for understanding](#toc0_)

You will still be working with the Fortune 1000 dataset like last time but this time you will visualize the results!

In [None]:
fortune = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/Fortune_1000.csv")
fortune.sample(10)

**Questions**
- Show the number of different companies per state with an appropriate chart. 
- Show the relative proportion of states in the dataset with an appropriate chart.
- Can you do the same to check the overall revenue per state?
- Check the distribution of market cap for the companies in the dataset. What do you see?
- Are there many outlier companies when looking at market cap? Choose an appropriate graph to show this.
- Display the number of companies per sector for the top 10% of companies. Do the same for the bottom 10%.
- Check how market cap changes in relationship to the profit with an appropriate plot. What do you see?
- Lastly, have a look at the sectors that have female CEOs. Which one is the most prevalent?
- What new information did we get through EDA compared to last time? Feel free to look for other things as well 😉

*Notes:*
- You might need to remove NaNs from some of the columns!
- You might need to convert some data types!

## <a id='toc4_9_'></a>[Common mistakes!](#toc0_)

In [None]:
import plotly.express as px
customer_data = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv")

### <a id='toc4_9_1_'></a>[Plotting without understanding what you're plotting (e.g. `customer_id`)](#toc0_)

In [None]:
px.bar(customer_data, x='Customer')

This graph doesn't give us any significant information. It just shows the number of claims per customer but it's too cluttered to read.

### <a id='toc4_9_2_'></a>[Doing barplots for numerical continuous data](#toc0_)

In [None]:
px.bar(customer_data, y='Income')

### <a id='toc4_9_3_'></a>[Doing boxplots for numerical discrete data](#toc0_)

In [None]:
px.box(customer_data, x='Number of Open Complaints')

### <a id='toc4_9_4_'></a>[Considering histograms on numerical discrete data as normal distributions](#toc0_)

In [None]:
px.histogram(customer_data, x='Number of Open Complaints')

### <a id='toc4_9_5_'></a>[Creating incorrectly formatted plots](#toc0_)

In [None]:
px.box(customer_data, x='Effective To Date', y='Total Claim Amount')

In [None]:
# The dates aren't properly formatted!
customer_data["Effective To Date"] = pd.to_datetime(customer_data["Effective To Date"])
px.box(customer_data, x='Effective To Date', y='Total Claim Amount')

### <a id='toc4_9_5_'></a>[Creating noisy plots](#toc0_)

In [None]:
customer_data["Effective To Date"] = pd.to_datetime(customer_data["Effective To Date"])
px.box(customer_data, x='Effective To Date', y='Total Claim Amount')

In [None]:
# The chart above is still not quite useful to understand trends over time
pivot = customer_data.groupby('Effective To Date')['Total Claim Amount'].sum().reset_index()
px.line(pivot, x='Effective To Date', y='Total Claim Amount')