<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-Exploratory-Data-Analysis-(EDA)" data-toc-modified-id="Introduction-to-Exploratory-Data-Analysis-(EDA)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to Exploratory Data Analysis (EDA)</a></span><ul class="toc-item"><li><span><a href="#Data-Types" data-toc-modified-id="Data-Types-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data Types</a></span><ul class="toc-item"><li><span><a href="#Numerical-or-Quantitative-Data" data-toc-modified-id="Numerical-or-Quantitative-Data-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Numerical or Quantitative Data</a></span></li><li><span><a href="#Categorical-or-Qualitative-Data" data-toc-modified-id="Categorical-or-Qualitative-Data-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Categorical or Qualitative Data</a></span></li></ul></li><li><span><a href="#Important-note" data-toc-modified-id="Important-note-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Important note</a></span></li><li><span><a href="#Common-EDA-methods" data-toc-modified-id="Common-EDA-methods-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Common EDA methods</a></span><ul class="toc-item"><li><span><a href="#Summary-Statistics" data-toc-modified-id="Summary-Statistics-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Summary Statistics</a></span></li><li><span><a href="#Visualizations" data-toc-modified-id="Visualizations-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Visualizations</a></span><ul class="toc-item"><li><span><a href="#For-numerical-continuous-variables" data-toc-modified-id="For-numerical-continuous-variables-1.3.2.1"><span class="toc-item-num">1.3.2.1&nbsp;&nbsp;</span>For numerical continuous variables</a></span></li><li><span><a href="#For-categorical-or-discrete-variables" data-toc-modified-id="For-categorical-or-discrete-variables-1.3.2.2"><span class="toc-item-num">1.3.2.2&nbsp;&nbsp;</span>For categorical or discrete variables</a></span></li></ul></li><li><span><a href="#Relationship-between-variables" data-toc-modified-id="Relationship-between-variables-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Relationship between variables</a></span></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Summary</a></span></li></ul></li></ul></div>

# Introduction to Exploratory Data Analysis (EDA)

Today, our focus is on introducing the fundamental concepts of EDA to enable you to apply basic exploratory techniques in this quest, as well as in your Data Wrangling project. It will serve as a foundation for deriving insights from your clean data, setting the stage for more advanced analysis as we progress through the bootcamp.

Exploratory Data Analysis (EDA) is an essential part of data analysis, allowing you to summarize, visualize, and understand the underlying patterns within the data. It's used to:

- Summarize the main aspects of the data
- Understand the structure and relationships within the data
- Detect outliers and anomalies
- Test underlying assumptions
- Develop insights that can inform further analysis


## Data Types

**Data Types: Numerical vs. Categorical**

Data can be broadly classified into different types, and the techniques we apply for analysis depend on the data type at hand. Let's explore the various data types and their corresponding methods for analysis.

<p align="center">
  <img src="https://github.com/data-bootcamp-v4/prework_img/blob/main/data_types.png?raw=true"  alt="data-types" width="800">
</p>

<p align="center">
    Source: <a href="https://quizlet.com/189358246/types-of-data-diagram/">Quizlet</a>
</p>



### Numerical or Quantitative Data

Numerical or quantitative data consists of values that **can be measured or counted**. When working with numerical data, **we focus on understanding the central tendency (measures of centrality) and the variability (measures of dispersion)** within the dataset.


### Categorical or Qualitative Data

Categorical or qualitative data represents variables that are **divided into distinct categories or groups**. When working with categorical data, **we focus on understanding the frequency counts and proportions within each category**.


## Important note

1. In some cases, a **numerical or quantitative variable may represent a categorical or qualitative variable**. For example, if a dataset includes a column with numerical values representing different categories or labels, such as "0" for "male" and "1" for "female," it should be treated as a categorical variable rather than a true numerical variable. Always consider the context and meaning of the data when determining the appropriate data type. *That's why in the first image there is a balloon where it says "Categorical - made of words"- coming out of it saying "Or numbers (if they represent categories)".*

2. If data is not clean, and it has null values, or incorrect data formatting, this might lead to Python assuming numerical variables are Object, and we might wrongly assume they are categorical. 

## Common EDA methods 

- **Summary Statistics:** 
     - For numerical variables include the mean, median, standard deviation, minimum, and maximum values.
     - For categorical variables, this might include the counts or frequencies of each category.
- **Visualizations:** Graphs and plots to visually represent the data, such as:
    - For numerical variables, histograms are used to show distribution and box plots to visualize spread and skewness.
    - For categorical variables, bar charts are used to represent the frequency or count of categories, and pie charts to show the proportion of categories.
- **Relationship Between Variables:** Understanding the relationships between different types of variables:
    - Between categorical variables: stacked bar charts can be used to represent the frequency or count between categories. Chi-square test and Cramer's V correlation coefficient will be covered later on the course.
    - Between numerical variables: Pearson's correlation coefficient can be used to quantify the linear relationship between two numerical variables. Scatter plots to show the relationship, or heatmaps for correlation matrices.
    - Between numerical and categorical variables, box plots grouped by categories or violin plots can be used to help visualize the distribution of a numerical variable across different categories of a categorical variable.


Lets look at an example using the following dataset:

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/data_science_salaries.csv")

# Check the data

The dataset `data_science_salaries.csv` contains information about the salaries of various positions related to data science. Here's a summary of the columns:

1. **work_year**: The year in which the salary was reported.
2. **experience_level**: The experience level of the employee (e.g., Senior, Mid).
3. **employment_type**: The type of employment (e.g., Full Time, Contract).
4. **job_title**: The specific job title (e.g., Principal Data Scientist, ML Engineer).
5. **salary**: The salary amount in the original currency.
6. **salary_currency**: The currency in which the salary is paid.
7. **salary_in_usd**: The salary amount converted to USD.
8. **employee_residence**: The country code of the employee's residence.
9. **remote_ratio**: The percentage of remote work.
10. **company_location**: The country code of the company's location.
11. **company_size**: The size of the company (e.g., Large, Medium, Small).

Let's proceed with the EDA by exploring the following:

1. Summary statistics of numerical columns.
2. Distribution of experience levels, employment types, and company sizes.
3. Salary distribution based on experience levels.
4. Correlation between numerical features.



### Summary Statistics

If needed, you can review prework materials for an explanation on mean, median, mode, std, range, etc.

In [None]:
# Summary for numerical variables

- **work_year**: The dataset contains data from the years 2020 to 2023.
- **salary**: The salary amounts range from a minimum of 6,000 (in original currency) to a maximum of 30,400,000, with a mean of approximately 190,695. The large range and standard deviation may suggest outliers or different currencies being mixed.
- **salary_in_usd**: The salaries converted to USD range from 5,132 to 450,000, with a mean of approximately 137,570.
- **remote_ratio**: The remote work ratio ranges from 0% to 100%, with a mean of 46.27%. This suggests a mix of entirely on-site, entirely remote, and hybrid working arrangements.

In [None]:
# Summary for categorical (or object to be exact) variables 

Most employees are classified as "Senior," working "Full Time," and primarily in "Medium" sized companies. The "USD" currency and the "US" location dominate the dataset. The diversity in job titles (93 unique) indicates a wide range of roles within the field.

In [None]:
# Summary for categorical variable 'category'

In [None]:
# Summary for categorical variables

### Visualizations

#### For numerical continuous variables

The histogram will provide insights into the underlying distribution of the salaries in USD, including aspects like the central tendency, spread, skewness, and the presence of any outliers.

In [None]:
# For continuous variables: Histogram
import seaborn as sns
import plotly.express as px


The histogram represents the distribution of salaries in USD within the dataset, with the following insights:

1. **Central Tendency**: Most of the salaries appear to be concentrated around the lower to middle range, with fewer occurrences in the higher salary range.
2. **Spread**: The spread of the data is wide, with salaries ranging from very low to very high values.
3. **Skewness**: The distribution appears to be right-skewed, meaning that the tail on the right side is longer or fatter than the left side. This indicates that there are a few individuals with exceptionally high salaries compared to the majority.
4. **Outliers**: The long tail on the right side may also suggest the presence of outliers or exceptionally high salary values that are not typical of the general distribution.
5. **Binning**: The use of 100 bins provides a detailed view of the data but may also lead to some noise, especially if the dataset has a wide range of values.

This distribution could reflect the variations in salaries based on factors such as experience, job role, industry, geography, and other variables. 

#### For categorical or discrete variables

Next, we'll visualize the distribution of experience levels, employment types, and company sizes to understand the composition of the dataset.

We've already looked at it numerically, now we'll do so visually.

In [None]:
# Bar chart of job_title

In [None]:
# Bar chart of employment type

In [None]:
# bar chart of company size

- **Experience Levels**: The dataset has a higher number of entries for "Mid" and "Senior" levels, with fewer entries for "Junior" and "Lead" levels.
- **Employment Types**: Most of the entries are for "Full Time" positions, followed by "Contract" and a smaller number of "Part Time" positions.
- **Company Sizes**: The distribution of company sizes shows a higher number of entries for "Small" and "Medium" companies, with fewer entries for "Large" companies.

### Relationship between variables

We'll create a scatter plot to visualize the relationship between the original salary and the salary in USD. 

In [None]:
# Scatterplot for numerical variables

The scatter plot does not reveal a clear linear relationship between the original salary and the salary in USD. The spread of data points is wide, and there is no discernible pattern.

Additionally, we'll calculate the correlation coefficient to quantify this relationship.

Let's examine the correlation between the numerical features to identify any relationships between them.


In [None]:
# Calculate the correlation matrix for numerical variables

In [None]:
# Plot a heatmap for numerical variables

The heatmap illustrates the correlation between numerical features:

- **remote_ratio**: This feature does not show a significant correlation with either salary or salary_in_usd, indicating that the remote work ratio may not have a strong direct impact on salary amounts.
- **salary** and **salary_in_usd**: The correlation coefficient is approximately −0.02, indicating a very weak negative correlation between the two variables.  The lack of a strong correlation between the original salary and the salary in USD suggests that the conversion between different currencies and the factors influencing salaries in different regions are complex.

```python 
# If we had many numerical continuous variables, we could do many scatterplots using a for loop

df_numerical = df.select_dtypes(include = ['float64', 'int64'])

for idx in df_numerical.columns:
    df_numerical.plot(kind='scatter',
          x=idx,
          y='salary')
    
```

Next, let's visualize the salary distribution in USD based on experience levels to see how salaries vary with experience.

In [None]:
# Box plot for numerical and categorical variable

The boxplot illustrates the salary distribution in USD based on experience levels:

- **Entry**: Salaries for entry positions have the lowest range and median.
- **Mid**: Mid-level positions show a higher range and median salary compared to entry positions.
- **Senior**: Senior positions have a more extensive range and higher median salary.
- **Expert**: Expert positions have the highest range and median salary, reflecting the higher responsibility and expertise required for these roles.

Some outliers are present in the data, particularly for the "Senior" and "Mid" experience levels. These could be due to factors like industry, location, or specific skill sets.

## Summary

In this lesson, you have been introduced to the concept of Exploratory Data Analysis (EDA), highlighting its crucial role in understanding the underlying structure of data, identifying outliers, detecting anomalies, testing hypotheses, and checking assumptions. 

- **Data Types**
    - **Numerical or Quantitative Data**: Measurable or countable values, analyzed using measures of centrality and dispersion or visualization techniques.
    - **Categorical or Qualitative Data**: Distinct categories or groups, analyzed using frequency counts and proportions or visualization techniques.

    - **Important Note**
        - Context and meaning of data is vital in determining the data type.
        - Clean data is crucial to correct type determination.
- **Common EDA Methods**
    - **Summary Statistics**: Mean, median, standard deviation for numerical; counts or frequencies for categorical.
    - **Visualizations**: Histograms, box plots for numerical; bar charts, pie charts for categorical.
    - **Relationship Between Variables**: Scatter plots, correlation coefficients for numerical; stacked bar charts  for categorical.
