# **Exploratory Data Analysis in Python**

# Course Description

So you’ve got some interesting data - where do you begin your analysis? This course will cover the process of exploring and analyzing data, from understanding what’s included in a dataset to incorporating exploration findings into a data science workflow.


Using data on unemployment figures and plane ticket prices, you’ll leverage Python to summarize and validate data, calculate, identify and replace missing values, and clean both numerical and categorical values. Throughout the course, you’ll create beautiful Seaborn visualizations to understand variables and their relationships.


For example, you’ll examine how alcohol use and student performance are related. Finally, the course will show how exploratory findings feed into data science workflows by creating new features, balancing categorical features, and generating hypotheses from findings.


By the end of this course, you’ll have the confidence to perform your own exploratory data analysis (EDA) in Python.You’ll be able to explain your findings visually to others and suggest the next steps for gathering insights from your data!

# **Chapter 1: Getting to Know a Dataset**
What's the best way to approach a new dataset? Learn to validate and summarize categorical and numerical data and create Seaborn visualizations to communicate your findings.

# **Initial exploration**

Functions for initial exploration
You are researching unemployment rates worldwide and have been given a new dataset to work with. The data has been saved and loaded for you as a pandas DataFrame called unemployment. You've never seen the data before, so your first task is to use a few pandas functions to learn about this new data.

pandas has been imported for you as pd.

Instructions 1/3

## 1
Use a pandas function to print the first five rows of the unemployment DataFrame.


## 2
Use a pandas function to print a summary of column non-missing values and data types from the unemployment DataFrame.
## 3
Print the summary statistics (count, mean, standard deviation, min, max, and quartile values) of each numerical column in unemployment.


In [None]:
# Print the first five rows of unemployment
print(unemployment.head())

In [None]:
<script.py> output:
      country_code          country_name      continent   2010   2011  ...   2017   2018   2019   2020   2021
    0          AFG           Afghanistan           Asia  11.35  11.05  ...  11.18  11.15  11.22  11.71  13.28
    1          AGO                Angola         Africa   9.43   7.36  ...   7.41   7.42   7.42   8.33   8.53
    2          ALB               Albania         Europe  14.09  13.48  ...  13.62  12.30  11.47  13.33  11.82
    3          ARE  United Arab Emirates           Asia   2.48   2.30  ...   2.46   2.35   2.23   3.19   3.36
    4          ARG             Argentina  South America   7.71   7.18  ...   8.35   9.22   9.84  11.46  10.90

    [5 rows x 15 columns]

In [None]:
# Print a summary of non-missing values and data types in the unemployment DataFrame
print(unemployment.info())

In [None]:
<script.py> output:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 182 entries, 0 to 181
    Data columns (total 15 columns):
     #   Column        Non-Null Count  Dtype
    ---  ------        --------------  -----
     0   country_code  182 non-null    object
     1   country_name  182 non-null    object
     2   continent     177 non-null    object
     3   2010          182 non-null    float64
     4   2011          182 non-null    float64
     5   2012          182 non-null    float64
     6   2013          182 non-null    float64
     7   2014          182 non-null    float64
     8   2015          182 non-null    float64
     9   2016          182 non-null    float64
     10  2017          182 non-null    float64
     11  2018          182 non-null    float64
     12  2019          182 non-null    float64
     13  2020          182 non-null    float64
     14  2021          182 non-null    float64
    dtypes: float64(12), object(3)
    memory usage: 21.5+ KB

In [None]:
# Print summary statistics for numerical columns in unemployment
print(unemployment.describe())

In [None]:
<script.py> output:
              2010     2011     2012     2013     2014  ...     2017     2018     2019     2020     2021
    count  182.000  182.000  182.000  182.000  182.000  ...  182.000  182.000  182.000  182.000  182.000
    mean     8.409    8.315    8.318    8.345    8.180  ...    7.669    7.426    7.244    8.421    8.391
    std      6.249    6.267    6.367    6.416    6.284  ...    5.902    5.819    5.697    6.041    6.067
    min      0.450    0.320    0.480    0.250    0.200  ...    0.140    0.110    0.100    0.210    0.260
    25%      4.015    3.775    3.743    3.692    3.625  ...    3.690    3.625    3.487    4.285    4.335
    50%      6.965    6.805    6.690    6.395    6.450  ...    5.650    5.375    5.240    6.695    6.425
    75%     10.957   11.045   11.285   11.310   10.695  ...   10.315    9.258    9.445   11.155   10.840
    max     32.020   31.380   31.020   29.000   28.030  ...   27.040   26.910   28.470   29.220   33.560

    [8 rows x 12 columns]

Excellent work—you've now learned that unemployment contains 182 rows of country data including country_code, country_name, continent, and unemployment percentages from 2010 through 2021. If you looked very closely, you might have noticed that a few countries are missing information in the continent column! Let's continue exploring this data in the next exercise.

# **Counting categorical values**
Recall from the previous exercise that the unemployment DataFrame contains 182 rows of country data including country_code, country_name, continent, and unemployment percentages from 2010 through 2021.

You'd now like to explore the categorical data contained in unemployment to understand the data that it contains related to each continent.

The unemployment DataFrame has been loaded for you along with pandas as pd.

## Instructions

Use a pandas function to count the values associated with each continent in the unemployment DataFrame.


In [None]:
# Count the values associated with each continent in unemployment
print(unemployment["continent"].value_counts())

In [None]:
<script.py> output:
    Africa           53
    Asia             47
    Europe           39
    North America    18
    South America    12
    Oceania           8
    Name: continent, dtype: int64

Well done! Did you know that there are 23 countries in North America, which includes countries in the Caribbean and Central America? You may have noticed that North America has 18 data points in the unemployment DataFrame, so we are missing information on a few of the countries from our dataset.

# Global unemployment in 2021
It's time to explore some of the numerical data in unemployment! What was typical unemployment in a given year? What was the minimum and maximum unemployment rate, and what did the distribution of the unemployment rates look like across the world? A histogram is a great way to get a sense of the answers to these questions.

Your task in this exercise is to create a histogram showing the distribution of global unemployment rates in 2021.

The unemployment DataFrame has been loaded for you along with pandas as pd.

## Instructions

Import the required visualization libraries.
Create a histogram of the distribution of 2021 unemployment percentages across all countries in unemployment; show a full percentage point in each bin.

In [None]:
# Import the required visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Create a histogram of 2021 unemployment; show a full percent in each bin
sns.histplot(data=unemployment, x="2021", binwidth=1)
plt.show()

Nice work—it looks like 2021 unemployment hovered around 3% to 8% for most countries in the dataset, but a few countries experienced very high unemployment of 20% to 35%.

# **Data validation**

# Validating continents
Your colleague has informed you that the data on unemployment from countries in Oceania is not reliable, and you'd like to identify and exclude these countries from your unemployment data. The .isin() function can help with that!

Your task is to use .isin() to identify countries that are not in Oceania. These countries should return True while countries in Oceania should return False. This will set you up to use the results of .isin() to quickly filter out Oceania countries using Boolean indexing.

The unemployment DataFrame is available, and pandas has been imported as pd.

# Instructions 1/2

Define a Series of Booleans describing whether or not each continent is outside of Oceania; call this Series not_oceania.

In [None]:
# Define a Series describing whether each continent is outside of Oceania
not_oceania = ~unemployment["continent"].isin(["Oceania"])


## Instructions 2/2

Use Boolean indexing to print the unemployment DataFrame without any of the data related to countries in Oceania.

In [None]:
# Define a Series describing whether each continent is outside of Oceania
not_oceania = ~unemployment["continent"].isin(["Oceania"])

# Print unemployment without records related to countries in Oceania
print(unemployment[not_oceania])

In [None]:
<script.py> output:
        country_code          country_name      continent   2010   2011  ...   2017   2018   2019   2020   2021
    0            AFG           Afghanistan           Asia  11.35  11.05  ...  11.18  11.15  11.22  11.71  13.28
    1            AGO                Angola         Africa   9.43   7.36  ...   7.41   7.42   7.42   8.33   8.53
    2            ALB               Albania         Europe  14.09  13.48  ...  13.62  12.30  11.47  13.33  11.82
    3            ARE  United Arab Emirates           Asia   2.48   2.30  ...   2.46   2.35   2.23   3.19   3.36
    4            ARG             Argentina  South America   7.71   7.18  ...   8.35   9.22   9.84  11.46  10.90
    ..           ...                   ...            ...    ...    ...  ...    ...    ...    ...    ...    ...
    175          VNM               Vietnam           Asia   1.11   1.00  ...   1.87   1.16   2.04   2.39   2.17
    178          YEM           Yemen, Rep.           Asia  12.83  13.23  ...  13.30  13.15  13.06  13.39  13.57
    179          ZAF          South Africa         Africa  24.68  24.64  ...  27.04  26.91  28.47  29.22  33.56
    180          ZMB                Zambia         Africa  13.19  10.55  ...  11.63  12.01  12.52  12.85  13.03
    181          ZWE              Zimbabwe         Africa   5.21   5.37  ...   4.78   4.80   4.83   5.35   5.17

    [174 rows x 15 columns]

Well done! You validated categorical data and used your .isin() validation to then exclude data that you weren't interested in! Filtering out data that you don't need at the start of your EDA process is a great way to organize yourself for the exploration yet to come.

# Validating range
Now it's time to validate our numerical data. We saw in the previous lesson using .describe() that the largest unemployment rate during 2021 was nearly 34 percent, while the lowest was just above zero.

Your task in this exercise is to get much more detailed information about the range of unemployment data using Seaborn's boxplot, and you'll also visualize the range of unemployment rates in each continent to understand geographical range differences.

unemployment is available, and the following have been imported for you: Seaborn as sns, matplotlib.pyplot as plt, and pandas as pd.

## Instructions

Print the minimum and maximum unemployment rates, in that order, during 2021.
Create a boxplot of 2021 unemployment rates, broken down by continent.

In [None]:
# Print the minimum and maximum unemployment rates during 2021
print(unemployment["2021"].min(), unemployment["2021"].max())

# Create a boxplot of 2021 unemployment rates, broken down by continent
sns.boxplot(data=unemployment, x="2021", y="continent")
plt.show()

Nice work! Notice how different the ranges in unemployment are between continents. For example, Africa's 50th percentile is lower than that of North America, but the range is much wider.

# **Data summarization**
# Summaries with .groupby() and .agg()**bold text**
In this exercise, you'll explore the means and standard deviations of the yearly unemployment data. First, you'll find means and standard deviations regardless of the continent to observe worldwide unemployment trends. Then, you'll check unemployment trends broken down by continent.

The unemployment DataFrame is available, and pandas has been imported as pd.

## Instructions 1/2

Print the mean and standard deviation of the unemployment rates for each year.

In [None]:
# Print the mean and standard deviation of rates by year
print(unemployment.agg(["mean", "std"]))

In [None]:
<script.py> output:
           2010   2011   2012   2013   2014  ...   2017   2018   2019   2020   2021
    mean  8.409  8.315  8.318  8.345  8.180  ...  7.669  7.426  7.244  8.421  8.391
    std   6.249  6.267  6.367  6.416  6.284  ...  5.902  5.819  5.697  6.041  6.067

    [2 rows x 12 columns]

## 2/2
Print the mean and standard deviation of the unemployment rates for each year, grouped by continent.

In [None]:
# Print yearly mean and standard deviation grouped by continent
print(unemployment.groupby("continent").agg(["mean", "std"]))

In [None]:

<script.py> output:
                     2010           2011           2012  ...   2019    2020           2021
                     mean    std    mean    std    mean  ...    std    mean    std    mean    std
    continent                                            ...
    Africa          9.344  7.411   9.369  7.402   9.241  ...  7.455  10.308  7.928  10.474  8.132
    Asia            6.241  5.146   5.942  4.780   5.835  ...  5.254   7.012  5.700   6.906  5.415
    Europe         11.008  6.392  10.948  6.540  11.326  ...  4.125   7.471  4.071   7.415  3.948
    North America   8.663  5.116   8.563  5.377   8.449  ...  4.770   9.298  4.963   9.155  5.076
    Oceania         3.623  2.055   3.647  2.008   4.104  ...  2.369   4.274  2.617   4.280  2.672
    South America   6.871  2.807   6.518  2.802   6.411  ...  3.380  10.275  3.411   9.924  3.612

    [6 rows x 24 columns]

Nicely done! This data is well-summarized, but it's a little long. What if you wanted to focus on a summary for just one year and make it more readable? Give it a go in the next exercise!

# **Named aggregations**
You've seen how .groupby() and .agg() can be combined to show summaries across categories. Sometimes, it's helpful to name new columns when aggregating so that it's clear in the code output what aggregations are being applied and where.

Your task is to create a DataFrame called continent_summary which shows a row for each continent. The DataFrame columns will contain the mean unemployment rate for each continent in 2021 as well as the standard deviation of the 2021 employment rate. And of course, you'll rename the columns so that their contents are clear!

The unemployment DataFrame is available, and pandas has been imported as pd.

## Instructions

Create a column called mean_rate_2021 which shows the mean 2021 unemployment rate for each continent.
Create a column called std_rate_2021 which shows the standard deviation of the 2021 unemployment rate for each continent.

In [None]:
continent_summary = unemployment.groupby("continent").agg(
    # Create the mean_rate_2021 column
    mean_rate_2021=("2021", "mean"),
    # Create the std_rate_2021 column
    std_rate_2021=("2021", "std")
)
print(continent_summary)

In [None]:
<script.py> output:
                   mean_rate_2021  std_rate_2021
    continent
    Africa                 10.474          8.132
    Asia                    6.906          5.415
    Europe                  7.415          3.948
    North America           9.155          5.076
    Oceania                 4.280          2.672
    South America           9.924          3.612

Super summarizing! Average 2021 unemployment varied widely by continent, and so did the unemployment within those continents.

# Visualizing categorical summaries
As you've learned in this chapter, Seaborn has many great visualizations for exploration, including a bar plot for displaying an aggregated average value by category of data.

In Seaborn, bar plots include a vertical bar indicating the 95% confidence interval for the categorical mean. Since confidence intervals are calculated using both the number of values and the variability of those values, they give a helpful indication of how much data can be relied upon.

Your task is to create a bar plot to visualize the means and confidence intervals of unemployment rates across the different continents.

unemployment is available, and the following have been imported for you: Seaborn as sns, matplotlib.pyplot as plt, and pandas as pd.

## Instructions

Create a bar plot showing continents on the x-axis and their respective average 2021 unemployment rates on the y-axis.

In [None]:
# Create a bar plot of continents and their 2021 average unemployment
sns.barplot(data=unemployment, x="continent", y="2021")
plt.show()

A perfect plot! Way to go. While Europe has higher average unemployment than Asia, it also has a smaller confidence interval for that average, so the average value is more reliable.

# **Chapter 1: Data Cleaning and Imputation**

Exploring and analyzing data often means dealing with missing values, incorrect data types, and outliers. In this chapter, you’ll learn techniques to handle these issues and streamline your EDA processes!

# **Addressing missing data**
# Dealing with missing data
It is important to deal with missing data before starting your analysis.

One approach is to drop missing values if they account for a small proportion, typically five percent, of your data.

Working with a dataset on plane ticket prices, stored as a pandas DataFrame called planes, you'll need to count the number of missing values across all columns, calculate five percent of all values, use this threshold to remove observations, and check how many missing values remain in the dataset.

## Instructions 1/3

Print the number of missing values in each column of the DataFrame.

In [None]:
# Count the number of missing values in each column
print(planes.isna().sum())

In [None]:
<script.py> output:
    Airline            427
    Date_of_Journey    322
    Source             187
    Destination        347
    Route              256
    Dep_Time           260
    Arrival_Time       194
    Duration           214
    Total_Stops        212
    Additional_Info    589
    Price              616
    dtype: int64


## Instructions 2/3

Calculate how many observations five percent of the planes DataFrame is equal to.

In [None]:
# Count the number of missing values in each column
print(planes.isna().sum())

# Find the five percent threshold
threshold = len(planes) * 0.05

In [None]:
<script.py> output:
    Airline            427
    Date_of_Journey    322
    Source             187
    Destination        347
    Route              256
    Dep_Time           260
    Arrival_Time       194
    Duration           214
    Total_Stops        212
    Additional_Info    589
    Price              616
    dtype: int64


## Instructions 3/3

Create cols_to_drop by applying boolean indexing to columns of the DataFrame with missing values less than or equal to the threshold.
Use this filter to remove missing values and save the updated DataFrame.

In [None]:
# Count the number of missing values in each column
print(planes.isna().sum())

# Find the five percent threshold
threshold = len(planes) * 0.05

# Create a filter
cols_to_drop = planes.columns[planes.isna().sum() <= threshold]

# Drop missing values for columns below the threshold
planes.dropna(subset=cols_to_drop, inplace=True)

print(planes.isna().sum())

In [None]:
<script.py> output:
    Airline            427
    Date_of_Journey    322
    Source             187
    Destination        347
    Route              256
    Dep_Time           260
    Arrival_Time       194
    Duration           214
    Total_Stops        212
    Additional_Info    589
    Price              616
    dtype: int64

<script.py> output:
    Airline            427
    Date_of_Journey    322
    Source             187
    Destination        347
    Route              256
    Dep_Time           260
    Arrival_Time       194
    Duration           214
    Total_Stops        212
    Additional_Info    589
    Price              616
    dtype: int64
    Airline              0
    Date_of_Journey      0
    Source               0
    Destination          0
    Route                0
    Dep_Time             0
    Arrival_Time         0
    Duration             0
    Total_Stops          0
    Additional_Info    300
    Price              368
    dtype: int64

Awesome! By creating a missing values threshold and using it to filter columns, you've managed to remove missing values from all columns except for "Additional_Info" and "Price"

# Strategies for remaining missing data
The five percent rule has worked nicely for your planes dataset, eliminating missing values from nine out of 11 columns!

Now, you need to decide what to do with the "Additional_Info" and "Price" columns, which are missing 300 and 368 values respectively.

You'll first take a look at what "Additional_Info" contains, then visualize the price of plane tickets by different airlines.

The following imports have been made for you:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
## Instructions 1/3

Print the values and frequencies of "Additional_Info".

In [None]:
# Check the values of the Additional_Info column
print(planes["Additional_Info"].value_counts())

In [None]:
<script.py> output:
    No info                         6399
    In-flight meal not included     1525
    No check-in baggage included     258
    1 Long layover                    14
    Change airports                    7
    No Info                            2
    Business class                     1
    Red-eye flight                     1
    2 Long layover                     1
    Name: Additional_Info, dtype: int64


## Instructions 2/3

Calculate how many observations five percent of the planes DataFrame is equal to.

In [None]:
# Count the number of missing values in each column
print(planes.isna().sum())

# Find the five percent threshold
threshold = len(planes) * 0.05

In [None]:
<script.py> output:
    Airline            427
    Date_of_Journey    322
    Source             187
    Destination        347
    Route              256
    Dep_Time           260
    Arrival_Time       194
    Duration           214
    Total_Stops        212
    Additional_Info    589
    Price              616
    dtype: int64


## Instructions 3/3
Create cols_to_drop by applying boolean indexing to columns of the DataFrame with missing values less than or equal to the threshold.
Use this filter to remove missing values and save the updated DataFrame.

In [None]:
# Count the number of missing values in each column
print(planes.isna().sum())

# Find the five percent threshold
threshold = len(planes) * 0.05

# Create a filter
cols_to_drop = planes.columns[planes.isna().sum() <= threshold]

# Drop missing values for columns below the threshold
planes.dropna(subset=cols_to_drop, inplace=True)

print(planes.isna().sum())

In [None]:
<script.py> output:
    Airline            427
    Date_of_Journey    322
    Source             187
    Destination        347
    Route              256
    Dep_Time           260
    Arrival_Time       194
    Duration           214
    Total_Stops        212
    Additional_Info    589
    Price              616
    dtype: int64

<script.py> output:
    Airline            427
    Date_of_Journey    322
    Source             187
    Destination        347
    Route              256
    Dep_Time           260
    Arrival_Time       194
    Duration           214
    Total_Stops        212
    Additional_Info    589
    Price              616
    dtype: int64
    Airline              0
    Date_of_Journey      0
    Source               0
    Destination          0
    Route                0
    Dep_Time             0
    Arrival_Time         0
    Duration             0
    Total_Stops          0
    Additional_Info    300
    Price              368
    dtype: int64

Awesome! By creating a missing values threshold and using it to filter columns, you've managed to remove missing values from all columns except for "Additional_Info" and "Price".

# Strategies for remaining missing data
The five percent rule has worked nicely for your planes dataset, eliminating missing values from nine out of 11 columns!

Now, you need to decide what to do with the "Additional_Info" and "Price" columns, which are missing 300 and 368 values respectively.

You'll first take a look at what "Additional_Info" contains, then visualize the price of plane tickets by different airlines.

The following imports have been made for you:

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt
## Instructions 1/3

Print the values and frequencies of "Additional_Info".

In [None]:
# Check the values of the Additional_Info column
print(planes["Additional_Info"].value_counts())

In [None]:
<script.py> output:
    No info                         6399
    In-flight meal not included     1525
    No check-in baggage included     258
    1 Long layover                    14
    Change airports                    7
    No Info                            2
    Business class                     1
    Red-eye flight                     1
    2 Long layover                     1
    Name: Additional_Info, dtype: int64


## Instructions 2/3

Create a boxplot of "Price" by "Airline".

In [None]:
# Check the values of the Additional_Info column
print(planes["Additional_Info"].value_counts())

# Create a box plot of Price by Airline
sns.boxplot(data=planes, x="Airline", y="Price")

plt.show()

# **Converting and analyzing categorical data**

# Finding the number of unique values
You would like to practice some of the categorical data manipulation and analysis skills that you've just seen. To help identify which data could be reformatted to extract value, you are going to find out which non-numeric columns in the planes dataset have a large number of unique values.

pandas has been imported for you as pd, and the dataset has been stored as planes.

## Instructions

Filter planes for columns that are of "object" data type.
Loop through the columns in the dataset.
Add the column iterator to the print statement, then call the function to return the number of unique values in the column.

In [None]:
# Filter the DataFrame for object columns
non_numeric = planes.select_dtypes("object")

# Loop through columns
for col in non_numeric.columns:

  # Print the number of unique values
  print(f"Number of unique values in {col} column: ", non_numeric[col].nunique())

In [None]:
<script.py> output:
    Number of unique values in Airline column:  8
    Number of unique values in Date_of_Journey column:  44
    Number of unique values in Source column:  5
    Number of unique values in Destination column:  6
    Number of unique values in Route column:  122
    Number of unique values in Dep_Time column:  218
    Number of unique values in Duration column:  362
    Number of unique values in Total_Stops column:  5
    Number of unique values in Additional_Info column:  9

Great looping! Interestingly, "Duration" is currently an object column whereas it should be a numeric column, and has 362 unique values! Let's find out more about this column.

# Flight duration categories
As you saw, there are 362 unique values in the "Duration" column of planes. Calling planes["Duration"].head(), we see the following values:

0        19h
1     5h 25m
2     4h 45m
3     2h 25m
4    15h 30m
Name: Duration, dtype: object
Looks like this won't be simple to convert to numbers. However, you could categorize flights by duration and examine the frequency of different flight lengths!

You'll create a "Duration_Category" column in the planes DataFrame. Before you can do this you'll need to create a list of the values you would like to insert into the DataFrame, followed by the existing values that these should be created from.

## Instructions 1/2

Create a list of categories containing "Short-haul", "Medium", and "Long-haul".

In [None]:
# Create a list of categories
flight_categories = ["Short-haul", "Medium", "Long-haul"]


## Instructions 2/2

Create short_flights, a string to capture values of "0h", "1h", "2h", "3h", or "4h" taking care to avoid values such as "10h".
Create medium_flights to capture any values between five and nine hours.
Create long_flights to capture any values from 10 hours to 16 hours inclusive.

In [None]:
# Create a list of categories
flight_categories = ["Short-haul", "Medium", "Long-haul"]

# Create short_flights
short_flights = "^0h|^1h|^2h|^3h|^4h"

# Create medium_flights
medium_flights = "^5h|^6h|^7h|^8h|^9h"

# Create long_flights
long_flights = "10h|11h|12h|13h|14h|15h|16h"

Nicely done! Now you've created your categories and values, it's time to conditionally add the categories into the DataFrame.

# Adding duration categories
Now that you've set up the categories and values you want to capture, it's time to build a new column to analyze the frequency of flights by duration!

The variables flight_categories, short_flights, medium_flights, and long_flights that you previously created are available to you.

Additionally, the following packages have been imported: pandas as pd, numpy as np, seaborn as sns, and matplotlib.pyplot as plt.

## Instructions

Create conditions, a list containing subsets of planes["Duration"] based on short_flights, medium_flights, and long_flights.
Create the "Duration_Category" column by calling a function that accepts your conditions list and flight_categories, setting values not found to "Extreme duration".
Create a plot showing the count of each category.

In [None]:
# Create conditions for values in flight_categories to be created
conditions = [
    (planes["Duration"].str.contains(short_flights)),
    (planes["Duration"].str.contains(medium_flights)),
    (planes["Duration"].str.contains(long_flights))
]

# Apply the conditions list to the flight_categories
planes["Duration_Category"] = np.select(conditions,
                                        flight_categories,
                                        default="Extreme duration")

# Plot the counts of each category
sns.countplot(data=planes, x="Duration_Category")
plt.show()


Creative categorical transformation work! It's clear that the majority of flights are short-haul, and virtually none are longer than 16 hours! Now let's take a deep dive into working with numerical data.

# **Working with numeric data**

# Flight duration
You would like to analyze the duration of flights, but unfortunately, the "Duration" column in the planes DataFrame currently contains string values.

You'll need to clean the column and convert it to the correct data type for analysis.

## Instructions 1/4

Print the first five values of the "Duration" column.

In [None]:
# Preview the column
print(planes["Duration"].head())

In [None]:
<script.py> output:
    0                  19.0h
    1     5.416666666666667h
    2                  4.75h
    3    2.4166666666666665h
    4                  15.5h
    Name: Duration, dtype: object


## Instructions 2/4

Remove "h" from the column.

In [None]:
# Preview the column
print(planes["Duration"].head())

# Remove the string character
planes["Duration"] = planes["Duration"].str.replace("h", "")


## Instructions 3/4

Convert the column to float data type.

In [None]:
# Preview the column
print(planes["Duration"].head())

# Remove the string character
planes["Duration"] = planes["Duration"].str.replace("h", "")

# Convert to float data type
planes["Duration"] = planes["Duration"].astype(float)


## Instructions 4/4

Plot a histogram of "Duration" values.

In [None]:
# Preview the column
print(planes["Duration"].head())

# Remove the string character
planes["Duration"] = planes["Duration"].str.replace("h", "")

# Convert to float data type
planes["Duration"] = planes["Duration"].astype(float)

# Plot a histogram
sns.histplot(data=planes, x="Duration")
plt.show()

Creative cleaning skills! Once the data was in the right format, you were able to plot the distribution of 'Duration' and see that the most common flight length is around three hours.

# Adding descriptive statistics
Now "Duration" and "Price" both contain numeric values in the planes DataFrame, you would like to calculate summary statistics for them that are conditional on values in other columns.

## Instructions 1/3

## 1
Add a column to planes containing the standard deviation of "Price" based on "Airline".

## 2
Calculate the median for "Duration" by "Airline", storing it as a column called "airline_median_duration".
## 3
Find the mean "Price" by "Destination", saving it as a column called "price_destination_mean".

In [None]:
# Price standard deviation by Airline
planes["airline_price_st_dev"] = planes.groupby("Airline")["Price"].transform(lambda x: x.std())

print(planes[["Airline", "airline_price_st_dev"]].value_counts())

In [None]:
<script.py> output:
    Airline            airline_price_st_dev
    Jet Airways        4230.749                3685
    IndiGo             2266.754                1981
    Air India          3865.872                1686
    Multiple carriers  3763.675                1148
    SpiceJet           1790.852                 787
    Vistara            2864.268                 455
    Air Asia           2016.739                 309
    GoAir              2790.815                 182
    dtype: int64

In [None]:
# Median Duration by Airline
planes["airline_median_duration"] = planes.groupby("Airline")["Duration"].transform(lambda x: x.median())

print(planes[["Airline","airline_median_duration"]].value_counts())

In [None]:
<script.py> output:
    Airline            airline_median_duration
    Jet Airways        13.333                     3685
    IndiGo             2.917                      1981
    Air India          15.917                     1686
    Multiple carriers  10.250                     1148
    SpiceJet           2.500                       787
    Vistara            3.167                       455
    Air Asia           2.833                       309
    GoAir              5.167                       182
    dtype: int64

In [None]:
# Mean Price by Destination
planes["price_destination_mean"] = planes.groupby("Destination")["Price"].transform(lambda x: x.mean())

print(planes[["Destination","price_destination_mean"]].value_counts())

In [None]:
<script.py> output:
    Airline            airline_median_duration
    Jet Airways        13.333                     3685
    IndiGo             2.917                      1981
    Air India          15.917                     1686
    Multiple carriers  10.250                     1148
    SpiceJet           2.500                       787
    Vistara            3.167                       455
    Air Asia           2.833                       309
    GoAir              5.167                       182
    dtype: int64

<script.py> output:
    Destination  price_destination_mean
    Cochin       10506.993                 4391
    Banglore     9132.225                  2773
    Delhi        5157.794                  1219
    New Delhi    11738.589                  888
    Hyderabad    5025.210                   673
    Kolkata      4801.490                   369
    dtype: int64

Terrific transforming! Looks like Jet Airways has the largest standard deviation in price, Air India has the largest median duration, and New Delhi, on average, is the most expensive destination. Now let's look at how to handle outliers.

#** Handling outliers**
# Identifying outliers
You've proven that you recognize what to do when presented with outliers, but can you identify them using visualizations?

Try to figure out if there are outliers in the "Price" or "Duration" columns of the planes DataFrame.

matplotlib.pyplot and seaborn have been imported for you as plt and sns respectively.

## Instructions 1/3

Plot the distribution of "Price" column from planes.

In [None]:
# Plot a histogram of flight prices
sns.histplot(data=planes, x="Price")
plt.show()


## Instructions 2/3

Display the descriptive statistics for flight duration.



In [None]:
# Plot a histogram of flight prices
sns.histplot(data=planes, x="Price")
plt.show()

# Display descriptive statistics for flight duration
print(planes["Duration"].describe())

In [None]:
<script.py> output:
    count    10446.000
    mean        10.724
    std          8.472
    min          0.083
    25%          2.833
    50%          8.667
    75%         15.500
    max         47.667
    Name: Duration, dtype: float64

# Removing outliers
While removing outliers isn't always the way to go, for your analysis, you've decided that you will only include flights where the "Price" is not an outlier.

Therefore, you need to find the upper threshold and then use it to remove values above this from the planes DataFrame.

pandas has been imported for you as pd, along with seaborn as sns.

## Instructions 1/4

Find the 75th and 25th percentiles, saving as price_seventy_fifth and price_twenty_fifth respectively.

## 2
Calculate the IQR, storing it as prices_iqr.
## 3
Calculate the upper and lower outlier thresholds.
## 4
Remove the outliers from planes.

In [None]:
# Find the 75th and 25th percentiles
price_seventy_fifth = planes["Price"].quantile(0.75)
price_twenty_fifth = planes["Price"].quantile(0.25)

In [None]:
# Find the 75th and 25th percentiles
price_seventy_fifth = planes["Price"].quantile(0.75)
price_twenty_fifth = planes["Price"].quantile(0.25)

# Calculate iqr
prices_iqr = price_seventy_fifth - price_twenty_fifth

In [None]:
# Find the 75th and 25th percentiles
price_seventy_fifth = planes["Price"].quantile(0.75)
price_twenty_fifth = planes["Price"].quantile(0.25)

# Calculate iqr
prices_iqr = price_seventy_fifth - price_twenty_fifth

# Calculate the thresholds
upper = price_seventy_fifth + (1.5 * prices_iqr)
lower = price_twenty_fifth - (1.5 * prices_iqr)

In [None]:
# Find the 75th and 25th percentiles
price_seventy_fifth = planes["Price"].quantile(0.75)
price_twenty_fifth = planes["Price"].quantile(0.25)

# Calculate iqr
prices_iqr = price_seventy_fifth - price_twenty_fifth

# Calculate the thresholds
upper = price_seventy_fifth + (1.5 * prices_iqr)
lower = price_twenty_fifth - (1.5 * prices_iqr)

# Subset the data
planes = planes[(planes["Price"] > lower) & (planes["Price"] < upper)]

print(planes["Price"].describe())

In [None]:
<script.py> output:
    count     9959.000
    mean      8875.161
    std       4057.202
    min       1759.000
    25%       5228.000
    50%       8283.000
    75%      12284.000
    max      23001.000
    Name: Price, dtype: float64

Ridiculous outlier removal skills! You managed to create thresholds based on the IQR and used them to filter the planes dataset to eliminate extreme prices. Originally the dataset had a maximum price of almost 55000, but the output of planes.describe() shows the maximum has been reduced to around 23000, reflecting a less skewed distribution for analysis!

# **Chapter 3: Relationships in Dat**
Variables in datasets don't exist in a vacuum; they have relationships with each other. In this chapter, you'll look at relationships across numerical, categorical, and even DateTime data, exploring the direction and strength of these relationships as well as ways to visualize them.

# **Patterns over time**

# Importing DateTime data
You'll now work with the entire divorce dataset! The data describes Mexican marriages dissolved between 2000 and 2015. It contains marriage and divorce dates, education level, birthday, income for each partner, and marriage duration, as well as the number of children the couple had at the time of divorce.

The column names and data types are as follows:

divorce_date          object
dob_man               object
education_man         object
income_man           float64
dob_woman             object
education_woman       object
income_woman         float64
marriage_date         object
marriage_duration    float64
num_kids             float64
It looks like there is a lot of date information in this data that is not yet a DateTime data type! Your task is to fix that so that you can explore patterns over time.

pandas has been imported as pd.

## Instructions

Import divorce.csv, saving as a DataFrame, divorce; indicate in the import function that the divorce_date, dob_man, dob_woman, and marriage_date columns should be imported as DateTime values.

In [None]:
# Import divorce.csv, parsing the appropriate columns as dates in the import
divorce = pd.read_csv("divorce.csv", parse_dates=["divorce_date", "dob_man", "dob_woman", "marriage_date"])
print(divorce.dtypes)

In [None]:
<script.py> output:
    divorce_date         datetime64[ns]
    dob_man              datetime64[ns]
    education_man                object
    income_man                  float64
    dob_woman            datetime64[ns]
    education_woman              object
    income_woman                float64
    marriage_date        datetime64[ns]
    marriage_duration           float64
    num_kids                    float64
    dtype: object

Bingo! Nice work parsing those dates at the same time as you imported the data into pandas. Next, have a go at updating DateTime data types in a DataFrame that has already been imported

# Visualizing relationships over time
Now that your date data is saved as DateTime data, you can explore patterns over time! Does the year that a couple got married have a relationship with the number of children that the couple has at the time of divorce? Your task is to find out!

The divorce DataFrame (with all dates formatted as DateTime data types) has been loaded for you. pandas has been loaded as pd, matplotlib.pyplot has been loaded as plt, and Seaborn has been loaded as sns.

## Instructions 1/2

Define a column called marriage_year, which contains just the year portion of the marriage_date column.

In [None]:
# Define the marriage_year column
divorce["marriage_year"] = divorce["marriage_date"].dt.year


## Instructions 2/2

Create a line plot showing the average number of kids a couple had during their marriage, arranged by the year that the couple got married.

In [None]:
# Define the marriage_year column
divorce["marriage_year"] = divorce["marriage_date"].dt.year

# Create a line plot showing the average number of kids by year
sns.lineplot(data=divorce, x="marriage_year", y="num_kids")
plt.show()

Nice! You've discovered a pattern here: it looks like couples who had later marriage years also had fewer children during their marriage. We'll explore this relationship and others further in the next video.

#**Correlation**
# Visualizing variable relationships
In the last exercise, you may have noticed that a longer marriage_duration is correlated with having more children, represented by the num_kids column. The correlation coefficient between the marriage_duration and num_kids variables is 0.45.

In this exercise, you'll create a scatter plot to visualize the relationship between these variables. pandas has been loaded as pd, matplotlib.pyplot has been loaded as plt, and Seaborn has been loaded as sns.

## Instructions

Create a scatterplot showing marriage_duration on the x-axis and num_kids on the y-axis

In [None]:
# Create the scatterplot
sns.scatterplot(data=divorce, x="marriage_duration", y="num_kids")
plt.show()

Bingo! There is a slight positive relationship in your scatterplot. In the dataset, couples with no children have no value in the num_kids column. If you are confident that all or most of the missing values in num_kids are related to couples without children, you could consider updating these values to 0, which might increase the correlation.

# Visualizing multiple variable relationships
Seaborn's .pairplot() is excellent for understanding the relationships between several or all variables in a dataset by aggregating pairwise scatter plots in one visual.

Your task is to use a pairplot to compare the relationship between marriage_duration and income_woman. pandas has been loaded as pd, matplotlib.pyplot has been loaded as plt, and Seaborn has been loaded as sns.

## Instructions

Create a pairplot to visualize the relationships between income_woman and marriage_duration in the divorce DataFrame.

In [None]:
# Create a pairplot for income_woman and marriage_duration
sns.pairplot(data=divorce, vars=["income_woman", "marriage_duration"])
plt.show()

Well done! Just as in the correlation matrix, you can see that the relationship between income_woman and marriage_duration is not a strong one. You can also get a sense of the distributions of both variables in the upper left and lower right plots.

# **Factor relationships and distributions**

# Categorial data in scatter plots
In the video, we explored how men's education and age at marriage related to other variables in our dataset, the divorce DataFrame. Now, you'll take a look at how women's education and age at marriage relate to other variables!

Your task is to create a scatter plot of each woman's age and income, layering in the categorical variable of education level for additional context.

The divorce DataFrame has been loaded for you, and woman_age_marriage has already been defined as a column representing an estimate of the woman's age at the time of marriage. pandas has been loaded as pd, matplotlib.pyplot has been loaded as plt, and Seaborn has been loaded as sns.

## Instructions

Create a scatter plot that shows woman_age_marriage on the x-axis and income_woman on the y-axis; each data point should be colored based on the woman's level of education, represented by education_woman.


In [None]:
# Create the scatter plot
sns.scatterplot(data=divorce, x="woman_age_marriage", y="income_woman", hue="education_woman")
plt.show()

Awesome—it looks like there is a positive correlation between professional education and higher salaries, as you might expect. The relationship between women's age at marriage and education level is a little less clear.

# Exploring with KDE plots
Kernel Density Estimate (KDE) plots are a great alternative to histograms when you want to show multiple distributions in the same visual.

Suppose you are interested in the relationship between marriage duration and the number of kids that a couple has. Since values in the num_kids column range only from one to five, you can plot the KDE for each value on the same plot.

The divorce DataFrame has been loaded for you. pandas has been loaded as pd, matplotlib.pyplot has been loaded as plt, and Seaborn has been loaded as sns. Recall that the num_kids column in divorce lists only N/A values for couples with no children, so you'll only be looking at distributions for divorced couples with at least one child.

Instructions 1/3

Create a KDE plot that shows marriage_duration on the x-axis and a different colored line for each possible number of children that a couple might have, represented by num_kids.

In [None]:
# Create the KDE plot
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids")
plt.show()


## Instructions 2/3

Notice that the plot currently shows marriage durations less than zero; update the KDE plot so that marriage duration cannot be smoothed past the extreme data

In [None]:
# Update the KDE plot so that marriage duration can't be smoothed too far
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0)
plt.show()


## Instructions 3/3

Update the code for the KDE plot from the previous step to show a cumulative distribution function for each number of children a couple has.

In [None]:
# Update the KDE plot to show a cumulative distribution function
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0, cumulative=True)
plt.show()

Well done! It looks as though there is a positive correlation between longer marriages and more children, but of course, this doesn't indicate causation. You can also see that there is much less data on couples with more than two children; this helps us understand how reliable our findings are.

# **Chapter 4: Relationships in Data**

Variables in datasets don't exist in a vacuum; they have relationships with each other. In this chapter, you'll look at relationships across numerical, categorical, and even DateTime data, exploring the direction and strength of these relationships as well as ways to visualize them.

# **Considerations for categorical data**

# Checking for class imbalance
The 2022 Kaggle Survey captures information about data scientists' backgrounds, preferred technologies, and techniques. It is seen as an accurate view of what is happening in data science based on the volume and profile of responders.

Having looked at the job titles and categorized to align with our salaries DataFrame, you can see the following proportion of job categories in the Kaggle survey:

Job Category	Relative Frequency
Data Science	0.281236
Data Analytics	0.224231
Other	0.214609
Managerial	0.121300
Machine Learning	0.083248
Data Engineering	0.075375
Thinking of the Kaggle survey results as the population, your task is to find out whether the salaries DataFrame is representative by comparing the relative frequency of job categories.

## Instructions

Print the relative frequency of the "Job_Category" column from salaries DataFrame.

In [None]:
# Print the relative frequency of Job_Category
print(salaries["Job_Category"].value_counts(normalize=True))

In [None]:
<script.py> output:
    Data Science        0.278
    Data Engineering    0.273
    Data Analytics      0.226
    Machine Learning    0.120
    Other               0.069
    Managerial          0.034
    Name: Job_Category, dtype: float64

Fantastic relative frequency calculation! It looks like Data Science is the most popular class and has a similar representation. Still, the other categories have quite different relative frequencies, which might not be surprising given the target audience is data scientists! Given the difference in relative frequencies, can you trust the salaries DataFrame to accurately represent Managerial roles?

# Cross-tabulation
Cross-tabulation can help identify how observations occur in combination.

Using the salaries dataset, which has been imported as a pandas DataFrame, you'll perform cross-tabulation on multiple variables, including the use of aggregation, to see the relationship between "Company_Size" and other variables.

pandas has been imported for you as pd.

## Instructions 1/3

Perform cross-tabulation, setting "Company_Size" as the index, and the columns to classes in "Experience".

In [None]:
# Cross-tabulate Company_Size and Experience
print(pd.crosstab(salaries["Company_Size"], salaries["Experience"]))

In [None]:
<script.py> output:
    Experience    EN  EX  MI   SE
    Company_Size
    L             24   7  49   44
    M             25   9  58  136
    S             18   1  21   15

## 2/3
Cross-tabulate "Job_Category" and classes of "Company_Size" as column names.

In [None]:
# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"]))

In [None]:
<script.py> output:
    Company_Size       L   M   S
    Job_Category
    Data Analytics    23  61   8
    Data Engineering  28  72  11
    Data Science      38  59  16
    Machine Learning  17  19  13
    Managerial         5   8   1
    Other             13   9   6

## 3/3
Update pd.crosstab() to return the mean "Salary_USD" values.

In [None]:
# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"],
            values=salaries["Salary_USD"], aggfunc="mean"))

In [None]:

<script.py> output:
    Company_Size       L   M   S
    Job_Category
    Data Analytics    23  61   8
    Data Engineering  28  72  11
    Data Science      38  59  16
    Machine Learning  17  19  13
    Managerial         5   8   1
    Other             13   9   6

<script.py> output:
    Company_Size               L           M          S
    Job_Category
    Data Analytics    112851.749   95912.685  53741.877
    Data Engineering  118939.035  121287.061  86927.136
    Data Science       96489.520  116044.456  62241.749
    Machine Learning  140779.492  100794.237  78812.586
    Managerial        190551.449  150713.628  31484.700
    Other              92873.911   89750.579  69871.248

Awesome cross-tabulation! This is a handy function to examine the combination of frequencies, as well as find aggregated statistics. Looks like the largest mean salary is for Managerial data roles in large companies!

# Generating new features

# Extracting features for correlation
In this exercise, you'll work with a version of the salaries dataset containing a new column called "date_of_response".

The dataset has been read in as a pandas DataFrame, with "date_of_response" as a datetime data type.

Your task is to extract datetime attributes from this column and then create a heat map to visualize the correlation coefficients between variables.

Seaborn has been imported for you as sns, pandas as pd, and matplotlib.pyplot as plt.

## Instructions

Extract the month from "date_of_response", storing it as a column called "month".
Create the "weekday" column, containing the weekday that the participants completed the survey.
Plot a heat map, including the Pearson correlation coefficient scores.

In [None]:
# Get the month of the response
salaries["month"] = salaries["date_of_response"].dt.month

# Extract the weekday of the response
salaries["weekday"] = salaries["date_of_response"].dt.weekday

# Create a heatmap
sns.heatmap(salaries.corr(), annot=True)
plt.show()

Fantastic feature creation! Looks like there aren't any meaningful relationships between our numeric variables, so let's see if converting numeric data into classes offers additional insights.

# Calculating salary percentiles
In the video, you saw that the conversion of numeric data into categories sometimes makes it easier to identify patterns.

Your task is to convert the "Salary_USD" column into categories based on its percentiles. First, you need to find the percentiles and store them as variables.

pandas has been imported as pd and the salaries dataset read in as DataFrame called salaries.

## Instructions

Find the 25th percentile of "Salary_USD".
Store the median of "Salary_USD" as salaries_median.
Get the 75th percentile of salaries.

In [None]:
# Find the 25th percentile
twenty_fifth = salaries["Salary_USD"].quantile(0.25)

# Save the median
salaries_median = salaries["Salary_USD"].median()

# Gather the 75th percentile
seventy_fifth = salaries["Salary_USD"].quantile(0.75)
print(twenty_fifth, salaries_median, seventy_fifth)

In [None]:
<script.py> output:
    60880.691999999995 97488.552 143225.1

Looks like the interquartile range is between 60,881 and 143,225 dollars! Now let's use these variables to add a categorical salary column into the DataFrame!

Categorizing salaries
Now it's time to make a new category! You'll use the variables twenty_fifth, salaries_median, and seventy_fifth, that you created in the previous exercise, to split salaries into different labels.

The result will be a new column called "salary_level", which you'll incorporate into a visualization to analyze survey respondents' salary and at companies of different sizes.

pandas has been imported as pd, matplotlib.pyplot as plt, seaborn as sns, and the salaries dataset as a pandas DataFrame called salaries.

Instructions 1/4

Create salary_labels, a list containing "entry", "mid", "senior", and "exec".

In [None]:
# Create salary labels
salary_labels = ["entry", "mid", "senior", "exec"]


## Instructions 2/4

Finish salary_ranges, adding the 25th percentile, median, 75th percentile, and largest value from "Salary_USD".

In [None]:
# Create salary labels
salary_labels = ["entry", "mid", "senior", "exec"]

# Create the salary ranges list
salary_ranges = [0, twenty_fifth, salaries_median, seventy_fifth, salaries["Salary_USD"].max()]


## Instructions 3/4

Split "Salary_USD" based on the labels and ranges you've created.

In [None]:
# Create salary labels
salary_labels = ["entry", "mid", "senior", "exec"]

# Create the salary ranges list
salary_ranges = [0, twenty_fifth, salaries_median, seventy_fifth, salaries["Salary_USD"].max()]

# Create salary_level
salaries["salary_level"] = pd.cut(salaries["Salary_USD"],
                                  bins=salary_ranges,
                                  labels=salary_labels)


## Instructions 4/4

Use sns.countplot() to visualize the count of "Company_Size", factoring salary level labels.

In [None]:
# Create salary labels
salary_labels = ["entry", "mid", "senior", "exec"]

# Create the salary ranges list
salary_ranges = [0, twenty_fifth, salaries_median, seventy_fifth, salaries["Salary_USD"].max()]

# Create salary_level
salaries["salary_level"] = pd.cut(salaries["Salary_USD"],
                                  bins=salary_ranges,
                                  labels=salary_labels)

# Plot the count of salary levels at companies of different sizes
sns.countplot(data=salaries, x="Company_Size", hue="salary_level")
plt.show()

Nice work! By using pd.cut() to split out numeric data into categories, you can see that a large proportion of workers at small companies get paid "entry" level salaries, while more staff at medium-sized companies are rewarded with "senior" level salary. Now let's look at generating hypotheses as you reach the end of the EDA phase!

# **Generating hypotheses**

# Comparing salaries
Exploratory data analysis is a crucial step in generating hypotheses!

You've had an idea you'd like to explore—do data professionals get paid more in the USA than they do in Great Britain?

You'll need to subset the data by "Employee_Location" and produce a plot displaying the average salary between the two groups.

The salaries DataFrame has been imported as a pandas DataFrame.

pandas has been imported as pd, maplotlib.pyplot as plt and seaborn as sns.

## Instructions

Filter salaries where "Employee_Location" is "US" or "GB", saving as usa_and_gb.
Use usa_and_gb to create a barplot visualizing "Salary_USD" against "Employee_Location".

In [None]:
# Filter for employees in the US or GB
usa_and_gb = salaries[salaries["Employee_Location"].isin(["US", "GB"])]

# Create a barplot of salaries by location
sns.barplot(data=usa_and_gb, x="Employee_Location", y="Salary_USD")
plt.show()

Nicely done! By subsetting the data you were able to directly compare salaries between the USA and Great Britain. The visualization suggests you've generated a hypothesis that is worth formally investigating to determine whether a real difference exists or not!

# Choosing a hypothesis
You've seen how visualizations can be used to generate hypotheses, making them a crucial part of exploratory data analysis!

In this exercise, you'll generate a bar plot to inspect how salaries differ based on company size and employment status. For reference, there are four values:

Value	Meaning
CT	Contractor
FL	Freelance
PT	Part-time
FT	Full-time
pandas has been imported as pd, matplotlib.pyplot as plt, seaborn as sns, and the salaries dataset as a pandas DataFrame called salaries.

## Instructions 1/2

Produce a barplot comparing "Salary_USD" by "Company_Size", factoring "Employment_Status".

In [None]:
# Create a bar plot of salary versus company size, factoring in employment status
sns.barplot(data=salaries, x="Company_Size", y="Salary_USD", hue="Employment_Status")
plt.show()

[https://drive.google.com/file/d/1Ove5PeR-8TsSy0pVIl9fUZ3q_f4cJ2Lf/view?usp=share_link)