# Project 3

## This project is based on a Bureau of Labor Statistics dataset titled 'U.S. Incomes by Occupation and Gender.' The aim of the project is to assess the gender pay gap across various occupations in the U.S. The project focuses on two key metrics:
### 1. Top 10 occupations with the highest gender pay gap 
### 2. The relationship between the percentage of women in an occupation and the gender pay gap

In [62]:
import plotly.io as pio

pio.renderers.default = "vscode+jupyterlab+notebook_connected"

In [63]:
import pandas as pd
import plotly.express as px

In [64]:
df = pd.read_csv("Genderpaygap.csv") #Load the selected dataset


In [65]:
print("First few rows of the dataset:") #Inspect initial rows of the dataset
print(df.head())

First few rows of the dataset:
                        Occupation  All_workers  All_weekly  M_workers  \
0                  ALL OCCUPATIONS       109080       809.0      60746   
1                       MANAGEMENT        12480      1351.0       7332   
2                 Chief executives         1046      2041.0        763   
3  General and operations managers          823      1260.0        621   
4                      Legislators            8         NaN          5   

   M_weekly  F_workers  F_weekly  
0     895.0      48334     726.0  
1    1486.0       5147    1139.0  
2    2251.0        283    1836.0  
3    1347.0        202    1002.0  
4       NaN          4       NaN  


In [66]:
#Check for missing data and relevant data types
print("\nMissing values in dataset:")
print(df.isnull().sum())

print("\nData types in dataset:")
print(df.dtypes)


Missing values in dataset:
Occupation       0
All_workers      0
All_weekly     236
M_workers        0
M_weekly       326
F_workers        0
F_weekly       366
dtype: int64

Data types in dataset:
Occupation      object
All_workers      int64
All_weekly     float64
M_workers        int64
M_weekly       float64
F_workers        int64
F_weekly       float64
dtype: object


In [67]:
#Exclude rows with null values in given columns: All_weekly', 'M_weekly', and 'F_weekly'
df_cleaned = df.dropna(subset=['All_weekly', 'M_weekly', 'F_weekly'])

In [68]:
# Calculate the Gender Pay Gap as the difference between male and female median weekly incomes
df_cleaned['Gender_Pay_Gap'] = df_cleaned['M_weekly'] - df_cleaned['F_weekly']



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [69]:
# Display summary statistics for Gender Pay Gap
print("\nSummary statistics for Gender Pay Gap:")
print(df_cleaned['Gender_Pay_Gap'].describe())


Summary statistics for Gender Pay Gap:
count    142.000000
mean     190.063380
std      149.089302
min      -99.000000
25%       89.250000
50%      164.500000
75%      264.750000
max      742.000000
Name: Gender_Pay_Gap, dtype: float64


In [70]:
# Calculate the percentage of women in each occupation
df_cleaned['Women_Percentage'] = df_cleaned['F_workers'] / df_cleaned['All_workers'] * 100



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [71]:
# Display summary statistics for Women Percentage
print("\nSummary statistics for Women Percentage:")
print(df_cleaned['Women_Percentage'].describe())


Summary statistics for Women Percentage:
count    142.000000
mean      47.592523
std       21.030421
min        2.394268
25%       34.344675
50%       47.037643
75%       60.198884
max       94.421952
Name: Women_Percentage, dtype: float64


In [72]:
# Graph 1: Bar plot for the top 10 occupations with the highest gender pay gap
df_sorted = df_cleaned.sort_values(by='Gender_Pay_Gap', ascending=False)
fig1 = px.bar(df_sorted.head(10), x='Occupation', y='Gender_Pay_Gap',
              hover_data=['Women_Percentage'],
              labels={'Gender_Pay_Gap': 'Gender Pay Gap (USD)', 'Occupation': 'Occupation'},
              title='Top 10 Occupations with the Highest Gender Pay Gap')

In [73]:
# Show the first plot
fig1.show()

In [74]:
# Graph 2: Scatter plot showing the relationship between the percentage of women and the gender pay gap
fig2 = px.scatter(df_cleaned, x='Women_Percentage', y='Gender_Pay_Gap',
                  hover_data=['Occupation'],
                  title='Gender Pay Gap vs. Percentage of Women in Occupation',
                  labels={'Women_Percentage': 'Percentage of Women (%)', 'Gender_Pay_Gap': 'Gender Pay Gap (USD)'})

In [75]:
# Show the second plot
fig2.show()

In [76]:
#Additional important metrics to note
# Find the occupation with the largest gender pay gap
max_gap_occupation = df_cleaned.loc[df_cleaned['Gender_Pay_Gap'].idxmax()]
print("\nOccupation with the largest gender pay gap:")
print(max_gap_occupation)


Occupation with the largest gender pay gap:
Occupation              LEGAL
All_workers              1346
All_weekly             1391.0
M_workers                 624
M_weekly               1877.0
F_workers                 722
F_weekly               1135.0
Gender_Pay_Gap          742.0
Women_Percentage    53.640416
Name: 133, dtype: object


In [77]:
# Find the occupation with the smallest gender pay gap
min_gap_occupation = df_cleaned.loc[df_cleaned['Gender_Pay_Gap'].idxmin()]
print("\nOccupation with the smallest gender pay gap:")
print(min_gap_occupation)


Occupation with the smallest gender pay gap:
Occupation          Wholesale and retail buyers, except farm products
All_workers                                                       142
All_weekly                                                      926.0
M_workers                                                          73
M_weekly                                                        886.0
F_workers                                                          69
F_weekly                                                        985.0
Gender_Pay_Gap                                                  -99.0
Women_Percentage                                            48.591549
Name: 35, dtype: object


In [78]:
# Calculate the correlation between the percentage of women in an occupation and the gender pay gap
correlation = df_cleaned['Women_Percentage'].corr(df_cleaned['Gender_Pay_Gap'])
print("\nCorrelation between Women Percentage and Gender Pay Gap:", correlation)



Correlation between Women Percentage and Gender Pay Gap: -0.14283239103121648


## *Key takeaways:*
### 1. Graph 1 reveals significant disparities in the gender pay gap across occupations. While some roles exhibit near parity, many show a substantial gap favoring men. Job roles in industries like law and finance demonstrate the highest disparities, where men earn significantly more than women on average. While the average gender pay gap is approximately 190, with a standard deviation of 149, it spans a broad range from a minimum of -99 to a maximum of 742, indicating notable disparities across occupations. The overall distribution highlights systemic inequality, with most occupations favoring male earnings.
### 2. Graph 2 was based on a hypothesis that higher gender pay gap may contribute to fewer number of women in the concerned occupation. However, a correlation of -0.14 indicates a weak negative correlation between the two variables. Thus, while the hypothesis is rejected, it presents another interesting research question in the field of gender studies: what other factors contribute to the underrepresentation of women in certain occupations despite smaller gender pay gaps?