# The Labor Gap in Technical Demand and the Impact of Gender


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc">
    <ul class="toc-item">
        <li><span><a href="#Introduction" data-toc-modified-id="Introduction-1">
            <span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span>
        </li>    
        <li><span><a href="#Data-Manipulation" data-toc-modified-id="Data-Manipulation-2">
            <span class="toc-item-num">2&nbsp;&nbsp;</span>Data Manipulation</a></span>
        </li>
        <li>
            <span><a href="#Tree-Maps" data-toc-modified-id="Tree-Maps-3">
                <span class="toc-item-num">3&nbsp;&nbsp;</span>Tree Maps</a></span>
        </li>
        <li>
            <span><a href="#Heat-Maps" data-toc-modified-id="Heat-Maps-4">
                <span class="toc-item-num">4&nbsp;&nbsp;</span>Heat Maps</a></span>
        </li>
        <li>
            <span><a href="#Time-Series-Analysis" data-toc-modified-id="Time-Series-Analysis-5">
                <span class="toc-item-num">5&nbsp;&nbsp;</span>Time Series Analysis</a></span>
        </li>                 
        <li>
            <span><a href="#Conclusion" data-toc-modified-id="Conclusion-6">
                <span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusion</a>
            </span>
        </li>          
    </ul>

In [None]:
%pip install -r requirements.txt # for intial use, please run this line to install the required packages

In [1]:

# Import Dependencies
import sys
import os
import pandas as pd
import altair as alt
from squarify import normalize_sizes, squarify
from bokeh.plotting import figure
from bokeh.sampledata.sample_superstore import data
from bokeh.transform import factor_cmap
from bokeh.io import output_notebook
from jupyter_bokeh.widgets import BokehModel # widgets third party extension will be required

# Add custom py file to directory to import functions
module_path = os.path.abspath(os.path.join('.'))
if module_path not in sys.path:
    sys.path.append(module_path)
# Import custom data manipulation helper functions
from data_manipulation import get_short_names, get_format_parameters, get_df_list_final, get_bls_data_2002_to_2015, get_student_record_df


### !!!DOCKER CONTAINER FOR REPRODUCIBILITY

If you are running an anaconda environement, a python version other than 3.11 or Mac OS. You can you use Docker Desktop to build the environement.

Doker Files:
    - Dockerfile.yaml
    - docker-compose.yaml


Steps:
1. From the project directory, run the following command in your terminal: docker-compose up -d

2. Go to the log files in your docker container instance and search for the key word "copy".

3. Copy and paste the url provided by Jupyter Notebooks. Its should look LIKE this http://127.0.0.1:8888/tree?token=0397757e1569ed9c4b9a449e4c729de8d297c24c50a16a12.








# Introduction

## High Level Summary

Our project is intended to analyze the distribution of undergraduate majors at a state university compared against the national labor market, as reported by the U.S. Bureau of Labor Statistics (BLS). A recent study found that the number of bachelor’s degrees programs at 1,754 institutions across the U.S. respond dramatically to changes in labor market demand (Conzelmann, 2024). We are interested in whether this university is following with keeping up with labor market trends.

Additionally, we are seeking to uncover gender disparities in technical fields. According to the United States census bureau, women make up around half the national workforce, but are incredibly underrepresented in STEM fields (Martinez, 2021). Similarly, on a national scale women make up over half of bachelor’s degree graduates, but only 38.6 % of STEM graduates (Singh, 2020). Because of this, we are interested in exploring gender disparities in occupations with high supply of students but low labor market demand.

## Objectives

By analyzing university enrollment data and demand by major, we are aiming to explore the following questions:

Question 1: Is the university keeping up with the growing demand for professional workers (as opposed to vocational and entry-level workers), in respect to BLS projections?

Question 2: Is the university over or under indexing the BLS National Average?

Question 3: Do technical fields show gender-related disparities between students and workers? If there are, are these disparities trending downward?

# Data Manipulation

In [2]:

# Import list of level dataframes to build treemap visuals.
# Please refer to data_manipulation.py for data manipulation documentation.
# Per DRY principles, because the data_manipulation.py functions were required across all analysis, 
# data manipulation functions were centralized into a single py file 
# This was to support reusability, reproducibility and readability throughout the project.
# This insures each contributor works with the same dataset for reproducibility.

df_level_list = get_df_list_final()
df_level_list[0].head(5)

Unnamed: 0,year,l1,number_of_students_all_sum,number_of_students_men_sum,number_of_students_women_sum,number_of_students_unknown_sum,number_of_workers_all_sum,median_weekly_earnings_all_mean,number_of_workers_men_sum,median_weekly_earnings_men_mean,number_of_workers_women_sum,median_weekly_earnings_women_mean
0,2011,Architecture and engineering occupations,1133.0,844.0,289.0,0.0,2493,753.285714,2178,739.904762,315,0.0
1,2011,"Arts, design, entertainment, sports, and media...",433.0,212.0,221.0,0.0,1464,446.368421,853,257.842105,613,136.578947
2,2011,Building and grounds cleaning and maintenance ...,0.0,0.0,0.0,0.0,3339,555.166667,2212,591.166667,1129,216.333333
3,2011,Business and financial operations occupations,2.0,1.0,1.0,0.0,5170,891.035714,2223,609.142857,2944,468.714286
4,2011,Community and social service occupations,0.0,0.0,0.0,0.0,1932,667.375,729,342.125,1202,386.5


# Tree Maps

## Tree Map Data Manipulation

In [13]:

def treemap_data_manipulation(target_metric):
    # import BLS hierarchy mapping to build treemap block levels.
    df_occupation_level_mapping = pd.read_excel('./data/bls_cpsaat39_2011_to_2015.xlsx', sheet_name='level_mapping_l0', header=0)
    df_occupation_level_mapping_distinct = df_occupation_level_mapping[['l4', 'l3', 'l2', 'l1']].drop_duplicates().reset_index()
    df_occupation_level_mapping_distinct = df_occupation_level_mapping_distinct[['l4', 'l3', 'l2', 'l1']]
    df_occupation_level_mapping_distinct

    # Define the target metric to build treemap.
    metric = target_metric # metric = 'number_of_workers_all_sum'
    # Extract base metric exclusive of sex to be used by the get_short_names(level, metric) formatting function.
    base_metric = metric.replace('_sum', '').replace('_women', '').replace('_men', '').replace('_all', '')

    # Create level 1 and level 2 dataframes for treemap block data objects.
    l1_grouping = df_level_list[0]
    l2_grouping = df_level_list[1]

    # Merge hiearchny mapping to level 1 block grouping.
    l1_grouping = pd.merge(l1_grouping,
                            df_occupation_level_mapping_distinct[['l1', 'l2']], 
                            how='left',
                            left_on=['l1'],
                            right_on=['l1'])


    # Filter out the most recent year for analsyis by target metric.
    l1_grouping = l1_grouping[l1_grouping['year'] == '2015']
    l1_grouping = l1_grouping[['l1','l2', metric]]
    l2_grouping = l2_grouping[l2_grouping['year'] == '2015']
    l2_grouping = l2_grouping[['l2', metric]]

    # Get short name dictionary mapper
    short_name_l2 = get_short_names('l2', base_metric)
    # Transform BLS long name to short name for readability in treemap visual.
    l1_grouping['l2'] = l1_grouping['l2'].apply(lambda x: short_name_l2[x] if x in short_name_l2.keys() else x)
    l2_grouping['l2'] = l2_grouping['l2'].apply(lambda x: short_name_l2[x] if x in short_name_l2.keys() else x)
    # Transform BLS long name to short name for readability in treemap visual.
    short_name_l1 = get_short_names('l1', base_metric)
    l1_grouping['l1'] = l1_grouping['l1'].apply(lambda x: short_name_l1[x] if x in short_name_l1.keys() else x)

    # Compute total metric value to calculate worker percentages by level category.
    total = l2_grouping[metric].sum() # total seems low, validate later on

    # Concat percentage of workers to level labels for visual.
    l1_grouping['l1'] = l1_grouping.apply(lambda x: x['l1'] + ' | ' + str(int(round(x[metric]/total*100, 0))) + '%', axis=1)
    l2_grouping['l2'] = l2_grouping.apply(lambda x: x['l2'] + ' | ' + str(int(round(x[metric]/total*100, 0))) + '%', axis=1)
    l2_lookup = {k: k + " | " + v for k, v in (x.split(" | ") for x in l2_grouping['l2'].to_list())}
    l1_grouping['l2'] = l1_grouping.apply(lambda x: l2_lookup[x['l2']], axis=1)

    # keep only records with non-zero values, otherwise treemap will throw a divide by zero error
    l1_grouping = l1_grouping[l1_grouping[metric] > 0]
    l2_grouping = l2_grouping[l2_grouping[metric] > 0]

    return l1_grouping, l2_grouping


In [16]:

def build_project_treemap(target_metric='number_of_students_all_sum'):

    # Define the target metric to build treemap.
    metric = target_metric # metric = 'number_of_workers_all_sum'
    # Extract base metric exclusive of sex to be used by the get_short_names(level, metric) formatting function.
    base_metric = metric.replace('_sum', '').replace('_women', '').replace('_men', '').replace('_all', '')

    l1_grouping, l2_grouping = treemap_data_manipulation(target_metric)

    # The following code derived from: treemaps https://docs.bokeh.org/en/latest/docs/examples/topics/hierarchical/treemap.html
    # Fist function builds the tree blocks required to build the treemap visual. ie. parent level 2 blocks and child level 1 blocks.
    def treemap(df, col, x, y, dx, dy, *, N=100):
        sub_df = df.nlargest(N, col)
        normed = normalize_sizes(sub_df[col], dx, dy)
        blocks = squarify(normed, x, y, dx, dy)
        blocks_df = pd.DataFrame.from_dict(blocks).set_index(sub_df.index)
        return sub_df.join(blocks_df, how='left').reset_index()

    # Get unique level 2 labels
    l2s = tuple(l2_grouping['l2'].unique().tolist())

    # define block 2 parameters 
    x, y, w, h = 0, 0, 2000, 1125
    blocks_by_L2 = treemap(l2_grouping, metric, x, y, w, h)

    blocks_by_L2.drop(columns=['index'], inplace=True)

    # define block 2 parameters select target columns need to supply def treemap function.
    blocks_by_L2 = blocks_by_L2[['l2', metric, 'x', 'y', 'dx', 'dy']] 

    # Iterate through each level 2 and build level 1 treemap blocks.
    dfs = []
    for index, (l2, number_of_workers_all, x, y, dx, dy) in blocks_by_L2.iterrows():
        df = l1_grouping[l1_grouping.l2==l2]    
        # Create level 1 treemap blocks and append to added to parent block 2
        dfs.append(treemap(df, metric, x, y, dx, dy, N=100))

    # Concat level 1 blocks
    blocks = pd.concat(dfs)

    # Create Bokeh figure object
    p = figure(width=w, height=h, tooltips="@l1", toolbar_location=None,
            x_axis_location=None, y_axis_location=None)
    p.x_range.range_padding = p.y_range.range_padding = 0
    p.grid.grid_line_color = None


    # Get treemap format parameters
    param_set = get_format_parameters(metric=base_metric, number_of_groups=len(l2_grouping))
    block_params = param_set['block']
    l2_params = param_set['l2']
    l1_params = param_set['l1']

    # Build block 2 ie. parent treemap graphical object 
    p.block('x', 'y', 'dx', 'dy', source=blocks, line_width=block_params['line_width'], line_color=block_params['line_color'],
            fill_alpha=block_params['fill_alpha'],
            fill_color=factor_cmap("l2", block_params['palette'][::-1] if 'women' in metric or 'all' in metric else block_params['palette'], l2s)) # legend_field="l2",  

    # Apply formates from get_format_parameters() helper function block 2 parents
    p.text('x', 'y',  x_offset=l2_params['x_offset'], y_offset=l2_params['y_offset'], text="l2", source=blocks_by_L2,
        text_font_size=l2_params['text_font_size'], text_color=l2_params['text_color'])

    blocks["ytop"] = blocks.y + blocks.dy

    # Apply formates from get_format_parameters() helper function to block 1 children
    p.text('x', 'ytop', x_offset=l1_params['x_offset'], y_offset=l1_params['y_offset'], text="l1", source=blocks,
        text_font_size=l1_params['text_font_size'], text_baseline=l1_params['text_baseline'],
        text_color=l1_params['text_color'])

    # function required to display bokeh plot in jupyter notebook
    output_notebook()
    # display() and BokehModel() required to display visual in jupyter notebook native VSCode IDE
    handle = display(BokehModel(p))


In [17]:
build_project_treemap(target_metric='number_of_workers_all_sum')

BokehModel(render_bundle={'docs_json': {'9046ff32-d099-42d9-bb4a-e06a8533ad17': {'version': '3.4.3', 'title': …

In [18]:
build_project_treemap(target_metric='number_of_students_all_sum')

BokehModel(render_bundle={'docs_json': {'e099b376-b3db-434e-9894-fe93c0052c10': {'version': '3.4.3', 'title': …

# Heat Maps

## Heat Map Data Manipulation

## Heat Map Visuals

# Time Series Analysis

## Time Series Data Manipulation

In [6]:

merge_record_df = get_student_record_df()
df_bls_all_filtered = get_bls_data_2002_to_2015()

def data_prep(data1, data2):
    # Student Data Prep
    major_df = pd.read_excel("./data/majors.xlsx")
    student_Occupations_df = pd.merge(
                    data1,
                    major_df, 
                    how='left',
                    left_on='MAJOR1_DESCR',
                    right_on='MAJOR')

    pivot_table = pd.pivot_table(
        student_Occupations_df,
        index=['OCCUPATION', 'MAJOR1_TERM_TERM_YEAR'],  # Rows
        columns='SEX',                                  # Columns
        aggfunc='size',                                 # Count of occurrences
        fill_value=0                                    # Fill missing values with 0
    )
    pivot_table.reset_index(inplace=True)
    pivot_table['MAJOR1_TERM_TERM_YEAR'] = pivot_table['MAJOR1_TERM_TERM_YEAR'].apply(lambda x: int(x))

    student_melted = pd.melt(pivot_table, 
                        id_vars=['MAJOR1_TERM_TERM_YEAR', 'OCCUPATION'],  # Columns to keep
                        value_vars=['F', 'M'],  # Columns to unpivot
                        var_name='SEX',  # Name for new column indicating Sex (Females/Males)
                        value_name='count')  # Name for new column holding the values

    # BLS Data Prep
    def getMales(row):
        x = row['male_16_years_and_over'] + row['male_20_years_and_over']
        return x if isinstance(x, float) else None

    data2['Males'] = data2.apply(lambda x: getMales(x), axis=1)

    def getFemales(row):
        x = row['female_16_years_and_over'] + row['female_20_years_and_over']
        return x if isinstance(x, float) else None

    data2['Females'] = data2.apply(lambda x: getFemales(x), axis=1)

    table = pd.pivot_table(data2, values=['Males', 'Females'], index=['year', 'occupation'],
                           aggfunc="sum")
    table.reset_index(inplace=True)
    table = table[table['occupation'] != 'Unknown']
    table['year'] = table['year'].apply(lambda x: int(x))

    # Fix the filter here, use `table['year']` not `pivot_table['year']`

    # Melt the BLS data
    bls_melted = pd.melt(table, 
                        id_vars=['year', 'occupation'],  # Columns to keep
                        value_vars=['Females', 'Males'],  # Columns to unpivot
                        var_name='SEX',  # Name for new column indicating Sex (Females/Males)
                        value_name='count')  # Name for new column holding the values
    
    student_melted = student_melted.rename(columns={'MAJOR1_TERM_TERM_YEAR': 'year'})
    bls_melted = bls_melted.rename(columns={'occupation': 'OCCUPATION'})

    return student_melted, bls_melted

# Call the function
student_melted, bls_melted = data_prep(merge_record_df, df_bls_all_filtered)
student_melted.head(5)


Unnamed: 0,year,OCCUPATION,SEX,count
0,1982,Architecture and engineering occupations,F,0
1,1983,Architecture and engineering occupations,F,0
2,1984,Architecture and engineering occupations,F,2
3,1985,Architecture and engineering occupations,F,0
4,1986,Architecture and engineering occupations,F,2


## Time Series Visuals

In [7]:
def TimeSeriesBar(data, occupation, people):
    data = data.copy()
    data = data[(data['year'] > 2008) & (data['year'] <= 2016)]

    # Ensure the 'year' column is in the correct format (datetime)
    data['year'] = pd.to_datetime(data['year'].astype(str))
    data = data[data['OCCUPATION'] == occupation]

    # Define your custom color scheme
    custom_colors = [
        '#FBEA65',  # for Female
        '#081F49',  # for Male
    ]

    # Create the stacked bar chart with specified color order and no legend
    bar_chart = alt.Chart(data).mark_bar(size=15, stroke='black', strokeWidth=1).encode(
        x=alt.X("year:T",
                axis=alt.Axis(grid=False, title="Year"),
                scale=alt.Scale(domain=[pd.to_datetime('2008-01-01'), pd.to_datetime('2016-01-01')])
        ),
        y=alt.Y('count', stack=True, axis=alt.Axis(grid=False, title=people)),
        color=alt.Color("SEX", 
                        scale=alt.Scale(range=custom_colors),  # Use the custom colors for SEX
                        legend=None  # Remove the legend
        )
    ).properties(
        title={
            "text": f"{occupation}",
            "subtitle": f"Distribution by Sex for {people}",
            "anchor": "start"        # Align title to the left
        },
        width=200,    # Set custom width
        height=200   # Set custom height
    )
    
    return bar_chart


In [8]:
TimeSeriesBar(student_melted, 'Life, physical, and social science occupations', "Students")

In [9]:
TimeSeriesBar(bls_melted, 'Life, physical, and social science occupations', "Workers (in thousands)")

In [10]:
TimeSeriesBar(student_melted, 'Architecture and engineering occupations', "Students")

In [11]:
TimeSeriesBar(bls_melted, 'Architecture and engineering occupations', "Workers (in thousands)")

In [12]:
TimeSeriesBar(student_melted, 'Management occupations', "Students")