In [None]:
# Install the required libraries
!pip install prophet

In [None]:
# Dependencies
# NOTE: We might not use all of these. I just improrted everything I can think of for now. We'll delete the ones we don't need later
import requests
import time
from dotenv import load_dotenv
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import json
import path
import re

In [None]:
#Read data into the notebook
linkedin_postings_df = pd.read_csv('./data_sets/postings.csv').dropna()
machine_learning_jobs_df = pd.read_json('./data_sets/job_data.json', lines=True)

# Normalize, clean, massage, and combine data for ease of processing

# Cast all job skills to lower case strings to standardize string matching later
linkedin_postings_df['job_skills'] = linkedin_postings_df['job_skills'].apply(lambda item: item.lower().split(', '))

***Introduction*** 
The goal of this exploratory data analysis is to characterize and investigate the growth of machine learning as a job skill. We are interested in looking at this topic along a number angles. TBC....
Our dataset includes 9367 job postings takend from linkedin for the 2023 calendar year. 

QUESTION Geography 

In [None]:
# Insert Question 1 analysis and visualizations here. Insert new cells if necessary 

Q1 Summary \[INSERT SUMMARY HERE] ... write a little about what the findings above seem to indicate about question 1

**Section 2: Relative Comparisons of Tech Workers with/without Machine Learning **
Another dimension by which to look at the growth of machine learning is to compare software developers *with* ML experience to those *without* machine learning experience. Layoffs among general tech workers have been pronounced in recent months [Link](https://layoffs.fyi/) yet demand for Machine Learning engineers has reportedly increased 42% on Hired's platform [Link](https://hired.com/resources/articles/trends-software-engineer-specializations/) . This demonstrates that as a specialization, ML ought to be treated separately from other tech skills.

***Basic metrics***
As a firsts step, we compare the relative proportion of machine learning related job listings to those lacking that term. We achieve this by doing a string match through the data set using a common list of ML related terms.

In [None]:
# Get all job listings with an AI related keyword  listed as a skill requirement 
terms_to_match = ['machine learning', 'artificial intelligence', 'pytorch', 'langchain', 'ai', 'tensorflow', 'deep learning', 'neural network', 
               'natural language processing', 'nlp', 'computer vision', 'large language models', 'chatbot', 'ai chatbot', 'llm', 'llms', 'generative ai', 'generative models', 'genai', 'bert', 'spacy', 'nltk', 'keras', 'gpt', 'chatgpt', 'prompt development', 'prompt engineering']

linkedin_postings_df['job_skills'] = linkedin_postings_df['job_skills']

linkedin_postings_df['has_ai'] = linkedin_postings_df['job_skills'].apply(
    lambda skills: any(term in skills for term in terms_to_match)
)
#Separate the groupings into two new dataframes
ai_roles = linkedin_postings_df.loc[linkedin_postings_df['has_ai'] == True]
# Job listings without AI keywords will be classified as "general" roles
general_roles = linkedin_postings_df.loc[linkedin_postings_df['has_ai'] == False]

Relative proportion of ML jobs to non-ML tech jobs

In [None]:
# Raw counts of ML to non-ML
display(linkedin_postings_df['has_ai'].value_counts())
# Proportion of ML to non-ML
display(linkedin_postings_df['has_ai'].value_counts(normalize=True))

Given that x of the y job listings contain an AI related keyword, approximately z% of all listings are AI-related to some degree.

Another dimension we want to investigate is the relative proportion of job levels, or experience requirements. Considering that AI is a newer skill set, we predict a greater proportion of entry level positions compared to generic tech roles based on skills that have been around longer.

In [None]:
print(ai_roles['job level'].value_counts(normalize=True))
print(general_roles['job level'].value_counts(normalize=True))

We find that at least for this data set, the relative demand for associate versus mid-senior level developers for AI is almost identical to the relative demand of the same for general developers. Notably, demand for mid-senior is slightly lower for ML-related jobs, although it remains to be seen if this is statistically significant. 

Next, within the AI subset, we wanted to see which skills were most in demand. For this we will take our list of AI related job skill keywords and construct a new dataframe with a dictionary using the keywords as keys and the frequencies as values.  

In [None]:
# Use a list comprehension and the dict method to turn tuples into key value pairs
keyword_dict = dict([(keyword, 0) for keyword in terms_to_match])
# Count the frequencies the terms appear in each skill set
def increment_keywords(term_list:list, dictionary: dict): 
    for skill in term_list:
        if skill in dictionary:
            dictionary[skill] += 1
    return term_list
ai_roles['job_skills'].apply(lambda skills: increment_keywords(skills, keyword_dict))
#Convert dict to series 
ai_skill_series = pd.Series(keyword_dict)

# Here we clean up the data by de-duplicating different keyword lables and sum them together 
label_mapping = {'artificial intelligence': 'ai', 'natural language processing': 'nlp', 'large language models': 'llms', 'llm': 'llms', 'generative models' : 'genai', 'generative ai': 'genai'}

# Replace the labels in the series index
ai_skill_series.index = ai_skill_series.index.to_series().replace(label_mapping)

# Aggregate the data - sum the values with the same label
ai_skill_series = ai_skill_series.groupby(ai_skill_series.index).sum()

# Display the resulting series
print(ai_skill_series)

In [None]:
ai_skill_series.sort_values(ascending=False).plot(kind='bar')

We are also interested in the more general skill breakdown. Within the AI-related job listings, how in demand are the skills our previous term filter didn't control for?

In [None]:

def count_skills(skills, label_mapping):
    # Get a list of all job skills in the ai roles
    skill_list_of_lists = skills.values.tolist()
    #Merge list of lists to 1d list
    merged_skill_lists = sum(skill_list_of_lists, [])
    skill_series = pd.Series(merged_skill_lists)
    # Get the unique labels for skills
    flattened_skill_list = skill_series.tolist()
    unqiue_skill_list = skill_series.unique().tolist()
    # Create a dictionary to count each occurence of the skill
    unique_dict = dict([(keyword, 0) for keyword in unique_skills_list])
    increment_keywords(flattened_skill_list, unique_dict)
    skill_count_series = pd.Series(unique_dict).sort_values(ascending=False)
    # Aggregate any specified retundant labels and sum them under the same grouping
    skill_count_series.index = skill_count_series.index.to_series().replace(label_mapping)
    skill_count_series = skill_count_series.groupby(skill_count_series.index).sum()
    return skill_count_series
# all_job_skills =  count_skills(ai_roles['job_skills'], unique_dict, label_mapping)

all_job_skills_series =  count_skills(ai_roles['job_skills'], label_mapping)
# Sort and display the top 20 most in demand skills, both including and not including our AI keywords
display(all_job_skills_series.sort_values(ascending=False).head(20).plot(kind='bar', title='Most in Demand Skills for AI Telated Jobs'))

Not unexpectedly, python appears to be highly in demand for machine learning roles. Note also that among roles that contain some AI focused facet, machine learning is the most frequently occuring skill overall. 

Perhaps more interesting is that a fraction of job listings including machine learning do not also include python as an additional skill, as the number of counts for the python keyword is less than the counts for machine learning. This raises the question: do those roles including machine learning but excluding python have anything in common or special about them? 

For this query we specify a filter condition where we search for data items in the job skills column that have the keyword "machine learning" but not the keyword "python".

Another point of comparison is to measure the degree of overlap between skills for ai related roles and general software roles when controlling for the AI-specific skills. In other words, how similar are the skill demands for all software jobs when machine learning results are excluded?

In [None]:
# First, take the above results but filter out the AI related keywords by applying a data mask
mask = all_job_skills_series.index.to_series().apply(lambda x: not any(keyword in x for keyword in terms_to_match))
filtered_skills = all_job_skills_series[mask]

#Next, apply the skills sorting logic above to the general roles dataframe 
# Get a list of all job skills in the ai roles
all_job_skills_general = count_skills(general_roles['job_skills'], {})

def render_normalized_plot(series1, series2, plot_title, column_names, xlabel, ylabel, display_count):
    # Normalize the series
    normalized_series1 = (series1 / series1.sum()) * 100
    normalized_series2 = (series2 / series2.sum()) * 100
    
    # Create a DataFrame with the normalized data
    df_normalized = pd.DataFrame({
        column_names[0]: normalized_series1,
        column_names[1]: normalized_series2
    }).fillna(0)  # Ensure no NaN values
    
    # Sort the DataFrame based on the first column
    df_normalized = df_normalized.sort_values(by=column_names[0], ascending=False).head(display_count)
    
    # Plot the normalized data
    ax = df_normalized.plot(kind='bar', width=0.8, figsize=(14, 7), color=['skyblue', 'red'], alpha=0.7)
    plt.title(plot_title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.xticks(rotation=45)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.legend(title='Dataset')
    plt.tight_layout()
    plt.show()


# Assuming filtered_skills and all_job_skills_general are defined
render_normalized_plot(
    filtered_skills, 
    all_job_skills_general, 
    'Normalized Comparison of Top 20 Skills for AI and non-AI SWE jobs', 
    ['Normalized AI Job Skills', 'Normalized General Job Skills'], 
    'Skills', 
    'Percentage of Total Mentions (%)',
    20
)

The above comparative bar chart shows the breakdown of relative demand for desired skills for machine learning engineers compared to general software engineers as a proportion of total mentions, controlled for AI-specific keywords. Skills that have only one bar in the y axis are skills that did not appear at all in the top skills of the respective data set. 

What is most striking about this graph is that the most in demand skills for both data sets are comparable. For both datasets, python and  java are the most in demand skills overall, with aws, javascript and sql appearing close to the top as well with slightly different rankings. 

It is also curious that "software engineering" as a distinct skill is rated more highly for the AI roles as the 4th overall skill, whereas it is ranked 9th for the general developer roles. One possible explanation is simply that the number of skills overall for general software roles is likely much greater, simply because the data set is several times larger than the AI role subset. This would increase the odds that more skills overall are mentioned and could bias the data.

Two cloud computing related keywords, "azure" and "cloud computing" appear for the AI related roles but not the general roles.

Demand for python is notably elevated for the AI data set, but most skills that are shared between both data sets appear with almost the same proportional frequency. It would make sense that web technologies such as html, angular and css might be missing from the AI data set, as ML roles are unlikely to put much emphasis on browser based, frontend web development. A notable exception is that demand for react, a frontend UI framework, is comprable between datasets. Interest in data science and devops for AI roles only is also featured. 

For one final point of comparision between AI-related and general SWE roles, we wanted to see if there was a meaningful difference in minimum requirements for years of experience. We would expect that Machine Learning, being a newer field, would have lower minimum requirements compared to the baseline. 

Because the raw data did not include a nicely isolated years of experience field, but did contain a job_summary field which follows a somewhat industry standard template,  it is possible in principle to parse out years of experience from the job description. However the caveat is that given the irregularity of the summary texts, not all information can be reliably extracted. At best we can use a regex heursitic to get some of the years of experience information.

In [None]:
# We want to do some regex matching for the most common string patterns. For "years of experience" as a first approximation 
# we found that the most common patterns are ranges n-m, as in 3-5 years of experience, a minimum number and a plus sign n+ years of experience, or a single number, n years of experience.  
def match_number_patterns(text):
    #First clear all whitespaces from the string
    cleaned_str = text.replace(' ', '')
    # Construct the regular expression to account for the above cases
    pattern = r'(\d+-\d+|\d+\+|\d+)'
    # Find all matches using the re library
    matches = re.findall(pattern, text)
    if matches:
        # Joining all matches
        return ''.join(matches) 
        # Return None if no matches were found
    return None 


def extract_info_from_summaries(summary: str, search_term: str, slice_length: int):
    # Find the index of the first character of the first instance of the search term
    search_term_index = summary.find(search_term)
    # Check if the search term is found
    if search_term_index != -1:
        # Calculate start index based on slice length
        start_index = search_term_index + slice_length if slice_length < 0 else search_term_index
        # Adjust the end index if slice_length is negative
        end_index = search_term_index if slice_length < 0 else search_term_index + slice_length
        # Return the substring
        return match_number_patterns(summary[start_index:end_index])
    else:
        # Return None if the search term is not found
        return None

# Apply the extract_info function to the job summaries of both dataframes 
ai_roles['years_exp'] = ai_roles.loc[:,'job_summary'].apply(lambda summary : extract_info_from_summaries(summary, 'years of experience', -4))
general_roles['years_exp'] = general_roles.loc[:,'job_summary'].apply(lambda summary : extract_info_from_summaries(summary, 'years of', -4))

# This function will fail to extract meaningful info for many entries, populating indexes of the list with None or if the re.findall condition triggers, 'nan'
# Additional filtering is necessary to remove the nullish entries 
filtered_years_exp_ai = [entry for entry in ai_roles['years_exp'].tolist() if entry and str(entry).lower() != 'nan']
filtered_years_exp_general = [entry for entry in general_roles['years_exp'].tolist() if entry and str(entry).lower() != 'nan']
#Finally, we are left with string character representations of numeric information. Since we want to conduct some basic statistics on this data
# one last round of processing is necessary to remove non-numeric symbols such as '+' signs, and to take the lower value of any ranges
def cast_to_int(string: str):
    # Remove '+' symbols at the end of the string
    if string.endswith('+'):
        return int(string[:-1])

    # If the string contains a range indicated by '-', take the lower bound
    if '-' in string:
        lower_bound = string.split('-')[0]  # Split the string at '-' and take the first part
        return int(lower_bound) 

    # If the string is a plain number, directly convert it to an integer
    return int(string)

finalized_years_ai = [cast_to_int(entry) for entry in filtered_years_exp_ai]
finalized_years_general = [cast_to_int(entry) for entry in filtered_years_exp_general]

years_ai_series = pd.Series(finalized_years_ai)
years_general_series = pd.Series(finalized_years_general)

Having now obtained our rough years of experience, some outliers required investigation. Impossible values such as 200, or less plausible values such as 20, proved to be references to the company's years of experience upon manual inspection of the csv. For example several Raytheon job listings contained the string "we bring the strength of more than 100 years of experience and renowned engineering expertise..." After filtering out these and other manually corrected invalid data points, the data is ready for analysis 

In [None]:
years_ai_series = years_ai_series[years_ai_series < 16].sort_values(ascending=False)
years_general_series = years_general_series[years_general_series < 20].sort_values(ascending=False)
years_of_experience_df = pd.DataFrame({
    'Minimum Years Experience for AI roles': years_ai_series,
    'Minimum Years Experience for General Roles': years_general_series
})
years_of_experience_df.describe()

*Interpretation:* 
As predicted, the average years of experience is lower for ML roles, although not by a significant margin. Considering the imperfections of the methods employed, it could be worthwhile to see if improved extraction techniques might further refine this finding. 

Q2 Summary \[INSERT SUMMARY HERE]

In [None]:
# Some geography related pre-processing

# Get the initials of each state
state_abbreviations = [
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA",
    "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
    "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
    "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
    "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"
    ]
def get_location(location_str: str):
   # Note: because the job_location field is inconsistent in the data set, we need to do a little data preparation  
    if location_str[-2:] == 'om':
        #handle British jobs  as the last two characters means United Kingd*om
        return 
    elif location_str[-2:] == 'da':
        #handle Canadian jobs 
        return 
    elif not any(location_str[-2:] == abbreviation for abbreviation in state_abbreviations):
        # Handle the situation where the string is too heterogenous to classify within reasonable bounds
        return
    else :
        #Otherwise simply return the state abbreviation 
        return location_str[-2:]

linkedin_postings_df['State'] = linkedin_postings_df['job_location'].apply(lambda item: get_location(item))
state_counts = linkedin_postings_df['State'].value_counts().reset_index()
state_counts.columns = ['State', 'Count']
geo_df = geo_df = state_counts.copy()
geo_df.set_index('State').plot(kind='bar', figsize=(15, 9), width=0.9, legend=False, title='Tech Jobs by US State')

**Interpretation**
California has the most job listings by a wide margin (about x2 its nearest competitor TX), as expected considering that it is the traditional tech hub for the US. The healthy showing of job listings in Texas and Florida might be attributed to the generous tax breaks and buisness friendly environment.

Next, we want to see if there is any proportional difference in the distribution of AI related roles to baseline. 

In [None]:
# Apply get_location function to extract states
ai_roles['State'] = ai_roles.loc[:,'job_location'].apply(lambda item: get_location(item))
general_roles['State'] = general_roles.loc[:,'job_location'].apply(lambda item: get_location(item))

# Plot the raw counts for ML jobs separately 
ai_roles.loc[:,'State'].value_counts().plot(kind='bar', figsize=(15, 9), width=0.9, legend=False, title='AI Jobs by US State')

# Count the number of jobs per state
ai_state_counts = ai_roles.loc[:,'State'].value_counts().reset_index()
ai_state_counts.columns = ['State', 'Count']
general_state_counts = general_roles.loc[:,'State'].value_counts().reset_index()
general_state_counts.columns = ['State', 'Count']

# Ensure the Count columns are numeric
ai_state_counts.loc[:, 'Count'] = pd.to_numeric(ai_state_counts['Count']).sort_values(ascending=False)
general_state_counts.loc[:, 'Count'] = pd.to_numeric(general_state_counts['Count']).sort_values(ascending=False)
# Merge the DataFrames to align the states
merged_counts = pd.merge(ai_state_counts, general_state_counts, on='State', how='outer', suffixes=('_AI', '_General')).fillna(0)
merged_counts.set_index('State', inplace=True)
render_normalized_plot(
    merged_counts['Count_AI'], merged_counts['Count_General'], 
    'Normalized Comparison of Job Distribution by State for AI and Non-AI Related Tech Jobs',
    ['Normalized AI Job Counts', 'Normalized General Job Counts'],
    'State',
    'Percentage of Total Mentions (%)',
    50
)


**Interpretation** 
These findings show a different picture. The proportion of ML jobs is far greater in CA than for non-AI tech jobs. MA both has the second largest count and is disproportionately favored for ML jobs compared to baseline, followed by WA. One possible conjecture for this is that MA is a major research hub state, so one might assume demand for ML in that state would be higher than average considering its wide application for R&D purposes. The notably greater proportion of ML roles in WA could be explained by the fact that both Microsoft, which has been investing heavily in AI, and Amazon are headquartered in that state. 

Question Job Skills

In [None]:
# etc 

Q3 Summary \[INSERT SUMMARY HERE]

Question Seniority/Job level

In [None]:
# etc 

Question 5 Industry demand 