In [1]:
# Importing Dependencies
import pandas as pd
import re

# **EDA Analysis**
## Ticket 1.1: Final Target Classification Trials
### Data Ingestion


In [2]:
# Load the data
job_postings = pd.read_csv('../../../../data/job_postings.csv')

display("Job Postings Dataset:")
display(job_postings.head())

'Job Postings Dataset:'

Unnamed: 0,job_link,last_processed_time,last_status,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type
0,https://www.linkedin.com/jobs/view/senior-mach...,2024-01-21 08:08:48.031964+00,Finished NER,t,t,f,Senior Machine Learning Engineer,Jobs for Humanity,"New Haven, CT",2024-01-14,East Haven,United States,Agricultural-Research Engineer,Mid senior,Onsite
1,https://www.linkedin.com/jobs/view/principal-s...,2024-01-20 04:02:12.331406+00,Finished NER,t,t,f,"Principal Software Engineer, ML Accelerators",Aurora,"San Francisco, CA",2024-01-14,El Cerrito,United States,Set-Key Driver,Mid senior,Onsite
2,https://www.linkedin.com/jobs/view/senior-etl-...,2024-01-21 08:08:31.941595+00,Finished NER,t,t,f,Senior ETL Data Warehouse Specialist,Adame Services LLC,"New York, NY",2024-01-14,Middletown,United States,Technical Support Specialist,Associate,Onsite
3,https://www.linkedin.com/jobs/view/senior-data...,2024-01-20 15:30:55.796572+00,Finished NER,t,t,f,Senior Data Warehouse Developer / Architect,Morph Enterprise,"Harrisburg, PA",2024-01-12,Lebanon,United States,Architect,Mid senior,Onsite
4,https://www.linkedin.com/jobs/view/lead-data-e...,2024-01-21 08:08:58.312124+00,Finished NER,t,t,f,Lead Data Engineer,Dice,"Plano, TX",2024-01-14,McKinney,United States,Maintenance Data Analyst,Mid senior,Onsite


### **Target Jobs Classification Regex**

In [3]:
# target job title regex list
target_job_titles_regex = {
    "MLOps Engineer": r"(?i)(MLOps|Machine Learning Operations|Machine Learning Infrastructure Engineer|ML Infrastructure|ML Platform|ML Systems|ML Platform Engineer|AIML Ops Engineer|Machine Learning Software Developer)\w*[-\s]?",

    "Machine Learning Engineer": r"(?i)(Machine Learning Engineer|ML Engineer|Machine Learning Engineering|ML Developer|Machine Learning Software Engineer|AIML Engineer|AIML Data Scientist|AI Data Science Lead)\w*[-\s]?",

    "Data Architect": r"(?i)(Data Architect|Senior Data Architect|Cloud Data Architect|Big Data Architect|Enterprise Data Architect|Principal Data Architect|Lead Data Architect|Data Warehouse Architect|Data Architecture|Data Lake Architect|Data Streaming Architect)\w*[-\s]?",

    "Database Engineer / Administrator": r"(?i)(Database|Database Architect|DBA\b|Cloud Database|Azure Database|AWS Database|Databases|GCP Database|Oracle Database Engineer)\w*[-\s]?",

    "Data Engineer": r"(?i)(Data Engineer|Senior Data Engineer|Lead Data Engineer|Big Data Engineer|Data Engineering|Data Engineering Manager|Data Engineering Architect|Data Pipeline Engineer|Big Data Developer|Data Engineers|Data Integrations|Data Infrastructure|ETL Developer)\w*[-\s]?",

    "Data Governance & Security": r"(?i)(Data Governance|Data Privacy|Data Steward|Data Protection|Data Security|Master Data Management|Data Governance Manager|Data Compliance|Data Lifecycle Manager)\w*[-\s]?",

    "Data Operations & Management": r"(?i)(Data Manager|Enterprise Data Manager|Data Operations|Data Operations Manager|Data Operations Analyst|Data Management Engineer|Data Strategy Manager|Data Solution Architect|Data Deployment|Data Conversion|Data Replication Engineer|DevOps Engineer|Distributed Systems|Storage)\w*[-\s]?",

    "Data Modeling & Warehousing": r"(?i)(Data Modeling|Data Warehouse|Big Data Developer|Data Warehouse Architect|Cloud Datawarehouse|Data Platform Developer)\w*[-\s]?",

    "Data Specialist": r"(?i)(Data Specialist|Data Processing|Data Consultant|Data Quality Manager|Data Coordinator|Data Entry Specialist)\w*[-\s]?",

    "Data Scientist": r"(?i)(Data Scientist|Data Scientists|Data Science Engineer|Data Science Manager|Data Science Analyst|Data Science Practitioner|Customer Data Scientist)\w*[-\s]?",

    "Data Analyst": r"(?i)(Data Analyst|Data Analysts|Financial Data Analyst|Business Intelligence|BI Analyst|Data Business Analyst|Data Insights Analyst)\w*[-\s]?",

    "Software & Platform Engineering": r"(?i)(Software Engineer|Software Engineering|Software Developer|Software Engineer Data Science|Software Engineer Data Platforms|Platform Engineer|Application Developer|Backend Engineer|Systems Developer)\w*[-\s]?",

    "Cloud & Infrastructure Engineering": r"(?i)(Cloud Data|Cloud Data Architect|Azure Data|AWS Data|Azure Databricks|AWS Databricks|Cloud Engineer|Cloud Platform Engineer|Infrastructure Engineer|Datacenter Technician|Datacenter Engineer|Datacenter Network Engineer|Datacenter Engineering|Site Reliability Engineer|SRE)\w*[-\s]?",

    "Risk & Compliance Analytics": r"(?i)(Risk Analyst|AML\b|BSA|Risk Modeling|Financial Analyst|Hedge Fund|Data Loss Prevention|DLP)\w*[-\s]?"
}


### **Classification Function**

In [4]:
# Function to Classify Job Titles
def classify(job_title, keywords_list=target_job_titles_regex):
    for industry, keyword in keywords_list.items():
        match = re.search(keyword, str(job_title))
        if match:
            keyword = re.sub(r'[^a-zA-Z\s]', '', match.group()).strip().title()   # using match.group() to return the actual keyword that was matched rather than the regex pattern
            return industry, keyword              
    return "unclassified", "unclassified"


In [5]:
# Copy Dataframe and Execute the classification function
structured_job_titles = job_postings.copy()
structured_job_titles['job_classification'], structured_job_titles['job_keyword'] = zip(*structured_job_titles['job_title'].apply(classify))

# Check the results
structured_job_titles.head(10)


Unnamed: 0,job_link,last_processed_time,last_status,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type,job_classification,job_keyword
0,https://www.linkedin.com/jobs/view/senior-mach...,2024-01-21 08:08:48.031964+00,Finished NER,t,t,f,Senior Machine Learning Engineer,Jobs for Humanity,"New Haven, CT",2024-01-14,East Haven,United States,Agricultural-Research Engineer,Mid senior,Onsite,Machine Learning Engineer,Machine Learning Engineer
1,https://www.linkedin.com/jobs/view/principal-s...,2024-01-20 04:02:12.331406+00,Finished NER,t,t,f,"Principal Software Engineer, ML Accelerators",Aurora,"San Francisco, CA",2024-01-14,El Cerrito,United States,Set-Key Driver,Mid senior,Onsite,Software & Platform Engineering,Software Engineer
2,https://www.linkedin.com/jobs/view/senior-etl-...,2024-01-21 08:08:31.941595+00,Finished NER,t,t,f,Senior ETL Data Warehouse Specialist,Adame Services LLC,"New York, NY",2024-01-14,Middletown,United States,Technical Support Specialist,Associate,Onsite,Data Modeling & Warehousing,Data Warehouse
3,https://www.linkedin.com/jobs/view/senior-data...,2024-01-20 15:30:55.796572+00,Finished NER,t,t,f,Senior Data Warehouse Developer / Architect,Morph Enterprise,"Harrisburg, PA",2024-01-12,Lebanon,United States,Architect,Mid senior,Onsite,Data Modeling & Warehousing,Data Warehouse
4,https://www.linkedin.com/jobs/view/lead-data-e...,2024-01-21 08:08:58.312124+00,Finished NER,t,t,f,Lead Data Engineer,Dice,"Plano, TX",2024-01-14,McKinney,United States,Maintenance Data Analyst,Mid senior,Onsite,Data Engineer,Lead Data Engineer
5,https://www.linkedin.com/jobs/view/senior-data...,2024-01-21 07:14:11.378097+00,Finished NER,t,t,f,Senior Data Engineer,University of Chicago,"Chicago, IL",2024-01-14,East Chicago,United States,Data Base Administrator,Mid senior,Onsite,Data Engineer,Senior Data Engineer
6,https://www.linkedin.com/jobs/view/manager-cyb...,2024-01-21 07:14:09.631476+00,Finished NER,t,t,f,"Manager, Cyber Risk & Analysis (Machine Learning)",Jobs for Humanity,"Boston, MA",2024-01-16,Beverly,United States,Manager Reports Analysis,Mid senior,Onsite,unclassified,unclassified
7,https://www.linkedin.com/jobs/view/principal-a...,2024-01-21 07:39:58.478064+00,Finished NER,t,t,f,"Principal Associate, Data Loss Prevention (DLP...",Jobs for Humanity,"Scranton, PA",2024-01-14,Nanticoke,United States,Architect,Mid senior,Onsite,Risk & Compliance Analytics,Data Loss Prevention
8,https://www.linkedin.com/jobs/view/senior-fina...,2024-01-21 07:14:50.991803+00,Finished NER,t,t,f,Senior Financial Data Analyst,The Walt Disney Company,"Lake Buena Vista, FL",2024-01-15,Avondale,United States,Budget Officer,Mid senior,Onsite,Data Analyst,Financial Data Analyst
9,https://www.linkedin.com/jobs/view/machine-lea...,2024-01-21 07:40:40.017291+00,Finished NER,t,t,f,Machine Learning Infrastructure Engineer,L&T Technology Services,"Sunnyvale, CA",2024-01-14,Redwood City,United States,Test Fixture Designer,Mid senior,Onsite,MLOps Engineer,Machine Learning Infrastructure Engineer


In [6]:
# count the number of each keyword occurrences
target_job_counts = pd.DataFrame(structured_job_titles['job_keyword'].value_counts()).reset_index().set_index('job_keyword')

# display the counts
print("Keyword Counts:")
target_job_counts

Keyword Counts:


Unnamed: 0_level_0,count
job_keyword,Unnamed: 1_level_1
unclassified,4339
Data Analyst,1820
Data Scientist,799
Data Engineer,770
Database,713
...,...
Data Platform Developer,1
Data Replication Engineer,1
Data Science Practitioner,1
Data Scientists,1


In [7]:
# count the number of each job classification
target_job_classification_counts = pd.DataFrame(structured_job_titles['job_classification'].value_counts()).reset_index().set_index('job_classification')

# display the counts
display("Job Classification Counts:")
display(target_job_classification_counts)

'Job Classification Counts:'

Unnamed: 0_level_0,count
job_classification,Unnamed: 1_level_1
unclassified,4339
Data Analyst,1920
Data Engineer,1791
Data Scientist,849
Database Engineer / Administrator,748
Machine Learning Engineer,506
Data Architect,370
Data Governance & Security,340
Risk & Compliance Analytics,275
Data Operations & Management,251


### **Check Group**

In [8]:
# check group
job_check = 'Cloud & Infrastructure Engineering'
pd.DataFrame(structured_job_titles[structured_job_titles['job_classification'] == job_check])

Unnamed: 0,job_link,last_processed_time,last_status,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type,job_classification,job_keyword
418,https://www.linkedin.com/jobs/view/datadog-clo...,2024-01-19 14:08:54.13512+00,Finished NER,t,t,f,Datadog Cloud Engineer Denver Colorado (3 days...,"Anveta, Inc","Denver, CO",2024-01-14,Colorado,United States,Computer Systems Hardware Analyst,Mid senior,Onsite,Cloud & Infrastructure Engineering,Cloud Engineer
686,https://www.linkedin.com/jobs/view/principal-s...,2024-01-19 17:02:42.718423+00,Finished NER,t,t,f,"Principal Site Reliability Engineer, Datastore...",ThousandEyes (part of Cisco),"San Francisco, CA",2024-01-14,Novato,United States,Reliability Engineer,Mid senior,Onsite,Cloud & Infrastructure Engineering,Site Reliability Engineer
849,https://uk.linkedin.com/jobs/view/manager-site...,2024-01-19 18:49:14.510383+00,Finished NER,t,t,f,Manager Site Reliability Engineering (SRE) - K...,Gaia Labs LLC,"Southampton, England, United Kingdom",2024-01-14,South Hampshire,United Kingdom,Starter,Mid senior,Onsite,Cloud & Infrastructure Engineering,Site Reliability Engineering
1069,https://uk.linkedin.com/jobs/view/senior-backe...,2024-01-19 21:27:04.594381+00,Finished NER,t,t,f,Senior Backend and Cloud Engineer - Machine Le...,Scandit,"Bristol, England, United Kingdom",2024-01-14,Newport,United Kingdom,Value Engineer,Mid senior,Onsite,Cloud & Infrastructure Engineering,Cloud Engineer
1215,https://www.linkedin.com/jobs/view/sr-site-rel...,2024-01-19 23:13:49.155128+00,Finished NER,t,t,f,Sr. Site Reliability Engineer (Application Sof...,SpaceX,"Hawthorne, CA",2024-01-14,Malibu,United States,Reliability Engineer,Mid senior,Onsite,Cloud & Infrastructure Engineering,Site Reliability Engineer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11506,https://www.linkedin.com/jobs/view/datacenter-...,2024-01-21 02:32:47.751331+00,Finished NER,t,t,f,"Datacenter Engineer, Infrastructure Network En...",Tesla,"Sacramento, CA",2024-01-14,Roseville,United States,Value Engineer,Mid senior,Onsite,Cloud & Infrastructure Engineering,Datacenter Engineer
11527,https://ca.linkedin.com/jobs/view/solution-arc...,2024-01-21 02:46:01.83432+00,Finished NER,t,t,f,Solution Architect – Azure Data Integration,Torinit Technologies,"Toronto, Ontario, Canada",2024-01-14,Ontario,Canada,Change Person,Mid senior,Onsite,Cloud & Infrastructure Engineering,Azure Data
11592,https://www.linkedin.com/jobs/view/cloud-data-...,2024-01-21 03:18:41.401982+00,Finished NER,t,t,f,Cloud Data Platform Administrator,Adtalem Global Education,"Chicago, IL",2024-01-14,La Grange,United States,Data Base Administrator,Associate,Onsite,Cloud & Infrastructure Engineering,Cloud Data
11871,https://uk.linkedin.com/jobs/view/senior-backe...,2024-01-21 06:09:24.081787+00,Finished NER,t,t,f,Senior Backend and Cloud Engineer - Machine Le...,Scandit,"Swansea, Wales, United Kingdom",2024-01-14,Swansea,United Kingdom,Agricultural-Research Engineer,Mid senior,Onsite,Cloud & Infrastructure Engineering,Cloud Engineer


### **Reflection: Successfully Standardizing Job Titles into Technical Data Science Roles**  

After a lot of trial and error, I’ve finally managed to **group and integrate varied job title keywords** that I had previously grouped into **14 distinct technical data science roles**. This was a huge step forward, as it now technically outlines the 14 job titles that will be used for the final MVP visualizations. The selection was made quite simply by looking at which careers are most representative of the data science industry thus requiring the title to have a certain number of job listings and title generality. To that end I attempted to group keywords by Discipline or Job Role and Career Growth Opportunies together to ensure that each category has at least **50 or more job listings**, making the visualizations much more meaningful.  

Though the process wasn’t straightforward—by using a **hierarchical keyword-matching approach**, I was able to **sort job titles more accurately and move certain specialized roles** into their appropriate categories.

---

### **Final EDA Steps: Experience & Seniority Classification**  

With **job titles now standardized**, the next and final step in this **Exploratory Data Analysis (EDA) ticker 1.1** is to classify **experience and seniority levels**. Just like with job titles, this will involve:  

- Spotting **patterns in experience-related keywords** (e.g., *Junior, Mid, Senior, Lead, Principal*).  
- Making sure **job levels are assigned correctly** to maintain career progression.  
- Avoiding **misclassification**, especially when job titles contain overlapping terms that could place them in the wrong category.  

Once this step is complete, the dataset job title break down will be in **its cleanest, most structured form**, setting us up for **clear and reliable job market insight categorizations** in the final visualizations.  
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>


### **Seniority Level Classification Regex**

In [9]:
# seniority level regex list
seniority_levels_regex = {
    # 🔹 Principal / Staff-Level Roles (Must Be Checked First)
    "Principal / Staff-Level": r"(?i)(Principal|Staff|Sr[-\s]?Staff|Distinguished|Fellow|Master|L4|Level 4|Chief[-\s]?Architect|Chief[-\s]?Scientist)\w*[-\s]?",

    # 🔹 Lead / Supervisor Roles (Checked Before Senior)
    "Lead": r"(?i)(Lead|Tech[-\s]?Lead|Team[-\s]?Lead|Supervisor|Group[-\s]?Lead|Project[-\s]?Lead|Engineering[-\s]?Lead|Squad[-\s]?Lead|Chapter[-\s]?Lead|Manager|Head[-\s]?of[-\s]?Team)\w*[-\s]?",

    # 🔹 Senior-Level Roles (Checked Before Mid-Level)
    "Senior-Level": r"(?i)(Senior|Sr\.?|SNR|SEN|L3|Level 3|Expert|Specialist|Advanced|Seasoned|Experienced)\w*[-\s]?",

    # 🔹 Mid-Level Roles (Checked Before Junior)
    "Mid-Level": r"(?i)(Mid[-\s]?Level|Intermediate|Mid|L2|Level 2|Professional|Regular)\w*[-\s]?",

    # 🔹 Entry-Level / Junior Roles (Checked After Principal & Senior)
    "Entry-Level / Junior": r"(?i)(Junior|Jr\.?|Entry[-\s]?Level|Associate|Graduate|Trainee|Fresher|New Grad|Early[-\s]?Career|L1|Level 1)\w*[-\s]?",

    # 🔹 Intern / Internship Roles (Checked Last)
    "Intern": r"(?i)(Intern|Internship|Co[-\s]?Op|Apprentice|Trainee)\w*[-\s]?",

    # 🔹 Director / Executive Roles (Checked Last for Highest Priority)
    "Director / Executive": r"(?i)(Director|Head|VP|Vice[-\s]?President|CIO|CTO|CISO|CEO|Chief|Executive|C[-]?Level|Managing[-\s]?Director|Global[-\s]?Head|President|Founder|Partner)\w*[-\s]?"
}


### **Seniority Level Classification Function**

In [10]:
# Create single use function to classify seniority level
def classify_seniority_level(job_title):
    return classify(job_title, seniority_levels_regex)

In [11]:
# Execute the classification function on the copied dataframe
structured_job_titles['seniority_level'], structured_job_titles['seniority_level_keyword'] = zip(*structured_job_titles['job_title'].apply(classify_seniority_level))

# Check the results
structured_job_titles.head()

Unnamed: 0,job_link,last_processed_time,last_status,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type,job_classification,job_keyword,seniority_level,seniority_level_keyword
0,https://www.linkedin.com/jobs/view/senior-mach...,2024-01-21 08:08:48.031964+00,Finished NER,t,t,f,Senior Machine Learning Engineer,Jobs for Humanity,"New Haven, CT",2024-01-14,East Haven,United States,Agricultural-Research Engineer,Mid senior,Onsite,Machine Learning Engineer,Machine Learning Engineer,Senior-Level,Senior
1,https://www.linkedin.com/jobs/view/principal-s...,2024-01-20 04:02:12.331406+00,Finished NER,t,t,f,"Principal Software Engineer, ML Accelerators",Aurora,"San Francisco, CA",2024-01-14,El Cerrito,United States,Set-Key Driver,Mid senior,Onsite,Software & Platform Engineering,Software Engineer,Principal / Staff-Level,Principal
2,https://www.linkedin.com/jobs/view/senior-etl-...,2024-01-21 08:08:31.941595+00,Finished NER,t,t,f,Senior ETL Data Warehouse Specialist,Adame Services LLC,"New York, NY",2024-01-14,Middletown,United States,Technical Support Specialist,Associate,Onsite,Data Modeling & Warehousing,Data Warehouse,Senior-Level,Senior
3,https://www.linkedin.com/jobs/view/senior-data...,2024-01-20 15:30:55.796572+00,Finished NER,t,t,f,Senior Data Warehouse Developer / Architect,Morph Enterprise,"Harrisburg, PA",2024-01-12,Lebanon,United States,Architect,Mid senior,Onsite,Data Modeling & Warehousing,Data Warehouse,Senior-Level,Senior
4,https://www.linkedin.com/jobs/view/lead-data-e...,2024-01-21 08:08:58.312124+00,Finished NER,t,t,f,Lead Data Engineer,Dice,"Plano, TX",2024-01-14,McKinney,United States,Maintenance Data Analyst,Mid senior,Onsite,Data Engineer,Lead Data Engineer,Lead,Lead


### **Classify the Unclassified Seniority Titles**

#### Method 1: Using itterrows() and a for loop

In [12]:
# Create a For Loop and If Conditional to classify the unclassified seniority titles
for index, row in structured_job_titles.iterrows():
    if row['seniority_level'] == 'unclassified':
        if row['job_keyword'] == 'Data Analyst' or row['job_keyword'] == 'Data Security' or row['job_keyword'] == 'Database' or row['job_keyword'] == 'Cloud Engineer' or \
                row['job_keyword'] == 'Financial Data Analyst' or row['job_keyword'] == 'Bsa' or row['job_keyword'] == 'Machine Learning Engineer' or row['job_keyword'] == 'Data Processing' or \
                row['job_keyword'] == 'Backend Engineer' or row['job_keyword'] == 'Ml Engineering' or row['job_keyword'] == 'Data Governance' or row['job_keyword'] == 'Big Data Engineer' or \
                row['job_keyword'] == 'Aml' or row['job_keyword'] == 'Data Privacy' or row['job_keyword'] == 'Data Business Analyst' or row['job_keyword'] == 'Data Engineers' or \
                row['job_keyword'] == 'Data Engineer' or row['job_keyword'] == 'Infrastructure Engineer' or row['job_keyword'] == 'Datacenter Technician' or \
                row['job_keyword'] == 'Data Operations' or row['job_keyword'] == 'Data Science Engineer' or row['job_keyword'] == 'Data Consultant' or \
                row['job_keyword'] == 'Software Developer' or row['job_keyword'] == 'Data Science Analyst' or row['job_keyword'] == 'Bi Analyst' or \
                row['job_keyword'] == 'Ml Developer' or row['job_keyword'] == 'Ml Engineer' or row['job_keyword'] == 'Datacenter Engineer' or row['job_keyword'] == 'Platform Engineer' or \
                row['job_keyword'] == 'Cloud Data' or row['job_keyword'] == 'Etl Developer' or row['job_keyword'] == 'Dba' or row['job_keyword'] == 'Databases' or \
                row['job_keyword'] == 'Financial Analyst' or row['job_keyword'] == 'Devops Engineer' or row['job_keyword'] == 'Data Insights Analyst' or \
                row['job_keyword'] == 'Risk Analyst' or row['job_keyword'] == 'Data Analysts' or row['job_keyword'] == 'Cloud Database' or \
                row['job_keyword'] == 'Site Reliability Engineer' or row['job_keyword'] == 'Data Analystat' or row['job_keyword'] == 'Data Pipeline Engineer' or \
                row['job_keyword'] == 'Big Data Engineering':	
            structured_job_titles.loc[index, 'seniority_level'] = "Entry-Level / Junior" 
        elif row['job_keyword'] == 'Data Scientist' or row['job_keyword'] == 'Data Engineering' or row['job_keyword'] == 'MLOps Engineer' or \
                row['job_keyword'] == 'Business Intelligence' or row['job_keyword'] == 'Data Coordinator' or row['job_keyword'] == 'Data Steward' or \
                row['job_keyword'] == 'Machine Learning Infrastructure Engineer' or row['job_keyword'] == 'Machine Learning Software Developer' or row['job_keyword'] == 'Software Engineer' or \
                row['job_keyword'] == 'Customer Data Scientist' or row['job_keyword'] == 'Data Warehouse Architect'  or row['job_keyword'] == 'Ml Systems' or \
                row['job_keyword'] == 'Data Compliance' or row['job_keyword'] == 'Big Data Architect' or row['job_keyword'] == 'Aws Databricks' or \
                row['job_keyword'] == 'Big Data Developer' or row['job_keyword'] == 'Azure Data' or row['job_keyword'] == 'Data Replication Engineer' or \
                row['job_keyword'] == 'Data Science Practitioner' or row['job_keyword'] == 'Data Integrations' or row['job_keyword'] == 'Data Modeling' or \
                row['job_keyword'] == 'Machine Learning Operations' or row['job_keyword'] == 'Mlops' or row['job_keyword'] == 'Data Loss Prevention' or \
                row['job_keyword'] == 'Ml Infrastructure' or row['job_keyword'] == 'Machine Learning Software Engineer' or row['job_keyword'] == 'Data Deployment' or \
                row['job_keyword'] == 'Data Architecture' or row['job_keyword'] == 'Datacenter Network Engineer' or row['job_keyword'] == 'Azure Databricks' or \
                row['job_keyword'] == 'Data Stewardship' or row['job_keyword'] == 'Ml Platform' or row['job_keyword'] == 'Data Conversion' or \
                row['job_keyword'] == 'Data Management Engineer':
            structured_job_titles.loc[index, 'seniority_level'] = "Mid-Level"
        elif row['job_keyword'] == 'Data Architect' or row['job_keyword'] == 'Data Warehouse' or row['job_keyword'] == 'Cloud Data Architect' or \
                row['job_keyword'] == 'Data Protection' or row['job_keyword'] == 'Data Lake Architect' or row['job_keyword'] == 'Enterprise Data Architect' or \
                row['job_keyword'] == 'Data Solution Architect' or row['job_keyword'] == 'Data Streaming Architect':
            structured_job_titles.loc[index, 'seniority_level'] = "Senior-Level"



#### Method 2: Using a function to classify the unclassified seniority titles

In [13]:
# # Define a function to classify the unclassified seniority titles
# def classify_unclassified_seniority(job_keyword):
#     # Define sets of keywords for each seniority level
#     entry_level_keywords = [
#         "Data Analyst", "Data Security", "Database", "Cloud Engineer",
#         "Financial Data Analyst", "Bsa", "Machine Learning Engineer", "Data Processing", 
#         "Backend Engineer", "Ml Engineering", "Data Governance", "Big Data Engineer",
#         "Aml", "Data Privacy", "Data Business Analyst", "Data Engineers",
#         "Data Engineer", "Infrastructure Engineer", "Datacenter Technician",
#         "Data Operations", "Data Science Engineer", "Data Consultant",
#         "Software Developer", "Data Science Analyst", "Bi Analyst",
#         "Ml Developer", "Ml Engineer", "Datacenter Engineer", "Platform Engineer",
#         "Cloud Data", "Etl Developer", "Dba", "Databases",
#         "Financial Analyst", "Devops Engineer", "Data Insights Analyst",
#         "Risk Analyst", "Data Analysts", "Cloud Database",
#         "Site Reliability Engineer", "Data Analystat", "Data Pipeline Engineer",
#         "Big Data Engineering"
#     ]


#     mid_level_keywords = [
#         "Data Scientist", "Data Engineering", "MLOps Engineer",
#         "Business Intelligence", "Data Coordinator", "Data Steward",
#         "Machine Learning Infrastructure Engineer", "Machine Learning Software Developer", "Software Engineer",
#         "Customer Data Scientist", "Data Warehouse Architect", "Ml Systems",
#         "Data Compliance", "Big Data Architect", "Aws Databricks",
#         "Big Data Developer", "Azure Data", "Data Replication Engineer",
#         "Data Science Practitioner", "Data Integrations", "Data Modeling",
#         "Machine Learning Operations", "Mlops", "Data Loss Prevention",
#         "Ml Infrastructure", "Machine Learning Software Engineer", "Data Deployment",
#         "Data Architecture", "Datacenter Network Engineer", "Azure Databricks",
#         "Data Stewardship", "Ml Platform", "Data Conversion",
#         "Data Management Engineer"
#     ]

#     senior_level_keywords = [
#         "Data Architect", "Data Warehouse", "Cloud Data Architect",
#         "Data Protection", "Data Lake Architect", "Enterprise Data Architect",
#         "Data Solution Architect", "Data Streaming Architect"
#     ]

#     if job_keyword in entry_level_keywords:
#         return "Entry-Level / Junior"
#     elif job_keyword in mid_level_keywords:
#         return "Mid-Level"
#     elif job_keyword in senior_level_keywords:
#         return "Senior-Level"
#     return "unclassified"

In [14]:
# Dropping rows with no job classification
structured_job_titles = structured_job_titles[structured_job_titles['job_classification'] != 'unclassified']

# Show unclassified seniority titles
unclassified_seniority = structured_job_titles[structured_job_titles['seniority_level'] == 'unclassified']
unclassified_seniority_df = pd.DataFrame(unclassified_seniority[['job_title', 'job_classification', 'job_keyword', 'seniority_level', 'seniority_level_keyword']])
unclassified_seniority_df


Unnamed: 0,job_title,job_classification,job_keyword,seniority_level,seniority_level_keyword


### **Check Group**

In [15]:
# check group
job_check = 'Cloud & Infrastructure Engineering'
pd.DataFrame(structured_job_titles[structured_job_titles['job_classification'] == job_check].head())

Unnamed: 0,job_link,last_processed_time,last_status,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type,job_classification,job_keyword,seniority_level,seniority_level_keyword
418,https://www.linkedin.com/jobs/view/datadog-clo...,2024-01-19 14:08:54.13512+00,Finished NER,t,t,f,Datadog Cloud Engineer Denver Colorado (3 days...,"Anveta, Inc","Denver, CO",2024-01-14,Colorado,United States,Computer Systems Hardware Analyst,Mid senior,Onsite,Cloud & Infrastructure Engineering,Cloud Engineer,Entry-Level / Junior,unclassified
686,https://www.linkedin.com/jobs/view/principal-s...,2024-01-19 17:02:42.718423+00,Finished NER,t,t,f,"Principal Site Reliability Engineer, Datastore...",ThousandEyes (part of Cisco),"San Francisco, CA",2024-01-14,Novato,United States,Reliability Engineer,Mid senior,Onsite,Cloud & Infrastructure Engineering,Site Reliability Engineer,Principal / Staff-Level,Principal
849,https://uk.linkedin.com/jobs/view/manager-site...,2024-01-19 18:49:14.510383+00,Finished NER,t,t,f,Manager Site Reliability Engineering (SRE) - K...,Gaia Labs LLC,"Southampton, England, United Kingdom",2024-01-14,South Hampshire,United Kingdom,Starter,Mid senior,Onsite,Cloud & Infrastructure Engineering,Site Reliability Engineering,Lead,Manager
1069,https://uk.linkedin.com/jobs/view/senior-backe...,2024-01-19 21:27:04.594381+00,Finished NER,t,t,f,Senior Backend and Cloud Engineer - Machine Le...,Scandit,"Bristol, England, United Kingdom",2024-01-14,Newport,United Kingdom,Value Engineer,Mid senior,Onsite,Cloud & Infrastructure Engineering,Cloud Engineer,Senior-Level,Senior
1215,https://www.linkedin.com/jobs/view/sr-site-rel...,2024-01-19 23:13:49.155128+00,Finished NER,t,t,f,Sr. Site Reliability Engineer (Application Sof...,SpaceX,"Hawthorne, CA",2024-01-14,Malibu,United States,Reliability Engineer,Mid senior,Onsite,Cloud & Infrastructure Engineering,Site Reliability Engineer,Senior-Level,Sr


### **Drop unclassified rows**



In [16]:
# Dropping rows with no job classification
structured_job_titles = structured_job_titles[structured_job_titles['job_classification'] != 'unclassified']

# check results
structured_job_titles


Unnamed: 0,job_link,last_processed_time,last_status,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type,job_classification,job_keyword,seniority_level,seniority_level_keyword
0,https://www.linkedin.com/jobs/view/senior-mach...,2024-01-21 08:08:48.031964+00,Finished NER,t,t,f,Senior Machine Learning Engineer,Jobs for Humanity,"New Haven, CT",2024-01-14,East Haven,United States,Agricultural-Research Engineer,Mid senior,Onsite,Machine Learning Engineer,Machine Learning Engineer,Senior-Level,Senior
1,https://www.linkedin.com/jobs/view/principal-s...,2024-01-20 04:02:12.331406+00,Finished NER,t,t,f,"Principal Software Engineer, ML Accelerators",Aurora,"San Francisco, CA",2024-01-14,El Cerrito,United States,Set-Key Driver,Mid senior,Onsite,Software & Platform Engineering,Software Engineer,Principal / Staff-Level,Principal
2,https://www.linkedin.com/jobs/view/senior-etl-...,2024-01-21 08:08:31.941595+00,Finished NER,t,t,f,Senior ETL Data Warehouse Specialist,Adame Services LLC,"New York, NY",2024-01-14,Middletown,United States,Technical Support Specialist,Associate,Onsite,Data Modeling & Warehousing,Data Warehouse,Senior-Level,Senior
3,https://www.linkedin.com/jobs/view/senior-data...,2024-01-20 15:30:55.796572+00,Finished NER,t,t,f,Senior Data Warehouse Developer / Architect,Morph Enterprise,"Harrisburg, PA",2024-01-12,Lebanon,United States,Architect,Mid senior,Onsite,Data Modeling & Warehousing,Data Warehouse,Senior-Level,Senior
4,https://www.linkedin.com/jobs/view/lead-data-e...,2024-01-21 08:08:58.312124+00,Finished NER,t,t,f,Lead Data Engineer,Dice,"Plano, TX",2024-01-14,McKinney,United States,Maintenance Data Analyst,Mid senior,Onsite,Data Engineer,Lead Data Engineer,Lead,Lead
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12209,https://www.linkedin.com/jobs/view/data-archit...,2024-01-21 08:08:07.523737+00,Finished NER,t,t,f,Data Architect,General Dynamics Information Technology,"St Louis, MO",2024-01-14,Collinsville,United States,Interior Designer,Mid senior,Onsite,Data Architect,Data Architect,Senior-Level,unclassified
12211,https://ca.linkedin.com/jobs/view/senior-data-...,2024-01-20 05:10:03.58781+00,Finished NER,t,t,f,Senior Data Insights Analyst,CARFAX Canada,"London, Ontario, Canada",2024-01-14,London,Canada,Data Entry Clerk,Mid senior,Onsite,Data Analyst,Data Insights Analyst,Senior-Level,Senior
12213,https://www.linkedin.com/jobs/view/corporate-a...,2024-01-19 15:10:41.177008+00,Finished NER,t,t,f,Corporate AML Alert Investigation Specialist,"Glacier Bancorp, Inc.","Kalispell, MT",2024-01-14,Montana,United States,Teller,Mid senior,Onsite,Risk & Compliance Analytics,Aml,Senior-Level,Specialist
12214,https://www.linkedin.com/jobs/view/senior-data...,2024-01-20 15:20:19.036168+00,Finished NER,t,t,f,Senior Data Scientist,Highnote,"San Francisco, CA",2024-01-16,San Rafael,United States,Mathematician,Mid senior,Onsite,Data Scientist,Data Scientist,Senior-Level,Senior


### **Final Reflection: Job Title & Experience Level Classification**  

After extensive iterations and refinement, I’ve successfully classified **job titles and experience levels**, marking the completion of **Ticket 1.1 – EDA**. The final dataset consists of **5,079 classified job listings**, a significant improvement from the original **12,000+ entries**, effectively filtering out ambiguous or highly niche job titles that lacked meaningful representation.  

I’m **proud of the classification rate**, especially given the complexity of long-form job titles and the overlapping nature of technical roles. By structuring job titles using a **hierarchical keyword-matching approach**, I was able to ensure that roles were grouped meaningfully while **minimizing misclassification and cross-contamination**. The addition of **seniority level classification** further enhances the dataset, allowing for better segmentation by career progression.  

However, if I had more time, there are several areas I’d consider improving:  
- **Handling slight misspellings or inconsistencies** in job titles, which may have led to some relevant listings being missed.  
- **Exploring fuzzy matching techniques** to capture more variations in long-form job titles.  
- **Refining the seniority classification process** to better differentiate between ambiguous roles (e.g., distinguishing between "Lead" and "Senior" when both appear in the same title).  
- **Optimizing regex rules further** to ensure even more precise or better blanketed classification, particularly for niche roles that may still be underrepresented.  

While this step concludes my **EDA for job title and experience level classification**, the next step is integrating this process into the **main data cleaning pipeline**, ensuring that future job postings are automatically categorized with minimal manual intervention. This structured approach lays the foundation for **more accurate and meaningful visualizations**, setting up the project for deeper insights into data science career paths.
