# Aim / Purpose
**Build a skills taxonomy (~200–300 skills with aliases) that can later be used to extract skills from job postings.**


**Do a first manual API pull to test the data pipeline, validate the structure of the data, and save a small sample of job postings.**


**Document observations and insights to guide Phase 2 (automation) and Phase 3 (skill extraction & EDA).**

### Taxnomy Refinement

In [86]:
import json
from pathlib import Path

In [87]:
taxonomy_path = Path("../data/processed/skills_taxonomy.json")
with open(taxonomy_path, "r") as f:
    taxonomy = json.load(f)

print(f"Current taxonomy size: {len(taxonomy)}")


Current taxonomy size: 200


In [88]:
aliases = {
    "python": ["py", "python3"],
    "javascript": ["js", "nodejs", "ecmascript"],
    "r": ["r language"],
    "sql": ["structured query language"]
}

# Merge aliases into taxonomy
for skill, alias_list in aliases.items():
    if skill in taxonomy:
        taxonomy[skill]["aliases"] = alias_list
    else:
        taxonomy[skill] = {"source": "manual", "aliases": alias_list}


# Ensure no duplicates in aliases
for skill, meta in taxonomy.items():
    if "aliases" in meta:
        meta["aliases"] = sorted(list(set(meta["aliases"])))

# Sort skills alphabetically
taxonomy = dict(sorted(taxonomy.items()))

with open(taxonomy_path, "w") as f:
    json.dump(taxonomy, f, indent=2)

print(f"Updated taxonomy size: {len(taxonomy)}")



Updated taxonomy size: 200


### First Manual API Pull

In [89]:
# Api parameters
import os
import requests
from datetime import datetime

from dotenv import load_dotenv
load_dotenv()  # loads your .env file

# Load API keys from .env 
ADZUNA_APP_ID = os.getenv("ADZUNA_APP_ID")
ADZUNA_APP_KEY = os.getenv("ADZUNA_APP_KEY")

# Adzuna endpoint
BASE_URL = "https://api.adzuna.com/v1/api/jobs/us/search/1"  # US example, page 1

# Query parameters
params = {
    "app_id": ADZUNA_APP_ID,
    "app_key": ADZUNA_APP_KEY,
    "results_per_page": 10,
    "what": "data scientist",  # test role
    "content-type": "application/json",
}

In [90]:
# Api Request 
response = requests.get(BASE_URL, params=params)

if response.status_code == 200:
    jobs_data = response.json().get("results", [])
    print(f"Fetched {len(jobs_data)} job postings")
else:
    print("API request failed with status:", response.status_code)
    jobs_data = []


Fetched 10 job postings


In [91]:
#jobs_data

In [92]:
#Validate Jobs
validated_jobs = []
for job in jobs_data:
    if all([
        job.get("title"),
        job.get("description") and len(job["description"]) >= 50,
        job.get("location") and job["location"].get("display_name"),
        "salary_min" in job or "salary_max" in job
    ]):
        validated_jobs.append(job)

print(f"Validated {len(validated_jobs)} jobs out of {len(jobs_data)} fetched")

Validated 10 jobs out of 10 fetched


In [93]:
# Save Raw Data
raw_dir = Path(f"../data/raw/{datetime.today().strftime('%Y-%m-%d')}")
raw_dir.mkdir(parents=True, exist_ok=True)

raw_file = raw_dir / "jobs_sample.json"
with open(raw_file, "w") as f:
    json.dump(validated_jobs, f, indent=2)

print(f"Saved sample jobs to {raw_file}")


Saved sample jobs to ..\data\raw\2025-10-02\jobs_sample.json


In [94]:
print(f"Total skills in taxonomy: {len(taxonomy)}")
print(f"Jobs fetched: {len(jobs_data)}")
print(f"Jobs validated: {len(validated_jobs)}")

Total skills in taxonomy: 200
Jobs fetched: 10
Jobs validated: 10


In [95]:
import pandas as pd
import json
from collections import Counter
survey = pd.read_csv("../data/raw/survey_results_public.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [97]:

TOP_N = 200  # Number of top StackOverflow skills to include

# Phase 0 seed skills (must be in taxonomy)
seed_skills = [
    "python", "r", "sql", "pandas", "numpy", "scikit-learn",
    "tensorflow", "keras", "pytorch", "spark", "hadoop",
    "aws", "azure", "gcp", "streamlit", "flask", "fastapi",
    "docker", "kubernetes"
]

skill_cols = [
    "LanguageHaveWorkedWith", "LanguageWantToWorkWith",
    "DatabaseHaveWorkedWith", "DatabaseWantToWorkWith",
    "PlatformHaveWorkedWith", "PlatformWantToWorkWith",
    "WebframeHaveWorkedWith", "WebframeWantToWorkWith",
    "DevEnvsHaveWorkedWith", "DevEnvsWantToWorkWith",
    "SOTagsHaveWorkedWith", "SOTagsWantToWorkWith"
]

# Extract & clean skills
all_skills_list = []
for col in skill_cols:
    if col in survey.columns:
        skills = (
            survey[col]
            .dropna()
            .str.split(";")
            .explode()
            .str.lower()
            .str.strip()
        )
        all_skills_list.extend(skills)

# Count frequency of each skill
skill_counts = Counter(all_skills_list)

# Keep top N most common skills
top_skills = [skill for skill, _ in skill_counts.most_common(TOP_N)]
print(f"Top {TOP_N} StackOverflow skills: {len(top_skills)}")

# Load existing taxonomy
with open("../data/processed/skills_taxonomy.json", "r") as f:
    taxonomy = json.load(f)

# Merge top skills + ensure seed skills
for skill in top_skills + seed_skills:  # seed_skills guaranteed included
    if skill not in taxonomy:
        source = "stackoverflow" if skill in top_skills else "seed"
        taxonomy[skill] = {"source": source}

# Save cleaned taxonomy
with open("../data/processed/skills_taxonomy.json", "w") as f:
    json.dump(taxonomy, f, indent=2)
print(f"Final taxonomy size after cleaning and merge: {len(taxonomy)}")


Top 200 StackOverflow skills: 188
Final taxonomy size after cleaning and merge: 200


# Conclusion
---


## Skills Taxonomy

- Number of skills: 200
- Sources: O*NET + seed list + StackOverflow

---

## Manual API Pull

- Date of pull: 2025-10-02
- Role(s) fetched: "data scientist"
- Number of jobs fetched: 5
- Number of jobs validated: 5
- Any missing fields or anomalies: None

---

## Insights

- Any common skill mentions? ['python', 'r', 'aws', 'c', 'spark']
- Are salary fields mostly missing/present? Mostly missing
- Location parsing observations: ['Raleigh, Wake County', 'Lake Mary, Seminole County', 'The Gap, Chicago', 'West Slope, Washington County', 'Dahlgren, King George County']

---
Skills taxonomy built and saved: `skills_taxonomy.json`  
Manual API pull successful: 10 jobs fetched and validated  
Common skills extracted: ['analyze', 'aws', 'c', 'dig', 'git']  
Unique locations observed: ['Raleigh, Wake County', 'Lake Mary, Seminole County', ...]  

**Next Steps:**  
1. Automate weekly job fetching via `fetch_jobs.py`  
2. Add monitoring and logging for pipeline  
3. Prepare mapping function for skill extraction using updated taxonomy