# Skill Frequency Analysis

This notebook analyzes curated job description datasets to identify
frequently required skills for different tech roles.

The goal is to validate data assumptions and support explainable
skill gap analysis.

In [1]:
import json
from collections import Counter
from pathlib import Path
import pandas as pd

In [3]:
DATA_DIR = Path("../data/mock_job_descriptions")

def load_role_data(file_path):
    with open(file_path, "r") as f:
        return json.load(f)

roles_data = {}

for file in DATA_DIR.glob("*.json"):
    role_data = load_role_data(file)
    role_name = role_data["role"]
    roles_data[role_name] = role_data["descriptions"]

roles_data.keys()

dict_keys(['Data Analyst', 'Machine Learning Engineer', 'Backend Software Engineer'])

In [4]:
def extract_skills(descriptions):
    skills = []
    for desc in descriptions:
        skills.extend(desc["skills"])
    return skills

role_skills = {
    role: extract_skills(descriptions)
    for role, descriptions in roles_data.items()
}

role_skills

{'Data Analyst': ['sql',
  'excel',
  'data visualization',
  'statistics',
  'python',
  'sql',
  'tableau',
  'power bi',
  'business analysis',
  'statistics',
  'sql',
  'python',
  'pandas',
  'data cleaning',
  'data storytelling'],
 'Machine Learning Engineer': ['python',
  'machine learning',
  'data preprocessing',
  'model evaluation',
  'sql',
  'python',
  'scikit-learn',
  'deep learning',
  'model deployment',
  'docker',
  'python',
  'machine learning',
  'statistics',
  'experimentation',
  'model tuning'],
 'Backend Software Engineer': ['python',
  'sql',
  'rest api',
  'docker',
  'python',
  'fastapi',
  'postgresql',
  'aws']}

In [5]:
role_skill_frequency = {
    role: Counter(skills)
    for role, skills in role_skills.items()
}

role_skill_frequency

{'Data Analyst': Counter({'sql': 3,
          'statistics': 2,
          'python': 2,
          'excel': 1,
          'data visualization': 1,
          'tableau': 1,
          'power bi': 1,
          'business analysis': 1,
          'pandas': 1,
          'data cleaning': 1,
          'data storytelling': 1}),
 'Machine Learning Engineer': Counter({'python': 3,
          'machine learning': 2,
          'data preprocessing': 1,
          'model evaluation': 1,
          'sql': 1,
          'scikit-learn': 1,
          'deep learning': 1,
          'model deployment': 1,
          'docker': 1,
          'statistics': 1,
          'experimentation': 1,
          'model tuning': 1}),
 'Backend Software Engineer': Counter({'python': 2,
          'sql': 1,
          'rest api': 1,
          'docker': 1,
          'fastapi': 1,
          'postgresql': 1,
          'aws': 1})}

In [6]:
def frequency_to_df(counter):
    return (
        pd.DataFrame(counter.items(), columns=["skill", "frequency"])
        .sort_values(by="frequency", ascending=False)
        .reset_index(drop=True)
    )

role_skill_dfs = {
    role: frequency_to_df(counter)
    for role, counter in role_skill_frequency.items()
}

role_skill_dfs

{'Data Analyst':                  skill  frequency
 0                  sql          3
 1               python          2
 2           statistics          2
 3   data visualization          1
 4                excel          1
 5              tableau          1
 6             power bi          1
 7    business analysis          1
 8               pandas          1
 9        data cleaning          1
 10   data storytelling          1,
 'Machine Learning Engineer':                  skill  frequency
 0               python          3
 1     machine learning          2
 2   data preprocessing          1
 3     model evaluation          1
 4                  sql          1
 5         scikit-learn          1
 6        deep learning          1
 7     model deployment          1
 8               docker          1
 9           statistics          1
 10     experimentation          1
 11        model tuning          1,
 'Backend Software Engineer':         skill  frequency
 0      python         

## Backend Software Engineer

In [7]:
role_skill_dfs["Backend Software Engineer"]

Unnamed: 0,skill,frequency
0,python,2
1,sql,1
2,rest api,1
3,docker,1
4,fastapi,1
5,postgresql,1
6,aws,1


## Data Analyst

In [8]:
role_skill_dfs["Data Analyst"]

Unnamed: 0,skill,frequency
0,sql,3
1,python,2
2,statistics,2
3,data visualization,1
4,excel,1
5,tableau,1
6,power bi,1
7,business analysis,1
8,pandas,1
9,data cleaning,1


## Machine Learning Engineer

In [9]:
role_skill_dfs["Machine Learning Engineer"]

Unnamed: 0,skill,frequency
0,python,3
1,machine learning,2
2,data preprocessing,1
3,model evaluation,1
4,sql,1
5,scikit-learn,1
6,deep learning,1
7,model deployment,1
8,docker,1
9,statistics,1


In [10]:
def assign_importance(df):
    df = df.copy()
    max_freq = df["frequency"].max()

    def tier(freq):
        if freq >= max_freq * 0.75:
            return "must-have"
        elif freq >= max_freq * 0.4:
            return "common"
        else:
            return "nice-to-have"

    df["importance"] = df["frequency"].apply(tier)
    return df

role_skill_tiers = {
    role: assign_importance(df)
    for role, df in role_skill_dfs.items()
}

role_skill_tiers

{'Data Analyst':                  skill  frequency    importance
 0                  sql          3     must-have
 1               python          2        common
 2           statistics          2        common
 3   data visualization          1  nice-to-have
 4                excel          1  nice-to-have
 5              tableau          1  nice-to-have
 6             power bi          1  nice-to-have
 7    business analysis          1  nice-to-have
 8               pandas          1  nice-to-have
 9        data cleaning          1  nice-to-have
 10   data storytelling          1  nice-to-have,
 'Machine Learning Engineer':                  skill  frequency    importance
 0               python          3     must-have
 1     machine learning          2        common
 2   data preprocessing          1  nice-to-have
 3     model evaluation          1  nice-to-have
 4                  sql          1  nice-to-have
 5         scikit-learn          1  nice-to-have
 6        deep learning

### Observations

- Backend roles emphasize Python, SQL, and API-related skills.
- Data Analyst roles prioritize SQL and statistics over ML.
- ML Engineer roles blend modeling and engineering skills.
- Skill frequency patterns align with real-world expectations.

This validates the curated dataset and supports its use
as a market-signal proxy.

### Why this matters

This notebook demonstrates:
- data-driven reasoning
- explainable analysis
- careful validation before backend implementation

The results here directly inform:
- skill gap prioritization
- resume alignment logic
- role proximity evaluation