# Phase 0: Setup and Initial Data Extraction

## Purpose
This notebook documents the initial setup for the Job Market Analyzer project. It includes:

1. **Environment Setup**
   - Installed libraries
   - Folder structure confirmation
   - GitHub repository connection
   - API keys stored in `.env`

2. **O*NET Skills Database Extraction**
   - Loading relevant O*NET Excel/CSV files
   - Exploring unique skills, technology skills, and tools
   - Notes on data cleaning and preparation for future use

3. **API Testing**
   - Testing Adzuna API connection
   - Checking if API keys work
   - Extracting sample job postings
   - Inspecting JSON structure and available fields

## Notes
- This notebook is purely exploratory and preparatory.
- No models or visualizations will be built here.
- All data saved here will be used in later phases for modeling, skill extraction, and dashboard development.


In [9]:
import pandas as pd
import json
onet_df= pd.read_csv('../data/raw/technology skills.csv')

In [10]:
onet_df.head()

Unnamed: 0,O*NET-SOC Code,Title,Example,Commodity Code,Commodity Title,Hot Technology,In Demand
0,11-1011.00,Chief Executives,Adobe Acrobat,43232202,Document management software,Y,N
1,11-1011.00,Chief Executives,AdSense Tracker,43232306,Data base user interface and query software,N,N
2,11-1011.00,Chief Executives,Atlassian JIRA,43232201,Content workflow software,Y,N
3,11-1011.00,Chief Executives,Blackbaud The Raiser's Edge,43232303,Customer relationship management CRM software,N,N
4,11-1011.00,Chief Executives,ComputerEase construction accounting software,43231601,Accounting software,N,N


In [11]:
# Step 1: Extract O*NET tool names
onet_skills = onet_df["Example"].dropna().str.lower().unique().tolist()

# Step 2: Add seed list of must-have DS/ML skills
seed_skills = [
    "python", "r", "sql", "pytorch", "tensorflow", "keras", "scikit-learn",
    "pandas", "numpy", "matplotlib", "seaborn", "spark", "hadoop", "aws",
    "azure", "gcp", "streamlit", "flask", "fastapi", "docker", "kubernetes"
]

# Merge & deduplicate
all_skills = sorted(set(onet_skills + seed_skills))

# Save taxonomy
taxonomy = {skill: {"source": "onet" if skill in onet_skills else "seed"}
            for skill in all_skills}

with open("../data/processed/skills_taxonomy.json", "w") as f:
    json.dump(taxonomy, f, indent=2)

print(f"Total taxonomy size: {len(taxonomy)}")


Total taxonomy size: 8785


In [13]:
onet_df.columns

Index(['O*NET-SOC Code', 'Title', 'Example', 'Commodity Code',
       'Commodity Title', 'Hot Technology', 'In Demand'],
      dtype='object')

In [4]:
#%pip install python-dotenv

In [5]:
# API TESTING
from dotenv import load_dotenv
import os

load_dotenv()  # loads your .env file

ADZUNA_APP_ID = os.getenv("ADZUNA_APP_ID")
ADZUNA_APP_KEY = os.getenv("ADZUNA_APP_KEY")

In [8]:
import requests

# Example: get 5 job postings for "Data Scientist" in India
url = f"https://api.adzuna.com/v1/api/jobs/in/search/1"
params = {
    "app_id": ADZUNA_APP_ID,
    "app_key": ADZUNA_APP_KEY,
    "what": "Data Scientist",
    "where": "India",
    "results_per_page": 5,
    "content-type": "application/json"
}

response = requests.get(url, params=params)

# Check if the API request was successful
if response.status_code == 200:
    data = response.json() # Convert the JSON response into a Python dictionary
    jobs = data["results"] # Extract the list of job postings from the dictionary
    # Print how many jobs were fetched
    print("Adzuna API works! Number of jobs fetched:", len(jobs))
    # Print the keys (fields) of the first job posting to understand the structure
    print("Sample job keys:", list(jobs[0].keys()))
else:
    # If the API request failed, print the status code and the response text for debugging
    print("Adzuna API failed.")
    print("Status code:", response.status_code)
    print("Response message:", response.text)

Adzuna API works! Number of jobs fetched: 5
Sample job keys: ['created', 'id', 'redirect_url', 'title', 'description', '__CLASS__', 'salary_is_predicted', 'adref', 'category', 'location', 'company']
