# Job Market Analysis Tool

## Data-Driven Analysis of Tech Job Market

**Author:** Computer Science Graduate  
**Date:** November 2025  
**Purpose:** Analyze job market trends, in-demand skills, and salary patterns

---

## üìã Table of Contents
1. [Setup & Imports](#setup)
2. [Data Collection](#data-collection)
3. [Data Cleaning & Preprocessing](#data-cleaning)
4. [Exploratory Data Analysis](#eda)
5. [Skill Extraction & Analysis](#skill-extraction)
6. [Data Visualization](#visualization)
7. [Machine Learning - Job Clustering](#clustering)
8. [Key Insights & Recommendations](#insights)

---


## 1. Setup & Imports <a id='setup'></a>

Import necessary libraries for data processing, visualization, and analysis.


In [23]:
# Data Processing
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud

# Natural Language Processing
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Utilities
from collections import Counter
import os
from datetime import datetime

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ All libraries imported successfully!")
print(f"üìä Pandas Version: {pd.__version__}")
print(f"üî¢ NumPy Version: {np.__version__}")


‚úÖ All libraries imported successfully!
üìä Pandas Version: 2.2.3
üî¢ NumPy Version: 2.0.2


In [24]:
# Download NLTK data (run once)
try:
    nltk.data.find('tokenizers/punkt')
    nltk.data.find('corpora/stopwords')
    print("‚úÖ NLTK data already downloaded")
except LookupError:
    print("üì• Downloading NLTK data...")
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('averaged_perceptron_tagger')
    print("‚úÖ NLTK data downloaded successfully!")


‚úÖ NLTK data already downloaded


## 2. Data Collection <a id='data-collection'></a>

### Dataset Information

**Recommended Dataset:** [LinkedIn Job Postings - 2023](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings)

**Instructions:**
1. Download the dataset from Kaggle
2. Place the CSV file in `../data/raw/` folder
3. Update the file path below

**Alternative datasets:**
- [Data Science Job Postings](https://www.kaggle.com/datasets/rashikrahmanpritom/data-science-job-posting-on-glassdoor)
- [Job Posts Data](https://www.kaggle.com/datasets/madhab/jobposts)


In [25]:
# Load the dataset
# Update this path to match your downloaded dataset file name
data_path = '../data/raw/jobs_bayt_3.csv'

# Alternative: if you have a different dataset, update the path
# data_path = '../data/raw/job_postings.csv'

try:
    df = pd.read_csv(data_path)
    print(f"‚úÖ Dataset loaded successfully!")
    print(f"üìä Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
except FileNotFoundError:
    print("‚ö†Ô∏è Dataset not found!")
    print(f"Please download a job postings dataset and place it at: {data_path}")
    print("\nRecommended: https://www.kaggle.com/datasets/arshkon/linkedin-job-postings")
    
    # Create sample data for demonstration purposes
    print("\nüîÑ Creating sample dataset for demonstration...")
    df = pd.DataFrame({
        'job_title': ['Data Scientist', 'Software Engineer', 'ML Engineer', 'Data Analyst', 'Full Stack Developer'] * 200,
        'company': ['Company A', 'Company B', 'Company C', 'Company D', 'Company E'] * 200,
        'location': ['Riyadh, Saudi Arabia', 'Jeddah, Saudi Arabia', 'Dubai, UAE', 'Remote', 'Dammam, Saudi Arabia'] * 200,
        'description': [
            'Looking for Data Scientist with Python, SQL, Machine Learning, TensorFlow experience',
            'Software Engineer needed. Java, Spring Boot, AWS, Docker required',
            'ML Engineer position. Python, PyTorch, Kubernetes, MLOps skills needed',
            'Data Analyst role. SQL, Tableau, Excel, Power BI required',
            'Full Stack Developer. React, Node.js, MongoDB, JavaScript expertise needed'
        ] * 200,
        'salary': np.random.choice([None, '80000-120000', '100000-150000', '60000-90000'], 1000),
        'experience_level': np.random.choice(['Entry Level', 'Mid Level', 'Senior Level', 'Lead'], 1000),
        'posted_date': pd.date_range('2023-01-01', periods=1000, freq='D')
    })
    print(f"‚úÖ Sample dataset created: {df.shape[0]:,} rows √ó {df.shape[1]} columns")


‚úÖ Dataset loaded successfully!
üìä Shape: 3,527 rows √ó 26 columns


In [26]:
# Display basic information
print("=" * 80)
print("DATASET OVERVIEW")
print("=" * 80)
df.head(10)


DATASET OVERVIEW


Unnamed: 0,Job ID,Title,Job URL,Company,Company_URL,Date Posted,Job Description,Job Location,Company Industry,Company Type,...,Tags,Gender,Job Country,Job City,Min Years of Experience,Max Years of Experience,Min Age,Max Age,Monthly Salary Min Range,Monthly Salary Max Range
0,4177110,OPERATIONS OFFICER,https://www.bayt.com/en/saudi-arabia/jobs/role...,NBP,https://www.bayt.com/en/company/nbp-1891015/,2020-04-15,Customer Service: Ensure provision of Assist...,"Riyadh, Saudi Arabia",Banking,Employer (Public Sector),...,"['Banking', 'Business Administration', 'Operat...",,Saudi Arabia,Riyadh,1.0,,,,,
1,4177085,Branch Administration Manager,https://www.bayt.com/en/saudi-arabia/jobs/role...,Kinetic Business Solutions,https://www.bayt.com/en/company/kinetic-busine...,2020-04-15,Our client is a conglomerate within the heal...,"Riyadh, Saudi Arabia",Medical Clinic,Recruitment Agency,...,"['Hospital Operations', 'Community Health', 'C...",,Saudi Arabia,Riyadh,5.0,,,,,
2,4174537,Admin Assistant(Analytical Department),https://www.bayt.com/en/saudi-arabia/jobs/role...,Jubail,https://www.bayt.com/en/saudi-arabia/jobs/loca...,2020-04-15,1. Keep a record of appointments and meeting...,"Jubail, Saudi Arabia",Oil & Gas,Employer (Private Sector),...,['Administration'],,Saudi Arabia,Jubail,,,,,,
3,4177016,ŸÅŸÜŸäŸë ŸÖÿπŸÑŸàŸÖÿßÿ™ ŸàŸÖÿ∑Ÿàÿ± ÿ®ÿ±ÿßŸÖÿ¨,https://www.bayt.com/en/saudi-arabia/jobs/role...,Jeddah,https://www.bayt.com/en/saudi-arabia/jobs/loca...,2020-04-15,ŸÅŸÜŸäŸë ŸÖÿπŸÑŸàŸÖÿßÿ™ ŸàŸÖÿ∑Ÿàÿ± ÿ®ÿ±ÿßŸÖÿ¨ Skills 1.ÿßŸÑÿ™ÿ≠ÿØÿ´ ÿ®ÿßŸÑ...,"Jeddah, Saudi Arabia",Facilities & Property Management; Corporate Ma...,Employer (Private Sector),...,"['Applications Support', 'Email Management', '...",,Saudi Arabia,Jeddah,1.0,5.0,18.0,40.0,,
4,4177035,Admin Assistant,https://www.bayt.com/en/saudi-arabia/jobs/role...,Jubail,https://www.bayt.com/en/saudi-arabia/jobs/loca...,2020-04-15,provide general administrative and clerical ...,"Jubail, Saudi Arabia",Oil & Gas,Employer (Private Sector),...,['Administration'],,Saudi Arabia,Jubail,,,,,,
5,4177019,Document Controller,https://www.bayt.com/en/saudi-arabia/jobs/role...,Jubail,https://www.bayt.com/en/saudi-arabia/jobs/loca...,2020-04-15,"Responsible for collecting, sorting, storing...","Jubail, Saudi Arabia",Oil & Gas,Employer (Private Sector),...,['Administration'],,Saudi Arabia,Jubail,,,,,,
6,4176692,Airfield Operations Specialist,https://www.bayt.com/en/saudi-arabia/jobs/role...,Riyadh,https://www.bayt.com/en/saudi-arabia/jobs/loca...,2020-04-14,Possesses general knowledge and experience i...,"Riyadh, Saudi Arabia",Aviation Support Services,Employer (Private Sector),...,[],Male,Saudi Arabia,Riyadh,4.0,,22.0,55.0,,
7,4175775,ÿ≥ÿßÿ¶ŸÇ ŸàŸÖŸÜÿØŸàÿ®,https://www.bayt.com/en/saudi-arabia/jobs/role...,Fakieh Group,https://www.bayt.com/en/company/fakieh-group-2...,2020-04-08,ÿ™ŸàÿµŸäŸÑ ŸÖŸÜÿ™ÿ¨ÿßÿ™ŸÑŸÑŸÖŸÜÿßÿ≤ŸÑ Skills ÿßŸÑŸÜÿ∏ÿßŸÅÿ© ÿßŸÑÿ¥ÿÆÿµŸäÿ© ÿ±...,"Western Province, Saudi Arabia",Animal Production,Employer (Private Sector),...,[],Male,Saudi Arabia,Western Province,0.0,2.0,20.0,40.0,1000.0,1500.0
8,4175350,Company Available positions,https://www.bayt.com/en/saudi-arabia/jobs/role...,Delicious Food Company,https://www.bayt.com/en/company/delicious-food...,2020-04-06,If you‚Äôre looking for a new challenging care...,"Al Olaya, Riyadh , Saudi Arabia","Catering, Food Service, & Restaurant",Employer (Private Sector),...,['Administrative'],,Saudi Arabia,Riyadh,2.0,20.0,,,,
9,3908800,HR Officer,https://www.bayt.com/en/saudi-arabia/jobs/role...,Rentokil Saudi Arabia,https://www.bayt.com/en/company/rentokil-saudi...,2020-04-05,- Assisting with preparation for disciplinar...,"Riyadh, Saudi Arabia",Facilities & Property Management,Employer (Private Sector),...,"['Human Resources', 'Public Relations', 'MIS',...",,Saudi Arabia,Riyadh,2.0,5.0,23.0,45.0,,


In [27]:
# Dataset information
print("\n" + "=" * 80)
print("DATASET INFO")
print("=" * 80)
df.info()



DATASET INFO
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3527 entries, 0 to 3526
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Job ID                    3527 non-null   int64  
 1   Title                     3527 non-null   object 
 2   Job URL                   3527 non-null   object 
 3   Company                   3527 non-null   object 
 4   Company_URL               3527 non-null   object 
 5   Date Posted               3527 non-null   object 
 6   Job Description           3527 non-null   object 
 7   Job Location              3526 non-null   object 
 8   Company Industry          3527 non-null   object 
 9   Company Type              3527 non-null   object 
 10  Job Role                  3527 non-null   object 
 11  Employment Type           3527 non-null   object 
 12  Number of Vacancies       3527 non-null   object 
 13  Career Level              1019 non-null   object 

In [28]:
# Check for missing values
print("\n" + "=" * 80)
print("MISSING VALUES")
print("=" * 80)
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Count': missing_values.values,
    'Percentage': missing_percentage.values
}).sort_values('Missing Count', ascending=False)

print(missing_df[missing_df['Missing Count'] > 0].to_string(index=False))



MISSING VALUES
                  Column  Missing Count  Percentage
Monthly Salary Min Range           3365   95.406861
Monthly Salary Max Range           3365   95.406861
                 Min Age           3315   93.989226
                 Max Age           3312   93.904168
 Max Years of Experience           3197   90.643606
                  Gender           3160   89.594556
      Residence Location           2951   83.668840
                  Degree           2824   80.068046
 Min Years of Experience           2812   79.727814
            Career Level           2508   71.108591
                Job City            856   24.269918
            Job Location              1    0.028353
             Job Country              1    0.028353


In [29]:
# Create a copy for cleaning
df_clean = df.copy()

print("üßπ Starting data cleaning process...\n")

# 1. Remove duplicates
initial_rows = len(df_clean)
df_clean = df_clean.drop_duplicates()
duplicates_removed = initial_rows - len(df_clean)
print(f"‚úÖ Removed {duplicates_removed:,} duplicate rows")

# 2. Handle missing values in critical columns
# For job title and description (critical fields)
if 'job_title' in df_clean.columns:
    df_clean = df_clean.dropna(subset=['job_title'])
    
if 'description' in df_clean.columns:
    df_clean['description'] = df_clean['description'].fillna('')
    
# For other columns, fill with 'Unknown' or appropriate value
for col in df_clean.columns:
    if df_clean[col].dtype == 'object':
        df_clean[col] = df_clean[col].fillna('Unknown')

print(f"‚úÖ Handled missing values")

# 3. Standardize text columns (lowercase, strip whitespace)
text_columns = ['job_title', 'company', 'location', 'description']
for col in text_columns:
    if col in df_clean.columns:
        df_clean[col] = df_clean[col].astype(str).str.strip()

print(f"‚úÖ Standardized text columns")

# 4. Clean job titles (normalize similar titles)
if 'job_title' in df_clean.columns:
    df_clean['job_title_clean'] = df_clean['job_title'].str.lower()
    
    # Normalize common variations
    title_mapping = {
        r'data scientist.*': 'Data Scientist',
        r'machine learning.*': 'Machine Learning Engineer',
        r'ml engineer.*': 'Machine Learning Engineer',
        r'software engineer.*': 'Software Engineer',
        r'software developer.*': 'Software Engineer',
        r'data analyst.*': 'Data Analyst',
        r'data engineer.*': 'Data Engineer',
        r'full.?stack.*': 'Full Stack Developer',
        r'frontend.*|front.end.*': 'Frontend Developer',
        r'backend.*|back.end.*': 'Backend Developer',
        r'devops.*': 'DevOps Engineer',
        r'cloud.*engineer.*': 'Cloud Engineer',
        r'product manager.*': 'Product Manager',
        r'business analyst.*': 'Business Analyst'
    }
    
    for pattern, normalized_title in title_mapping.items():
        df_clean.loc[df_clean['job_title_clean'].str.contains(pattern, case=False, na=False), 
                     'job_title_clean'] = normalized_title
    
    print(f"‚úÖ Normalized job titles")

print(f"\n‚úÖ Data cleaning complete!")
print(f"üìä Final dataset: {len(df_clean):,} rows √ó {len(df_clean.columns)} columns")


üßπ Starting data cleaning process...

‚úÖ Removed 0 duplicate rows
‚úÖ Handled missing values
‚úÖ Standardized text columns

‚úÖ Data cleaning complete!
üìä Final dataset: 3,527 rows √ó 26 columns


In [30]:
# Display cleaned data
df_clean.head()


Unnamed: 0,Job ID,Title,Job URL,Company,Company_URL,Date Posted,Job Description,Job Location,Company Industry,Company Type,...,Tags,Gender,Job Country,Job City,Min Years of Experience,Max Years of Experience,Min Age,Max Age,Monthly Salary Min Range,Monthly Salary Max Range
0,4177110,OPERATIONS OFFICER,https://www.bayt.com/en/saudi-arabia/jobs/role...,NBP,https://www.bayt.com/en/company/nbp-1891015/,2020-04-15,Customer Service: Ensure provision of Assist...,"Riyadh, Saudi Arabia",Banking,Employer (Public Sector),...,"['Banking', 'Business Administration', 'Operat...",Unknown,Saudi Arabia,Riyadh,1.0,,,,,
1,4177085,Branch Administration Manager,https://www.bayt.com/en/saudi-arabia/jobs/role...,Kinetic Business Solutions,https://www.bayt.com/en/company/kinetic-busine...,2020-04-15,Our client is a conglomerate within the heal...,"Riyadh, Saudi Arabia",Medical Clinic,Recruitment Agency,...,"['Hospital Operations', 'Community Health', 'C...",Unknown,Saudi Arabia,Riyadh,5.0,,,,,
2,4174537,Admin Assistant(Analytical Department),https://www.bayt.com/en/saudi-arabia/jobs/role...,Jubail,https://www.bayt.com/en/saudi-arabia/jobs/loca...,2020-04-15,1. Keep a record of appointments and meeting...,"Jubail, Saudi Arabia",Oil & Gas,Employer (Private Sector),...,['Administration'],Unknown,Saudi Arabia,Jubail,,,,,,
3,4177016,ŸÅŸÜŸäŸë ŸÖÿπŸÑŸàŸÖÿßÿ™ ŸàŸÖÿ∑Ÿàÿ± ÿ®ÿ±ÿßŸÖÿ¨,https://www.bayt.com/en/saudi-arabia/jobs/role...,Jeddah,https://www.bayt.com/en/saudi-arabia/jobs/loca...,2020-04-15,ŸÅŸÜŸäŸë ŸÖÿπŸÑŸàŸÖÿßÿ™ ŸàŸÖÿ∑Ÿàÿ± ÿ®ÿ±ÿßŸÖÿ¨ Skills 1.ÿßŸÑÿ™ÿ≠ÿØÿ´ ÿ®ÿßŸÑ...,"Jeddah, Saudi Arabia",Facilities & Property Management; Corporate Ma...,Employer (Private Sector),...,"['Applications Support', 'Email Management', '...",Unknown,Saudi Arabia,Jeddah,1.0,5.0,18.0,40.0,,
4,4177035,Admin Assistant,https://www.bayt.com/en/saudi-arabia/jobs/role...,Jubail,https://www.bayt.com/en/saudi-arabia/jobs/loca...,2020-04-15,provide general administrative and clerical ...,"Jubail, Saudi Arabia",Oil & Gas,Employer (Private Sector),...,['Administration'],Unknown,Saudi Arabia,Jubail,,,,,,


In [31]:
# Statistical summary
print("=" * 80)
print("STATISTICAL SUMMARY")
print("=" * 80)
df_clean.describe(include='all')


STATISTICAL SUMMARY


Unnamed: 0,Job ID,Title,Job URL,Company,Company_URL,Date Posted,Job Description,Job Location,Company Industry,Company Type,...,Tags,Gender,Job Country,Job City,Min Years of Experience,Max Years of Experience,Min Age,Max Age,Monthly Salary Min Range,Monthly Salary Max Range
count,3527.0,3527,3527,3527,3527,3527,3527.0,3527,3527,3527,...,3527,3527,3527,3527,715.0,330.0,212.0,215.0,162.0,162.0
unique,,3114,3527,510,512,224,3332.0,72,182,5,...,740,3,2,38,,,,,,
top,,Project Manager,https://www.bayt.com/en/saudi-arabia/jobs/role...,Aramco Services Company,https://www.bayt.com/en/company/aramco-service...,2020-01-07,,"Riyadh, Saudi Arabia",Other Business Support Services,Unspecified,...,[],Unknown,Saudi Arabia,Riyadh,,,,,,
freq,,12,1,211,211,330,60.0,968,1070,2510,...,2690,3160,3526,1449,,,,,,
mean,39125620.0,,,,,,,,,,...,,,,,4.39021,8.1,24.966981,39.181395,3129.62963,4447.530864
std,22368290.0,,,,,,,,,,...,,,,,3.487097,5.088786,4.027611,7.149252,3291.979504,5518.632478
min,1403917.0,,,,,,,,,,...,,,,,0.0,0.0,18.0,25.0,0.0,500.0
25%,4174496.0,,,,,,,,,,...,,,,,2.0,5.0,22.0,35.0,1000.0,1500.0
50%,52800090.0,,,,,,,,,,...,,,,,3.0,7.0,25.0,39.0,2000.0,3000.0
75%,54561700.0,,,,,,,,,,...,,,,,5.0,10.0,27.0,40.5,4000.0,5000.0


In [32]:
# Top job titles
if 'job_title_clean' in df_clean.columns:
    print("\n" + "=" * 80)
    print("TOP 15 JOB TITLES")
    print("=" * 80)
    top_jobs = df_clean['job_title_clean'].value_counts().head(15)
    print(top_jobs.to_string())
    
    # Visualization
    fig, ax = plt.subplots(figsize=(12, 6))
    top_jobs.plot(kind='barh', ax=ax, color='steelblue')
    ax.set_xlabel('Number of Job Postings', fontsize=12)
    ax.set_ylabel('Job Title', fontsize=12)
    ax.set_title('Top 15 Most Common Job Titles', fontsize=14, fontweight='bold')
    ax.invert_yaxis()
    plt.tight_layout()
    plt.savefig('../visualizations/top_job_titles.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("‚ö†Ô∏è job_title_clean column not found")


‚ö†Ô∏è job_title_clean column not found


In [33]:
# Location distribution
if 'location' in df_clean.columns:
    print("\n" + "=" * 80)
    print("TOP LOCATIONS")
    print("=" * 80)
    top_locations = df_clean['location'].value_counts().head(10)
    print(top_locations.to_string())
    
    # Visualization
    fig, ax = plt.subplots(figsize=(12, 6))
    top_locations.plot(kind='bar', ax=ax, color='coral')
    ax.set_xlabel('Location', fontsize=12)
    ax.set_ylabel('Number of Job Postings', fontsize=12)
    ax.set_title('Top 10 Locations with Most Job Postings', fontsize=14, fontweight='bold')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig('../visualizations/top_locations.png', dpi=300, bbox_inches='tight')
    plt.show()


In [34]:
# Experience level distribution
if 'experience_level' in df_clean.columns:
    print("\n" + "=" * 80)
    print("EXPERIENCE LEVEL DISTRIBUTION")
    print("=" * 80)
    exp_dist = df_clean['experience_level'].value_counts()
    print(exp_dist.to_string())
    
    # Visualization - Pie chart
    fig, ax = plt.subplots(figsize=(10, 8))
    colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']
    exp_dist.plot(kind='pie', ax=ax, autopct='%1.1f%%', colors=colors, startangle=90)
    ax.set_ylabel('')
    ax.set_title('Distribution of Jobs by Experience Level', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.savefig('../visualizations/experience_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()


## 5. Skill Extraction & Analysis <a id='skill-extraction'></a>

Extract and analyze technical skills from job descriptions.


In [35]:
# Define comprehensive skills dictionary
SKILLS_DICT = {
    # Programming Languages
    'Programming Languages': [
        'python', 'java', 'javascript', 'typescript', 'c\\+\\+', 'c#', 'csharp',
        'r\\b', 'matlab', 'scala', 'kotlin', 'swift', 'go\\b', 'golang', 'rust',
        'php', 'ruby', 'perl', 'julia', 'bash', 'shell'
    ],
    
    # Data Science & ML
    'Data Science & ML': [
        'machine learning', 'deep learning', 'neural network', 'tensorflow', 
        'pytorch', 'keras', 'scikit-learn', 'sklearn', 'pandas', 'numpy', 
        'matplotlib', 'seaborn', 'nlp', 'computer vision', 'opencv', 'hugging face',
        'transformers', 'bert', 'gpt', 'llm', 'artificial intelligence', 'ai\\b'
    ],
    
    # Databases
    'Databases': [
        'sql', 'mysql', 'postgresql', 'mongodb', 'redis', 'cassandra', 
        'oracle', 'sqlite', 'mssql', 'dynamodb', 'elasticsearch', 'neo4j',
        'mariadb', 'firebase', 'cosmos'
    ],
    
    # Cloud Platforms
    'Cloud Platforms': [
        'aws', 'azure', 'gcp', 'google cloud', 'cloud computing', 'lambda',
        's3\\b', 'ec2', 'kubernetes', 'docker', 'containerization', 'microservices'
    ],
    
    # Web Development
    'Web Development': [
        'react', 'angular', 'vue\\.js', 'node\\.js', 'express', 'django',
        'flask', 'spring', 'spring boot', 'asp\\.net', 'html', 'css', 
        'rest api', 'graphql', 'webpack', 'redux'
    ],
    
    # DevOps & Tools
    'DevOps & Tools': [
        'git', 'github', 'gitlab', 'jenkins', 'ci/cd', 'terraform', 
        'ansible', 'puppet', 'chef', 'circleci', 'travis', 'devops'
    ],
    
    # Big Data
    'Big Data': [
        'hadoop', 'spark', 'kafka', 'airflow', 'hive', 'pig', 'flink',
        'storm', 'etl', 'data pipeline', 'data warehouse', 'snowflake'
    ],
    
    # BI & Visualization
    'BI & Visualization': [
        'tableau', 'power bi', 'powerbi', 'looker', 'qlik', 'excel',
        'd3\\.js', 'plotly', 'dashboarding', 'data visualization'
    ],
    
    # Soft Skills
    'Soft Skills': [
        'agile', 'scrum', 'leadership', 'communication', 'teamwork',
        'problem solving', 'analytical', 'project management'
    ]
}

print("‚úÖ Skills dictionary created with", sum(len(v) for v in SKILLS_DICT.values()), "skill patterns")


‚úÖ Skills dictionary created with 128 skill patterns


In [36]:
# Function to extract skills from text
def extract_skills(text, skills_dict):
    """Extract skills from job description text."""
    if pd.isna(text) or text == '':
        return []
    
    text_lower = str(text).lower()
    found_skills = []
    
    for category, skills in skills_dict.items():
        for skill in skills:
            # Use regex for more accurate matching
            pattern = r'\b' + skill + r'\b'
            if re.search(pattern, text_lower, re.IGNORECASE):
                # Store the skill in a cleaner format
                skill_name = skill.replace('\\b', '').replace('\\+\\+', '++').replace('\\.', '.')
                found_skills.append(skill_name)
    
    return found_skills

# Apply skill extraction
print("üîç Extracting skills from job descriptions...")
if 'description' in df_clean.columns:
    df_clean['skills'] = df_clean['description'].apply(lambda x: extract_skills(x, SKILLS_DICT))
    df_clean['skills_count'] = df_clean['skills'].apply(len)
    print(f"‚úÖ Skills extracted! Average {df_clean['skills_count'].mean():.1f} skills per job")
else:
    print("‚ö†Ô∏è Description column not found")


üîç Extracting skills from job descriptions...
‚ö†Ô∏è Description column not found


In [37]:
# Analyze most in-demand skills
if 'skills' in df_clean.columns:
    # Flatten all skills into a single list
    all_skills = [skill for skills_list in df_clean['skills'] for skill in skills_list]
    
    # Count skill frequency
    skill_counts = Counter(all_skills)
    top_skills = pd.DataFrame(skill_counts.most_common(30), columns=['Skill', 'Count'])
    
    print("\n" + "=" * 80)
    print("TOP 30 MOST IN-DEMAND SKILLS")
    print("=" * 80)
    print(top_skills.to_string(index=False))
    
    # Visualize top 20 skills
    fig, ax = plt.subplots(figsize=(12, 8))
    top_20 = top_skills.head(20)
    ax.barh(range(len(top_20)), top_20['Count'], color='mediumseagreen')
    ax.set_yticks(range(len(top_20)))
    ax.set_yticklabels(top_20['Skill'])
    ax.set_xlabel('Number of Job Postings', fontsize=12)
    ax.set_ylabel('Skill', fontsize=12)
    ax.set_title('Top 20 Most In-Demand Skills', fontsize=14, fontweight='bold')
    ax.invert_yaxis()
    plt.tight_layout()
    plt.savefig('../visualizations/top_skills.png', dpi=300, bbox_inches='tight')
    plt.show()


In [38]:
# Skills by category analysis
if 'skills' in df_clean.columns:
    category_skills = {category: [] for category in SKILLS_DICT.keys()}
    
    for skills_list in df_clean['skills']:
        for skill in skills_list:
            for category, skill_patterns in SKILLS_DICT.items():
                for pattern in skill_patterns:
                    clean_pattern = pattern.replace('\\b', '').replace('\\+\\+', '++').replace('\\.', '.')
                    if skill == clean_pattern:
                        category_skills[category].append(skill)
                        break
    
    # Count skills by category
    category_counts = {cat: len(skills) for cat, skills in category_skills.items()}
    category_df = pd.DataFrame(list(category_counts.items()), 
                               columns=['Category', 'Count']).sort_values('Count', ascending=False)
    
    print("\n" + "=" * 80)
    print("SKILLS BY CATEGORY")
    print("=" * 80)
    print(category_df.to_string(index=False))
    
    # Visualization
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.bar(range(len(category_df)), category_df['Count'], color='skyblue', edgecolor='navy')
    ax.set_xticks(range(len(category_df)))
    ax.set_xticklabels(category_df['Category'], rotation=45, ha='right')
    ax.set_xlabel('Skill Category', fontsize=12)
    ax.set_ylabel('Total Mentions', fontsize=12)
    ax.set_title('Distribution of Skills by Category', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.savefig('../visualizations/skills_by_category.png', dpi=300, bbox_inches='tight')
    plt.show()


## 6. Data Visualization <a id='visualization'></a>

Create comprehensive visualizations and insights.


In [39]:
# Word Cloud of Skills
if 'skills' in df_clean.columns and len(all_skills) > 0:
    print("‚òÅÔ∏è Generating word cloud...")
    
    # Create text from all skills
    skills_text = ' '.join(all_skills)
    
    # Generate word cloud
    wordcloud = WordCloud(
        width=1600, 
        height=800, 
        background_color='white',
        colormap='viridis',
        relative_scaling=0.5,
        min_font_size=10
    ).generate(skills_text)
    
    # Display
    fig, ax = plt.subplots(figsize=(16, 8))
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.axis('off')
    ax.set_title('Skills Word Cloud - Most In-Demand Technologies', 
                 fontsize=16, fontweight='bold', pad=20)
    plt.tight_layout()
    plt.savefig('../visualizations/skills_wordcloud.png', dpi=300, bbox_inches='tight')
    plt.show()
    print("‚úÖ Word cloud generated!")


In [40]:
# Skills required by top job titles
if 'skills' in df_clean.columns and 'job_title_clean' in df_clean.columns:
    print("\nüìä Analyzing skills by job title...")
    
    # Get top 5 job titles
    top_5_jobs = df_clean['job_title_clean'].value_counts().head(5).index
    
    # Create skill frequency for each job title
    job_skills_data = []
    
    for job in top_5_jobs:
        job_df = df_clean[df_clean['job_title_clean'] == job]
        job_skills = [skill for skills_list in job_df['skills'] for skill in skills_list]
        
        if len(job_skills) > 0:
            skill_counts = Counter(job_skills).most_common(10)
            for skill, count in skill_counts:
                job_skills_data.append({
                    'Job Title': job,
                    'Skill': skill,
                    'Count': count
                })
    
    if len(job_skills_data) > 0:
        job_skills_df = pd.DataFrame(job_skills_data)
        
        # Create grouped bar chart
        fig, axes = plt.subplots(len(top_5_jobs), 1, figsize=(14, 4*len(top_5_jobs)))
        
        if len(top_5_jobs) == 1:
            axes = [axes]
        
        for idx, job in enumerate(top_5_jobs):
            job_data = job_skills_df[job_skills_df['Job Title'] == job].sort_values('Count', ascending=True)
            
            if len(job_data) > 0:
                axes[idx].barh(job_data['Skill'], job_data['Count'], color=f'C{idx}')
                axes[idx].set_title(f'Top Skills for {job}', fontsize=12, fontweight='bold')
                axes[idx].set_xlabel('Frequency')
        
        plt.tight_layout()
        plt.savefig('../visualizations/skills_by_job_title.png', dpi=300, bbox_inches='tight')
        plt.show()
        print("‚úÖ Skills by job title visualized!")


In [41]:
# Skill co-occurrence heatmap (which skills appear together)
if 'skills' in df_clean.columns:
    print("\nüî• Creating skill co-occurrence heatmap...")
    
    # Get top 15 skills
    top_skills_list = [skill for skill, count in skill_counts.most_common(15)]
    
    # Create co-occurrence matrix
    cooccurrence = pd.DataFrame(0, index=top_skills_list, columns=top_skills_list)
    
    for skills_list in df_clean['skills']:
        # For each pair of skills in the list
        for i, skill1 in enumerate(skills_list):
            if skill1 in top_skills_list:
                for skill2 in skills_list[i+1:]:
                    if skill2 in top_skills_list:
                        cooccurrence.loc[skill1, skill2] += 1
                        cooccurrence.loc[skill2, skill1] += 1
    
    # Create heatmap
    fig, ax = plt.subplots(figsize=(14, 12))
    sns.heatmap(cooccurrence, annot=True, fmt='d', cmap='YlOrRd', 
                square=True, linewidths=0.5, cbar_kws={"shrink": 0.8}, ax=ax)
    ax.set_title('Skill Co-occurrence Matrix - How Often Skills Appear Together', 
                 fontsize=14, fontweight='bold', pad=20)
    plt.tight_layout()
    plt.savefig('../visualizations/skill_cooccurrence.png', dpi=300, bbox_inches='tight')
    plt.show()
    print("‚úÖ Co-occurrence heatmap created!")


## 7. Machine Learning - Job Clustering <a id='clustering'></a>

Use unsupervised learning to group similar jobs based on their requirements.


In [None]:
# Prepare data for clustering
if 'description' in df_clean.columns:
    print("ü§ñ Preparing data for machine learning clustering...\n")
    
    # Filter out empty descriptions
    df_ml = df_clean[df_clean['description'].str.len() > 50].copy()
    
    if len(df_ml) < 10:
        print("‚ö†Ô∏è Not enough data for clustering (need at least 10 jobs with descriptions)")
        print("Skipping clustering analysis...")
        df_ml = None
        tfidf_matrix = None
    else:
        print(f"‚úÖ Found {len(df_ml):,} jobs with valid descriptions")
        
        # Create TF-IDF features from job descriptions
        print("üìä Creating TF-IDF features...")
        try:
            tfidf = TfidfVectorizer(
                max_features=100,
                stop_words='english',
                ngram_range=(1, 2),
                min_df=min(5, len(df_ml) // 10),  # Adjust min_df based on data size
                max_df=0.8
            )
            
            tfidf_matrix = tfidf.fit_transform(df_ml['description'])
            print(f"‚úÖ TF-IDF matrix created: {tfidf_matrix.shape}")
            
            # Determine optimal number of clusters using elbow method
            print("\nüîç Finding optimal number of clusters...")
            inertias = []
            max_k = min(10, len(df_ml) // 2)  # Ensure we don't have more clusters than half the data
            K_range = range(2, max_k + 1)
            
            for k in K_range:
                kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
                kmeans_temp.fit(tfidf_matrix)
                inertias.append(kmeans_temp.inertia_)
            
            # Plot elbow curve
            fig, ax = plt.subplots(figsize=(10, 6))
            ax.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
            ax.set_xlabel('Number of Clusters (k)', fontsize=12)
            ax.set_ylabel('Inertia', fontsize=12)
            ax.set_title('Elbow Method - Finding Optimal k', fontsize=14, fontweight='bold')
            ax.grid(True, alpha=0.3)
            plt.tight_layout()
            plt.savefig('../visualizations/elbow_curve.png', dpi=300, bbox_inches='tight')
            plt.show()
            
            print("‚úÖ Elbow curve generated!")
            
        except Exception as e:
            print(f"‚ùå Error during clustering preparation: {str(e)}")
            print("Skipping clustering analysis...")
            df_ml = None
            tfidf_matrix = None
else:
    print("‚ö†Ô∏è Description column not found. Skipping clustering analysis.")
    df_ml = None
    tfidf_matrix = None


In [None]:
# Perform K-Means clustering
if df_ml is not None and tfidf_matrix is not None:
    print("\nüéØ Performing K-Means clustering with k=5...")
    
    try:
        # Apply KMeans
        optimal_k = min(5, len(df_ml) // 10)  # Adjust k based on data size
        if optimal_k < 2:
            optimal_k = 2
        
        kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
        df_ml['cluster'] = kmeans.fit_predict(tfidf_matrix)
        
        print(f"‚úÖ Clustering complete with k={optimal_k}!")
        print(f"\nCluster distribution:")
        print(df_ml['cluster'].value_counts().sort_index())
        
        # Analyze each cluster
        print("\n" + "=" * 80)
        print("CLUSTER ANALYSIS")
        print("=" * 80)
        
        for cluster_id in range(optimal_k):
            cluster_jobs = df_ml[df_ml['cluster'] == cluster_id]
            
            print(f"\n{'='*80}")
            print(f"CLUSTER {cluster_id} ({len(cluster_jobs)} jobs)")
            print(f"{'='*80}")
            
            # Most common job titles
            if 'job_title_clean' in cluster_jobs.columns:
                print("\nTop Job Titles:")
                top_titles = cluster_jobs['job_title_clean'].value_counts().head(5)
                if len(top_titles) > 0:
                    print(top_titles.to_string())
                else:
                    print("  No job titles found")
            
            # Most common skills
            if 'skills' in cluster_jobs.columns:
                cluster_skills = [skill for skills_list in cluster_jobs['skills'] for skill in skills_list]
                if len(cluster_skills) > 0:
                    print("\nTop Skills:")
                    top_cluster_skills = Counter(cluster_skills).most_common(10)
                    for skill, count in top_cluster_skills:
                        print(f"  ‚Ä¢ {skill}: {count}")
                else:
                    print("\nNo skills found in this cluster")
    
    except Exception as e:
        print(f"‚ùå Error during clustering: {str(e)}")
        print("Skipping cluster analysis...")
else:
    print("\n‚ö†Ô∏è Skipping K-Means clustering - data not available")
    print("This happens when:")
    print("  ‚Ä¢ Not enough job descriptions (need at least 10)")
    print("  ‚Ä¢ Description column is missing")
    print("  ‚Ä¢ Previous clustering step failed")


In [None]:
# Visualize clusters using PCA
if df_ml is not None and tfidf_matrix is not None and 'cluster' in df_ml.columns:
    print("\nüìä Visualizing clusters with PCA...")
    
    try:
        # Reduce dimensions to 2D using PCA
        pca = PCA(n_components=2, random_state=42)
        tfidf_dense = tfidf_matrix.toarray()
        principal_components = pca.fit_transform(tfidf_dense)
        
        # Create scatter plot
        fig, ax = plt.subplots(figsize=(12, 8))
        scatter = ax.scatter(
            principal_components[:, 0],
            principal_components[:, 1],
            c=df_ml['cluster'],
            cmap='viridis',
            alpha=0.6,
            s=50,
            edgecolors='black',
            linewidth=0.5
        )
        
        ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)', fontsize=12)
        ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)', fontsize=12)
        ax.set_title('Job Clustering Visualization (PCA)', fontsize=14, fontweight='bold')
        
        # Add colorbar
        cbar = plt.colorbar(scatter, ax=ax)
        cbar.set_label('Cluster', fontsize=12)
        
        plt.tight_layout()
        plt.savefig('../visualizations/job_clusters.png', dpi=300, bbox_inches='tight')
        plt.show()
        
        print(f"‚úÖ Cluster visualization created!")
        print(f"üìä PCA explained variance: {sum(pca.explained_variance_ratio_):.2%}")
    
    except Exception as e:
        print(f"‚ùå Error creating PCA visualization: {str(e)}")
        print("Skipping cluster visualization...")
else:
    print("\n‚ö†Ô∏è Skipping cluster visualization - clustering data not available")
    print("Reasons:")
    print("  ‚Ä¢ Clustering was skipped due to insufficient data")
    print("  ‚Ä¢ Previous clustering step failed")
    print("  ‚Ä¢ No cluster column found in data")
    print("\nüí° Tip: Make sure you have at least 10 jobs with descriptions for clustering")


NameError: name 'df_ml' is not defined

## 8. Key Insights & Recommendations <a id='insights'></a>

Summarize findings and provide actionable recommendations.


In [None]:
# Generate comprehensive insights report
print("=" * 80)
print("KEY INSIGHTS & FINDINGS")
print("=" * 80)

# 1. Dataset overview
print(f"\nüìä DATASET OVERVIEW")
print(f"{'‚îÄ'*80}")
print(f"Total Job Postings Analyzed: {len(df_clean):,}")
print(f"Unique Job Titles: {df_clean['job_title'].nunique() if 'job_title' in df_clean.columns else 'N/A'}")
print(f"Unique Companies: {df_clean['company'].nunique() if 'company' in df_clean.columns else 'N/A'}")
print(f"Unique Locations: {df_clean['location'].nunique() if 'location' in df_clean.columns else 'N/A'}")

# 2. Most in-demand skills
if 'skills' in df_clean.columns and len(all_skills) > 0:
    print(f"\nüî• TOP 10 MOST IN-DEMAND SKILLS")
    print(f"{'‚îÄ'*80}")
    top_10_skills = skill_counts.most_common(10)
    for idx, (skill, count) in enumerate(top_10_skills, 1):
        percentage = (count / len(df_clean)) * 100
        print(f"{idx:2d}. {skill:30s} - {count:5,} jobs ({percentage:.1f}%)")

# 3. Most common job titles
if 'job_title_clean' in df_clean.columns:
    print(f"\nüíº TOP 10 MOST COMMON JOB TITLES")
    print(f"{'‚îÄ'*80}")
    top_10_titles = df_clean['job_title_clean'].value_counts().head(10)
    for idx, (title, count) in enumerate(top_10_titles.items(), 1):
        percentage = (count / len(df_clean)) * 100
        print(f"{idx:2d}. {title:35s} - {count:5,} jobs ({percentage:.1f}%)")

# 4. Location insights
if 'location' in df_clean.columns:
    print(f"\nüåç TOP 5 LOCATIONS")
    print(f"{'‚îÄ'*80}")
    top_5_locations = df_clean['location'].value_counts().head(5)
    for idx, (location, count) in enumerate(top_5_locations.items(), 1):
        percentage = (count / len(df_clean)) * 100
        print(f"{idx}. {location:40s} - {count:5,} jobs ({percentage:.1f}%)")

# 5. Experience level distribution
if 'experience_level' in df_clean.columns:
    print(f"\nüìà EXPERIENCE LEVEL DISTRIBUTION")
    print(f"{'‚îÄ'*80}")
    exp_dist = df_clean['experience_level'].value_counts()
    for level, count in exp_dist.items():
        percentage = (count / len(df_clean)) * 100
        print(f"‚Ä¢ {level:20s} - {count:5,} jobs ({percentage:.1f}%)")

print(f"\n{'='*80}")


### üìù Strategic Recommendations

Based on the analysis, here are actionable recommendations:

#### For Job Seekers:

1. **Focus on Core Skills**: Prioritize learning the top 10 most in-demand skills identified in this analysis
2. **Combination Matters**: Skills often appear together (see co-occurrence matrix) - build complementary skill sets
3. **Location Strategy**: Consider opportunities in cities with highest job demand
4. **Experience Level**: Understand typical requirements for your target roles

#### For Employers:

1. **Competitive Requirements**: Align job postings with market standards for required skills
2. **Skill Trends**: Stay updated on emerging technologies showing growth
3. **Talent Pool**: Consider skills that are frequently bundled together when hiring

#### For Career Development:

1. **Upskilling Path**: Create a learning roadmap based on skill frequency and co-occurrence
2. **Role Transitions**: Use cluster analysis to identify similar roles that match your skillset
3. **Market Positioning**: Understand how your skills align with current market demands


In [None]:
# Save cleaned dataset for future use (including for Streamlit dashboard)
print("üíæ Saving processed data...")

# Make a copy for saving
df_export = df_clean.copy()

# Convert skills list to string for CSV compatibility
if 'skills' in df_export.columns:
    df_export['skills'] = df_export['skills'].apply(lambda x: str(x) if isinstance(x, list) else x)

# Save to processed folder
output_path = '../data/processed/job_market_clean.csv'
df_export.to_csv(output_path, index=False)
print(f"‚úÖ Cleaned dataset saved to: {output_path}")
print(f"   ‚Ä¢ {len(df_export):,} rows √ó {len(df_export.columns)} columns")

# Save skills analysis
if 'skills' in df_clean.columns and len(all_skills) > 0:
    skills_df = pd.DataFrame(skill_counts.most_common(50), columns=['Skill', 'Frequency'])
    skills_df.to_csv('../data/processed/top_skills.csv', index=False)
    print(f"‚úÖ Top skills saved to: ../data/processed/top_skills.csv")
    print(f"   ‚Ä¢ {len(skills_df)} skills")

# Create a summary file for quick reference
summary_data = {
    'Metric': [
        'Total Jobs',
        'Unique Job Titles',
        'Unique Companies',
        'Unique Locations',
        'Date Range',
        'Analysis Date'
    ],
    'Value': [
        f"{len(df_clean):,}",
        f"{df_clean['job_title'].nunique() if 'job_title' in df_clean.columns else 'N/A'}",
        f"{df_clean['company'].nunique() if 'company' in df_clean.columns else 'N/A'}",
        f"{df_clean['location'].nunique() if 'location' in df_clean.columns else 'N/A'}",
        f"{df_clean['posted_date'].min() if 'posted_date' in df_clean.columns else 'N/A'} to {df_clean['posted_date'].max() if 'posted_date' in df_clean.columns else 'N/A'}",
        datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    ]
}
summary_df = pd.DataFrame(summary_data)
summary_df.to_csv('../data/processed/analysis_summary.csv', index=False)
print(f"‚úÖ Analysis summary saved to: ../data/processed/analysis_summary.csv")

print("\n" + "=" * 80)
print("ANALYSIS COMPLETE! üéâ")
print("=" * 80)
print("\nAll visualizations have been saved to: ../visualizations/")
print("\nVisualization files created:")
print("  ‚Ä¢ top_job_titles.png")
print("  ‚Ä¢ top_locations.png")
print("  ‚Ä¢ experience_distribution.png")
print("  ‚Ä¢ top_skills.png")
print("  ‚Ä¢ skills_by_category.png")
print("  ‚Ä¢ skills_wordcloud.png")
print("  ‚Ä¢ skills_by_job_title.png")
print("  ‚Ä¢ skill_cooccurrence.png")
print("  ‚Ä¢ elbow_curve.png")
print("  ‚Ä¢ job_clusters.png")

print("\nüìä Processed data files:")
print("  ‚Ä¢ job_market_clean.csv - Main dataset")
print("  ‚Ä¢ top_skills.csv - Top 50 skills")
print("  ‚Ä¢ analysis_summary.csv - Quick summary")

print("\nüöÄ Ready for Streamlit Dashboard!")
print("   Run: streamlit run streamlit_app.py")
print("\n" + "=" * 80)


---

## üöÄ Next Steps

### To enhance this project further:

1. **Time Series Analysis**: If your dataset has date information, analyze trends over time
2. **Salary Analysis**: If salary data is available, analyze compensation patterns by role, location, and experience
3. **Interactive Dashboard**: Create a Streamlit dashboard for interactive exploration
4. **Web Scraping**: Collect real-time data from job sites for fresh insights
5. **Predictive Models**: Build models to predict job category or salary based on description
6. **Geographic Visualization**: Create maps showing job distribution across regions
7. **Skill Gap Analysis**: Compare required vs. available skills in the market

### Resources for Learning:
- **Python for Data Analysis** by Wes McKinney
- **Kaggle Learn**: Free data science courses
- **Towards Data Science**: Articles on job market analysis
- **LinkedIn Learning**: Courses on data visualization and NLP

---

## üì¨ Contact & Portfolio

**Project Repository**: [Add your GitHub link]  
**LinkedIn**: [Add your LinkedIn profile]  
**Email**: [Your email]

---

*This project demonstrates proficiency in: Python, Data Analysis, Data Visualization, NLP, Machine Learning, and Business Intelligence*
