# Phase2_Supervised_Learning

**Course:** SWE 485  
**Notebook:** Phase2_Supervised_Learning.ipynb

## Dataset Goal & Source
- **Goal:** Analyze relationships between job titles and required skills for recommendation.
- **Source:** https://www.kaggle.com/datasets/batuhanmutlu/job-skill-set?resource=download

In [2]:
# Imports

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


# For building the pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

# For the models (Algorithms)
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# For evaluation
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score
import seaborn as sns # For plotting the confusion matrix

In [3]:
# Load Dataset

DATA_PATH = "../Dataset/jobs_dataset_raw.csv"

try:
    df = pd.read_csv(DATA_PATH)
    print(f"Loaded: {DATA_PATH}")
except FileNotFoundError as e:
    raise FileNotFoundError(f"File not found: '{DATA_PATH}'. Please verify the name and folder.") from e

Loaded: ../Dataset/jobs_dataset_raw.csv


In [4]:
# 2. Data Cleaning and Preparation

# Clean the skills column from brackets and quotes
# e.g., "'Python', 'SQL'" ---> "Python SQL"
df['job_skill_set_cleaned'] = df['job_skill_set'].str.strip("[]'").str.replace("', '", " ", regex=False)

# Combine all text columns into one feature for the model to process(the main feature, X)
# Using .fillna('') to avoid errors with missing text
df['all_text'] = df['job_title'].fillna('') + ' ' + \
                 df['job_description'].fillna('') + ' ' + \
                 df['job_skill_set_cleaned'].fillna('')

print("Combined text feature created successfully.")
df[['category', 'all_text']].head()

Combined text feature created successfully.


Unnamed: 0,category,all_text
0,HR,Sr Human Resource Generalist SUMMARY\nTHE SR. ...
1,HR,Human Resources Manager BE PART OF A STELLAR T...
2,HR,Director of Human Resources OUR CLIENT IS A TH...
3,HR,Chief Human Resources Officer JOB TITLE: CHIEF...
4,HR,Human Resources Generalist (Hybrid Role) DESCR...


In [5]:
# 3. Feature and Target Definition

# X = input text data (features) (the text the model will read)
X = df['all_text']

# y = output category (target) (the answer we want the model to predict)
y = df['category']

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")

Features (X) shape: (1167,)
Target (y) shape: (1167,)


In [6]:
# 4. Train-Test Split

# Split data: 80% training, 20% testing
# stratify=y keeps the same category proportions in both splits

# random_state=1 ensures we get the same split every time we run the code
# stratify=y is a crucial step. Since our data is well-balanced (as discovered in Phase 1),
# this ensures that the proportion of categories (IT, HR, Sales...) in the training set
# is the same as in the test set. This makes our evaluation very accurate.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

print(f"Training set size: {X_train.shape[0]} rows")
print(f"Test set size: {X_test.shape[0]} rows")

Training set size: 933 rows
Test set size: 234 rows


## 5. Algorithm Selection & Justification


### Model 1: Multinomial Naive Bayes (MNB)
* **Justification:** 

### Model 2: Linear Support Vector Machine (LinearSVC)
* **Justification:**

In [7]:
## 6. Model 1: Multinomial Naive Bayes (Baseline)

# Create a processing pipeline
# Step 1: Convert text to TF-IDF vectors (ngram_range=(1,2) includes bi-grams like "problem solving")
# Step 2: Apply the Naive Bayes model / Train a MultinomialNB classifier
pipeline_nb = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2))),
    ('model', MultinomialNB())
])

# Train the model on the training data
print("Training Naive Bayes model...")
pipeline_nb.fit(X_train, y_train)

# Get predictions on the test set
print("Getting predictions...")
y_pred_nb = pipeline_nb.predict(X_test)

print("Naive Bayes model trained and predictions are ready.")

Training Naive Bayes model...
Getting predictions...
Naive Bayes model trained and predictions are ready.
