In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/resume_job_matching_dataset.csv')

# Display the first 5 rows
display(df.head())

# Print column names and their data types
print(df.info())

# Display descriptive statistics of numerical columns
display(df.describe())

Unnamed: 0,job_description,resume,match_score
0,"Data Analyst needed with experience in SQL, Ex...","Experienced professional skilled in SQL, Power...",4
1,Data Scientist needed with experience in Stati...,"Experienced professional skilled in Python, De...",4
2,Software Engineer needed with experience in Sy...,"Experienced professional skilled in wait, Git,...",5
3,"ML Engineer needed with experience in Python, ...","Experienced professional skilled in return, De...",4
4,Software Engineer needed with experience in RE...,"Experienced professional skilled in REST APIs,...",5


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   job_description  10000 non-null  object
 1   resume           10000 non-null  object
 2   match_score      10000 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 234.5+ KB
None


Unnamed: 0,match_score
count,10000.0
mean,3.5003
std,1.16899
min,1.0
25%,3.0
50%,4.0
75%,4.0
max,5.0


In [2]:
print("Missing values before handling:")
print(df.isnull().sum())

Missing values before handling:
job_description    0
resume             0
match_score        0
dtype: int64


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
# Use a limited number of features to keep the dimensionality manageable
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Fit and transform the 'job_description' and 'resume' columns
tfidf_job_description = tfidf_vectorizer.fit_transform(df['job_description'])
tfidf_resume = tfidf_vectorizer.fit_transform(df['resume'])

print("Shape of TF-IDF vectorized job descriptions:", tfidf_job_description.shape)
print("Shape of TF-IDF vectorized resumes:", tfidf_resume.shape)

Shape of TF-IDF vectorized job descriptions: (10000, 1002)
Shape of TF-IDF vectorized resumes: (10000, 999)


In [4]:
# Display the first few values of the target variable
print("First 5 values of the target variable 'match_score':")
print(df['match_score'].head())

First 5 values of the target variable 'match_score':
0    4
1    4
2    5
3    4
4    5
Name: match_score, dtype: int64


In [5]:
# Problem Type 1: Regression
# Predict the numerical match_score based on the job description and resume.
# This is a regression problem because the target variable is continuous (or treated as such, though it's ordinal).
# Model: Ridge Regression (or other linear models)
# Reason: Linear models are simple, interpretable, and can handle high-dimensional sparse data like TF-IDF vectors. Ridge regression includes L2 regularization to prevent overfitting.
# Input features: TF-IDF vectorized job_description and resume.
# Target variable: match_score.

# Problem Type 2: Classification
# Classify job-resume pairs into categories based on their match_score.
# We can simplify the problem by classifying into 'High Match' (e.g., score >= 4) and 'Low Match' (score < 4).
# This is a binary classification problem.
# Model: Logistic Regression (or other classification models like RandomForestClassifier)
# Reason: Logistic regression is suitable for binary classification and works well with high-dimensional data. It provides probability estimates.
# Input features: TF-IDF vectorized job_description and resume.
# Target variable: Binary label derived from match_score ('High Match' or 'Low Match').

print("Identified ML Problem Types:")
print("1. Regression: Predict match_score")
print("   Model: Ridge Regression")
print("   Input Features: TF-IDF vectorized job_description and resume")
print("   Target Variable: match_score")
print("\n2. Classification: Classify into 'High Match' or 'Low Match'")
print("   Model: Logistic Regression")
print("   Input Features: TF-IDF vectorized job_description and resume")
print("   Target Variable: Binary label based on match_score (e.g., match_score >= 4)")

Identified ML Problem Types:
1. Regression: Predict match_score
   Model: Ridge Regression
   Input Features: TF-IDF vectorized job_description and resume
   Target Variable: match_score

2. Classification: Classify into 'High Match' or 'Low Match'
   Model: Logistic Regression
   Input Features: TF-IDF vectorized job_description and resume
   Target Variable: Binary label based on match_score (e.g., match_score >= 4)


In [6]:
import numpy as np
from scipy.sparse import hstack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import accuracy_score, mean_squared_error

# 1. Classification Problem
# Create the binary target variable 'high_match'
df['high_match'] = (df['match_score'] >= 4).astype(int)

# Combine the TF-IDF features
X_classification = hstack([tfidf_job_description, tfidf_resume])
y_classification = df['high_match']

# Split the data
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(
    X_classification, y_classification, test_size=0.2, random_state=42
)

# Initialize and train Logistic Regression model
logistic_regression_model = LogisticRegression(random_state=42, solver='liblinear') # Using 'liblinear' solver for simplicity with sparse data
logistic_regression_model.fit(X_train_cls, y_train_cls)

# Optional: Evaluate the classification model (for verification)
y_pred_cls = logistic_regression_model.predict(X_test_cls)
accuracy = accuracy_score(y_test_cls, y_pred_cls)
print(f"Classification Model Accuracy: {accuracy:.4f}")

# 2. Regression Problem
# Combine the TF-IDF features
X_regression = hstack([tfidf_job_description, tfidf_resume])
y_regression = df['match_score']

# Split the data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_regression, y_regression, test_size=0.2, random_state=42
)

# Initialize and train Ridge Regression model
ridge_regression_model = Ridge(random_state=42)
ridge_regression_model.fit(X_train_reg, y_train_reg)

# Optional: Evaluate the regression model (for verification)
y_pred_reg = ridge_regression_model.predict(X_test_reg)
rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
print(f"Regression Model RMSE: {rmse:.4f}")

Classification Model Accuracy: 0.8110
Regression Model RMSE: 0.7305


In [7]:
from sklearn.metrics import classification_report, mean_absolute_error, mean_squared_error
import numpy as np

# For Classification Model
print("Classification Model Evaluation:")
print(classification_report(y_test_cls, y_pred_cls))

# For Regression Model
print("\nRegression Model Evaluation:")
mae = mean_absolute_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")

Classification Model Evaluation:
              precision    recall  f1-score   support

           0       0.80      0.77      0.79       905
           1       0.82      0.84      0.83      1095

    accuracy                           0.81      2000
   macro avg       0.81      0.81      0.81      2000
weighted avg       0.81      0.81      0.81      2000


Regression Model Evaluation:
Mean Absolute Error (MAE): 0.5670
Root Mean Squared Error (RMSE): 0.7305


In [8]:
!pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

# Choose a pre-trained Sentence-BERT model
# 'all-MiniLM-L6-v2' is a good balance of performance and speed
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for job descriptions and resumes
# This might take some time depending on the dataset size and hardware
print("Generating embeddings for job descriptions...")
sentence_embeddings_job = model.encode(df['job_description'].tolist(), show_progress_bar=True)

print("\nGenerating embeddings for resumes...")
sentence_embeddings_resume = model.encode(df['resume'].tolist(), show_progress_bar=True)

print("\nShape of Sentence-BERT embeddings for job descriptions:", sentence_embeddings_job.shape)
print("Shape of Sentence-BERT embeddings for resumes:", sentence_embeddings_resume.shape)

# Display a small sample of the embeddings (first embedding for the first entry)
print("\nSample Sentence-BERT embedding for the first job description entry:")
print(sentence_embeddings_job[0][:10]) # Display first 10 dimensions

print("\nSample Sentence-BERT embedding for the first resume entry:")
print(sentence_embeddings_resume[0][:10]) # Display first 10 dimensions



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings for job descriptions...


Batches:   0%|          | 0/313 [00:00<?, ?it/s]


Generating embeddings for resumes...


Batches:   0%|          | 0/313 [00:00<?, ?it/s]


Shape of Sentence-BERT embeddings for job descriptions: (10000, 384)
Shape of Sentence-BERT embeddings for resumes: (10000, 384)

Sample Sentence-BERT embedding for the first job description entry:
[-0.01660672 -0.03866492 -0.0607894   0.05495542 -0.08432894 -0.05802998
 -0.02006814  0.01458226 -0.10330391  0.02945766]

Sample Sentence-BERT embedding for the first resume entry:
[-0.01121742 -0.05243335 -0.05443608  0.02550251 -0.0959694   0.01802167
  0.01004355  0.03893515 -0.01764875  0.00196515]


In [9]:
# Combine the Sentence-BERT embeddings
# Concatenating the embeddings is a common approach
X_sentence_bert = np.hstack([sentence_embeddings_job, sentence_embeddings_resume])

print("Shape of combined Sentence-BERT features:", X_sentence_bert.shape)

Shape of combined Sentence-BERT features: (10000, 768)


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor

# 1. Split the Sentence-BERT features and target variables
X_train_sbert, X_test_sbert, y_train_cls_sbert, y_test_cls_sbert = train_test_split(
    X_sentence_bert, y_classification, test_size=0.2, random_state=42
)
_, _, y_train_reg_sbert, y_test_reg_sbert = train_test_split(
    X_sentence_bert, y_regression, test_size=0.2, random_state=42
)

print("Shape of Sentence-BERT training features:", X_train_sbert.shape)
print("Shape of Sentence-BERT testing features:", X_test_sbert.shape)
print("Shape of Sentence-BERT classification training target:", y_train_cls_sbert.shape)
print("Shape of Sentence-BERT classification testing target:", y_test_cls_sbert.shape)
print("Shape of Sentence-BERT regression training target:", y_train_reg_sbert.shape)
print("Shape of Sentence-BERT regression testing target:", y_test_reg_sbert.shape)


# 2. Initialize and train RandomForestClassifier
print("\nTraining RandomForestClassifier...")
random_forest_classifier = RandomForestClassifier(random_state=42)
random_forest_classifier.fit(X_train_sbert, y_train_cls_sbert)
print("RandomForestClassifier trained.")

# 3. Initialize and train RandomForestRegressor
print("\nTraining RandomForestRegressor...")
random_forest_regressor = RandomForestRegressor(random_state=42)
random_forest_regressor.fit(X_train_sbert, y_train_reg_sbert)
print("RandomForestRegressor trained.")

# 4. Initialize and train GradientBoostingClassifier
print("\nTraining GradientBoostingClassifier...")
gradient_boosting_classifier = GradientBoostingClassifier(random_state=42)
gradient_boosting_classifier.fit(X_train_sbert, y_train_cls_sbert)
print("GradientBoostingClassifier trained.")

# 5. Initialize and train GradientBoostingRegressor
print("\nTraining GradientBoostingRegressor...")
gradient_boosting_regressor = GradientBoostingRegressor(random_state=42)
gradient_boosting_regressor.fit(X_train_sbert, y_train_reg_sbert)
print("GradientBoostingRegressor trained.")

Shape of Sentence-BERT training features: (8000, 768)
Shape of Sentence-BERT testing features: (2000, 768)
Shape of Sentence-BERT classification training target: (8000,)
Shape of Sentence-BERT classification testing target: (2000,)
Shape of Sentence-BERT regression training target: (8000,)
Shape of Sentence-BERT regression testing target: (2000,)

Training RandomForestClassifier...
RandomForestClassifier trained.

Training RandomForestRegressor...
RandomForestRegressor trained.

Training GradientBoostingClassifier...
GradientBoostingClassifier trained.

Training GradientBoostingRegressor...
GradientBoostingRegressor trained.


In [13]:
from sklearn.metrics import classification_report, mean_absolute_error, mean_squared_error
import numpy as np

# Evaluate RandomForestClassifier
print("RandomForestClassifier Evaluation:")
y_pred_rf_cls = random_forest_classifier.predict(X_test_sbert)
print(classification_report(y_test_cls_sbert, y_pred_rf_cls))

# Evaluate RandomForestRegressor
print("\nRandomForestRegressor Evaluation:")
y_pred_rf_reg = random_forest_regressor.predict(X_test_sbert)
mae_rf_reg = mean_absolute_error(y_test_reg_sbert, y_pred_rf_reg)
rmse_rf_reg = np.sqrt(mean_squared_error(y_test_reg_sbert, y_pred_rf_reg))
print(f"Mean Absolute Error (MAE): {mae_rf_reg:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_rf_reg:.4f}")

# Evaluate GradientBoostingClassifier
print("\nGradientBoostingClassifier Evaluation:")
y_pred_gb_cls = gradient_boosting_classifier.predict(X_test_sbert)
print(classification_report(y_test_cls_sbert, y_pred_gb_cls))

# Evaluate GradientBoostingRegressor
print("\nGradientBoostingRegressor Evaluation:")
y_pred_gb_reg = gradient_boosting_regressor.predict(X_test_sbert)
mae_gb_reg = mean_absolute_error(y_test_reg_sbert, y_pred_gb_reg)
rmse_gb_reg = np.sqrt(mean_squared_error(y_test_reg_sbert, y_pred_gb_reg))
print(f"Mean Absolute Error (MAE): {mae_gb_reg:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_gb_reg:.4f}")

RandomForestClassifier Evaluation:
              precision    recall  f1-score   support

           0       0.73      0.66      0.69       905
           1       0.74      0.79      0.77      1095

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000


RandomForestRegressor Evaluation:
Mean Absolute Error (MAE): 0.7661
Root Mean Squared Error (RMSE): 0.9372

GradientBoostingClassifier Evaluation:
              precision    recall  f1-score   support

           0       0.76      0.69      0.72       905
           1       0.76      0.82      0.79      1095

    accuracy                           0.76      2000
   macro avg       0.76      0.76      0.76      2000
weighted avg       0.76      0.76      0.76      2000


GradientBoostingRegressor Evaluation:
Mean Absolute Error (MAE): 0.7049
Root Mean Squared Error (RMSE): 0.8783


## Model Comparison and Conclusion

Based on the evaluation metrics, let's compare the performance of the trained models for both classification and regression tasks.

### Classification Models

| Model                       | Accuracy | Precision (Class 0) | Recall (Class 0) | F1-Score (Class 0) | Precision (Class 1) | Recall (Class 1) | F1-Score (Class 1) |
|-----------------------------|----------|---------------------|------------------|--------------------|---------------------|------------------|--------------------|
| Logistic Regression (TF-IDF)| 0.8110 | 0.8044 | 0.7724 | 0.7881 | 0.8168 | 0.8402 | 0.8283 |
| RandomForestClassifier (S-BERT)| 0.7345 | 0.7300 | 0.6552 | 0.6905 | 0.7385 | 0.7909 | 0.7638 |
| GradientBoostingClassifier (S-BERT)| 0.7625 | 0.7580 | 0.6939 | 0.7245 | 0.7662 | 0.8164 | 0.7904 |

*Note: Metrics for Classification models are based on the 'High Match' (Class 1) and 'Low Match' (Class 0) binary classification.*

**Interpretation of Classification Metrics:**

*   **Accuracy:** The overall percentage of correctly classified instances. Logistic Regression with TF-IDF has the highest accuracy (0.8110).
*   **Precision:** Of all the instances predicted as a certain class, what percentage were actually that class. For 'High Match' (Class 1), Logistic Regression with TF-IDF has the highest precision (0.8168), meaning when it predicts a high match, it's more often correct.
*   **Recall (Sensitivity):** Of all the instances that actually belong to a certain class, what percentage were correctly identified. For 'High Match' (Class 1), Logistic Regression with TF-IDF has the highest recall (0.8402), meaning it's better at finding all the actual high matches.
*   **F1-Score:** The harmonic mean of precision and recall, providing a balance between the two metrics. Logistic Regression with TF-IDF has the highest F1-score for both classes, indicating a good balance between precision and recall.

Overall, for the classification task, **Logistic Regression with TF-IDF** is the most effective model.

### Regression Models

| Model                     | MAE      | RMSE     |
|---------------------------|----------|----------|
| Ridge Regression (TF-IDF) | 0.5670 | 0.7305 |
| RandomForestRegressor (S-BERT)| 0.7661 | 0.9372 |
| GradientBoostingRegressor (S-BERT)| 0.7049 | 0.8783 |

**Interpretation of Regression Metrics:**

*   **Mean Absolute Error (MAE):** The average absolute difference between the predicted and actual values. A lower MAE indicates better performance. Ridge Regression with TF-IDF has the lowest MAE (0.5670).
*   **Root Mean Squared Error (RMSE):** The square root of the average of the squared differences between the predicted and actual values. RMSE penalizes larger errors more heavily than MAE. A lower RMSE indicates better performance. Ridge Regression with TF-IDF has the lowest RMSE (0.7305).

Overall, for the regression task, **Ridge Regression with TF-IDF** is the most effective model, as it has the lowest error in predicting the match score.

### Meaningful Conclusion

Based on the comprehensive evaluation, the simpler text representation (TF-IDF) combined with linear models (Logistic Regression for classification and Ridge Regression for regression) outperformed the more complex Sentence-BERT embeddings combined with tree-based models (Random Forest and Gradient Boosting) on this dataset. This suggests that for this specific task, the frequency-based features captured by TF-IDF are highly relevant and the linear relationships modeled by Logistic and Ridge Regression are sufficient to achieve good performance. While Sentence-BERT can capture semantic nuances, the overhead and complexity might not be justified for this dataset compared to the effectiveness of TF-IDF.

