# Job Fit Prediction Model Australia Job Market with Random Forest

This project was created to streamline my job application process in the competitive Australian market. With limited time to tailor my resume for each role, I wanted a data-driven approach to prioritize opportunities that align with my skills and experience.

This model analyzes job descriptions and cross-references them with my qualifications, using predefined criteria to evaluate the potential fit. The output is a simple, actionable recommendation: "Yes", the job is worth applying for, or "No", it isn’t.

By automating this decision-making process, I can focus my efforts on roles where I have the highest chance of securing an interview—saving time and ensuring precision in my applications. This project reflects not only my technical capabilities but also my commitment to leveraging data science in solving real-world problems.

## Table of Contents

1. [Import libraries](#Import-libraries)
2. [Import dataset](#Import-dataset)
3. [Exploratory data analysis](#Exploratory-data-analysis)
4. [Declare variables](#Declare-variables)
5. [Split data into training and test set](#Split-data-into-training-and-test-set)
6. [Random Forest Classifier model](#Random-Forest-Classifier-model)
7. [Model Evaluation](#Model-Evaluation)
8. [Results and conclusion](#Results-and-conclusion)

## Import libraries

In [35]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Import dataset

In [36]:
df = pd.read_csv('DS Job Market Data Set.csv')

## Exploratory data analysis

In [37]:
df.shape

(25, 12)

In [38]:
df.head()

Unnamed: 0,Job Title,Seniority,Job Location,Company,Estimate Base Salary,Company Size minimum,Company Sector,Min Years of experience (YOE),SQL,PYTHON,Minimum Qualification,Appliable
0,Data Analyst/Data Scientist,Junior,NSW,Peoplebank,100000,200,Staffing and Recruitting,2.0,Y,N,BS,N
1,Data Scientist-Advanced Analytics,Junior,NSW,IBM,100000,10000,IT Services and IT Consulting,,N,Y,,Y
2,"Senior Data Scientist - Collaboration, remote ...",Senior,NSW,Canva,150000,1000,Software Development,,Y,Y,,N
3,Data Scientist,Junior,ACT,Calleo,100000,10,Staffing and Recruiting,,Y,Y,,Y
4,"Data Scientist, Innovation Pathways, FaBA",Junior,QLD,The University of Queensland,100000,5000,Higher Education,,N,Y,M,Y


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Job Title                      25 non-null     object 
 1   Seniority                      25 non-null     object 
 2   Job Location                   25 non-null     object 
 3   Company                        25 non-null     object 
 4   Estimate Base Salary           25 non-null     object 
 5   Company Size minimum           25 non-null     int64  
 6   Company Sector                 25 non-null     object 
 7   Min Years of experience (YOE)  7 non-null      float64
 8   SQL                            25 non-null     object 
 9   PYTHON                         25 non-null     object 
 10  Minimum Qualification          14 non-null     object 
 11  Appliable                      25 non-null     object 
dtypes: float64(1), int64(1), object(10)
memory usage: 2.

##### Summary of variables
- 12 variables, where 7 are redundant as either they give us no extra information or not concernted with the target variable and are removed to prevent over fitting or unnecessary complexity and noise, leaving us with only 5
- These 5 variables are given by Seniority, Min Years of experience (YOE), SQL, PYTHON and Appliable
- All variable are caterogrical except for Min Years of experience (YOE)
- Appliable is the target variable    

Explore the target variable:

In [40]:
df['Appliable'].value_counts()

Appliable
N    17
Y     8
Name: count, dtype: int64

We can see that the target variable is binary

Missing values in variables

In [41]:
df.isnull().sum()

Job Title                         0
Seniority                         0
Job Location                      0
Company                           0
Estimate Base Salary              0
Company Size minimum              0
Company Sector                    0
Min Years of experience (YOE)    18
SQL                               0
PYTHON                            0
Minimum Qualification            11
Appliable                         0
dtype: int64

There are missing values in Min Years of experience (YOE) and Minimum Qualification. They refer to the 0 years in experince required and no qualifications required as stated in the job description. Therefore, the nulls are not noise. 

In [42]:
df['Min Years of experience (YOE)'] = df['Min Years of experience (YOE)'].fillna(0)

In [43]:
df['Minimum Qualification'] = df['Minimum Qualification'].fillna('No Qualification')

In [44]:
df.isnull().sum()

Job Title                        0
Seniority                        0
Job Location                     0
Company                          0
Estimate Base Salary             0
Company Size minimum             0
Company Sector                   0
Min Years of experience (YOE)    0
SQL                              0
PYTHON                           0
Minimum Qualification            0
Appliable                        0
dtype: int64

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Job Title                      25 non-null     object 
 1   Seniority                      25 non-null     object 
 2   Job Location                   25 non-null     object 
 3   Company                        25 non-null     object 
 4   Estimate Base Salary           25 non-null     object 
 5   Company Size minimum           25 non-null     int64  
 6   Company Sector                 25 non-null     object 
 7   Min Years of experience (YOE)  25 non-null     float64
 8   SQL                            25 non-null     object 
 9   PYTHON                         25 non-null     object 
 10  Minimum Qualification          25 non-null     object 
 11  Appliable                      25 non-null     object 
dtypes: float64(1), int64(1), object(10)
memory usage: 2.

After replacing the NaN values in column Min Years of experience (YOE) with int 0, and replacing Nan values in Minimum Qualification with string 'No Qualification', all values in the data frame are non-null.

## Declare variables

In [46]:
X = df.drop(['Appliable', 'Job Title', 'Job Location','Company', 'Estimate Base Salary', 'Company Size minimum', 'Company Sector', 'Minimum Qualification'], axis=1)
y = df['Appliable']

## Split data into training and test set

In [65]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [66]:
X_train.shape, X_test.shape

((16, 4), (9, 4))

In [67]:
X

Unnamed: 0,Seniority,Min Years of experience (YOE),SQL,PYTHON
0,Junior,2.0,Y,N
1,Junior,0.0,N,Y
2,Senior,0.0,Y,Y
3,Junior,0.0,Y,Y
4,Junior,0.0,N,Y
5,Senior,0.0,Y,Y
6,Junior,1.0,Y,Y
7,Senior,0.0,Y,Y
8,Senior,0.0,Y,Y
9,Junior,0.0,N,N


In [68]:
X_train

Unnamed: 0,Seniority,Min Years of experience (YOE),SQL,PYTHON
5,Senior,0.0,Y,Y
2,Senior,0.0,Y,Y
12,Junior,0.0,N,Y
15,Junior,3.0,Y,Y
3,Junior,0.0,Y,Y
4,Junior,0.0,N,Y
20,Junior,0.0,Y,Y
17,Senior,0.0,Y,Y
21,Senior,5.0,Y,Y
18,Senior,3.0,Y,N


In [69]:
X_test

Unnamed: 0,Seniority,Min Years of experience (YOE),SQL,PYTHON
8,Senior,0.0,Y,Y
16,Junior,5.0,Y,Y
0,Junior,2.0,Y,N
23,Senior,0.0,Y,Y
11,Senior,0.0,Y,Y
9,Junior,0.0,N,N
13,Senior,0.0,Y,Y
1,Junior,0.0,N,Y
22,Senior,5.0,Y,N


### Encode categorical variables

In [76]:
# import category encoders

import category_encoders as ce

In [79]:
encoder = ce.OrdinalEncoder(cols=X)


X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [80]:
X_train

Unnamed: 0,Seniority,Min Years of experience (YOE),SQL,PYTHON
5,1,1,1,1
2,1,1,1,1
12,2,1,2,1
15,2,2,1,1
3,2,1,1,1
4,2,1,2,1
20,2,1,1,1
17,1,1,1,1
21,1,3,1,1
18,1,2,1,2


## Random Forest Classifier model

In [None]:
# import Random Forest classifier

from sklearn.ensemble import RandomForestClassifier

In [None]:
# instantiate the classifier 

rfc = RandomForestClassifier(random_state=0)

In [None]:
# fit the model

rfc.fit(X_train, y_train)

In [None]:
# Predict the Test set results

y_pred = rfc.predict(X_test)

In [107]:
# Test using a new data frame

X_another_test = pd.DataFrame({
    'Seniority ': [2],  # Example: New job seniorities
    'Min Years of experience (YOE)': [2],  # Example: Years of experience
    'SQL': [1],  # SQL requirement
    'PYTHON': [1]  # Python requirement
})

In [108]:
X_another_test

Unnamed: 0,Seniority,Min Years of experience (YOE),SQL,PYTHON
0,2,2,1,1


In [106]:
predictions = rfc.predict(X_again)
print(predictions)

['N']


The model has learnt that roles with Seniority of senior and roles requiring any years of experince are not worth applying for which is correct.   

## Model Evaluation

In [86]:
# Check accuracy score 

from sklearn.metrics import accuracy_score


from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

              precision    recall  f1-score   support

           N       1.00      0.86      0.92         7
           Y       0.67      1.00      0.80         2

    accuracy                           0.89         9
   macro avg       0.83      0.93      0.86         9
weighted avg       0.93      0.89      0.90         9

Model accuracy score: 0.8889


In [None]:
From the model accuracy assessment above, we can conclude that the model accurcay is good. 

In [109]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

print('Confusion matrix\n\n', cm)

Confusion matrix

 [[6 1]
 [0 2]]


Performance metrics show that the model is highly accurate, where accuracy = TP + TN / Total Predictions = 89.89%. Furthermore, the recall is 100% as the model identifies all instances of class 1. Precision comes to about 66.67%. Overall the model is performing well, espically as recall is a priority. There is the issue of the false negative, however, it is low and I do not waste too much time applying to non applicable jobs. 

## Results and conclusion

This project aimed to streamline the job application process by developing a data-driven model to predict whether a job is worth applying for, based on the alignment of job descriptions with personal qualifications and preferences. The model was built to optimize time and effort by providing actionable recommendations: "Yes" (apply) or "No" (do not apply).

Key results:
- Accuracy: The model achieved an accuracy of 88.89%, indicating it correctly predicted the applicability of most jobs.
- Recall (Sensitivity): With a recall of 100% for applicable jobs, the model successfully identified all relevant opportunities, ensuring no potential jobs were overlooked.
- Precision: The precision of 66.67% highlights that while most predictions for applicable jobs were correct, there is room for improvement in minimizing false positives.

The project successfully met its objective of automating the decision-making process for job applications. By leveraging the model, it is possible to focus efforts on roles where there is a high likelihood of securing an interview, significantly improving efficiency.