# PREPROCESSING DATA

## Necessary imports and setting options for pandas to be able to see data clearly

In [4]:
import os
os.environ["OMP_NUM_THREADS"] = "1"

%load_ext autoreload
%autoreload 2
    
import json
import ast
import inspect
import os
from sklearn.preprocessing import MultiLabelBinarizer
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from IPython.display import display, HTML
from reutilizabile.common_imports import *
from reutilizabile.missing_freq_unique import *
from reutilizabile.feature_engineering import *
from reutilizabile.plots import *

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Read the local dataset

In [6]:
df = pd.read_csv("data/job_description.csv")

## Used INFO,DESCRIBE,HEAD functions to understand the type of data,size and structure

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1408970 entries, 0 to 1408969
Data columns (total 16 columns):
 #   Column            Non-Null Count    Dtype 
---  ------            --------------    ----- 
 0   Experience        1408970 non-null  object
 1   Qualifications    1408970 non-null  object
 2   Salary Range      1408970 non-null  object
 3   location          1408970 non-null  object
 4   Country           1408970 non-null  object
 5   Work Type         1408970 non-null  object
 6   Company Size      1408970 non-null  int64 
 7   Job Posting Date  1408970 non-null  object
 8   Job Title         1408970 non-null  object
 9   Role              1408970 non-null  object
 10  Job Portal        1408970 non-null  object
 11  Job Description   1408970 non-null  object
 12  Benefits          1408970 non-null  object
 13  skills            1408970 non-null  object
 14  Responsibilities  1408970 non-null  object
 15  Company           1408970 non-null  object
dtypes: int64(1), objec

In [9]:
df.describe()

Unnamed: 0,Company Size
count,1408970.0
mean,73709.42
std,35298.21
min,12646.0
25%,43130.0
50%,73634.0
75%,104302.8
max,134834.0


In [10]:
df.head()

Unnamed: 0,Experience,Qualifications,Salary Range,location,Country,Work Type,Company Size,Job Posting Date,Job Title,Role,Job Portal,Job Description,Benefits,skills,Responsibilities,Company
0,5 to 15 Years,M.Tech,$59K-$99K,Douglas,Isle of Man,Intern,26801,2022-04-24,Digital Marketing Specialist,Social Media Manager,Snagajob,"Social Media Managers oversee an organizations social media presence. They create and schedule content, engage with followers, and analyze social media metrics to drive brand awareness and engagement.","{'Flexible Spending Accounts (FSAs), Relocation Assistance, Legal Assistance, Employee Recognition Programs, Financial Counseling'}","Social media platforms (e.g., Facebook, Twitter, Instagram) Content creation and scheduling Social media analytics and insights Community engagement Paid social advertising","Manage and grow social media accounts, create engaging content, and interact with the online community. Develop social media content calendars and strategies. Monitor social media trends and engagement metrics.",Icahn Enterprises
1,2 to 12 Years,BCA,$56K-$116K,Ashgabat,Turkmenistan,Intern,100340,2022-12-19,Web Developer,Frontend Web Developer,Idealist,"Frontend Web Developers design and implement user interfaces for websites, ensuring they are visually appealing and user-friendly. They collaborate with designers and backend developers to create seamless web experiences for users.","{'Health Insurance, Retirement Plans, Paid Time Off (PTO), Flexible Work Arrangements, Employee Assistance Programs (EAP)'}","HTML, CSS, JavaScript Frontend frameworks (e.g., React, Angular) User experience (UX)","Design and code user interfaces for websites, ensuring a seamless and visually appealing user experience. Collaborate with UX designers to optimize user journeys. Ensure cross-browser compatibility and responsive design.",PNC Financial Services Group
2,0 to 12 Years,PhD,$61K-$104K,Macao,"Macao SAR, China",Temporary,84525,2022-09-14,Operations Manager,Quality Control Manager,Jobs2Careers,"Quality Control Managers establish and enforce quality standards within an organization. They develop quality control processes, perform inspections, and implement corrective actions to maintain product or service quality.","{'Legal Assistance, Bonuses and Incentive Programs, Wellness Programs, Employee Discounts, Retirement Plans'}","Quality control processes and methodologies Statistical process control (SPC) Root cause analysis and corrective action Quality management systems (e.g., ISO 9001) Compliance and regulatory knowledge",Establish and enforce quality control standards and procedures. Conduct quality audits and inspections. Collaborate with production teams to address quality issues and implement improvements.,United Services Automobile Assn.
3,4 to 11 Years,PhD,$65K-$91K,Porto-Novo,Benin,Full-Time,129896,2023-02-25,Network Engineer,Wireless Network Engineer,FlexJobs,"Wireless Network Engineers design, implement, and maintain wireless network solutions. They optimize wireless connectivity, troubleshoot issues, and ensure reliable and secure wireless communications.","{'Transportation Benefits, Professional Development, Bonuses and Incentive Programs, Profit-Sharing, Employee Discounts'}",Wireless network design and architecture Wi-Fi standards and protocols RF (Radio Frequency) planning and optimization Wireless security protocols Troubleshooting wireless network issues,"Design, configure, and optimize wireless networks, ensuring reliable and secure wireless connectivity. Troubleshoot wireless network issues. Plan and implement wireless network upgrades.",Hess
4,1 to 12 Years,MBA,$64K-$87K,Santiago,Chile,Intern,53944,2022-10-11,Event Manager,Conference Manager,Jobs2Careers,"A Conference Manager coordinates and manages conferences, meetings, and events. They plan logistics, handle budgeting, liaise with vendors, and ensure the smooth execution of events, catering to the needs and expectations of attendees.","{'Flexible Spending Accounts (FSAs), Relocation Assistance, Legal Assistance, Employee Recognition Programs, Financial Counseling'}",Event planning Conference logistics Budget management Vendor coordination Marketing and promotion Client relations,"Specialize in conference and convention planning. Coordinate speaker sessions, exhibitors, and attendee experiences. Oversee event registration and marketing.",Cairn Energy


## Using Custom Functions to Analyze Missing, Frequent, and Unique Values

### Function for analyzing missing data:

In [13]:
print(inspect.getsource(missing_data))

def missing_data(data):
    total = data.isnull().sum()
    percent = (total/data.isnull().count()*100)
    tt = pd.concat([total,percent], axis=1, keys=['Total','Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))



In [14]:
missing = missing_data(df)
print(missing)

        Experience Qualifications Salary Range location Country Work Type  \
Total            0              0            0        0       0         0   
Percent        0.0            0.0          0.0      0.0     0.0       0.0   
Types       object         object       object   object  object    object   

        Company Size Job Posting Date Job Title    Role Job Portal  \
Total              0                0         0       0          0   
Percent          0.0              0.0       0.0     0.0        0.0   
Types          int64           object    object  object     object   

        Job Description Benefits  skills Responsibilities Company  
Total                 0        0       0                0       0  
Percent             0.0      0.0     0.0              0.0     0.0  
Types            object   object  object           object  object  


### Function for analyzing most frequent values:

In [16]:
print(inspect.getsource(most_frequent_values))

def most_frequent_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    items = []
    vals = []
    for col in data.columns:
        try:
            itm = data[col].value_counts().index[0]
            val = data[col].value_counts().values[0]
            items.append(itm)
            vals.append(val)
        except Exception as ex:
            print(ex)
            items.append(0)
            vals.append(0)
            continue
    tt['Most frequent item'] = items
    tt['Frequence'] = vals
    tt['Percent from total'] = np.round(vals / total * 100, 3)
    return(np.transpose(tt))



In [17]:
frequent_values = most_frequent_values(df)
print(frequent_values)

                      Experience Qualifications Salary Range location  \
Total                    1408970        1408970      1408970  1408970   
Most frequent item  5 to 8 Years            BBA    $62K-$82K    Seoul   
Frequence                  29760         141518         2673    13203   
Percent from total         2.112         10.044         0.19    0.937   

                    Country  Work Type Company Size Job Posting Date  \
Total               1408970    1408970      1408970          1408970   
Most frequent item  Somalia  Part-Time        83541       2021-11-14   
Frequence              6713     282758           28             2081   
Percent from total    0.476     20.068        0.002            0.148   

                         Job Title                  Role Job Portal  \
Total                      1408970               1408970    1408970   
Most frequent item  UX/UI Designer  Interaction Designer   FlexJobs   
Frequence                    42446                 17999    

### Function for unique values:

In [19]:
print(inspect.getsource(unique_values))

def unique_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    uniques = []
    unique_vals = {}
    for col in data.columns:
        unique = data[col].nunique()
        uniques.append(unique)
        unique_vals[col] = data[col].unique()
    tt['Uniques'] = uniques
    return(np.transpose(tt))



In [20]:
unique = unique_values(df)
print(unique)

         Experience  Qualifications  Salary Range  location  Country  \
Total       1408970         1408970       1408970   1408970  1408970   
Uniques          48              10           561       214      216   

         Work Type  Company Size  Job Posting Date  Job Title     Role  \
Total      1408970       1408970           1408970    1408970  1408970   
Uniques          5        122188               731        147      376   

         Job Portal  Job Description  Benefits   skills  Responsibilities  \
Total       1408970          1408970   1408970  1408970           1408970   
Uniques          16              376        11      376               375   

         Company  
Total    1408970  
Uniques      888  


## Feature engineering and custom aggregation

### Extracting Min, Max, and Average Salary in RON from ranged values using a custom function

In [23]:
print(inspect.getsource(salary_interval))

def salary_interval(row):
    salary = row['Salary Range']
    separator = "K"
    min_salary = salary[1:].split(separator)[0]
    max_salary = salary.split(separator)[1]
    max_salary = max_salary[2:].split(separator)[0]

    min_salary = int(min_salary)*1000*4.3
    max_salary = int(max_salary)*1000*4.3 #to make it ron
    average_salary = int((min_salary+max_salary)/2)
    return min_salary, max_salary, average_salary



In [24]:
df[['Min_Salary','Max_Salary','Average_Salary']] = df.apply(salary_interval, axis=1, result_type='expand')
print(df['Min_Salary'].head())
print(df['Max_Salary'].head())
print(df['Average_Salary'].head())

0    253700.0
1    240800.0
2    262300.0
3    279500.0
4    275200.0
Name: Min_Salary, dtype: float64
0    425700.0
1    498800.0
2    447200.0
3    391300.0
4    374100.0
Name: Max_Salary, dtype: float64
0    339700.0
1    369800.0
2    354750.0
3    335400.0
4    324650.0
Name: Average_Salary, dtype: float64


### Analyzing unique salaries and how diverse the average salary, min salary, max salary values are 

In [26]:
df['Average_Salary'].unique()

array([339700., 369800., 354750., 335400., 324650., 326800., 356900.,
       359050., 301000., 346150., 348300., 402050., 305300., 333250.,
       307450., 378400., 414950., 391300., 374100., 399900., 352600.,
       316050., 320350., 395600., 296700., 387000., 382700., 412800.,
       337550., 298850., 410650., 344000., 417100., 318200., 331100.,
       406350., 371950., 328950., 322500., 380550., 313900., 389150.,
       393450., 303150., 384850., 341850., 404200., 397750., 361200.,
       363350., 408500., 367650., 365500., 350450., 376250., 311750.,
       309600., 292400., 294550., 290250., 419250.])

In [27]:
df['Average_Salary'].value_counts().sort_index()

Average_Salary
290250.0     2549
292400.0     5111
294550.0     7601
296700.0    10089
298850.0    12591
301000.0    14988
303150.0    17487
305300.0    20148
307450.0    22492
309600.0    25324
311750.0    27807
313900.0    27520
316050.0    27955
318200.0    27583
320350.0    27645
322500.0    27464
324650.0    27676
326800.0    28117
328950.0    27344
331100.0    27709
333250.0    27865
335400.0    27575
337550.0    27610
339700.0    27853
341850.0    27790
344000.0    27713
346150.0    27547
348300.0    27444
350450.0    27641
352600.0    27503
354750.0    27509
356900.0    27693
359050.0    27739
361200.0    27581
363350.0    27522
365500.0    27543
367650.0    27637
369800.0    27592
371950.0    27862
374100.0    27587
376250.0    27375
378400.0    27432
380550.0    27806
382700.0    27427
384850.0    27578
387000.0    27329
389150.0    27716
391300.0    27506
393450.0    27439
395600.0    27822
397750.0    27612
399900.0    24897
402050.0    22654
404200.0    20102
406350.0    1

In [28]:
df['Min_Salary'].unique()

array([253700., 240800., 262300., 279500., 275200., 270900., 258000.,
       245100., 236500., 266600., 249400.])

In [29]:
df['Min_Salary'].value_counts().sort_index()

Min_Salary
236500.0    128684
240800.0    127772
245100.0    127509
249400.0    127870
253700.0    128193
258000.0    127785
262300.0    128112
266600.0    128247
270900.0    127834
275200.0    128310
279500.0    128654
Name: count, dtype: int64

In [30]:
df['Max_Salary'].unique()

array([425700., 498800., 447200., 391300., 374100., 399900., 442900.,
       438600., 344000., 421400., 524600., 369800., 430000., 356900.,
       503100., 520300., 550400., 516000., 533200., 460100., 378400.,
       511700., 494500., 352600., 559000., 554700., 382700., 348300.,
       468700., 408500., 417100., 546100., 451500., 528900., 490200.,
       395600., 537500., 477300., 404200., 361200., 412800., 434300.,
       485900., 455800., 541800., 464400., 365500., 507400., 481600.,
       387000., 473000.])

In [31]:
df['Max_Salary'].value_counts().sort_index()

Max_Salary
344000.0    27845
348300.0    27692
352600.0    27655
356900.0    27447
361200.0    27788
365500.0    27756
369800.0    27420
374100.0    27575
378400.0    27602
382700.0    27724
387000.0    27822
391300.0    27823
395600.0    27521
399900.0    27755
404200.0    27512
408500.0    27914
412800.0    27909
417100.0    27755
421400.0    27581
425700.0    27725
430000.0    27794
434300.0    27923
438600.0    27599
442900.0    27543
447200.0    27671
451500.0    27285
455800.0    27608
460100.0    27899
464400.0    27491
468700.0    27759
473000.0    27574
477300.0    27397
481600.0    27499
485900.0    27357
490200.0    27636
494500.0    27450
498800.0    27623
503100.0    27545
507400.0    27466
511700.0    27729
516000.0    27576
520300.0    27482
524600.0    27506
528900.0    27623
533200.0    27557
537500.0    27584
541800.0    27507
546100.0    27520
550400.0    27730
554700.0    27571
559000.0    27645
Name: count, dtype: int64

### After calculating the average salary, I converted it into a **categorical variable** by splitting the values into 3 classes: (percentile-based binning)
- `Small` – salaries below the 33.33rd percentile
- `Medium` – salaries between 33.33rd percentile and 66.67th percentile
- `High` – salaries above 66.67th percentile

The custom function I used:

In [34]:
print(inspect.getsource(tercile_label))

def tercile_label(s,low_33,high_67):
    if s <= low_33:
        return 'Low'
    elif s <= high_67:
        return 'Medium'
    else:
        return 'High'



In [35]:
low_33, high_67 = np.percentile(df['Average_Salary'], [33.33, 66.67])
df['Salary_Tercile'] = df['Average_Salary'].apply(tercile_label,args=(low_33, high_67))
df[['Salary_Tercile','Average_Salary']].head()

Unnamed: 0,Salary_Tercile,Average_Salary
0,Medium,339700.0
1,Medium,369800.0
2,Medium,354750.0
3,Low,335400.0
4,Low,324650.0


### Because most models learn better using numerical encoding, I'm going to map Low,Medium,High to 0/1/2

In [37]:
encoding_map = {'Low': 0, 'Medium': 1, 'High': 2}
df["Salary"] = df["Salary_Tercile"].map(encoding_map)

### Deleted Salary Range column because it's not useful anymore

In [39]:
df.columns

Index(['Experience', 'Qualifications', 'Salary Range', 'location', 'Country',
       'Work Type', 'Company Size', 'Job Posting Date', 'Job Title', 'Role',
       'Job Portal', 'Job Description', 'Benefits', 'skills',
       'Responsibilities', 'Company', 'Min_Salary', 'Max_Salary',
       'Average_Salary', 'Salary_Tercile', 'Salary'],
      dtype='object')

In [40]:
df = df.drop(['Salary Range','Salary_Tercile'], axis=1)

## One-Hot Encoding of Multi-Label Benefits

### The original `Benefits` column contains multiple benefits listed as strings separated by commas (e.g., `"Parental Leave, Profit-sharing, Paid Time Off"`).
I want to parse the benefits and put them in separate binary columns.
This encoding allows machine learning models to easily interpret the presence or absence of each benefit independently.

### One-hot encoded representation for Benefits column

In [44]:
# 1. Create the 'Benefits list' column by parsing the 'Benefits' column
df['Benefits list'] = df['Benefits'].apply(parse_benefits_string)

# 2. Perform One-Hot Encoding
# Explode the 'Benefits list' column to get individual benefits in rows
df_exploded = df.explode('Benefits list')

# Create dummy variables (one-hot encode) from the exploded benefits
benefits_dummies = pd.get_dummies(df_exploded['Benefits list'], prefix='', prefix_sep='')

# Group back by the original DataFrame index and sum the dummy variables
benefits_dummies = benefits_dummies.groupby(level=0).sum()

# 3. Prepare the main DataFrame and join the new columns
# Crucially, drop the original 'Benefits' and the intermediate 'Benefits list' columns
# from 'df' before joining to prevent column name conflicts.
df = df.drop(columns=['Benefits', 'Benefits list'])

# Join the one-hot encoded benefit columns to the cleaned DataFrame
df = df.join(benefits_dummies)


### Save the changes I made so far to the dataset using this function:

In [46]:
print(inspect.getsource(save_changes))

def save_changes(data, filename="data.csv"):
    data.to_csv(filename, index=False)
    return data



In [47]:
df = save_changes(df)

In [48]:
df.head()

Unnamed: 0,Experience,Qualifications,location,Country,Work Type,Company Size,Job Posting Date,Job Title,Role,Job Portal,Job Description,skills,Responsibilities,Company,Min_Salary,Max_Salary,Average_Salary,Salary,Bonuses and Incentive Programs,Casual Dress Code,Childcare Assistance,Employee Assistance Programs (EAP),Employee Discounts,Employee Recognition Programs,Employee Referral Programs,Financial Counseling,Flexible Spending Accounts (FSAs),Flexible Work Arrangements,Health Insurance,Health and Wellness Facilities,Legal Assistance,Life and Disability Insurance,Paid Time Off (PTO),Parental Leave,Professional Development,Profit-Sharing,Relocation Assistance,Retirement Plans,Social and Recreational Activities,Stock Options or Equity Grants,Transportation Benefits,Tuition Reimbursement,Wellness Programs
0,5 to 15 Years,M.Tech,Douglas,Isle of Man,Intern,26801,2022-04-24,Digital Marketing Specialist,Social Media Manager,Snagajob,"Social Media Managers oversee an organizations social media presence. They create and schedule content, engage with followers, and analyze social media metrics to drive brand awareness and engagement.","Social media platforms (e.g., Facebook, Twitter, Instagram) Content creation and scheduling Social media analytics and insights Community engagement Paid social advertising","Manage and grow social media accounts, create engaging content, and interact with the online community. Develop social media content calendars and strategies. Monitor social media trends and engagement metrics.",Icahn Enterprises,253700.0,425700.0,339700.0,1,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0
1,2 to 12 Years,BCA,Ashgabat,Turkmenistan,Intern,100340,2022-12-19,Web Developer,Frontend Web Developer,Idealist,"Frontend Web Developers design and implement user interfaces for websites, ensuring they are visually appealing and user-friendly. They collaborate with designers and backend developers to create seamless web experiences for users.","HTML, CSS, JavaScript Frontend frameworks (e.g., React, Angular) User experience (UX)","Design and code user interfaces for websites, ensuring a seamless and visually appealing user experience. Collaborate with UX designers to optimize user journeys. Ensure cross-browser compatibility and responsive design.",PNC Financial Services Group,240800.0,498800.0,369800.0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0
2,0 to 12 Years,PhD,Macao,"Macao SAR, China",Temporary,84525,2022-09-14,Operations Manager,Quality Control Manager,Jobs2Careers,"Quality Control Managers establish and enforce quality standards within an organization. They develop quality control processes, perform inspections, and implement corrective actions to maintain product or service quality.","Quality control processes and methodologies Statistical process control (SPC) Root cause analysis and corrective action Quality management systems (e.g., ISO 9001) Compliance and regulatory knowledge",Establish and enforce quality control standards and procedures. Conduct quality audits and inspections. Collaborate with production teams to address quality issues and implement improvements.,United Services Automobile Assn.,262300.0,447200.0,354750.0,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1
3,4 to 11 Years,PhD,Porto-Novo,Benin,Full-Time,129896,2023-02-25,Network Engineer,Wireless Network Engineer,FlexJobs,"Wireless Network Engineers design, implement, and maintain wireless network solutions. They optimize wireless connectivity, troubleshoot issues, and ensure reliable and secure wireless communications.",Wireless network design and architecture Wi-Fi standards and protocols RF (Radio Frequency) planning and optimization Wireless security protocols Troubleshooting wireless network issues,"Design, configure, and optimize wireless networks, ensuring reliable and secure wireless connectivity. Troubleshoot wireless network issues. Plan and implement wireless network upgrades.",Hess,279500.0,391300.0,335400.0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0
4,1 to 12 Years,MBA,Santiago,Chile,Intern,53944,2022-10-11,Event Manager,Conference Manager,Jobs2Careers,"A Conference Manager coordinates and manages conferences, meetings, and events. They plan logistics, handle budgeting, liaise with vendors, and ensure the smooth execution of events, catering to the needs and expectations of attendees.",Event planning Conference logistics Budget management Vendor coordination Marketing and promotion Client relations,"Specialize in conference and convention planning. Coordinate speaker sessions, exhibitors, and attendee experiences. Oversee event registration and marketing.",Cairn Energy,275200.0,374100.0,324650.0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0


## Standardize experience data
The `Experience` column contains values in a text format such as `'5 to 15 Years'`, which are not directly usable in numerical models.

We extract the compute the average, and store it in a new numeric column `Average_experience` using a custom function:

In [50]:
print(inspect.getsource(parse_experience))

def parse_experience(exp):
    years = exp.replace("Years", "").strip().split("to")
    return int((int(years[0]) + int(years[1])) / 2)



In [51]:
df["Experience"].unique()

array(['5 to 15 Years', '2 to 12 Years', '0 to 12 Years', '4 to 11 Years',
       '1 to 12 Years', '4 to 12 Years', '3 to 15 Years', '2 to 8 Years',
       '2 to 9 Years', '1 to 10 Years', '3 to 10 Years', '1 to 8 Years',
       '1 to 9 Years', '5 to 14 Years', '0 to 11 Years', '3 to 12 Years',
       '5 to 9 Years', '0 to 15 Years', '0 to 10 Years', '2 to 14 Years',
       '3 to 9 Years', '4 to 15 Years', '2 to 10 Years', '4 to 8 Years',
       '3 to 8 Years', '1 to 14 Years', '1 to 13 Years', '0 to 8 Years',
       '5 to 10 Years', '2 to 13 Years', '4 to 9 Years', '1 to 15 Years',
       '4 to 10 Years', '5 to 12 Years', '0 to 13 Years', '4 to 14 Years',
       '1 to 11 Years', '4 to 13 Years', '0 to 9 Years', '5 to 8 Years',
       '2 to 15 Years', '5 to 13 Years', '5 to 11 Years', '0 to 14 Years',
       '3 to 13 Years', '2 to 11 Years', '3 to 11 Years', '3 to 14 Years'],
      dtype=object)

In [52]:
df["Average_experience"] = df["Experience"].apply(parse_experience)

In [53]:
df["Average_experience"].head()

0    10
1     7
2     6
3     7
4     6
Name: Average_experience, dtype: int64

In [54]:
df["Average_experience"].unique()

array([10,  7,  6,  8,  9,  5,  4], dtype=int64)

### Deleted Experience column because it's not useful anymore

In [56]:
df = df.drop("Experience",axis=1)

### Save changes

In [58]:
df = save_changes(df)

In [59]:
df.head()

Unnamed: 0,Qualifications,location,Country,Work Type,Company Size,Job Posting Date,Job Title,Role,Job Portal,Job Description,skills,Responsibilities,Company,Min_Salary,Max_Salary,Average_Salary,Salary,Bonuses and Incentive Programs,Casual Dress Code,Childcare Assistance,Employee Assistance Programs (EAP),Employee Discounts,Employee Recognition Programs,Employee Referral Programs,Financial Counseling,Flexible Spending Accounts (FSAs),Flexible Work Arrangements,Health Insurance,Health and Wellness Facilities,Legal Assistance,Life and Disability Insurance,Paid Time Off (PTO),Parental Leave,Professional Development,Profit-Sharing,Relocation Assistance,Retirement Plans,Social and Recreational Activities,Stock Options or Equity Grants,Transportation Benefits,Tuition Reimbursement,Wellness Programs,Average_experience
0,M.Tech,Douglas,Isle of Man,Intern,26801,2022-04-24,Digital Marketing Specialist,Social Media Manager,Snagajob,"Social Media Managers oversee an organizations social media presence. They create and schedule content, engage with followers, and analyze social media metrics to drive brand awareness and engagement.","Social media platforms (e.g., Facebook, Twitter, Instagram) Content creation and scheduling Social media analytics and insights Community engagement Paid social advertising","Manage and grow social media accounts, create engaging content, and interact with the online community. Develop social media content calendars and strategies. Monitor social media trends and engagement metrics.",Icahn Enterprises,253700.0,425700.0,339700.0,1,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,10
1,BCA,Ashgabat,Turkmenistan,Intern,100340,2022-12-19,Web Developer,Frontend Web Developer,Idealist,"Frontend Web Developers design and implement user interfaces for websites, ensuring they are visually appealing and user-friendly. They collaborate with designers and backend developers to create seamless web experiences for users.","HTML, CSS, JavaScript Frontend frameworks (e.g., React, Angular) User experience (UX)","Design and code user interfaces for websites, ensuring a seamless and visually appealing user experience. Collaborate with UX designers to optimize user journeys. Ensure cross-browser compatibility and responsive design.",PNC Financial Services Group,240800.0,498800.0,369800.0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,7
2,PhD,Macao,"Macao SAR, China",Temporary,84525,2022-09-14,Operations Manager,Quality Control Manager,Jobs2Careers,"Quality Control Managers establish and enforce quality standards within an organization. They develop quality control processes, perform inspections, and implement corrective actions to maintain product or service quality.","Quality control processes and methodologies Statistical process control (SPC) Root cause analysis and corrective action Quality management systems (e.g., ISO 9001) Compliance and regulatory knowledge",Establish and enforce quality control standards and procedures. Conduct quality audits and inspections. Collaborate with production teams to address quality issues and implement improvements.,United Services Automobile Assn.,262300.0,447200.0,354750.0,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,6
3,PhD,Porto-Novo,Benin,Full-Time,129896,2023-02-25,Network Engineer,Wireless Network Engineer,FlexJobs,"Wireless Network Engineers design, implement, and maintain wireless network solutions. They optimize wireless connectivity, troubleshoot issues, and ensure reliable and secure wireless communications.",Wireless network design and architecture Wi-Fi standards and protocols RF (Radio Frequency) planning and optimization Wireless security protocols Troubleshooting wireless network issues,"Design, configure, and optimize wireless networks, ensuring reliable and secure wireless connectivity. Troubleshoot wireless network issues. Plan and implement wireless network upgrades.",Hess,279500.0,391300.0,335400.0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,7
4,MBA,Santiago,Chile,Intern,53944,2022-10-11,Event Manager,Conference Manager,Jobs2Careers,"A Conference Manager coordinates and manages conferences, meetings, and events. They plan logistics, handle budgeting, liaise with vendors, and ensure the smooth execution of events, catering to the needs and expectations of attendees.",Event planning Conference logistics Budget management Vendor coordination Marketing and promotion Client relations,"Specialize in conference and convention planning. Coordinate speaker sessions, exhibitors, and attendee experiences. Oversee event registration and marketing.",Cairn Energy,275200.0,374100.0,324650.0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,6


## Analyzing Job Title column

### Number of unique values of job titles:

In [62]:
df['Job Title'].nunique()

147

### Unique values for job titles:

In [64]:
df['Job Title'].unique()

array(['Digital Marketing Specialist', 'Web Developer',
       'Operations Manager', 'Network Engineer', 'Event Manager',
       'Software Tester', 'Teacher', 'UX/UI Designer', 'Wedding Planner',
       'QA Analyst', 'Litigation Attorney', 'Mechanical Engineer',
       'Network Administrator', 'Account Manager', 'Brand Manager',
       'Social Worker', 'Social Media Coordinator',
       'Email Marketing Specialist', 'HR Generalist', 'Legal Assistant',
       'Nurse Practitioner', 'Account Director', 'Software Engineer',
       'Purchasing Agent', 'Sales Consultant', 'Civil Engineer',
       'Network Security Specialist', 'UI Developer', 'Financial Planner',
       'Event Planner', 'Psychologist', 'Electrical Designer',
       'Data Analyst', 'Technical Writer', 'Tax Consultant',
       'Account Executive', 'Systems Administrator',
       'Database Administrator', 'Research Analyst', 'Data Entry Clerk',
       'Registered Nurse', 'Investment Analyst', 'Speech Therapist',
       'Sales M

### The `Job Title` column contains 147 unique values. To simplify model inference and reduce dimensionality, I grouped them in broader categories using sentence transformers and KMeans

In [66]:
job_titles = df['Job Title'].unique().tolist()

In [67]:
# Initialize the Sentence Transformer model with a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Convert job titles into numerical vector representations (embeddings)
embeddings = model.encode(job_titles)

# Create a K-means clustering model with 12 clusters and a fixed random seed for reproducibility
kmeans = KMeans(n_clusters=12, random_state=42)
# Fit the model to the embeddings and get cluster assignments for each job title
labels = kmeans.fit_predict(embeddings)

### Associate labels created with job_titles

In [69]:
title_to_cluster = dict(zip(job_titles, labels))

### Add a new column representing the broad categories

In [71]:
df['Job Area'] = df['Job Title'].map(title_to_cluster)

In [72]:
df.head()

Unnamed: 0,Qualifications,location,Country,Work Type,Company Size,Job Posting Date,Job Title,Role,Job Portal,Job Description,skills,Responsibilities,Company,Min_Salary,Max_Salary,Average_Salary,Salary,Bonuses and Incentive Programs,Casual Dress Code,Childcare Assistance,Employee Assistance Programs (EAP),Employee Discounts,Employee Recognition Programs,Employee Referral Programs,Financial Counseling,Flexible Spending Accounts (FSAs),Flexible Work Arrangements,Health Insurance,Health and Wellness Facilities,Legal Assistance,Life and Disability Insurance,Paid Time Off (PTO),Parental Leave,Professional Development,Profit-Sharing,Relocation Assistance,Retirement Plans,Social and Recreational Activities,Stock Options or Equity Grants,Transportation Benefits,Tuition Reimbursement,Wellness Programs,Average_experience,Job Area
0,M.Tech,Douglas,Isle of Man,Intern,26801,2022-04-24,Digital Marketing Specialist,Social Media Manager,Snagajob,"Social Media Managers oversee an organizations social media presence. They create and schedule content, engage with followers, and analyze social media metrics to drive brand awareness and engagement.","Social media platforms (e.g., Facebook, Twitter, Instagram) Content creation and scheduling Social media analytics and insights Community engagement Paid social advertising","Manage and grow social media accounts, create engaging content, and interact with the online community. Develop social media content calendars and strategies. Monitor social media trends and engagement metrics.",Icahn Enterprises,253700.0,425700.0,339700.0,1,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,10,3
1,BCA,Ashgabat,Turkmenistan,Intern,100340,2022-12-19,Web Developer,Frontend Web Developer,Idealist,"Frontend Web Developers design and implement user interfaces for websites, ensuring they are visually appealing and user-friendly. They collaborate with designers and backend developers to create seamless web experiences for users.","HTML, CSS, JavaScript Frontend frameworks (e.g., React, Angular) User experience (UX)","Design and code user interfaces for websites, ensuring a seamless and visually appealing user experience. Collaborate with UX designers to optimize user journeys. Ensure cross-browser compatibility and responsive design.",PNC Financial Services Group,240800.0,498800.0,369800.0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,7,10
2,PhD,Macao,"Macao SAR, China",Temporary,84525,2022-09-14,Operations Manager,Quality Control Manager,Jobs2Careers,"Quality Control Managers establish and enforce quality standards within an organization. They develop quality control processes, perform inspections, and implement corrective actions to maintain product or service quality.","Quality control processes and methodologies Statistical process control (SPC) Root cause analysis and corrective action Quality management systems (e.g., ISO 9001) Compliance and regulatory knowledge",Establish and enforce quality control standards and procedures. Conduct quality audits and inspections. Collaborate with production teams to address quality issues and implement improvements.,United Services Automobile Assn.,262300.0,447200.0,354750.0,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,6,3
3,PhD,Porto-Novo,Benin,Full-Time,129896,2023-02-25,Network Engineer,Wireless Network Engineer,FlexJobs,"Wireless Network Engineers design, implement, and maintain wireless network solutions. They optimize wireless connectivity, troubleshoot issues, and ensure reliable and secure wireless communications.",Wireless network design and architecture Wi-Fi standards and protocols RF (Radio Frequency) planning and optimization Wireless security protocols Troubleshooting wireless network issues,"Design, configure, and optimize wireless networks, ensuring reliable and secure wireless connectivity. Troubleshoot wireless network issues. Plan and implement wireless network upgrades.",Hess,279500.0,391300.0,335400.0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,7,10
4,MBA,Santiago,Chile,Intern,53944,2022-10-11,Event Manager,Conference Manager,Jobs2Careers,"A Conference Manager coordinates and manages conferences, meetings, and events. They plan logistics, handle budgeting, liaise with vendors, and ensure the smooth execution of events, catering to the needs and expectations of attendees.",Event planning Conference logistics Budget management Vendor coordination Marketing and promotion Client relations,"Specialize in conference and convention planning. Coordinate speaker sessions, exhibitors, and attendee experiences. Oversee event registration and marketing.",Cairn Energy,275200.0,374100.0,324650.0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,6,3


### Save changes

In [74]:
df = save_changes(df)

## Analyzing Roles column

In [76]:
roles = df['Role'].unique().tolist()
roles[:20]

['Social Media Manager',
 'Frontend Web Developer',
 'Quality Control Manager',
 'Wireless Network Engineer',
 'Conference Manager',
 'Quality Assurance Analyst',
 'Classroom Teacher',
 'User Interface Designer',
 'Interaction Designer',
 'Wedding Consultant',
 'Performance Testing Specialist',
 'Family Law Attorney',
 'Mechanical Design Engineer',
 'Network Security Analyst',
 'Sales Account Manager',
 'Product Brand Manager',
 'School Social Worker',
 'Content Creator',
 'Deliverability Analyst',
 'HR Coordinator']

In [77]:
print(len(roles)) #number of unique values for roles

376


### The Role column contained `376` unique values, which is too many for effective modeling and analysis. To simplify, I applied the same embedding and clustering approach used for Job Title. 
Now I choosed to group them into 30 categories

In [79]:
roles_emb = model.encode(roles)
kmeans_roles = KMeans(n_clusters=30, random_state=42)
labels_roles = kmeans_roles.fit_predict(roles_emb)

In [80]:
role_clusters = dict(zip(roles, labels_roles))

### Add a new column representing the broad categories

In [82]:
df["Role cluster"] = df['Role'].map(role_clusters)

### Sample of 10 job titles mapped to their new 30 role cluster categories

In [84]:
grouped = df.groupby("Job Title")["Role cluster"].unique()

for job, roles in list(grouped.items())[:10]:
    print(f"Job Title: {job}")
    print(f"Role clusters: {roles}")
    print()

Job Title: Account Director
Role clusters: [25 15]

Job Title: Account Executive
Role clusters: [25]

Job Title: Account Manager
Role clusters: [25 23]

Job Title: Accountant
Role clusters: [7]

Job Title: Administrative Assistant
Role clusters: [10 17]

Job Title: Aerospace Engineer
Role clusters: [21]

Job Title: Architect
Role clusters: [15 24  2]

Job Title: Architectural Designer
Role clusters: [ 2 15]

Job Title: Art Director
Role clusters: [ 2 22]

Job Title: Art Teacher
Role clusters: [18  2]



### Deleted Job Title and Role column cause they are not useful anymore

In [None]:
df.drop(["Job Title","Role"],axis=1)

### Save changes

In [None]:
df = save_changes(df)

## Analyzing Company size column

In [None]:
print(df["Company Size"].nunique())

In [None]:
print(df["Company Size"].unique())

### Similarly to Salary column, I converted the Company Size variable into an ordinal categorical feature by applying percentile-based binning:

- `Small` – companies with a size below the 33.33rd percentile
- `Medium` – companies with a size between the 33.33rd and 66.67th percentiles
- `Large` – companies with a size above the 66.67th percentile

This transformation helps normalize company size into interpretable groups and allows the model to better capture potential correlations with salary or other features.

In [None]:
print(inspect.getsource(company_aggregation))

In [None]:
low_33, high_67 = np.percentile(df['Company Size'], [33.33, 66.67])
df['Company size'] = df['Company Size'].apply(company_aggregation,args=(low_33, high_67))

### Sample of 10 company sizes and the category

In [None]:
df[["Company Size","Company size"]].head(10)

### Save changes

In [None]:
df = save_changes(df)

## Analyzing Qualifications column

In [None]:
print(df["Qualifications"].nunique())

In [None]:
df["Qualifications"].unique()

### Since there are no duplicates, I will keep all these unique degree values as they are.