# **Project Title: Make Job Hunting Easier: LinkedIn Job Recommendation Based on Applicants' Description**

team member: Shaoying Zheng, Zhongrui Ning, Xiao Pu


# **Overview**

Our project hopes to match the most suitable job in the LinkedIn system with the information provided by the applicant, including their education, skills, ideal industry, ideal salary, etc. Ideally, the model we build will be adaptive, it can adjust and provide suitable job matches even with incomplete details.

# **Motivation**

Job hunting process for college students is daunting, especially in today's rapidly evolving labor market, where new-graduate job seekers are faced with an overwhelming number of job postings. When going through job application websites or apps like LinkedIn, many applicants spend a considerable amount of time filtering through irrelevant or unsuitable jobs, leading to inefficiency and frustration. Therefore, developing a smart, data-driven recommendation system that can make job hunting more personalized, efficient, and tailored to each individual's profile would provide immense value to job seekers.

Here are several specific questions we aim to explore:
1. What are the most common skills listed in job postings across various industries?

    What we hope to learn: By identifying the most frequently mentioned skills, we hope to find some "universal" skills in this era.

2. How do job requirements vary across different industries?

    What we hope to learn: We hope to identify the unique skills and qualifications required in different industries, which can help job seekers better understand the job market and make informed decisions.

3. How could job hunters with different background find suitable jobs?
  
    What we hope to learn: We hope to build a model that can provide job recommendations based on the applicant's background information, such as education, skills, and industry preference. Also, we hope to explore how the model can adapt to incomplete information.





# **Data Sources**

Source: [LinkedIn Job Postings (2023 - 2024)](ttps://www.kaggle.com/datasets/arshkon/linkedin-job-postingslo)
- A Snapshot Into the Current Job Market including company, jobs and mapping datasets.
This data source contains a nearly comprehensive record of 124,000+ job postings listed in 2023 and 2024. Each individual posting contains dozens of valuable attributes for both postings and companies, including the title, job description, salary, location, application URL, and work-types (remote, contract, etc), in addition to separate files containing the benefits, skills, and industries associated with each posting.


We're using 8 tables from original data source for this project:

1. `companies.csv`: Describing the situation of the companies
2. `company_industries.csv`: Industries that companies focusing on
3. `employee_counts.csv`: Amount of employee and follower on LinkedIn of those companies
4. `industries.csv`: Industries identifier ID and descriptions
5. `skills.csv`: Full name and abbreviation of job skills
6. `job_skills.csv`: What skills the posting jobs need
7. `job_industries.csv`: What industries the posting jobs based on
8. `salaries.csv`: The salary condition for the posted jobs
9. `job_company_id.csv`: To connect the posted job and certain company. 


# **Data description**

Here is how the data is structured and the columns attributes we used for merging. 

![SI_618_proj_ERD.png](attachment:SI_618_proj_ERD.png)


# **Data Manipulation**

## Steps:
**1. Merge dataframes:**

**2. Handle missing values:**

**3.Standardize Format:**

**4.create new columns**

In [51]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [52]:
# import the csv files and save as pd dataframes
companies = pd.read_csv('./Projectdata/company/companies.csv',usecols=['company_id','name','company_size','country'])
company_industries = pd.read_csv('./Projectdata/company/company_industries.csv', usecols=['company_id','industry'])
employee_counts = pd.read_csv('./Projectdata/company/employee_counts.csv', usecols=['company_id','employee_count', 'follower_count'])
skills = pd.read_csv('./Projectdata/mappings/skills.csv', usecols=['skill_abr','skill_name'] )
industries = pd.read_csv('./Projectdata/mappings/industries.csv', usecols=['industry_id','industry_name'])
job_skills = pd.read_csv('./Projectdata/jobs/job_skills.csv', usecols=['job_id','skill_abr'])
job_industries = pd.read_csv('./Projectdata/jobs/job_industries.csv', usecols=['job_id','industry_id'])
salaries = pd.read_csv('./Projectdata/jobs/salaries.csv', usecols=['job_id','max_salary','min_salary','med_salary','pay_period'])
job_company_pair = pd.read_csv('./job_company_id.csv', usecols=['job_id','company_id'])

In [53]:
# Because original posting data is too large, we will only use a subset of the data

# job_company_id = pd.read_csv('./Projectdata/postings.csv', usecols=['job_id','company_id'])
# job_company_id.to_csv('./Projectdata/job_company_id.csv', index=False)

In [54]:
# Make sure IDs are not null in all tables, and make sure keys (first column in each table) are unique
for table in [companies, company_industries, employee_counts, skills, industries, job_skills, job_industries, salaries]:
    id_fields = [col for col in table.columns if col.endswith('_id')]
    for field in id_fields:
        if field in table.columns:
            assert table[field].isnull().sum() == 0

In [55]:
# First, merge all dataframes related to jobs
job_skills = job_skills.merge(skills, on='skill_abr')

In [56]:
# Merge job_skills with job relating dataframes
job_condition = job_skills.merge(job_industries, on='job_id')
job_condition


Unnamed: 0,job_id,skill_abr,skill_name,industry_id
0,3884428798,MRKT,Marketing,82
1,3884428798,PR,Public Relations,82
2,3884428798,WRT,Writing/Editing,82
3,3887473071,SALE,Sales,48
4,3887465684,FIN,Finance,41
...,...,...,...,...
286880,3902876855,HR,Human Resources,80
286881,3902878689,MGMT,Management,116
286882,3902878689,MNFC,Manufacturing,116
286883,3902883233,SALE,Sales,44


In [57]:
job_condition = job_condition.merge(salaries, on='job_id')
job_condition

Unnamed: 0,job_id,skill_abr,skill_name,industry_id,max_salary,med_salary,min_salary,pay_period
0,3884428798,MRKT,Marketing,82,,20.0,,HOURLY
1,3884428798,PR,Public Relations,82,,20.0,,HOURLY
2,3884428798,WRT,Writing/Editing,82,,20.0,,HOURLY
3,3887470552,ADM,Administrative,54,25.00,,23.0,HOURLY
4,3884431523,MGMT,Management,56,120000.00,,100000.0,YEARLY
...,...,...,...,...,...,...,...,...
95083,3902883232,ADM,Administrative,104,,25.0,,HOURLY
95084,3902866633,PROD,Production,62,21.53,,21.1,HOURLY
95085,3902879720,ACCT,Accounting/Auditing,27,125000.00,,100000.0,YEARLY
95086,3902878689,MGMT,Management,116,85862.00,,63601.0,YEARLY


In [58]:
job_condition = job_condition.merge(industries, on='industry_id')
job_condition

Unnamed: 0,job_id,skill_abr,skill_name,industry_id,max_salary,med_salary,min_salary,pay_period,industry_name
0,3884428798,MRKT,Marketing,82,,20.0,,HOURLY,Book and Periodical Publishing
1,3884428798,PR,Public Relations,82,,20.0,,HOURLY,Book and Periodical Publishing
2,3884428798,WRT,Writing/Editing,82,,20.0,,HOURLY,Book and Periodical Publishing
3,3887470552,ADM,Administrative,54,25.00,,23.0,HOURLY,Chemical Manufacturing
4,3884431523,MGMT,Management,56,120000.00,,100000.0,YEARLY,Mining
...,...,...,...,...,...,...,...,...,...
95083,3902883232,ADM,Administrative,104,,25.0,,HOURLY,Staffing and Recruiting
95084,3902866633,PROD,Production,62,21.53,,21.1,HOURLY,Railroad Equipment Manufacturing
95085,3902879720,ACCT,Accounting/Auditing,27,125000.00,,100000.0,YEARLY,Retail
95086,3902878689,MGMT,Management,116,85862.00,,63601.0,YEARLY,"Transportation, Logistics, Supply Chain and St..."


In [59]:
# merge company dataframes
companies = companies.merge(company_industries, on='company_id')


In [60]:
companies_condition = companies.merge(employee_counts, on='company_id')

In [61]:
companies_condition = companies_condition.merge(industries, left_on='industry', right_on='industry_name')


In [62]:
companies_condition.drop(columns=['industry'], inplace=True)


In [63]:
companies_condition

Unnamed: 0,company_id,name,company_size,country,employee_count,follower_count,industry_id,industry_name
0,1009,IBM,7.0,US,314102,16253625,96,IT Services and IT Consulting
1,1009,IBM,7.0,US,313142,16309464,96,IT Services and IT Consulting
2,1009,IBM,7.0,US,313147,16309985,96,IT Services and IT Consulting
3,1009,IBM,7.0,US,311223,16314846,96,IT Services and IT Consulting
4,1016,GE HealthCare,7.0,US,56873,2185368,14,Hospitals and Health Care
...,...,...,...,...,...,...,...,...
35701,103463217,JRC Services,2.0,0,0,21,122,Facilities Services
35702,103466352,Centent Consulting LLC,,0,0,0,11,Business Consulting and Services
35703,103467540,"Kings and Queens Productions, LLC",,0,0,12,36,Broadcast Media Production and Distribution
35704,103468936,WebUnite,,US,0,1,11,Business Consulting and Services


In [64]:
#join job and company dataframes using job_company_pair
job_company_pair = job_company_pair.merge(job_condition, on='job_id',how='inner')
job_company_pair = job_company_pair.merge(companies_condition, on='company_id',how='inner')
job_company_pair

Unnamed: 0,job_id,company_id,skill_abr,skill_name,industry_id_x,max_salary,med_salary,min_salary,pay_period,industry_name_x,name,company_size,country,employee_count,follower_count,industry_id_y,industry_name_y
0,921716,2774458.0,MRKT,Marketing,44,20.0,,17.0,HOURLY,Real Estate,Corcoran Sawyer Smith,2.0,US,402,2351,44,Real Estate
1,921716,2774458.0,SALE,Sales,44,20.0,,17.0,HOURLY,Real Estate,Corcoran Sawyer Smith,2.0,US,402,2351,44,Real Estate
2,10998357,64896719.0,MGMT,Management,32,65000.0,,45000.0,YEARLY,Restaurants,The National Exemplar,1.0,US,15,40,32,Restaurants
3,10998357,64896719.0,MNFC,Manufacturing,32,65000.0,,45000.0,YEARLY,Restaurants,The National Exemplar,1.0,US,15,40,32,Restaurants
4,23221523,766262.0,OTHR,Other,9,175000.0,,140000.0,YEARLY,Law Practice,"Abrams Fensterman, LLP",2.0,US,222,2427,9,Law Practice
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307890,3906267117,56120.0,BD,Business Development,9,195000.0,,120000.0,YEARLY,Law Practice,Lozano Smith,2.0,US,185,2818,9,Law Practice
307891,3906267224,43325.0,MRKT,Marketing,25,75000.0,,70000.0,YEARLY,Manufacturing,Solugenix,5.0,US,862,79661,96,IT Services and IT Consulting
307892,3906267224,43325.0,MRKT,Marketing,25,75000.0,,70000.0,YEARLY,Manufacturing,Solugenix,5.0,US,875,81300,96,IT Services and IT Consulting
307893,3906267224,43325.0,MRKT,Marketing,25,75000.0,,70000.0,YEARLY,Manufacturing,Solugenix,5.0,US,874,81918,96,IT Services and IT Consulting


In [65]:
# check the data
job_company_pair.isnull().sum()

job_id                  0
company_id              0
skill_abr               0
skill_name              0
industry_id_x           0
max_salary          48150
med_salary         259745
min_salary          48150
pay_period              0
industry_name_x        41
name                    0
company_size         5532
country                 0
employee_count          0
follower_count          0
industry_id_y           0
industry_name_y         0
dtype: int64

### Handle the missing values

In [71]:
# since the number of missing values is not small, we will fill the missing values in different columns with different methods
job_company_pair['company_size'].fillna('unknown', inplace=True)
job_company_pair['max_salary'].fillna('null', inplace=True)
job_company_pair['min_salary'].fillna('null', inplace=True)
job_company_pair['med_salary'].fillna('null', inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  job_company_pair['max_salary'].fillna('null', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  job_company_pair['min_salary'].fillna('null', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on w

# **Data visualization**

In [67]:
# 1.

# **Reference**

https://www.kaggle.com/code/muhammadrifqimaruf/top10-recommendation-linkedin-job-posting

