# **Project Title: Make Job Hunting Easier: LinkedIn Job Recommendation Based on Applicants' Description**

team member: Shaoying Zheng, Zhongrui Ning, Xiao Pu


# **Overview**

Our project hopes to match the most suitable job in the LinkedIn system with the information provided by the applicant, including their education, skills, ideal industry, ideal salary, etc. Ideally, the model we build will be adaptive, it can adjust and provide suitable job matches even with incomplete details.

# **Motivation**

Job hunting process for college students is daunting, especially in today's rapidly evolving labor market, where new-graduate job seekers are faced with an overwhelming number of job postings. When going through job application websites or apps like LinkedIn, many applicants spend a considerable amount of time filtering through irrelevant or unsuitable jobs, leading to inefficiency and frustration. Therefore, developing a smart, data-driven recommendation system that can make job hunting more personalized, efficient, and tailored to each individual's profile would provide immense value to job seekers.

Here are several specific questions we aim to explore:
1. What are the most common skills listed in job postings across various industries?

    What we hope to learn: By identifying the most frequently mentioned skills, we hope to find some "universal" skills in this era.

2. How do job requirements vary across different industries?

    What we hope to learn: We hope to identify the unique skills and qualifications required in different industries, which can help job seekers better understand the job market and make informed decisions.

3. How could job hunters with different background find suitable jobs?
  
    What we hope to learn: We hope to build a model that can provide job recommendations based on the applicant's background information, such as education, skills, and industry preference. Also, we hope to explore how the model can adapt to incomplete information.





# **Data Sources**

Source: LinkedIn Job Postings (2023 - 2024)
- A Snapshot Into the Current Job Market including company, jobs and mapping datasets.
https://www.kaggle.com/datasets/arshkon/linkedin-job-postingslo
This data source contains a nearly comprehensive record of 124,000+ job postings listed in 2023 and 2024. Each individual posting contains dozens of valuable attributes for both postings and companies, including the title, job description, salary, location, application URL, and work-types (remote, contract, etc), in addition to separate files containing the benefits, skills, and industries associated with each posting.


# **Data description**

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [7]:
# import the csv files and save as pd dataframes
companies = pd.read_csv('./Projectdata/company/companies.csv',usecols=['company_id','name','company_size','country'])
company_industries = pd.read_csv('./Projectdata/company/company_industries.csv', usecols=['company_id','industry'])
employee_counts = pd.read_csv('./Projectdata/company/employee_counts.csv', usecols=['company_id','employee_count', 'follower_count'])
skills = pd.read_csv('./Projectdata/mappings/skills.csv', usecols=['skill_abr','skill_name'] )
industries = pd.read_csv('./Projectdata/mappings/industries.csv', usecols=['industry_id','industry_name'])
job_skills = pd.read_csv('./Projectdata/jobs/job_skills.csv', usecols=['job_id','skill_abr'])
job_industries = pd.read_csv('./Projectdata/jobs/job_industries.csv', usecols=['job_id','industry_id'])
salaries = pd.read_csv('./Projectdata/jobs/salaries.csv', usecols=['job_id','max_salary','min_salary','med_salary','pay_period'])

In [None]:
# Because original posting data is too large, we will only use a subset of the data

# job_company_id = pd.read_csv('./Projectdata/postings.csv', usecols=['job_id','company_id'])
# job_company_id.to_csv('./Projectdata/job_company_id.csv', index=False)

Here is how the data is structured:
![image.png](SI_618_proj_ERD.png)

In [13]:
# Make sure IDs are not null in all tables, and make sure keys (first column in each table) are unique
for table in [companies, company_industries, employee_counts, skills, industries, job_skills, job_industries, salaries]:
    id_fields = [col for col in table.columns if col.endswith('_id')]
    for field in id_fields:
        if field in table.columns:
            assert table[field].isnull().sum() == 0

In [14]:
# First, merge all dataframes related to jobs
job_skills = job_skills.merge(skills, on='skill_abr')

Unnamed: 0,job_id,skill_abr,skill_name
0,3884428798,MRKT,Marketing
1,3884428798,PR,Public Relations
2,3884428798,WRT,Writing/Editing
3,3887473071,SALE,Sales
4,3887465684,FIN,Finance
...,...,...,...
213763,3902876855,HR,Human Resources
213764,3902878689,MGMT,Management
213765,3902878689,MNFC,Manufacturing
213766,3902883233,SALE,Sales


# **Data Manipulation**

## Steps:
**1. Handle missing values:**

**2.Standardize Format:**

**3. Merge dataframes:**

**4.create new columns**

# **Data visualization**

In [None]:
# 1.

# **Reference**

https://www.kaggle.com/code/muhammadrifqimaruf/top10-recommendation-linkedin-job-posting

