# Data Engineering Jobs Exploration and Salary Prediction Project based on Glassdoor Listed Jobs 2023

## I. 🧹 Data Cleaning

In [46]:
import pandas as pd
import numpy as np
import os

### Loading the data from the scraped csv files

The datasets names is composed of the week number of the year and the year they were scraped in. For example "glassdoor-data-engineer-19-2023" was scraped on the 19th week of 2023.

In [47]:
def load_datasets(dir_path):

    dfs = []

    # loop over each file in the directory
    for i, filename in enumerate(os.listdir(dir_path)):
        if filename.endswith('.csv'):  # check if file has .csv extension
            file_path = os.path.join(dir_path, filename)  
            # read the CSV file into a DataFrame and give it a name
            df_name = f'df{i+1}'  # generate a name like 'df1', 'df2', etc.
            df = pd.read_csv(file_path)
            # add the DataFrame to the list
            dfs.append((df_name, df))

    # concatenate all the DataFrames together
    df_list = [df for _, df in dfs]  # extract just the DataFrames
    df = pd.concat(df_list, axis=0)

    return df

In [48]:
dir_path = "../data/raw/"

df = load_datasets(dir_path)
df.shape

(2460, 12)

Let's drop the duplicated job listings from our dataset

In [49]:
df = df.drop_duplicates(subset=['job_description'])
df.shape

(557, 12)

📘 Let's export this uncleaned raw data and share it publicly on Kaggle

In [50]:
data_path = '../data/kaggle/'

df.to_csv(data_path + "glassdoor-data-engineer-kaggle.csv", index=False)

### Cleaning the Data

Checking the null values

In [51]:
df.isnull().sum()

company               3
company_rating      100
location              1
job_title             1
job_description       1
salary_estimate      93
company_size         62
company_type         62
company_sector      164
company_industry    164
company_founded     209
company_revenue      62
dtype: int64

The most important column is "company", if it's null that means that the job didn't get scraped and therefore the other columns would alsoe be null

In [52]:
df = df.dropna(subset=['company'])

In [53]:
df.isnull().sum()

company               0
company_rating       97
location              0
job_title             0
job_description       0
salary_estimate      92
company_size         61
company_type         61
company_sector      163
company_industry    163
company_founded     208
company_revenue      61
dtype: int64

Cleaning the company name by removing the associated rating

In [54]:
df['company'] = df['company'].apply(lambda x: x.split('\n')[0].strip())
df.head()

Unnamed: 0,company,company_rating,location,job_title,job_description,salary_estimate,company_size,company_type,company_sector,company_industry,company_founded,company_revenue
0,PCS Global Tech,4.7,"Riverside, CA",Data Engineer | PAID BOOTCAMP,Responsibilities\r\n· Analyze and organize raw...,"$70,000 /yr (est.)",501 to 1000 Employees,Company - Private,Information Technology,Information Technology Support Services,,Unknown / Non-Applicable
1,Futuretech Consultants LLC,,"Newton, MS",Snowflake Data Engineer,My name is Dileep and I am a recruiter at Futu...,$42.50 /hr (est.),,,,,,
2,Clairvoyant,4.4,Remote,Data Engineer (MDM),Required Skills:\r\nMust have 5-8+ Years of ex...,$67.50 /hr (est.),51 to 200 Employees,Company - Private,Pharmaceutical & Biotechnology,Biotech & Pharmaceuticals,,Unknown / Non-Applicable
3,Apple,4.2,"Cupertino, CA",Data Engineer,"Summary\r\nPosted: Dec 22, 2021\r\nWeekly Hour...",,10000+ Employees,Company - Public,Information Technology,Computer Hardware Development,1976.0,$10+ billion (USD)
4,Skytech Consultancy Services,5.0,"Baltimore, MD",Data Engineer,Description of Work:\r\nTechnical experience i...,$65.00 /hr (est.),1 to 50 Employees,Company - Public,,,,Unknown / Non-Applicable


Correctly formating the salary estimate, and converting the hourly to annually

In [55]:
import re

def clean_salary(salary_string):

    if pd.isnull(salary_string):
        return np.nan
    else:
        match_year = re.search(r'\$(\d{1,3},?\d{0,3},?\d{0,3}) \/yr \(est.\)', salary_string)
        match_hour = re.search(r'\$(\d+(\.\d+)?) \/hr \(est.\)', salary_string)

        if match_year:
            salary_amount = float(match_year.group(1).replace(',', ''))
        elif match_hour:
            hourly_salary = float(match_hour.group(1))
            salary_amount = hourly_salary * 1800
        else:
            salary_amount = np.nan

        return salary_amount

In [56]:
df['salary_estimate'] = df['salary_estimate'].apply(clean_salary)

In [57]:
df['salary_estimate'].head()

0     70000.0
1     76500.0
2    121500.0
3         NaN
4    117000.0
Name: salary_estimate, dtype: float64

Now let's replace the null salary estimates with the mean

In [58]:
df['salary_estimate'].fillna(df['salary_estimate'].mean(), inplace=True)

Let's round the clean salary estimate

In [59]:
df['salary_estimate'] = df['salary_estimate'].round().astype(int)

In [60]:
df.head()

Unnamed: 0,company,company_rating,location,job_title,job_description,salary_estimate,company_size,company_type,company_sector,company_industry,company_founded,company_revenue
0,PCS Global Tech,4.7,"Riverside, CA",Data Engineer | PAID BOOTCAMP,Responsibilities\r\n· Analyze and organize raw...,70000,501 to 1000 Employees,Company - Private,Information Technology,Information Technology Support Services,,Unknown / Non-Applicable
1,Futuretech Consultants LLC,,"Newton, MS",Snowflake Data Engineer,My name is Dileep and I am a recruiter at Futu...,76500,,,,,,
2,Clairvoyant,4.4,Remote,Data Engineer (MDM),Required Skills:\r\nMust have 5-8+ Years of ex...,121500,51 to 200 Employees,Company - Private,Pharmaceutical & Biotechnology,Biotech & Pharmaceuticals,,Unknown / Non-Applicable
3,Apple,4.2,"Cupertino, CA",Data Engineer,"Summary\r\nPosted: Dec 22, 2021\r\nWeekly Hour...",111739,10000+ Employees,Company - Public,Information Technology,Computer Hardware Development,1976.0,$10+ billion (USD)
4,Skytech Consultancy Services,5.0,"Baltimore, MD",Data Engineer,Description of Work:\r\nTechnical experience i...,117000,1 to 50 Employees,Company - Public,,,,Unknown / Non-Applicable


Extracting the state from the job location

In [61]:
df['location'] = df['location'].astype(str)
df['job_state'] = df['location'].apply(lambda x: x if x.lower() == 'remote' else x.split(', ')[-1])

In [62]:
df.job_state.value_counts()

Remote            95
TX                56
CA                51
GA                34
NJ                29
NY                23
VA                23
MA                22
United States     20
FL                17
DC                17
IL                15
NC                14
PA                13
MN                12
CO                11
OH                10
MD                10
OR                 7
WI                 7
UT                 6
WA                 6
MI                 5
MO                 4
DE                 4
TN                 4
SC                 4
CT                 3
AZ                 3
California         2
Pennsylvania       2
AR                 2
Minnesota          2
Oregon             2
IA                 2
MS                 2
OK                 2
Illinois           1
North Carolina     1
AL                 1
KS                 1
NE                 1
Virginia           1
KY                 1
Manhattan          1
ME                 1
IN                 1
WV           

Replacing the 'United States' in job_state with the most common state (the state should not be Remote)

In [63]:
common_states = df.job_state.value_counts().index.tolist()
common_state = next((state for state in common_states if state != 'Remote'), None)
common_state

'TX'

In [64]:
df['job_state']= df['job_state'].replace('United States', common_state)
df.job_state.value_counts()

Remote            95
TX                76
CA                51
GA                34
NJ                29
VA                23
NY                23
MA                22
FL                17
DC                17
IL                15
NC                14
PA                13
MN                12
CO                11
OH                10
MD                10
WI                 7
OR                 7
UT                 6
WA                 6
MI                 5
MO                 4
DE                 4
TN                 4
SC                 4
CT                 3
AZ                 3
California         2
Pennsylvania       2
AR                 2
Minnesota          2
Oregon             2
IA                 2
MS                 2
OK                 2
Illinois           1
North Carolina     1
AL                 1
KS                 1
NE                 1
Virginia           1
KY                 1
Manhattan          1
ME                 1
IN                 1
WV                 1
Texas        

Replacing company rating null values with median

In [65]:
cr_median = df.company_rating.mean()
cr_median = round(cr_median, 1)
cr_median

4.0

In [66]:
df['company_rating'] = df['company_rating'].fillna(cr_median)

Adding a new column that contains the age of the company

In [67]:
df['company_founded'] = df['company_founded'].fillna(-1)
df['company_founded'] = df['company_founded'].astype(int)

In [68]:
import datetime

today = datetime.datetime.now()

df['company_age'] = df.company_founded.apply(lambda x: x if x < 0 else today.year - x)

df['company_age'].head()

0    -1
1    -1
2    -1
3    47
4    -1
Name: company_age, dtype: int64

Simplifying the job title

In [69]:
def title_simplifier(title):
    if 'data scientist' in title.lower():
        return 'data scientist'
    elif 'data engineer' in title.lower():
        return 'data engineer'
    elif 'data analyst' in title.lower():
        return 'data analyst'
    elif 'machine learning' in title.lower():
        return 'mle'
    else:
        return 'na'

In [70]:
df['job_simp'] = df['job_title'].apply(title_simplifier)
df.job_simp.value_counts()

data engineer     450
na                 97
data scientist      3
data analyst        2
mle                 2
Name: job_simp, dtype: int64

In [71]:
df = df[df['job_simp'] == 'data engineer']

df.job_simp.value_counts()

data engineer    450
Name: job_simp, dtype: int64

In [72]:
def seniority(title):
    if 'sr' in title.lower() or 'senior' in title.lower() or 'sr.' in title.lower() or 'lead' in title.lower() or 'principal' in title.lower():
            return 'senior'
    elif 'jr' in title.lower() or 'jr.' in title.lower():
        return 'junior'
    else:
        return 'na'

In [73]:
df['seniority'] = df['job_title'].apply(seniority)
df.seniority.value_counts()

na        320
senior    129
junior      1
Name: seniority, dtype: int64

In [74]:
df = df[df['seniority'] != "junior"]

df.seniority.value_counts()

na        320
senior    129
Name: seniority, dtype: int64

Extracting relevant skills from job description

In [86]:
prog_languages = ['python', 'java', 'scala', 'go', 'r', 'c++', 'c#', 'sql', 'nosql', 'rust', 'shell']
cloud_tools = ['aws', 'azure', 'google cloud', 'snowflake', 'databricks', 'redshift']
viz_tools = ['power bi', 'tableau', 'excel', 'ssis', 'qlik', 'sap', 'looker']
databases = ['sql server', 'postgresql', 'mongodb', 'mysql', 'oracle', 'casandra', 'elasticsearch', 'dynamodb', 'snowflake', 'redis', 'neo4j', 'hive', 'dbt']
big_data = ['spark', 'hadoop', 'kafka', 'flink']
devops = ['gitlab', 'terraform', 'docker', 'bash', 'ansible']

In [87]:
import re

def extract_keywords(description, keywords):
    pattern = r'\b(?:{})\b'.format('|'.join(map(re.escape, keywords)))
    matches = set(re.findall(pattern, description.lower(), flags=re.IGNORECASE))
    
    return list(matches)

In [88]:
df['job_languages'] = df['job_description'].apply(lambda x: extract_keywords(x, prog_languages))
df['job_cloud'] = df['job_description'].apply(lambda x: extract_keywords(x, cloud_tools))
df['job_viz'] = df['job_description'].apply(lambda x: extract_keywords(x, viz_tools))
df['job_databases'] = df['job_description'].apply(lambda x: extract_keywords(x, databases))
df['job_bigdata'] = df['job_description'].apply(lambda x: extract_keywords(x, big_data))
df['job_devops'] = df['job_description'].apply(lambda x: extract_keywords(x, devops))

Extracting Education from job description

In [89]:
education = ['associate', 'bachelor', 'master', 'phd']

In [90]:
def extract_degree(description, degrees):
    pattern = r'\b(?:{})\b'.format('|'.join(map(re.escape, degrees)))
    matches = re.findall(pattern, description.lower(), flags=re.IGNORECASE)
    
    if matches:
        return matches[0]
    
    return None

In [91]:
df['job_education'] = df['job_description'].apply(lambda x: extract_degree(x, education))

df['job_education'].value_counts()

bachelor    175
master       28
Name: job_education, dtype: int64

In [92]:
df = df[df['job_education'] != "associate"]
df = df[df['job_education'] != "phd"]

df['job_education'].value_counts()

bachelor    175
master       28
Name: job_education, dtype: int64

Let's extract the experience needed to apply for the job

In [93]:
import re

def extract_experience(description):
    pattern = r'(?:Experience level|experience|\+).*(?:\n.*)*(\d+|\+)\s*(?:year|years|\+ years|\+ years of experience)'
    matches = re.findall(pattern, description, flags=re.IGNORECASE)
    
    if matches:
        experience = matches[0]
        if experience == '+':
            return "+10 years"
        elif int(experience) < 2:
            return "0-2 years"
        elif int(experience) < 5:
            return "2-5 years"
        elif int(experience) < 10:
            return "5-10 years"
        else:
            return "+10 years"
    else:
        return None

In [94]:
df['job_experience'] = df['job_description'].apply(lambda x: extract_experience(x))

df['job_experience'].value_counts()

+10 years     106
5-10 years     70
0-2 years      64
2-5 years      62
Name: job_experience, dtype: int64

Some job listings don't mention the education or years of experience needed.

In [84]:
df.head()

Unnamed: 0,company,company_rating,location,job_title,job_description,salary_estimate,company_size,company_type,company_sector,company_industry,...,job_simp,seniority,job_languages,job_cloud,job_viz,job_databases,job_bigdata,job_devops,job_education,job_experience
0,PCS Global Tech,4.7,"Riverside, CA",Data Engineer | PAID BOOTCAMP,Responsibilities\r\n· Analyze and organize raw...,70000,501 to 1000 Employees,Company - Private,Information Technology,Information Technology Support Services,...,data engineer,na,"[sql, java, python]",[],[],[],[],[],,0-2 years
1,Futuretech Consultants LLC,4.0,"Newton, MS",Snowflake Data Engineer,My name is Dileep and I am a recruiter at Futu...,76500,,,,,...,data engineer,na,[sql],[snowflake],[ssis],[snowflake],[],[],bachelor,2-5 years
2,Clairvoyant,4.4,Remote,Data Engineer (MDM),Required Skills:\r\nMust have 5-8+ Years of ex...,121500,51 to 200 Employees,Company - Private,Pharmaceutical & Biotechnology,Biotech & Pharmaceuticals,...,data engineer,na,"[sql, python]","[databricks, aws]",[],[],[spark],[],master,0-2 years
3,Apple,4.2,"Cupertino, CA",Data Engineer,"Summary\r\nPosted: Dec 22, 2021\r\nWeekly Hour...",111739,10000+ Employees,Company - Public,Information Technology,Computer Hardware Development,...,data engineer,na,[python],[],[tableau],[],[],[],,
4,Skytech Consultancy Services,5.0,"Baltimore, MD",Data Engineer,Description of Work:\r\nTechnical experience i...,117000,1 to 50 Employees,Company - Public,,,...,data engineer,na,[sql],[],[tableau],[oracle],[],[],bachelor,5-10 years


Exporting the cleaned version of the dataframe as a new data file

In [95]:
data_path = '../data/processed/'

df.to_csv(data_path + "glassdoor-data-engineer-cleaned.csv", index=False)