# Data Engineering Jobs Exploration and Salary Prediction Project based on Glassdoor Listed Jobs 2023

## I. Data Cleaning

In [1]:
import pandas as pd
import numpy as np

Importing the first data scraped that contains job listings from page 0 to 20

In [2]:
df1 = pd.read_csv("../data/raw/glassdoor-data-engineer.csv")
df1.head()

Unnamed: 0,company,company_rating,location,job_title,job_description,salary_estimate,company_size,company_type,company_sector,company_industry,company_founded,company_revenue
0,PCS Global Tech\r\n4.7,4.7,"Riverside, CA",Data Engineer | PAID BOOTCAMP,Responsibilities\r\n· Analyze and organize raw...,"$70,000 /yr (est.)",501 to 1000 Employees,Company - Private,Information Technology,Information Technology Support Services,,Unknown / Non-Applicable
1,Futuretech Consultants LLC,,"Newton, MS",Snowflake Data Engineer,My name is Dileep and I am a recruiter at Futu...,$42.50 /hr (est.),,,,,,
2,Clairvoyant\r\n4.4,4.4,Remote,Data Engineer (MDM),Required Skills:\r\nMust have 5-8+ Years of ex...,$67.50 /hr (est.),51 to 200 Employees,Company - Private,Pharmaceutical & Biotechnology,Biotech & Pharmaceuticals,,Unknown / Non-Applicable
3,Apple\r\n4.2,4.2,"Cupertino, CA",Data Engineer,"Summary\r\nPosted: Dec 22, 2021\r\nWeekly Hour...",,10000+ Employees,Company - Public,Information Technology,Computer Hardware Development,1976.0,$10+ billion (USD)
4,Skytech Consultancy Services\r\n5.0,5.0,"Baltimore, MD",Data Engineer,Description of Work:\r\nTechnical experience i...,$65.00 /hr (est.),1 to 50 Employees,Company - Public,,,,Unknown / Non-Applicable


In [3]:
df1.shape

(600, 12)

Importing the second data scraped that contains job listings from page 21 to 30

In [4]:
df2 = pd.read_csv("../data/raw/glassdoor-data-engineer-after20.csv")
df2.head()

Unnamed: 0,company,company_rating,location,job_title,job_description,salary_estimate,company_size,company_type,company_sector,company_industry,company_founded,company_revenue
0,Jane Street\r\n4.4,4.4,"New York, NY",Data Engineer,About the Position\r\nWe are looking for a Dat...,"$237,500 /yr (est.)",1001 to 5000 Employees,Company - Private,Management & Consulting,Research & Development,2000.0,Unknown / Non-Applicable
1,"Twitch Interactive, Inc.\r\n3.8",3.8,"San Francisco, CA",Data Engineer,"3+ years of experience in data engineering, so...","$105,700 /yr (est.)",10000+ Employees,Company - Public,Information Technology,Internet & Web Services,1994.0,$10+ billion (USD)
2,Aretec Inc\r\n1.0,1.0,Remote,Junior Data Engineer,POSITION TITLE: Junior Data Engineer YEARS OF ...,"$102,500 /yr (est.)",51 to 200 Employees,Contract,,,,Unknown / Non-Applicable
3,Small Batch Standard\r\n4.1,4.1,Remote,Junior Data Engineer,"We're the premier, remote accounting, tax, and...","$64,000 /yr (est.)",1 to 50 Employees,Company - Private,Financial Services,Accounting & Tax,2010.0,Unknown / Non-Applicable
4,Bose\r\n3.8,3.8,"Framingham, MA",Data Analytics Engineer,"Job Description\r\nBose is about better sound,...","$98,192 /yr (est.)",5001 to 10000 Employees,Company - Private,Manufacturing,Consumer Product Manufacturing,1964.0,$1 to $5 billion (USD)


In [5]:
df2.shape

(300, 12)

Let's concatenate the two dataframes into one

In [6]:
df = pd.concat([df1, df2])
df.shape

(900, 12)

📘 Let's export this uncleaned raw data and share it publicly on Kaggle

In [7]:
data_path = '../data/kaggle/'

df.to_csv(data_path + "glassdoor-data-engineer-kaggle.csv", index=False)

### Let's Begin Cleaning the Data

Checking the null values

In [8]:
df.isnull().sum()

company               0
company_rating      185
location              0
job_title             0
job_description       0
salary_estimate      64
company_size        135
company_type        135
company_sector      391
company_industry    391
company_founded     495
company_revenue     135
dtype: int64

The most important column is "company", if it's null that means that the job didn't get scraped and therefore the other columns would alsoe be null

In [9]:
df = df.dropna(subset=['company'])

In [10]:
df.isnull().sum()

company               0
company_rating      185
location              0
job_title             0
job_description       0
salary_estimate      64
company_size        135
company_type        135
company_sector      391
company_industry    391
company_founded     495
company_revenue     135
dtype: int64

Cleaning the company name by removing the associated rating

In [11]:
df['company'] = df['company'].apply(lambda x: x.split('\n')[0].strip())
df.head()

Unnamed: 0,company,company_rating,location,job_title,job_description,salary_estimate,company_size,company_type,company_sector,company_industry,company_founded,company_revenue
0,PCS Global Tech,4.7,"Riverside, CA",Data Engineer | PAID BOOTCAMP,Responsibilities\r\n· Analyze and organize raw...,"$70,000 /yr (est.)",501 to 1000 Employees,Company - Private,Information Technology,Information Technology Support Services,,Unknown / Non-Applicable
1,Futuretech Consultants LLC,,"Newton, MS",Snowflake Data Engineer,My name is Dileep and I am a recruiter at Futu...,$42.50 /hr (est.),,,,,,
2,Clairvoyant,4.4,Remote,Data Engineer (MDM),Required Skills:\r\nMust have 5-8+ Years of ex...,$67.50 /hr (est.),51 to 200 Employees,Company - Private,Pharmaceutical & Biotechnology,Biotech & Pharmaceuticals,,Unknown / Non-Applicable
3,Apple,4.2,"Cupertino, CA",Data Engineer,"Summary\r\nPosted: Dec 22, 2021\r\nWeekly Hour...",,10000+ Employees,Company - Public,Information Technology,Computer Hardware Development,1976.0,$10+ billion (USD)
4,Skytech Consultancy Services,5.0,"Baltimore, MD",Data Engineer,Description of Work:\r\nTechnical experience i...,$65.00 /hr (est.),1 to 50 Employees,Company - Public,,,,Unknown / Non-Applicable


Correctly formating the salary estimate, and converting the hourly to annually

In [12]:
import re

def clean_salary(salary_string):

    if pd.isnull(salary_string):
        return np.nan
    else:
        match_year = re.search(r'\$(\d{1,3},?\d{0,3},?\d{0,3}) \/yr \(est.\)', salary_string)
        match_hour = re.search(r'\$(\d+(\.\d+)?) \/hr \(est.\)', salary_string)

        if match_year:
            salary_amount = float(match_year.group(1).replace(',', ''))
        elif match_hour:
            hourly_salary = float(match_hour.group(1))
            salary_amount = hourly_salary * 1800
        else:
            salary_amount = np.nan

        return salary_amount

In [13]:
df['salary_estimate'] = df['salary_estimate'].apply(clean_salary)

Now let's replace the null salary estimates with the mean

In [14]:
df['salary_estimate'].fillna(df['salary_estimate'].mean(), inplace=True)

Let's round the clean salary estimate

In [15]:
df['salary_estimate'] = df['salary_estimate'].round().astype(int)

In [16]:
df.head()

Unnamed: 0,company,company_rating,location,job_title,job_description,salary_estimate,company_size,company_type,company_sector,company_industry,company_founded,company_revenue
0,PCS Global Tech,4.7,"Riverside, CA",Data Engineer | PAID BOOTCAMP,Responsibilities\r\n· Analyze and organize raw...,70000,501 to 1000 Employees,Company - Private,Information Technology,Information Technology Support Services,,Unknown / Non-Applicable
1,Futuretech Consultants LLC,,"Newton, MS",Snowflake Data Engineer,My name is Dileep and I am a recruiter at Futu...,76500,,,,,,
2,Clairvoyant,4.4,Remote,Data Engineer (MDM),Required Skills:\r\nMust have 5-8+ Years of ex...,121500,51 to 200 Employees,Company - Private,Pharmaceutical & Biotechnology,Biotech & Pharmaceuticals,,Unknown / Non-Applicable
3,Apple,4.2,"Cupertino, CA",Data Engineer,"Summary\r\nPosted: Dec 22, 2021\r\nWeekly Hour...",106385,10000+ Employees,Company - Public,Information Technology,Computer Hardware Development,1976.0,$10+ billion (USD)
4,Skytech Consultancy Services,5.0,"Baltimore, MD",Data Engineer,Description of Work:\r\nTechnical experience i...,117000,1 to 50 Employees,Company - Public,,,,Unknown / Non-Applicable


Extracting the state from the job location

In [17]:
df['location'] = df['location'].astype(str)
df['job_state'] = df['location'].apply(lambda x: x if x.lower() == 'remote' else x.split(', ')[-1])

In [18]:
df.job_state.value_counts()

Remote           149
GA                96
TX                94
CA                87
NJ                80
MN                49
DC                46
VA                44
WI                36
MD                34
IL                34
MS                24
NY                21
MA                19
CT                18
OR                17
United States     14
PA                12
UT                 8
TN                 6
FL                 4
OH                 3
DE                 1
SC                 1
OK                 1
CO                 1
NC                 1
Name: job_state, dtype: int64

Replacing the 'United States' in job_state with the most common state (the state should not be Remote)

In [19]:
common_states = df.job_state.value_counts().index.tolist()
common_state = next((state for state in common_states if state != 'Remote'), None)
common_state

'GA'

In [20]:
df['job_state']= df['job_state'].replace('United States', common_state)
df.job_state.value_counts()

Remote    149
GA        110
TX         94
CA         87
NJ         80
MN         49
DC         46
VA         44
WI         36
IL         34
MD         34
MS         24
NY         21
MA         19
CT         18
OR         17
PA         12
UT          8
TN          6
FL          4
OH          3
DE          1
SC          1
OK          1
CO          1
NC          1
Name: job_state, dtype: int64

Replacing company rating null values with median

In [21]:
cr_median = df.company_rating.mean()
cr_median = round(cr_median, 1)
cr_median

4.2

In [22]:
df['company_rating'] = df['company_rating'].fillna(cr_median)

Adding a new column that contains the age of the company

In [23]:
df['company_founded'] = df['company_founded'].fillna(-1)
df['company_founded'] = df['company_founded'].astype(int)

In [24]:
import datetime

today = datetime.datetime.now()

df['company_age'] = df.company_founded.apply(lambda x: x if x < 0 else today.year - x)

df['company_age'].head()

0    -1
1    -1
2    -1
3    47
4    -1
Name: company_age, dtype: int64

Simplifying the job title

In [25]:
def title_simplifier(title):
    if 'data scientist' in title.lower():
        return 'data scientist'
    elif 'data engineer' in title.lower():
        return 'data engineer'
    elif 'data analyst' in title.lower():
        return 'data analyst'
    elif 'machine learning' in title.lower():
        return 'mle'
    else:
        return 'na'

In [26]:
df['job_simp'] = df['job_title'].apply(title_simplifier)
df.job_simp.value_counts()

data engineer     838
na                 55
data scientist      7
Name: job_simp, dtype: int64

In [27]:
df = df[df['job_simp'] != 'na']
df = df[df['job_simp'] != 'data scientist']

df.job_simp.value_counts()

data engineer    838
Name: job_simp, dtype: int64

In [28]:
def seniority(title):
    if 'sr' in title.lower() or 'senior' in title.lower() or 'sr.' in title.lower() or 'lead' in title.lower() or 'principal' in title.lower():
            return 'senior'
    elif 'jr' in title.lower() or 'jr.' in title.lower():
        return 'junior'
    else:
        return 'na'

In [29]:
df['seniority'] = df['job_title'].apply(seniority)
df.seniority.value_counts()

na        626
senior    210
junior      2
Name: seniority, dtype: int64

In [30]:
df = df[df['seniority'] != "junior"]

df.seniority.value_counts()

na        626
senior    210
Name: seniority, dtype: int64

Extracting relevant skills from job description

In [31]:
prog_languages = ['python', 'java', 'scala', 'go', 'r', 'c++', 'c#', 'sql', 'nosql', 'shell', 'rust']
cloud_tools = ['aws', 'azure', 'google cloud', 'snowflake', 'databricks', 'redshift', 'oracle', 'gcp', 'bigquery']
viz_tools = ['power bi', 'tableau', 'excel', 'ssis', 'qlik', 'sap', 'sas', 'dax']
databases = ['sql server', 'postegresql', 'mongodb', 'mysql', 'casandra', 'elasticsearch', 'dynamodb', 'redis', 'neo4j', 'hive']
librairies = ['spark', 'hadoop', 'kafka', 'airflow']

In [32]:
import re

def extract_keywords(description, keywords):
    pattern = r'\b(?:{})\b'.format('|'.join(map(re.escape, keywords)))
    matches = set(re.findall(pattern, description.lower(), flags=re.IGNORECASE))
    
    return list(matches)

In [33]:
df['job_languages'] = df['job_description'].apply(lambda x: extract_keywords(x, prog_languages))
df['job_cloud'] = df['job_description'].apply(lambda x: extract_keywords(x, cloud_tools))
df['job_viz'] = df['job_description'].apply(lambda x: extract_keywords(x, viz_tools))
df['job_databases'] = df['job_description'].apply(lambda x: extract_keywords(x, databases))
df['job_librairies'] = df['job_description'].apply(lambda x: extract_keywords(x, librairies))

Extracting Education from job description

In [34]:
education = ['associate', 'bachelor', 'master', 'phd']

In [35]:
def extract_degree(description, degrees):
    pattern = r'\b(?:{})\b'.format('|'.join(map(re.escape, degrees)))
    matches = re.findall(pattern, description.lower(), flags=re.IGNORECASE)
    
    if matches:
        return matches[0]
    
    return None

In [36]:
df['job_education'] = df['job_description'].apply(lambda x: extract_degree(x, education))

df['job_education'].value_counts()

bachelor     199
master        81
associate      1
Name: job_education, dtype: int64

In [37]:
df = df[df['job_education'] != "associate"]

df['job_education'].value_counts()

bachelor    199
master       81
Name: job_education, dtype: int64

Let's extract the experience needed to apply for the job

In [38]:
import re

def extract_experience(description):
    pattern = r'(?:Experience level|experience|\+).*(?:\n.*)*(\d+|\+)\s*(?:year|years|\+ years|\+ years of experience)'
    matches = re.findall(pattern, description, flags=re.IGNORECASE)
    
    if matches:
        experience = matches[0]
        if experience == '+':
            return "+10 years"
        elif int(experience) < 2:
            return "0-2 years"
        elif int(experience) < 5:
            return "2-5 years"
        elif int(experience) < 10:
            return "5-10 years"
        else:
            return "+10 years"
    else:
        return None

In [39]:
df['job_experience'] = df['job_description'].apply(lambda x: extract_experience(x))

df['job_experience'].value_counts()

0-2 years     217
5-10 years    127
2-5 years     110
+10 years      73
Name: job_experience, dtype: int64

Some job listings don't mention the education or years of experience needed.

In [40]:
df.head()

Unnamed: 0,company,company_rating,location,job_title,job_description,salary_estimate,company_size,company_type,company_sector,company_industry,...,company_age,job_simp,seniority,job_languages,job_cloud,job_viz,job_databases,job_librairies,job_education,job_experience
0,PCS Global Tech,4.7,"Riverside, CA",Data Engineer | PAID BOOTCAMP,Responsibilities\r\n· Analyze and organize raw...,70000,501 to 1000 Employees,Company - Private,Information Technology,Information Technology Support Services,...,-1,data engineer,na,"[sql, python, java]",[],[],[],[],,0-2 years
1,Futuretech Consultants LLC,4.2,"Newton, MS",Snowflake Data Engineer,My name is Dileep and I am a recruiter at Futu...,76500,,,,,...,-1,data engineer,na,[sql],[snowflake],[ssis],[],[],bachelor,2-5 years
2,Clairvoyant,4.4,Remote,Data Engineer (MDM),Required Skills:\r\nMust have 5-8+ Years of ex...,121500,51 to 200 Employees,Company - Private,Pharmaceutical & Biotechnology,Biotech & Pharmaceuticals,...,-1,data engineer,na,"[sql, python]","[databricks, aws]",[],[],[spark],master,0-2 years
3,Apple,4.2,"Cupertino, CA",Data Engineer,"Summary\r\nPosted: Dec 22, 2021\r\nWeekly Hour...",106385,10000+ Employees,Company - Public,Information Technology,Computer Hardware Development,...,47,data engineer,na,[python],[],[tableau],[],[],,
4,Skytech Consultancy Services,5.0,"Baltimore, MD",Data Engineer,Description of Work:\r\nTechnical experience i...,117000,1 to 50 Employees,Company - Public,,,...,-1,data engineer,na,[sql],[oracle],[tableau],[],[],bachelor,5-10 years


Exporting the cleaned version of the dataframe as a new data file

In [41]:
data_path = '../data/processed/'

df.to_csv(data_path + "glassdoor-data-engineer-cleaned.csv", index=False)