## Introduction

![EDA_illustration](./doc/images/placeholder.png)

<div style="background-color: rgb(40, 40, 40); padding: 20px">
Data science is an interdisciplinary field that involves the use of statistical, computational, and machine learning techniques to extract insights and knowledge from data.
</div>

<div style="background-color: rgb(40, 40, 40); padding: 20px">
Data Engineers are professionals who specialize in handling and managing large volumes of data. They write computer programs and use various tools to build data pipelines that transform raw data into a more usable format for different teams within a company, including Business Intelligence Analysts, Data Analysts, Data Scientists, Machine Learning Engineers, and Database Administrators.
</div>

The goal of this project was to gather job postings for `"Data Engineer"` positions from www.glassdoor.com and analyze them to identify average salaries, working conditions, benefits, required skills, and trends. The data was collected between `11-04-2023` and `15-04-2023` from `32 countries`, including the USA and Canada, Europe, South East Asia, and Oceania.<br><br>The scraper utilized the Selenium package to provide the necessary client interaction to extract posting information, with a maximum cap of `900 jobs per selected country`. Due to Glassdoor's behavior, the number of job postings was often reduced from `900` to `300` or less after eliminating duplicates. Sometimes, Glassdoor listed irrelevant positions like "Android Mobile Developer" or "Outside Sales Representative" for a "Data Engineer" query, causing the number to drop further.<br><br>
The final number of job entries is `3340` positions.<br>That spans countries in North America, Europe, South East Asia, and Oceania.<br><br>This project was inspired by [Ken Jee's work](https://github.com/PlayingNumbers), and the author would like to extend special thanks to him.

## About the data

The whole dataset was scraped from [glassdoor.com](www.glassdoor.com)

<img src="doc\images\Glassdoor-Logo.png"  width="60%" height="60%">

<div style="background-color: rgb(40, 40, 40); padding: 20px">
"Our site offers millions of the latest job listings, combined with a growing database of company reviews, CEO approval ratings, salary reports, interview reviews and questions, benefits reviews, office photos, and more. Unlike other job sites, all of the information on our site is shared by those who know a company best - the employees."
</div>

### Before Cleaning

In the RAW format, each job contains the following information:
- Company_name
- Rating
- Location
- Job_title
- Description
- Job_age
- Easy_apply
- Salary
- Employees
- Type_of_ownership
- Sector
- Founded
- Industry
- Revenue_USD
- Friend_recommend
- CEO_approval
- Career_opportunities
- Comp_&_benefits
- Culture_&_values
- Senior_management
- Work/Life_balance
- Pros
- Cons
- Benefits_rating
- Benefits_reviews

### After Cleaning

After cleaning and enriching data we have the following columns in multi-index:

**Job_details**

| columns | description (examples) | data type |
|---|---|---|
| Title                                             | Data Engineer, BI Engineer...                                     | string |
| Description                                       | The description provided by company                               | string |
| Seniority                                         | Junior, Mid, Senior, Management                                   | string |
| City                                              | Los Angeles, e.g.                                                 | string |
| State                                             | California, e.g.                                                  | string |
| Country                                           | United States, e.g.                                               | string |
| Region                                            | North America, e.g. (and Europe, South East Asia, Oceania)        | string |
| Job_age                                           | 1 to 31 (max value, in that case it means "31+ days")                  |  int   |
| Easy_apply                                        | Y/N (applying via glassdoor)                                      |  bool  |

**Salary**<br>
All float values are in USD
| columns | description (examples) | data type |
|---|---|---|
| Min                                               | Minimal buck in local currency for the position in the company    | float  |
| Max                                               | Maximal buck in local currency for the position in the company    | float  |
| Avg                                               | The inner value between the minimal and the maximal               | float  |
| Currency (ISO 3 letters standard)                 | The currency in which the salary is paid:<br>USD, EUR, CAD, DKK, HDK, NZD, NOK, PLN, RON, SGD, SEK, CHF, GBP   | string |
| Employer_provided                                 | Y/N (Does the employer provide pay scale ranges)                  |  bool  |
| Is_hourly                                         | Y/N (Paid by number of worked hours, or monthly)                  |  bool  |

**Company_info**
| columns | description (examples) | data type |
|---|---|---|
| Name                                          | The Great company Co., Ausgezeichnete Gmbh...                             | string|
| Rating                                        | 0.0-5.0, The rating of the company                                        |  int  |
| Employees                                     | 1 to 50, 51 to 200, 501 to 1000, 1001 to 5000, 5001 to 10000, 10000+      | string|
| Type_of_ownership                             | Company - Private, Company - Public, Subs...                              | string|
| Sector                                        | Information Technology, Human Resources & Staffing...                     | string|
| Industry                                      | Information Technology Support Services, HR Consulting...                 | string|
| Company_age                                   | 2, 12, 333... (The numbers of years in 2023)                              |  int  |
| Revenue_USD                                   | Less than $1 million, $1 to $5 million                                    | string|
| Friend_recommend                              | 0.00 to 1.00 (0% to 100%)                                                 | float |
| CEO_approval                                  | 0.00 to 1.00 (0% to 100%)                                                 | float |
| Career_opportunities                          | 0.0 to 5.0                                                                | float |
| Comp_&_benefits                               | 0.0 to 5.0                                                                | float |
| Senior_management                             | 0.0 to 5.0                                                                | float |
| Work/Life_balance                             | 0.0 to 5.0                                                                | float |
| Culture_&_values                              | 0.0 to 5.0                                                                | float |
| Pros                                          | Pay good money, work isn't too difficult" (in 2 reviews), ...             |list[str]|
| Cons                                          | "The pay could be better." (in 7 reviews), "Boss culture" (in 6 reviews)..|list[str]|
| Benefits_rating                               | 0.0 to 5.0                                                                | float |
| Benefits_reviews                              | 0.0 to 5.0                                                                | float |

(Below are the requirements extracted from the job description)

**Education**

| columns | description (examples) | data type |
|---|---|---|
| BA                                            | Y/N                                   |  bool  |
| MS                                            | Y/N                                   |  bool  |
| Phd                                           | Y/N                                   |  bool  |
| Certificate                                | Nanodegree, DataCamp, ..., Other      | string |

**Version_control**

| columns | description (examples) | data type |
|---|---|---|
| Git                                           | Github, GitLab, ..., Git (subset of any previous) | string |

**Cloud_platforms (Top 10)**

| columns | description (examples) | data type |
|---|---|---|
| AWS                                           | Y/N   |  bool  |
| Microsoft_Azure                               | Y/N   |  bool  |
| GPC                                           | Y/N   |  bool  |
| Alibaba                                       | Y/N   |  bool  |
| Oracle                                        | Y/N   |  bool  |
| IBM                                           | Y/N   |  bool  |
| Tencent                                       | Y/N   |  bool  |
| OVHcloud                                      | Y/N   |  bool  |
| DigitalOcean                                  | Y/N   |  bool  |
| Lincode                                       | Y/N   |  bool  |

**RDBMS (Relational Database Management System)**

| columns | description (examples) | data type |
|---|---|---|
| PostgreSQL                                    | Y/N |  bool  |
| Microsoft_SQL_Server                          | Y/N |  bool  |
| IBM_Db2                                       | Y/N |  bool  |
| MySQL                                         | Y/N |  bool  |
| Oracle_PL_SQL'                                | Y/N |  bool  |

**NOSQL (not only SQL)**

| columns | description (examples) | data type |
|---|---|---|
| MongoDB                                       | Y/N |  bool  |
| Cassandra                                     | Y/N |  bool  |
| Amazon_DynamoDB                               | Y/N |  bool  | 
| Neo4j                                         | Y/N |  bool  |

**Search_&_Analytics**

| columns | description (examples) | data type |
|---|---|---|
| Apache_Solr                                   | Y/N |  bool  |
| Amazon_Redshift                               | Y/N |  bool  |
| Google_BigQuery                               | Y/N |  bool  |
| Snowflake                                     | Y/N |  bool  |
| Oracle_Exadata                                | Y/N |  bool  |
| SAP_HANA                                      | Y/N |  bool  |
| Teradata                                      | Y/N |  bool  |

**Data_integration_and_processing**

| columns | description (examples) | data type |
|---|---|---|
| Informatica_PowerCenter                       | Y/N |  bool  |
| Databricks                                    | Y/N |  bool  |
| Presto                                        | Y/N |  bool  |

**Stream_processing_tools**

| columns | description (examples) | data type |
|---|---|---|
| Apache_Kafka                                  | Y/N |  bool  |     
| Apache_Flink                                  | Y/N |  bool  |
| Dataflow                                      | Y/N |  bool  |

**Workflow_orchestration_tools**

| columns | description (examples) | data type |
|---|---|---|
| Apache_Airflow                                | Y/N |  bool  |
| Luigi                                         | Y/N |  bool  |
| SSIS                                          | Y/N |  bool  |

**Big_Data_processing**

| columns | description (examples) | data type |
|---|---|---|
| Apache_Hadoop                                 | Y/N |  bool  |
| Apache_Hive                                   | Y/N |  bool  |
| Apache_Spark                                  | Y/N |  bool  |

**OS**

| columns | description (examples) | data type |
|---|---|---|
| Linux                                         | Y/N |  bool  |

**Programming_languages**

| columns | description (examples) | data type |
|---|---|---|
| Python                                        | Y/N |  bool  |
| R                                             | Y/N |  bool  |
| Scala                                         | Y/N |  bool  |
| Julia                                         | Y/N |  bool  |
| SQL                                           | Y/N |  bool  |
| Java                                          | Y/N |  bool  |
| C++                                           | Y/N |  bool  |
| Go                                            | Y/N |  bool  |
| Bash                                          | Y/N |  bool  |
| PowerShell                                    | Y/N |  bool  |
| CLI                                           | Y/N |  bool  |

**Business_Intelligence_Tools**

| columns | description (examples) | data type |
|---|---|---|
| Tableau                                       | Y/N |  bool  |
| Power_BI                                      | Y/N |  bool  |
| Google_Analytics                              | Y/N |  bool  |
| QlikView                                      | Y/N |  bool  |
| Oracle_BI_server                              | Y/N |  bool  |
| SAS_Analytics                                 | Y/N |  bool  |
| Lumira                                        | Y/N |  bool  |
| Cognos_Impromptu                              | Y/N |  bool  |
| MicroStrategy                                 | Y/N |  bool  |
| InsightSquared                                | Y/N |  bool  |
| Sisense                                       | Y/N |  bool  |
| Dundas_BI                                     | Y/N |  bool  |
| Domo                                          | Y/N |  bool  |
| Looker                                        | Y/N |  bool  |
| Excel                                         | Y/N |  bool  |

### Basic Exploration

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from pathlib import Path

import warnings
warnings.filterwarnings('ignore')

In [2]:
file_path = Path("data\clean\Data_Engineer\Data_Engineer_15-04-2023.csv")
data = pd.read_csv(file_path, index_col=0, header=[0, 1])

In [3]:

print(f"Summary Of The Dataset :")
data.describe()

Summary Of The Dataset :


Unnamed: 0_level_0,Job_details,Salary,Salary,Salary,Company_info,Company_info,Company_info,Company_info,Company_info,Company_info,Company_info,Company_info,Company_info,Company_info
Unnamed: 0_level_1,Job_age,Min,Max,Avg,Rating,Company_age,Friend_recommend,CEO_approval,Career_opportunities,Comp_&_benefits,Senior_management,Work/Life_balance,Culture_&_values,Benefits_rating
count,3340.0,809.0,815.0,809.0,2773.0,1847.0,2670.0,1810.0,2714.0,2711.0,2710.0,2710.0,2709.0,1272.0
mean,22.475449,82006.084054,102759.139877,92595.953646,3.973819,48.734705,0.775543,0.841492,3.716065,3.659056,3.62059,3.806015,3.843632,3.962736
std,10.805767,40699.305662,42593.859461,39225.439528,0.55893,53.817123,0.158428,0.14799,0.589307,0.624617,0.662821,0.595415,0.62882,0.639528
min,1.0,20240.0,20240.0,20240.0,1.0,2.0,0.09,0.13,1.0,1.0,1.0,1.0,1.0,1.0
25%,13.0,53064.0,71867.0,62238.0,3.7,15.0,0.69,0.78,3.4,3.3,3.2,3.4,3.5,3.7
50%,31.0,72976.0,93984.0,83500.0,4.0,28.0,0.8,0.88,3.7,3.7,3.6,3.8,3.9,4.0
75%,31.0,105763.0,130500.0,119500.0,4.3,56.0,0.89,0.94,4.0,4.0,4.0,4.2,4.2,4.3
max,31.0,361896.0,271429.0,227541.0,5.0,333.0,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0


## Questions to ask

1. Countries with Highest Number of Jobs
1. Salaries per Country
1. Age of Job Postings
1. Top 5 Industries with the Highest Number of Jobs
1. Top 5 Industries with the Highest Salaries
1. Companies with Maximum Number of Job Openings
1. Company Ratings
1. Company Size
1. Company Age
1. Company Ownership Type
1. Company Revenue
1. Remote Job
1. Employment Type
1. Experience Level
1. Skills Required
1. Correlation Map