# Job Skills Analysis in Data Analyst

#### Date: 28/03/2022

## Brief Introduction:

This project is to extract, manipulate and analyze job description data of data related jobs -- **Data Analyst, Data Engineer and Data Scientist**, in recent couple of months from the **SEEK** website. The purpose of this project is to explore the **hottest required technical and soft skills** in the job market and provide **career path references** for people who would like to be a Data Analyst, Data Engineer, or Data Scientist.

**Programming Environment:** Python 3.8 and Jupyter Notebook

**Tools used:**

- **PyCharm:** Data Extraction, Data Transformation, Data Exploration 

- **DB Browser for SQLite3:** Database Management, Data Analysis

- **Power BI and Tableau:** Data Visualization


## Table of Contents

* [1. Introduction](#sec_1)
* [1.1. Project Objective](#sec_1.1)
* [1.2 Target Audience](#sec_1.2)
* [1.3 Project Assumption](#sec_1.3)
* [1.4 Key Insights](#sec_1.4)

* [2. Data Wrangling and Methodology](#sec_2)
* [2.1 Data Cleaning - How to Use Regular Expression to Clean the Job Salary and Job Ad Posted Time?](#sec_2.1)
* [2.1.1 Job Salary Data Cleaning](#sec_2.1.1)
* [2.1.2 Job Ad Posted Time Data Cleaning](#sec_2.1.2)
* [2.2 Dimensional Modeling Method – How to Model and Transform the Data into Job Details Fact table and Job Info Dimension Table?](#sec_2.2)
* [2.2.1 Identifying Business Objective](#sec_2.2.1)
* [2.2.2 Identifying Granularity](#sec_2.2.2)
* [2.2.3 Identifying Dimensions](#sec_2.2.3)
* [2.2.4 Identifying Facts](#sec_2.2.4)
* [2.2.5 Building the schema](#sec_2.2.5)
* [2.3 Data Extracting - How to Filter the Related Recruitment Data for Data Analyst, Data Engineer, Data Scientist from More Than 10,000 Job Descriptions?](#sec_2.3)
* [2.4 Term Frequency Analysis – How to Use Word Count to Filter the Key Skills?](#sec_2.4)
* [2.5 Documentary Frequency Analysis – How to Identify the Importance of Key Skills?](#sec_2.5)
* [2.6 Requirements & Getting Started](#sec_2.6)
* [3. Exploratory Data Analysis](#sec_3)
* [4. Conclusion](#sec_4)
* [5. References](#sec_5)

## 1. Introduction <a class="anchor" id="sec_1"></a>
### 1.1 Project Objective <a class="anchor" id="sec_1.1"></a>

This project extract, clean, explore and analyze job description data of three popular job titles -- Data Analyst, Data Engineer and Data Scientist, in recent couple of months from SEEK website. This GitHub repo contains all the codes and datasets for this project. The objectives of this project are:

- **Clean dirty datetime data, then transform them for analytics.**
- **Exploratory data analysis**, e.g.
  - find the hottest technical and soft skills for different job titles -- Data Analyst, Data Engineer and Data Scientist.
  - analyze the salary differences for different job titles and job types.
  - analyze the available positions and required technical skills in different regions.
  - analyze the possible relationship between salary range and required technical skills.
**Provide career path references for people who would like to be a Data Analyst, Data Engineer or Data Scientist.**


### 1.2 Target Audience <a class="anchor" id="sec_1.2"></a>

**People who would like to be a Data Analyst, Data Engineer, or Data Scientist.**

### 1.3 Project Assumption  <a class="anchor" id="sec_1.3"></a>

- Assuming that the SEEK website can represent the current situation and trends of job market in Australia.

- Assuming the recruitment data on the SEEK website is accurate.

- Assume that people always would like to put the most important information first, for example, the higher the position in the job advertisement. In other words, the earlier the relevant skills appear means that they are more important to the company.

### 1.4 Key Insights  <a class="anchor" id="sec_1.4"></a>

This project extract job description data of three popular job titles -- Data Analyst, Data Engineer and Data Scientist,
in recent couple of months from SEEK website. The job locations include eight major cities in Australia – Sydney,
Melbourne, Perth, Brisbane, Gold Coast, Adelaide, ACT and Hobart. Among them, there are 2224 job advertisements
for Data Analyst, 1180 job advertisements for Data Engineer, and 529 job advertisements for Data Scientist. From the
analyzed results, the main conclusions can be summarized as follows:

**Top hottest required skills for Data Analyst:** **`SQL, Excel, Power BI, Tableau, Python, Cloud, Reporting and Communication Skills`**

**Top hottest required skills for Data Engineer:** **`Cloud, SQL, AWS, Python, Azure, ETL/ELT, Communication, Infrastructure`**

**Top hottest required skills for Data Scientist:** **`Artificial Intelligence, Python, Communication, Machine Learning, Data Science, Research, SQL, Statistics`**




## 2. Data Wrangling and Methodology <a class="anchor" id="sec_2"></a>

### 2.1 Data Cleaning - How to Use Regular Expression to Clean the Job Salary and Job Ad Posted Time? <a class="anchor" id="sec_2.1"></a>

In this project, we did not analyze the job salary because the data was not plenty enough for the moment. This
project focuses more on the analysis of required skills.
Moreover, if we have enough recruitment data for several years, we can also analyze the trend of the data job
market, for example, how the hottest required skills change over time.


#### 2.1.1 Job Salary Data Cleaning <a class="anchor" id="sec_2.1.1"></a>

Firstly, we use Regular Expression to extract the salary data and transform them into float numbers from the
database.
Then, based on different salary range, the job salary can be divided into following three categories:
* If salary is lower than 200, it is paid by per hour;
* If salary is greater than 200 and lower than 2000, it is paid by per day;
* If salary is greater than 2000, it is paid by per annual.

In such way, the annual salary can be calculated according to the full-time working hours from Australian
government website. There are 251 working days or 2008 working hours per year for a full-time working position.
Finally, the data can be converted into a CSV file and needs to be manually checked for errors.

#### 2.1.2 Job Ad Posted Time Data Cleaning <a class="anchor" id="sec_2.1.2"></a>

The job Ad posted time usually have the following suffixes:
* ‘m’ means the job ad was posted minutes ago;
* ‘h’ means the job ad was posted hours before;
* ‘d’ means the job ad was posted days ago.

Based on these suffixes, we can use the Python datetime library to calculate the Universal Time Coordinated
date of the job ad posted time.

### 2.2 Dimensional Modeling Method – How to Model and Transform the Data into Job Details Fact table and Job Info Dimension Table? <a class="anchor" id="sec_2.2"></a>

In this section, the dimensional data modelling technique will be applied to construct the data warehouse which
could store and retrieve data quickly for the further analysis (Saxena & Agarwal, 2014). By applying the DDM in this project, the data
behavior and domain can be easily understood, and their performance can be optimized.


#### 2.2.1 Identifying Business Objective <a class="anchor" id="sec_2.2.1"></a>

Based on the data we have collected, the business objective is to identify the required skills in various job
position for the people who would like establish their career in the field of data. Therefore, the key words
regarding skills will be extracted and analyzed from the job details.

#### 2.2.2 Identifying Granularity <a class="anchor" id="sec_2.2.2"></a>

Granularity is the lowest level of information for the tables in the data warehouse. The grain will be used for
identifying the level of details for the business problem. Hence, the grain of fact table are job_id, section_id, and
line_id which can be used to identify each job details. The grain of dimensional table is job_id which could be
used to identify job information.

#### 2.2.3 Identifying Dimensions <a class="anchor" id="sec_2.2.3"></a>

Dimensions are used for categorizing and describing facts and measures. The job information of each job can be
classified as the dimensions in this project. The dimensional table is used for storing the descriptive data and
providing the context to the fact creation. In this case, the Job Info Dimensional Table contains all the
dimensions of each job such as the job title, job company, job area, etc.

#### 2.2.4 Identifying Facts <a class="anchor" id="sec_2.2.4"></a>

Since the aim of this project is to investigate the job skills that are required for each position. The fact refers to
each job details that is posted as well as the fact table is utilized for storing a collection of measures such as
section id, line id, job details, etc.

#### 2.2.5 Building the schema <a class="anchor" id="sec_2.2.5"></a>

The star schema will be developed based on the following DDM analysis and the Entity Relational Diagram.

**Figure 2.2.** Entity Relationship Diagram of Job Skills Analysis
<img src="images/fact_dim_erd.png">

### 2.3 Data Extracting - How to Filter the Related Recruitment Data for Data Analyst, Data Engineer, Data Scientist from More Than 10,000 Job Descriptions? <a class="anchor" id="sec_2.3"></a>

The search algorithm from SEEK website brings up lots of irrelevant job titles, such as business intelligence,
accountant, etc. Therefore, before further analysis, we need to filter out the related recruitment data for Data
Analyst, Data Engineer, Data Scientist. In this project, we can use SQL LIKE and UNLIKE operator to identify
whether if the job title belongs to these three titles or not.
Finally, from `13,000` job advertisements, there are `2224` job advertisements for **Data Analyst**, `1180` job advertisements for **Data Engineer**, and `529` job advertisements for **Data Scientist**. 


### 2.4 Term Frequency Analysis – How to Use Word Count to Filter the Key Skills? <a class="anchor" id="sec_2.4"></a>

From the downloaded job details data, we can calculate the appeared frequency of each meaningful word(s),
including technical and soft skills. Then through the word cloud we can see which technologies are more
important than others. We can also use SQL LIKE operator to identify which statements contain these keywords
and count the distinct job ad number. According to the above word count and term frequency analysis, we can
filter out `46 highly-demand technical and 12 soft skills` in the job market.

__The 46 highly-demand skills are as follows:__

SQL, Excel, Power BI, Tableau, Python, Cloud, Azure, AWS, GCP, API, Pipeline, Dimension Modelling,
ETL/ELT, DevOps, CI/CD, Spark, Java, , Scala, Oracle, Kubernetes, Docker, Apache, Kafka, Linux, Snowflake,
Data Warehouse, Data Modelling, Data Visualization, Data Migration, Data Management, Data Integration, Data
Platform, Data Architecture, Data Factory, Databricks, Data Science, Machine Learning, Computer Science,
Research, Statistic, Mathematics, Quantitative, Algorithm, Deep Learning, Statistical Analysis;

__The 12 highly-demand soft skills are as follows:__

Communication Skill, Reporting, Stakeholder, Agile, Project Management, Business Intelligence, Decision
Making, Interpersonal, Time Management, Troubleshoot, Tertiary Qualification, PhD


### 2.5 Documentary Frequency Analysis – How to Identify the Importance of Key Skills? <a class="anchor" id="sec_2.5"></a>


In this section, to further illustrate which skills are more important, we select 6-10 skills most in demand for the
following documentary frequency analysis.
As elaborated in the assumption chapter, people always would like to put the most important information first, for
example, the higher the position in the job advertisement. We can analyze the average appeared line number of
these important skills. We can also categorize their first occurrence line numbers into three range: 1~3, 4~5 and
6+. From these specific analyses, we can distinguish the order of importance of these skills. The top hottest
required skills for Data Analyst, Data Engineer and Data Scientist are as follows:

**Data Analyst:** `SQL, Excel, Power BI, Tableau, Python, Cloud, Reporting and Communication Skills`

**Data Engineer:** `Cloud, SQL, AWS, Python, Azure, ETL/ELT, Communication, Infrastructure`

**Data Scientist:** `Artificial Intelligence, Python, Communication, Machine Learning, Data Science, Research, SQL, Statistics`

**Tools used in the project:**

- Programming Environment: Python 3.8 and Jupyter Notebook
- Data Extraction, Data Cleaning and Data Wrangling: PyCharm
- Data Wrangling and Database Management: DB Browser for SQLite3 and Elephant DB for PostgreSQL
- Data Visualization: Power BI and Tableau



### 2.6 Requirements & Getting Started <a class="anchor" id="sec_2.6"></a>


__Dependencies include:__

- json
- glob
- sqlite3
- re
- pandas
- typing 
- nltk.corpus 
- datetime 
- os
- requests
- urllib.parse
- bs4 


## 3. Exploratory Data Analysis<a class="anchor" id="sec_3"></a>

In this part, average line number of key skills in each job title will be explored to discover skill importance, where the top skills mean more significant. Moreover, job count will be conducted for each key skill in three job titles, which aims to examine hottest skills in each job titles.

### 3.1	Required Skills for Data Analyst <a class="anchor" id="sec_3.1"></a>

#### 3.1.1	Term Frequency and Word Cloud for Data Analyst

From the downloaded job details, we can calculate the appeared frequency of each meaningful word(s), including technical and soft skills. Then we can draw Word Cloud as shown Figure 3.1.

**Figure 3.1**. Word Cloud for Data Analyst.

<img src="images/da_word_cloud.png">

#### 3.1.2	Documentary Frequency Analysis and Technical Skills Required to be a Data Analyst

From skills in data analyst, skills like sql, management, excel, reporting and communication are the hottest. To discover more, five soft skills and five technical skills are chosen, namely **`sql, excel, power bi, tableau, python, and management, reporting, communication, design, insights.`**

**Figure 3.2**. Skills Required to be a Data Analyst.

<img src="images/da_skill_1.png">

<img src="images/da_skill_2.png">

#### 3.1.3	Ranked Line in the Job Description of Skills 

From these specific analyses, we can distinguish the order of importance of these skills. 

To discover more, five soft skills and five technical skills are chosen, namely sql, excel, power bi, tableau, python, and management, reporting, communication, design, insights.

**Figure 3.3**.  Dashborads of Skills for Data Analyst.

<img src="images/da_tech.png">
<img src="images/da_soft.png">


#### 3.1.4	Average Ranked Line in the Job Description of Technical Skills 
We can also calculate the average ranked line in the job description of these skills, as shown in Figure 3.6. 
From the figure shown, statistics and quantitative rank at the top which means they are important in data analyst jobs, and besides, technical skills such as ssas, r, sql, power bi and python are in line between 3 and 4.2. While skills like aws and azure are less important than the former skills.

**Figure 3.4**.  Average Ranked Line in the Job Description of Skills

<img src="images/da_line.png">



### 3.2	Required Skills for Data Engineer <a class="anchor" id="sec_3.2"></a>

#### 3.2.1	Term Frequency and Word Cloud for Data Engineer

According to the job details, we can get the term frequency of each meaningful word or grams, including technical and soft skills. Then we have Word Cloud as shown from Figure 3.5.

**Figure 3.5**. Word Cloud for Data Engineer.

<img src="images/de_word_cloud.png">

#### 3.2.2	Documentary Frequency Analysis and Technical Skills Required to be a Data Engineer

After count job numbers for each skill in data engineer, we can observe skills like cloud, sql, aws, azure, python and communication are the hottest. To discover more, twelve soft skills and six technical skills are chosen.

**Figure 3.6**. Skills Required to be a Data Engineer.

<img src="images/de_skill_1.png">
<img src="images/de_skill_2.png">

#### 3.2.3	Ranked Line in the Job Description of Technical Skills 

The order of job descriptions means skill importance, which assumes that we put important information at the beginning lines and which menas those skills are significant.


**Figure 3.7**.  Dashborads of Skills for Data Engineer.

<img src="images/de_tech.png">
<img src="images/de_soft.png">


#### 3.2.4	Average Ranked Line in the Job Description of Technical Skills 
We can also calculate the average ranked line in the job description of these skills, as shown in Figure 3.8. 
From the picture shown, data warehouse, data architecture and gcp rank at the top which means they are significant in data engineer jobs, and besides, skills related to cloud  such as aws, cloud, api, are also ranked at the top, while those soft skills like communication and documentation are less significant which are placed at around line 6. 

**Figure 3.8**.  Average Ranked Line in the Job Description of Skills

<img src="images/de_line1.png">
<img src="images/de_line2.png">


### 3.3	Required Skills for Data Scientist <a class="anchor" id="sec_3.3"></a>

#### 3.3.1	Term Frequency and Word Cloud for Data Scientist

From the job descriptions, we can calculate the term frequency of each meaningful word or words, including technical and soft skills. Then we can draw Word Cloud as shown Figure 3.9.

**Figure 3.9**. Word Cloud for Data Scientist.

<img src="images/ds_word_cloud.png">

#### 3.3.2	Documentary Frequency Analysis and Technical Skills Required to be a Data Scientist

From skills in data scientist, ai appear near twice compared to the second skill python, skills like machine learning and sql are also appear mostly, while some skills like aws, spark, power bi and azure appear less. To explore more, ai, python, machine learning, sql, statistics, cloud are as technical skills, while soft skills such as communication, stakeholder, agile, reporting, project managemennt, interpersonal skill, decision making troubleshooting and business inntelligence are explored.


**Figure 3.10**. Skills Required to be a Data Scientist.

<img src="images/ds_skill_1.png">

<img src="images/ds_skill_2.png">

#### 3.3.3	Ranked Line in the Job Description of Technical Skills 

From these specific analyses, we can distinguish the order of importance of these skills.

We can see that there are 244 jobs have placed ai at the top lines and 171 jobs place machine learning at the front and near 150 jobs place python and data science at the beginning, which shows that those skills are important in data scientist. We also could see soft skill communication are appeared more in the end lines, less appeared in line between 4 and 5, and least appeared in the first three lines, which means most jobs regard this skill less important than other skills like ai, python, machine learning. 


**Figure 3.11**.  Dashborad of Skills for Data Scientist.

<img src="images/ds_tech.png">


#### 3.3.4	Average Ranked Line in the Job Description of Technical Skills 
We can also calculate the average ranked line in the job description of these skills, as shown in Figure 3.12. 
According to pictures above, quantitative, natural language processing, mathematics, machine learning are averagely place at the first three lines, while those programming languages like python, sql, r rank at between line 3 and line 4 which means those skills are less important than mathematical skills but they still are important. Similar to data engineer, those soft skills like management and communication are ranked at the end.

**Figure 3.12**.  Average Ranked Line in the Job Description of Skills

<img src="images/ds_line1.png">
<img src="images/ds_line2.png">


## 3.4	Carrier Path<a class="anchor" id="sec_3.4"></a>

#### 3.4.1	The Necessary Skills to Become a Data Analyst, Data Engineer and Data Scientist

From the analysis elaborated above, we can conclude that the hottest required skills for Data Analyst are **SQL, Excel, Power BI, Visualization, Tableau, Python, Reporting and Communication Skills**; the hottest required skills for Data Engineer are **Cloud, SQL, AWS, Python, Azure, ETL/ELT, Communication, Infrastructure**; the top hottest required skills for Data Scientist are **Artificial Intelligence, Python, Communication, Machine Learning, Data Science, Research, SQL, Statistics**.


#### 3.4.2	Additional Skills while Switching Between Data Analyst, Data Engineer and Data Scientist

From the following schematic diagram of the career path between Data Analyst, Data Engineer and Data Scientist, we can tell that SQL, Python and communication skills are mandatory skills in the big data field.

If people want to change carrier from Data Analyst to Data Engineer, they must **obtain Cloud experience**, such as AWS, Azure. They will also need to obtain other necessary skills and experience such as ETL/ELT, Spark, Data Pipeline, DevOps.

If people who are Data Analysts would like to be Data Scientists, they are better to **have research or algorithms experience** and strong knowledge for the following fields, such as Artificial Intelligence, Machine Learning, Data Science, Statistic, Mathematics.

If people would like to change career from Data Engineer and Data Scientist to Data Analyst, they not only need to obtain strong hands-on experience on SQL and Python, but also need to improve **Data Visualization skills**, i.e.  Power BI, Tableau, Excel, and reporting / story telling skills. 

**Figure 3.22**. Schematic Diagram of the Career Path between Data Analyst, Data Engineer and Data Scientist

<img src="images/career_path.png">

## 4. Conclusion<a class="anchor" id="sec_4"></a>

From the analysis between Data Analyst, Data Engineer and Data Scientist elaborated above, we can conclude that **SQL, Python and communication skills are mandatory skills in the big data field**. However, these three job titles also have their own special technical tendencies depending on different responsibilities:

To be **Data Analysts requires strong data visualization, reporting skills and domain knowledge** because they need to use data to communicate and help companies make business decisions.

To be **Data Engineers needs to master more technical skills** as their job is to build the data pipeline and optimize the systems to allow data analysts and data scientist to perform their work.

While **Data Scientists use statistics and machine learning algorithms** to make predictions, it requires excellent mathematics, statistics knowledge, and research experience.

## 5. References <a class="anchor" id="sec_8"></a>

* Saxena, G., & Agarwal, B. B. (2014). Data Warehouse Designing: Dimensional Modelling and ER Modelling. International Journal of Engineering Inventions, 3(9), 28-34.
