# Career Path Analysis to be Data Analyst, Data Engineer and Data Scientist in SEEK

**Date: 24/03/2022**

**Brief Introduction**:

This project is to extract, clean, explore and analyze job description data of three popular job titles -- Data Analyst, Data Engineer and Data Scientist, in recent couple of months from the SEEK website. The purpose of this project is to have a quick look at the hottest required technical and soft skills in the job market and help provide career path references for people who would like to be a Data Analyst, Data Engineer, or Data Scientist.

**Programming Environment**: Python 3.8 and Jupyter Notebook

**Tools used**:
- Data Extraction, Data Cleaning and Data Wrangling: PyCharm
- Data Wrangling and Database Management: DB Browser for SQLite3 and Elephant DB for PostgreSQL
- Data Visualization: Power BI and Tableau


## Table of Contents

[1.Introduction](#sec_1)

- [1.1 Project Objective](#sec_1.1)
- [1.2 Target Audience](#sec_1.2)
- [1.3 Project Assumption](#sec_1.3)
- [1.4 Key Insights](#sec_1.4)

[2.Data Wrangling and Methodology](#sec_2)

- [2.1 Data Cleaning](#sec_2.1)
- [2.2 Dimensional Modeling Method](#sec_2.2)
- [2.3 Data Extracting](#sec_2.3)
- [2.4 Term Frequency Analysis](#sec_2.4)
- [2.5 Documentary Frequency Analysis](#sec_2.5)
- [2.6 Requirements & Getting Started](#sec_2.6)

[3.Exploratory Data Analysis](#sec_3)

- [3.1 Required Skills for Data Analyst](#sec_3.1)
- [3.2 Required Skills for Data Engineer](#sec_3.2)
- [3.3 Required Skills for Data Scientistg](#sec_3.3)
- [3.4 Carrier Path](#sec_3.4)

[4.Conclusion](#sec_4)

[5.References](#sec_5)


## 1. Introduction <a class="anchor" id="sec_1"></a>
### 1.1 Project Objective <a class="anchor" id="sec_1.1"></a>
This project extract, clean, explore and analyze job description data of three popular job titles -- Data Analyst, Data Engineer and Data Scientist, in recent couple of months from SEEK website. This GitHub repo contains all the codes and datasets for this project. The objectives of this project are:

- **Clean dirty salary and datetime data, then transform them for analytics**.

- **Exploratory data analysis**, e.g. 
    - ✨find the hottest technical and soft skills for different job titles -- Data Analyst, Data Engineer and Data Scientist.
    - ✨analyze the salary differences for different job titles and job types.
    - ✨analyze the available positions and required technical skills in different regions.
    - ✨analyze the possible relationship between salary range and required technical skills.

- **Provide career path references for people who would like to be a Data Analyst, Data Engineer or Data Scientist**.

### 1.2 Target Audience <a class="anchor" id="sec_1.2"></a>
#### People who would like to be a Data Analyst, Data Engineer, or Data Scientist.
### 1.3 Project Assumption <a class="anchor" id="sec_1.3"></a>
- Assuming that the SEEK website can represent the current situation and trends of job market in Australia.
- Assuming the recruitment data on the SEEK website is accurate.
- Assume that people always would like to put the most important information first, for example, the higher the position in the job advertisement. In other words, the earlier the relevant skills appear means that they are more important to the company.

### 1.4 Key Insights <a class="anchor" id="sec_1.4"></a>
This project extract job description data of three popular job titles -- Data Analyst, Data Engineer and Data Scientist, in recent couple of months from SEEK website. The job locations include eight major cities in Australia – Sydney, Melbourne, Perth, Brisbane, Gold Coast, Adelaide, ACT and Hobart. Among them, there are 2224 job advertisements for Data Analyst, 1180 job advertisements for Data Engineer, and 529 job advertisements for Data Scientist. From the analyzed results, the main conclusions can be summarized as follows: 
#### Top hottest required skills for Data Analyst: `SQL, Excel, Power BI, Tableau, Python, Data Visualization, Reporting and Communication Skills`
#### Top hottest required skills for Data Engineer:`Cloud, SQL, AWS, Python, Azure, Data Pipeline, ETL/ELT, Communication Skills`
#### Top hottest required skills for Data Scientist: `Python, Communication Skills, Machine Learning, Data Science, Statistics, Research, Computer Science, SQL`

## 2.	Data Wrangling and Methodology <a class="anchor" id="sec_2"></a>
### 2.1 Data Cleaning - How to Use Regular Expression to Clean the Job Salary and Job Ad Posted Time? <a class="anchor" id="sec_2.1"></a>
In this project, we did not analyze the job salary because the data was not plenty enough for the moment. This project focuses more on the analysis of required skills. Moreover, if we have enough recruitment data for several years, we can also analyze the trend of the data job market, for example, how the hottest required skills change over time.
#### 2.1.1	Job Salary Data Cleaning
Firstly, we use Regular Expression to extract the salary data and transform them into float numbers from the database.
Then, based on different salary range, the job salary can be divided into following three categories: 
- If salary is lower than 200, it is paid by per hour;
- If salary is greater than 200 and lower than 2000, it is paid by per day;
- If salary is greater than 2000, it is paid by per annual.

In such way, the annual salary can be calculated according to the full-time working hours from Australian government website. There are 251 working days or 2008 working hours per year for a full-time working position.
Finally, the data can be converted into a CSV file and needs to be manually checked for errors. 

#### 2.1.2	Job Ad Posted Time Data Cleaning
The job Ad posted time usually have the following suffixes:
- ‘m’ means the job ad was posted minutes ago; 
- ‘h’ means the job ad was posted hours before; 
- ‘d’ means the job ad was posted days ago.

Based on these suffixes, we can use the Python datetime library to calculate the Universal Time Coordinated date of the job ad posted time.

### 2.2	Dimensional Modeling Method – How to Model and Transform the Data into Job Details Fact table and Job Info Dimension Table? <a class="anchor" id="sec_2.2"></a>
In this section, the dimensional data modelling technique will be applied to construct the data warehouse which could store and retrieve data quickly for the further analysis. By applying the DDM in this project, the data behavior and domain can be easily understood, and their performance can be optimized.
#### 2.2.1	Identifying Business Objective
Based on the data we have collected, the business objective is to identify the required skills in various job position for the people who would like establish their career in the field of data. Therefore, the key words regarding skills will be extracted and analyzed from the job details.
#### 2.2.2	 Identifying Granularity
Granularity is the lowest level of information for the tables in the data warehouse. The grain will be used for identifying the level of details for the business problem. Hence, the grain of fact table are job_id, section_id, and line_id which can be used to identify each job details. The grain of dimensional table is job_id which could be used to identify job information.
#### 2.2.3	 Identifying Dimensions
Dimensions are used for categorizing and describing facts and measures. The job information of each job can be classified as the dimensions in this project. The dimensional table is used for storing the descriptive data and providing the context to the fact creation. In this case, the Job Info Dimensional Table contains all the dimensions of each job such as the job title, job company, job area, etc. 
#### 2.2.4	 Identifying Facts
Since the aim of this project is to investigate the job skills that are required for each position. The fact refers to each job details that is posted as well as the fact table is utilized for storing a collection of measures such as section id, line id, job details, etc.
#### 2.2.5	 Building the schema
The star schema will be developed based on the following DDM analysis and the Entity Relational Diagram.

**Figure 2.1.** Entity Relationship Diagram of Skills Gap Analysis
<img src="images/01.png">

### 2.3	Data Extracting - How to Filter the Related Recruitment Data for Data Analyst, Data Engineer, Data Scientist from More Than 10,000 Job Descriptions? <a class="anchor" id="sec_2.3"></a>
The search algorithm from SEEK website brings up lots of irrelevant job titles, such as business intelligence, accountant, etc. Therefore, before further analysis, we need to filter out the related recruitment data for Data Analyst, Data Engineer, Data Scientist. In this project, we can use SQL LIKE and UNLIKE operator to identify whether if the job title belongs to these three titles or not.

Finally, from `13,000` job advertisements, there are `2224` job advertisements for **Data Analyst**, `1180` job advertisements for **Data Engineer**, and `529` job advertisements for **Data Scientist**. 

### 2.4	Term Frequency Analysis – How to Use Word Count to Filter the Key Skills? <a class="anchor" id="sec_2.4"></a>
From the downloaded job details data, we can calculate the appeared frequency of each meaningful word(s), including technical and soft skills. Then through the word cloud we can see which technologies are more important than others. We can also use SQL LIKE operator to identify which statements contain these keywords and count the distinct job ad number. According to the above word count and term frequency analysis, we can filter out `46 highly-demand technical and 12 soft skills` in the job market.

**The 46 highly-demand skills are as follows**:

SQL, Excel, Power BI, Tableau, Python, Cloud, Azure, AWS, GCP, API, Pipeline, Dimension Modelling, ETL/ELT, DevOps, CI/CD, Spark, Java, , Scala, Oracle, Kubernetes, Docker, Apache, Kafka, Linux, Snowflake, Data Warehouse, Data Modelling, Data Visualization, Data Migration, Data Management, Data Integration, Data Platform, Data Architecture, Data Factory, Databricks, Data Science, Machine Learning, Computer Science, Research, Statistic, Mathematics, Quantitative, Algorithm, Deep Learning, Statistical Analysis;

**The 12 highly-demand soft skills are as follows**:

Communication Skill, Reporting, Stakeholder, Agile, Project Management, Business Intelligence, Decision Making, Interpersonal, Time Management, Troubleshoot, Tertiary Qualification, PhD

### 2.5	Documentary Frequency Analysis – How to Identify the Importance of Key Skills?<a class="anchor" id="sec_2.5"></a>
In this section, to further illustrate which skills are more important, we select 8 skills most in demand for the following documentary frequency analysis.

As elaborated in the assumption chapter, people always would like to put the most important information first, for example, the higher the position in the job advertisement. We can analyze the average appeared line number of these important skills. We can also categorize their first occurrence line numbers into three range: 1 ~ 3, 4 ~ 5 and 6+. From these specific analyses, we can distinguish the order of importance of these skills. The top hottest required skills for Data Analyst, Data Engineer and Data Scientist are as follows:

**Data Analyst**: `SQL, Excel, Power BI, Tableau, Python, Visualization, Reporting and Communication Skills`; 

**Data Engineer**: `Cloud, SQL, AWS, Python, Azure, Data Pipeline, ETL/ELT, Communication Skills`; 

**Data Scientist**: `Python, Communication Skills, Machine Learning, Computer Science, Data Science, Research, SQL, Statistics`.

**Tools used in the project**:

- Programming Environment: Python 3.8 and Jupyter Notebook
- Data Extraction, Data Cleaning and Data Wrangling: PyCharm
- Data Wrangling and Database Management: DB Browser for SQLite3 and Elephant DB for PostgreSQL
- Data Visualization: Power BI and Tableau

### 2.6	Requirements & Getting Started <a class="anchor" id="sec_2.6"></a>
**Dependencies include**:
- json
- glob
- sqlite3
- re
- pandas 
- typing
- nltk.corpus
- datetime
- os
- requests
- urllib.parse
- bs4 

## 3. Exploratory Data Analysis<a class="anchor" id="sec_3"></a>

In this part, average line number of key skills in each job title will be explored to discover skill importance, where the top skills mean more significant. Moreover, job count will be conducted for each key skill in three job titles, which aims to examine hottest skills in each job titles.
### 3.1	Required Skills for Data Analyst <a class="anchor" id="sec_3.1"></a>
#### 3.1.1	Term Frequency and Word Cloud for Data Analyst
From the downloaded job details, we can calculate the appeared frequency of each meaningful word(s), including technical and soft skills. Then we can draw Word Cloud as shown Figure 3.1.

**Figure 3.1**. Word Cloud for Data Analyst.
<img src="images/02.png">
#### 3.1.2	Documentary Frequency Analysis and Technical Skills Required to be a Data Analyst
Based on the word cloud shown above, we can see which technologies are more important than others. Then we use SQL LIKE operator to identify which statements contain these keywords and count the distinct total number of job ads. In such ways, we can obtain the percentage of JDs required the technical and soft skills in total JDs, as shown in Figure 3.2 and Figure 3.3. From the figures, we can conclude that the top 8 hottest required skills for Data Analyst are **`SQL, Excel, Power BI, Tableau, Python, Data Visualization, Reporting and Communication Skills`**. The JDs required these skills account for nearly or more than 20 percent of the total JDs.

The other important technical skills are `Statistic, Data Warehouse, Cloud, Research, Data Modeling, Deep Learning, Azure, ETL/ELT, Computer Science, AWS`, etc. These skills are also quite important to become a professional Data Analyst. We can tell that people need to have excellent Data Visualization and Reporting skills to be Data Analyst than Data Scientist and Data Engineer.

**Figure 3.2**. Technical Skills Required to be a Data Analyst.
<img src="images/03.png">
<img src="images/04.png">

**Figure 3.3**. Soft Skills Required to be a Data Analyst.
<img src="images/05.png">
#### 3.1.3	Ranked Line in the Job Description of Technical Skills 
In order to illustrate further of our conclusions, we assume that people always would like to put the most important information first, for example, the higher the position in the job advertisement. Then we analyze the average appeared line number of these important skills. We categorize their first occurrence line numbers into three range: 1 ~ 3, 4 ~ 5 and 6+, as shown in Figure 3.4 and Figure 3.5. 

From the figures, we can see that there are **`783 times that SQL appears in the first to third lines`**, 219 times that SQL appears in the fourth and fifth lines, and 131 times that appears in other lines. There are **`510 times that Excel appears in the first to third lines`**, 272 times that appears in the fourth and fifth lines, and 252 times that appears in other lines. The more times they appear in the front lines, means that they are more important than other skills. From these specific analyses, we can distinguish the order of importance of these skills. 

**Data Analyst: SQL, Excel, Power BI, Tableau, Python, Data Visualization, Reporting and Communication Skills.**

**Figure 3.4**.  Ranked Line in the Job Description of Technical Skills for Data Analyst.
<img src="images/06.png">
<img src="images/07.png">

**Figure 3.5**. Ranked Line in the Job Description of Soft Skills for Data Analyst.
<img src="images/08.png">
#### 3.1.4	Average Ranked Line in the Job Description of Technical Skills 
We can also calculate the average ranked line in the job description of these skills, as shown in Figure 3.6. 

**Figure 3.6**.  Average Ranked Line in the Job Description of Technical Skills
<img src="images/09.png">

### 3.2	Required Skills for Data Engineer<a class="anchor" id="sec_3.2"></a>
#### 3.2.1	Term Frequency and Word Cloud for Data Engineer
By similar analysis method, we can calculate the appeared frequency of each meaningful word(s) in the Data Engineer JDs. Then we can draw Word Cloud as shown Figure 3.7.

**Figure 3.7**. Word Cloud for Data Engineer
<img src="images/10.png">
#### 3.2.2	Documentary Frequency Analysis and Technical Skills Required to be a Data Engineer
Based on the word cloud shown above, we can see which technologies are more important than others. Then we use SQL LIKE operator to identify which statements contain these keywords and count the distinct total number of job ads. In such ways, we can obtain the percentage of JDs required the technical and soft skills in total Data Engineer JDs, as shown in Figure 3.8 and Figure 3.9. 

From the figures, we can conclude that the top 8 hottest required skills for Data Engineer are **`Cloud, SQL, AWS, Python, Azure, Data Pipeline, ETL/ELT, Communication Skills`**. The JDs required these skills account for nearly or more than 30 percent of the total JDs. The other also important technical skills are `Data Warehouse, Excel, DevOps, Scala, CI/CD, Data Modeling, Deep Learning, Data Architecture, Spark, Java`, etc. These skills are also quite important to become a professional Data Engineer. We can tell that people need to master the most skills to be Data Engineer than Data Analyst and Data Scientists.

**Figure 3.8**. Technical Skills Required to be a Data Engineer
<img src="images/11.png">
<img src="images/12.png">
**Figure 3.9**. Soft Skills Required to be a Data Engineer
<img src="images/13.png">
#### 3.2.3	Ranked Line in the Job Description of Technical Skills 
Then we analyze the average appeared line number of these important skills. We categorize their first occurrence line numbers into three range: 1 ~ 3, 4 ~ 5 and 6+, as shown in Figure 3.10 and Figure 3.11. 

From the figures, we can see that there are `501 times that Cloud appears in the first to third lines`, 79 times that appears in the fourth and fifth lines, and 66 times that appears in other lines. In such way, we can distinguish the order of importance of these skills. The more times they appear in the front lines, means that they are more important than other skills. From Figure 3.100, we can conclude that the top demand skills are **Cloud, SQL, AWS, Python, Azure, Data Pipeline, ETL/ELT, Communication Skills**. The other important technical skills are `Excel, DevOps, Scala, CI/CD, Data Modeling, Data Architecture, Data Warehouse, Spark, Data Ingestion, API, Java`, etc.

**Figure 3.10**.  Ranked Line in the Job Description of Technical Skills for Data Engineer
<img src="images/14.png">
<img src="images/15.png">
**Figure 3.11**. Ranked Line in the Job Description of Soft Skills for Data Engineer
<img src="images/16.png">
#### 3.2.4	Average Ranked Line in the Job Description of Technical Skills 
Then we can calculate the average ranked line in the job description of these skills, as shown in Figure 3.12. 

**Figure 3.12**. Average Ranked Line in the Job Description of Technical Skills
<img src="images/17.png">

### 3.3	Required Skills for Data Scientist<a class="anchor" id="sec_3.3"></a>
#### 3.3.1	Term Frequency and Word Cloud for Data Scientist
By similar analysis method, we can calculate the appeared frequency of each meaningful word(s) in the Data Scientist JDs. Then we can draw Word Cloud as shown Figure 3.13.

**Figure 3.13**.  Word Cloud for Data Scientist
<img src="images/18.png">

#### 3.3.2	Documentary Frequency Analysis and Technical Skills Required to be a Data Scientist
Based on the word cloud shown above and using SQL LIKE operator to identify which statements contain these keywords, we can count the distinct total number of job ads. In such ways, we can obtain the percentage of JDs required the technical and soft skills in total Data Scientist JDs, as shown in Figure 3.14 and Figure 3.15. 

From the figures, we can conclude that the top 8 hottest required skills for Data Engineer are `Python, Communication Skills, Machine Learning, Data Science, Statistics, Research, Computer Science, SQL`. The JDs required these skills account for nearly or more than 30 percent of the total JDs. The other also important technical skills are `Excel, Mathematics, Cloud, Algorithms, Quantitative, Data Modeling, Deep Learning, AWS, Spark`, etc. These skills are also quite important to become a professional Data Scientist. We can tell that people better to have strong understanding of the machine learning and mathematical foundations to be Data Scientist than Data Analyst and Data Engineer.

**Figure 3.14**. Technical Skills Required to be a Data Scientist
<img src="images/19.png">
<img src="images/20.png">
**Figure 3.15**. Soft Skills Required to be a Data Scientist
<img src="images/21.png">

#### 3.3.3	Ranked Line in the Job Description of Technical Skills 
Then we analyze the average appeared line number of these important skills. We categorize their first occurrence line numbers into three range: 1~3, 4~5 and 6+, as shown in Figure 3.16 and Figure 3.17. 

From the figures, we can see that there are `156 times that Python appears in the first to third lines`, 45 times that appears in the fourth and fifth lines, and 29 times that appears in other lines. In such way, we can distinguish the order of importance of these skills. The more times they appear in the front lines, means that they are more important than other skills. From Figure 3.14, we can conclude that the top demand skills are `Python, Communication Skills, Machine Learning, Data Science, Statistics, Research, SQL, Computer Science`. The other important technical skills are `Excel, Mathematics, Cloud, Algorithms, Quantitative, Data Modeling, AWS, Spark`, etc.

**Figure 3.16**.  Ranked Line in the Job Description of Technical Skills for Data Scientist
<img src="images/22.png">
<img src="images/23.png">
**Figure 3.17**. Ranked Line in the Job Description of Soft Skills for Data Scientist
<img src="images/24.png">

#### 3.3.4	Average Ranked Line in the Job Description of Technical Skills 
We can also calculate the average ranked line in the job description of these skills, as shown in Figure 3.18.

**Figure 3.18**.  Average Ranked Line in the Job Description of Technical Skills
<img src="images/25.png">

## 3.4	Carrier Path<a class="anchor" id="sec_3.4"></a>
#### 3.4.1	The Necessary Skills to Become a Data Analyst, Data Engineer and Data Scientist
From the analysis elaborated above, we can conclude that the hottest required skills for Data Analyst are **SQL, Excel, Power BI, Tableau, Python, Data Visualization, Reporting and Communication Skills**; the hottest required skills for Data Engineer are **Cloud, SQL, AWS, Python, Azure, Data Pipeline, ETL/ELT, Communication Skills**; the top hottest required skills for Data Scientist are **Python, Communication Skills, Machine Learning, Data Science, Statistics, Research, Computer Science, SQL**.
Then we can draw the dashboard for these necessary skills to be a Data Analyst, Data Engineer, and Data Scientist, as shown in Figure 3.19 – 3.21 .

**Figure 3.19**.  Dashboard for the Necessary Skills to be a Data Analyst
<img src="images/26.png">

**Figure 3.20**.  Dashboard for the Necessary Skills to be a Data Engineer
<img src="images/27.png">

**Figure 3.21**.  Dashboard for Necessary Skills to be a Data Scientist
<img src="images/28.png">

#### 3.4.2	Additional Skills while Switching Between Data Analyst, Data Engineer and Data Scientist
From the following schematic diagram of the career path between Data Analyst, Data Engineer and Data Scientist, we can tell that SQL, Python and communication skills are mandatory skills in the big data field.

If people want to change carrier from Data Analyst to Data Engineer, they must obtain **Cloud experience, such as AWS, Azure**. They will also need to obtain other necessary skills and experience such as **ETL/ELT, Spark, Data Pipeline, DevOps**.

If people who are Data Analysts would like to be Data Scientists, they are better to have **research or algorithms experience** and strong knowledge for the following fields, such as **Machine Learning, Data Science, Statistic, Mathematics**.

If people would like to change career from Data Engineer and Data Scientist to Data Analyst, they not only need to obtain **strong hands-on experience on SQL and Python**, but also need to improve **Data Visualization skills, i.e.  Power BI, Tableau, Excel, and Reporting / story telling skills**. 

**Figure 3.22**. Schematic Diagram of the Career Path between Data Analyst, Data Engineer and Data Scientist
<img src="images/29.png">

## 4.	Conclusion<a class="anchor" id="sec_4"></a>
From the analysis between Data Analyst, Data Engineer and Data Scientist elaborated above, we can conclude that **SQL, Python and communication skills are mandatory skills in the big data field**. However, these three job titles also have their own special technical tendencies depending on different responsibilities:

To be **Data Analysts requires strong data visualization, reporting skills and domain knowledge** because they need to use data to communicate and help companies make business decisions.

To be **Data Engineers needs to master more technical skills** as their job is to build the data pipeline and optimize the systems to allow data analysts and data scientist to perform their work. 

While **Data Scientists** use **statistics and machine learning algorithms** to make predictions, it requires excellent mathematics, statistics knowledge, and research experience.

## 5.	References<a class="anchor" id="sec_5"></a>

In [48]:
# Run me to hide code cells

from IPython.core.display import display, HTML
display(HTML(r"""<style id=hide>div.input{display:none;}</style>
<button type="button"onclick="var myStyle = document.getElementById('hide').sheet;
myStyle.insertRule('div.input{display:inherit !important;}', 0);">Show the code</button>"""))

In [27]:
# @hidden
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {

import pandas as pd
import re


job_details = pd.read_csv('seek_data_analyst_job_extra_info.csv')

# job_salary data cleaning
salary_up = []
salary_down = []
for i in job_details['job_salary']:
    # use Regular Expression to extract the salary data and transform them into float numbers from the database.
    salary = re.findall(r'[\$-]([\d, kK]+)', str(i))
    salary_new1 = [s.replace(',', '') for s in salary]
    salary_new2 = [s.replace('k', '000') for s in salary_new1]
    salary_new3 = [s.replace('K', '000') for s in salary_new2]
    salary_new4 = [s.replace(' ', '') for s in salary_new3]
    salary_new5 = [s for s in salary_new4 if s != '']

    if len(salary_new5) == 1 and salary_new5[0].isdigit():
        salary_down.append(float(salary_new5[0]))
        salary_up.append('')
    elif len(salary_new5) == 2:
        salary_down.append(float(salary_new5[0]))
        salary_up.append(float(salary_new5[1]))
    else:
        salary_down.append('')
        salary_up.append('')

job_details['job_salary_down'] = salary_down
job_details['job_salary_up'] = salary_up

# based on different salary range, the job salary can be divided into following three categories: 
# If salary is lower than 200, it is paid by per hour;
# If salary is greater than 200 and lower than 2000, it is paid by per day;
# If salary is greater than 2000, it is paid by per annual.
# In such way, the annual salary can be calculated according to the full-time working hours from Australian government website. 
# There are 251 working days or 2008 working hours per year for a full-time working position.

job_details['annual_salary_down'] = job_details['job_salary_down']

job_details.annual_salary_down[(200 > job_details['job_salary_down'])] = job_details['job_salary_down'] * 2008
job_details.annual_salary_down[(200 < job_details['job_salary_down'])
                          & (job_details['job_salary_down'] < 2000)] = job_details['job_salary_down'] * 251


job_details['annual_salary_up'] = job_details['job_salary_up']

job_details.annual_salary_up[(200 > job_details['job_salary_down'])] = job_details['job_salary_up'] * 2008
job_details.annual_salary_up[(200 < job_details['job_salary_down'])
                          & (job_details['job_salary_down'] < 2000)] = job_details['job_salary_up'] * 251

# Finally, the data can be converted into a CSV file and needs to be manually checked for errors. 
job_details.to_csv('seek_data_analyst_job_extra_info_cleaned.csv', index=False)

 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this salary data cleaning is hidden.
<a href="javascript:code_toggle()"></a>''')

In [29]:
# @hidden
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
from datetime import datetime, timedelta
import pandas as pd
import sqlite3


def getdate(days_ago: int, hours_ago: int, minutes_ago: int):
    time_now = datetime.utcnow()
    time_diff = timedelta(days=days_ago, hours=hours_ago, minutes=minutes_ago)
    return (time_now - time_diff).strftime('%Y-%m-%d %H:%M:%S')


job_details = pd.read_csv('seek_data_analyst_job_extra_info_cleaned.csv')
job_posted_time = []

for dates in job_details['job_listing_date']:
    if 'd' in str(dates):
        days_ago = int(dates.split('d')[0])
        job_posted_time.append(getdate(days_ago, 0, 0))
    elif 'h' in str(dates):
        hours_ago = int(dates.split('h')[0])
        job_posted_time.append(getdate(0, hours_ago, 0))
    elif 'm' in str(dates):
        minutes_ago = int(dates.split('m')[0])
        job_posted_time.append(getdate(0, 0, minutes_ago))
    else:
        job_posted_time.append('')

print(job_posted_time)
job_details['job_posted_time'] = job_posted_time
job_details.to_csv('seek_data_analyst_job_extra_info_cleaned2.csv', index=False)

 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for the datatime data cleaning is hidden.
<a href="javascript:code_toggle()"></a>''')

In [34]:
# @hidden
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
import re
import pandas as pd
from typing import Tuple, List, Iterable, Any
import nltk
from nltk.corpus import stopwords


keywords = ['SQL', 'Excel', 'PowerBI', 'Tableau', 'Python', 'Cloud'
        , 'Azure', 'AWS', 'GCP', 'API', 'Pipeline', 'Dimension Modelling', 'ETL', 'ELT'
        , 'DevOps', 'CI CD', 'Spark', 'Java', 'Scala', 'Oracle', 'Kubernetes', 'Docker'
        , 'Apache', 'Kafka', 'Linux', 'Snowflake', 'Data Warehouse'
        , 'Data Modelling', 'Data Visualisation', 'Data Migration', 'Data Management'
        , 'Data Integration', 'Data Platform', 'Data Architecture', 'Data Factory', 'Databricks', 'Data Science'
        , 'Machine Learning', 'Computer Science', 'Research', 'Statistic', 'Mathematics'
        , 'Quantitative', 'Algorithm', 'Deep Learning', 'Statistical Analysis'
        , 'Communication Skill', 'Stakeholder', 'Reporting', 'Agile'
        , 'Project Management', 'Business Intelligence', 'Decision Making', 'Interpersonal'
        , 'Time Management', 'Troubleshoot', 'Tertiary Qualification', 'PhD']


# nltk.download('stopwords')
def ngram(l: Iterable[Any], gram: int) -> List[Any]:
    l = list(l)
    return [l[i: i + gram] for i in range(len(l) - gram + 1)]


def get_words(line: str, gram: int) -> List[str]:
    words = re.findall(r'[a-zA-Z]+', line.lower())
    if gram == 1:
        stop_words = stopwords.words('english')
        words = [word for word in words if word not in stop_words]
    else:
        words = [' '.join(word) for word in ngram(words, gram)]
    return words


# 每次输入一句string，统计单词或词组出现的次数
def word_count(stats: dict, sentence: str, gram: int) -> dict:
    if isinstance(sentence, str):
        # matched = re.findall(r'[a-zA-Z]+', sentence.lower())
        # without_stop_words = [word for word in matched if word not in stop_words]
        for word in get_words(sentence, gram):
            if word in stats:
                stats[word] += 1
            else:
                stats[word] = 1
    return stats
    

def sorted_word_count(d: dict) -> Tuple:
    return sorted(list(d.items()), key=lambda x: x[-1], reverse=True)


# read the job details file
wc = {}
job_uls = pd.read_csv('data_scientist/seek_data_scientist_job_details_filtered.csv')

for ul in job_uls['detail']:
    wc = word_count(wc, ul, gram=1)

for i in sorted_word_count(wc):
    print(i)
    
df = pd.DataFrame(sorted_word_count(wc))

df.to_csv('seek_data_scientist_gram_1.csv', index=False)

 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for the word count - Term Frequency Analysis is hidden.
<a href="javascript:code_toggle()"></a>''')

In [36]:
# @hidden
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {

import pandas as pd
from typing import Tuple, List, Iterable, Any


def filtered_keywords(sentence: str) -> list:
    result = []
    if isinstance(sentence, str):
        for word in keywords:
            if word.lower() in sentence.lower():
                result.append(word)
    return result


def ngram(l: Iterable[Any], gram: int) -> List[Any]:
    l = list(l)
    return [l[i: i + gram] for i in range(len(l) - gram + 1)]
    
    
def get_words(line: str, gram: int) -> List[str]:
    words = re.findall(r'[a-zA-Z]+', line.lower())
    if gram == 1:
        stop_words = stopwords.words('english')
        words = [word for word in words if word not in stop_words]
    else:
        words = [' '.join(word) for word in ngram(words, gram)]
    return words
    
    
keywords = ['SQL', 'Excel', 'PowerBI', 'Tableau', 'Python', 'Cloud'
        , 'Azure', 'AWS', 'GCP', 'API', 'Pipeline', 'Dimension Modelling', 'ETL', 'ELT'
        , 'DevOps', 'CI CD', 'Spark', 'Java', 'Scala', 'Oracle', 'Kubernetes', 'Docker'
        , 'Apache', 'Kafka', 'Linux', 'Snowflake', 'Data Warehouse'
        , 'Data Modelling', 'Data Visualisation', 'Data Migration', 'Data Management'
        , 'Data Integration', 'Data Platform', 'Data Architecture', 'Data Factory', 'Databricks', 'Data Science'
        , 'Machine Learning', 'Computer Science', 'Research', 'Statistic', 'Mathematics'
        , 'Quantitative', 'Algorithm', 'Deep Learning', 'Statistical Analysis'
        , 'Communication Skill', 'Stakeholder', 'Reporting', 'Agile'
        , 'Project Management', 'Business Intelligence', 'Decision Making', 'Interpersonal'
        , 'Time Management', 'Troubleshoot', 'Tertiary Qualification', 'PhD']
    

keywords_dict = {'SQL':[], 'Excel':[], 'Power BI':[], 'Tableau':[], 'Python':[], 'Cloud':[], 'Azure':[]
        , 'AWS':[], 'GCP':[], 'API':[], 'Pipeline':[], 'Dimension Modelling':[], 'ETL':[], 'ELT':[]
        , 'DevOps':[], 'CI CD':[], 'Spark':[], 'Java':[], 'Scala':[], 'Oracle': [], 'Kubernetes': []
        , 'Docker':[], 'Apache': [], 'Kafka':[], 'Linux':[], 'Snowflake':[], 'Data Warehouse':[]
        , 'Data Modelling':[], 'Data Visualisation':[], 'Data Migration':[], 'Data Management':[]
        , 'Data Integration':[], 'Data Platform':[], 'Data Architecture':[], 'Data Factory':[], 'Data Science':[]
        , 'Machine Learning':[], 'Computer Science':[], 'Research':[], 'Statistic':[], 'Mathematics':[]
        , 'Quantitative':[], 'Algorithm':[], 'Deep Learning':[], 'Statistical Analysis':[]
        , 'Communication Skill':[], 'Stakeholder': [], 'Reporting':[], 'Agile':[]
        , 'Project Management':[], 'Business Intelligence':[], 'Decision Making':[], 'Interpersonal':[]
        , 'Time Management':[], 'Troubleshoot':[], 'Tertiary Qualification':[], 'PhD':[]}


# read the job details file
exsited_keyword = []
job_uls = pd.read_csv('data_scientist/seek_data_scientist_job_details_filtered.csv')
for ul in job_uls['detail']:
    exsited_keyword.append(filtered_keywords(ul))

job_uls['exsited_keyword'] = exsited_keyword

for index, row in job_uls.iterrows():
    if isinstance(row['detail'], str):
        for word in get_words(row['detail'], 1):
            for skill in keywords:
                if word == skill.lower():
                    keywords_dict[skill].append(row['line_id'])
        for word in get_words(row['detail'], 2):
            for skill in keywords:
                if word == skill.lower():
                    keywords_dict[skill].append(row['line_id'])

skill_ranked_line = {}
print(keywords_dict)
for key, value in keywords_dict.items():
    if len(value) != 0:
        skill_ranked_line[key] = sum(value) / len(value)
    print(key, value)

 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for the Average Line_id Count is hidden.
<a href="javascript:code_toggle()"></a>''')

In [43]:
# @hidden
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {

---

```sql
WITH 
enriched AS (
	SELECT job_id, job_title
			, CASE 
			WHEN job_title LIKE '%data%analyst%'  THEN 'data analyst'  
			WHEN job_title LIKE '%bi%'  THEN 'data analyst'  
			WHEN job_title LIKE '%tableau%'  THEN 'data analyst'
			WHEN job_title like '%sql%' THEN 'data analyst'		
			WHEN job_title LIKE '%analytic%'  THEN 'data analyst' 
			WHEN job_title LIKE '%data%modeller%'  THEN 'data analyst' 
			WHEN job_title LIKE '%visualisation%'  THEN 'data analyst' 
			WHEN job_title LIKE '%business%intelligence%'  THEN 'data analyst'  
			WHEN job_title like '%insight%' THEN 'data analyst'
			WHEN job_title like '%reporting%' THEN 'data analyst'
			WHEN job_title like '%model%' THEN 'data analyst'
			
			WHEN job_title like '%data%engineer%' THEN 'data engineer'
			WHEN job_title like '%warehouse%' THEN 'data engineer'
			WHEN job_title like '%architect%' THEN 'data engineer'
			WHEN job_title like '%snowflake%' THEN 'data engineer'
			WHEN job_title like '%etl%' THEN 'data engineer'
			WHEN job_title like '%api%' THEN 'data engineer'
			WHEN job_title like '%cloud%' THEN 'data engineer'
			WHEN job_title like '%aws%' THEN 'data engineer'
			WHEN job_title like '%kafka%' THEN 'data engineer'
			WHEN job_title like '%pipeline%' THEN 'data engineer'
			WHEN job_title like '%migration%' THEN 'data engineer'
			
			WHEN job_title like '%scientist%' THEN 'data scientist'
			WHEN job_title like '%science%' THEN 'data scientist'
			WHEN job_title like '%machine%learning%' THEN 'data scientist'
			WHEN job_title like '% ai%' THEN 'data scientist'
			WHEN job_title like '%ai %' THEN 'data scientist'
			WHEN job_title like '%ml%' THEN 'data scientist'
			WHEN job_title like '%computational%statistics%' then 'data scientist'
			WHEN job_title like '%quantitative%' then 'data scientist'
			WHEN job_title like '%artificial%intelligence%' then 'data scientist'
			
			WHEN job_title like '%business%analyst%' THEN 'else'
			WHEN job_title like '%accountant%' THEN 'else'
			WHEN job_title like '%sales%' THEN 'else'
			WHEN job_title like '%product%' THEN 'else'
			WHEN job_title like '%dev%ops%' THEN 'else'
			WHEN job_title like '%graduate%' THEN 'else'
			WHEN job_title LIKE '%developer%' THEN 'else'
			WHEN job_title LIKE '%functional%' THEN 'else'
			WHEN job_title LIKE '%financial%' THEN 'else'
			WHEN job_title LIKE '%finance%' THEN 'else'
			WHEN job_title LIKE '%manager%' THEN 'else'
			WHEN job_title LIKE '%project%' THEN 'else'
			WHEN job_title LIKE '%program%' THEN 'else'
			
			END AS job_standard_name
	FROM dim_job_info
	ORDER BY 2 DESC
	)

, filtered_job_details AS (
	SELECT *
	FROM dim_job_details
	WHERE job_id IN (
		SELECT job_id FROM enriched 
-- 		WHERE job_standard_name = 'data analyst' )
-- 		WHERE job_standard_name = 'data engineer' )
		WHERE job_standard_name = 'data scientist' )
	)

, enriched_details AS (
	SELECT *
		, CASE WHEN detail LIKE '%sql%' THEN '1' END AS sql_counter
		, CASE WHEN detail LIKE '%excel%' THEN '1' END AS excel_counter
		, CASE WHEN detail LIKE '%power%bi%' THEN '1' END AS powerbi_counter
		, CASE WHEN detail LIKE '%tableau%' THEN '1' END AS tableau_counter
		, CASE WHEN detail LIKE '%python%' THEN '1' END AS python_counter
		, CASE WHEN detail LIKE '%cloud%' THEN '1' END AS cloud_counter
		, CASE WHEN detail LIKE '%azure%' THEN '1' END AS azure_counter
		, CASE WHEN detail LIKE '%aws%' THEN '1' END AS aws_counter
		, CASE WHEN detail LIKE '%gcp%' THEN '1' END AS gcp_counter
		
		, CASE WHEN detail LIKE '%api%' THEN '1' END AS api_counter
		, CASE WHEN detail LIKE '%pipeline%' THEN '1' END AS pipeline_counter
		, CASE WHEN detail LIKE '%dimension%' THEN '1' END AS dimension_counter
		, CASE WHEN detail LIKE '%etl%' THEN '1' END AS etl_counter
		, CASE WHEN detail LIKE '%elt%' THEN '1' END AS elt_counter
		, CASE WHEN detail LIKE '%devops%' THEN '1' END AS devops_counter
		, CASE WHEN detail LIKE '%ci%cd%' THEN '1' END AS cicd_counter
		, CASE WHEN detail LIKE '%spark%' THEN '1' END AS spark_counter
		, CASE WHEN detail LIKE '%java%' THEN '1' END AS java_counter
		, CASE WHEN detail LIKE '%scala%' THEN '1' END AS scala_counter
		, CASE WHEN detail LIKE '%kafka%' THEN '1' END AS kafka_counter
		, CASE WHEN detail LIKE '%linux%' THEN '1' END AS linux_counter
		, CASE WHEN detail LIKE '%snowflake%' THEN '1' END AS snowflake_counter	
		, CASE WHEN detail LIKE '%oracle%' THEN '1' END AS oracle_counter
		, CASE WHEN detail LIKE '%kubernetes%' THEN '1' END AS kubernetes_counter
		, CASE WHEN detail LIKE '%docker%' THEN '1' END AS docker_counter
		, CASE WHEN detail LIKE '%apache%' THEN '1' END AS apache_counter
		
		, CASE WHEN detail LIKE '%data%warehous%' THEN '1' END AS DW_counter
		, CASE WHEN detail LIKE '%data%modelling%' THEN '1' END AS DM_counter
		, CASE WHEN detail LIKE '%data%visualisation%' THEN '1' END AS DV_counter
		, CASE WHEN detail LIKE '%data%migration%' THEN '1' END AS DMI_counter
		, CASE WHEN detail LIKE '%data%management%' THEN '1' END AS DMA_counter
		, CASE WHEN detail LIKE '%data%integration%' THEN '1' END AS DI_counter
		, CASE WHEN detail LIKE '%data%platform%' THEN '1' END AS DP_counter
		, CASE WHEN detail LIKE '%data%architecture%' THEN '1' END AS DA_counter
		, CASE WHEN detail LIKE '%data%factory%' THEN '1' END AS DF_counter
		, CASE WHEN detail LIKE '%databricks%' THEN '1' END AS DB_counter
		
		, CASE WHEN detail LIKE '%data%science%' THEN '1' END AS DS_counter
		, CASE WHEN detail LIKE '%machine%learning%' THEN '1' END AS ML_counter
		, CASE WHEN detail LIKE '% ai %' THEN '1' END AS AI_counter
		, CASE WHEN detail LIKE '%computer%science%' THEN '1' END AS CS_counter
		, CASE WHEN detail LIKE '%research%' THEN '1' END AS research_counter
		, CASE WHEN detail LIKE '%statistic%' THEN '1' END AS statistic_counter
		, CASE WHEN detail LIKE '%mathematics%' THEN '1' END AS mathematics_counter
		, CASE WHEN detail LIKE '%quantitative%' THEN '1' END AS quantitative_counter
		, CASE WHEN detail LIKE '%algorithm%' THEN '1' END AS algorithm_counter
		, CASE WHEN detail LIKE '%deep%learning%' THEN '1' END AS DL_counter
		, CASE WHEN detail LIKE '%statistical%analysis%' THEN '1' END AS SA_counter
        
        , CASE WHEN detail LIKE '%communication%' THEN '1' END AS communication_counter
		, CASE WHEN detail LIKE '%reporting%' THEN '1' END AS reporting_counter
		, CASE WHEN detail LIKE '%agile%' THEN '1' END AS agile_counter
		
		, CASE WHEN detail LIKE '%stakeholder%' THEN '1' END AS stakeholder_counter
		, CASE WHEN detail LIKE '%project%management%' THEN '1' END AS PM_counter
		, CASE WHEN detail LIKE '%decision%making%' THEN '1' END AS DM_counter
		, CASE WHEN detail LIKE '%interpersonal%skills%' THEN '1' END AS IS_counter
		, CASE WHEN detail LIKE '%time%management%' THEN '1' END AS TM_counter
		, CASE WHEN detail LIKE '%troubleshoot%' THEN '1' END AS troubleshoot_counter
		, CASE WHEN detail LIKE '%tertiary%qualification%' THEN '1' END AS TQ_counter
		, CASE WHEN detail LIKE '%phd%' THEN '1' END AS phd_counter
		, CASE WHEN detail LIKE '%business%intelligence%' THEN '1' END AS BI_counter

	FROM filtered_job_details
	)
, job_details AS (
	SELECT job_id
		, SUM(sql_counter) > 0 AS has_sql
		, SUM(excel_counter) > 0 AS has_excel
		, SUM(powerbi_counter) > 0 AS has_powerbi
		, SUM(tableau_counter) > 0 AS has_tableau
		, SUM(python_counter) > 0 AS has_python
		, SUM(cloud_counter) > 0 AS has_cloud
		, SUM(azure_counter) > 0 AS has_azure
		, SUM(aws_counter) > 0 AS has_aws
		, SUM(gcp_counter) > 0 AS has_gcp
		
		, SUM(api_counter) > 0 AS has_api
		, SUM(pipeline_counter) > 0 AS has_pipeline
		, SUM(dimension_counter) > 0 AS has_dimension
		, SUM(etl_counter) > 0 AS has_etl
		, SUM(elt_counter) > 0 AS has_elt
		, SUM(devops_counter) > 0 AS has_devops
		, SUM(cicd_counter) > 0 AS has_cicd
		, SUM(spark_counter) > 0 AS has_spark
		, SUM(java_counter) > 0 AS has_java
		, SUM(scala_counter) > 0 AS has_scala
		, SUM(kafka_counter) > 0 AS has_kafka
		, SUM(linux_counter) > 0 AS has_linux
		, SUM(snowflake_counter) > 0 AS has_snowflake
		, SUM(oracle_counter) > 0 AS has_oracle
		, SUM(kubernetes_counter) > 0 AS has_kubernetes
		, SUM(docker_counter) > 0 AS has_docker
		, SUM(apache_counter) > 0 AS has_apache

		, SUM(DW_counter) > 0 AS has_DW
		, SUM(DM_counter) > 0 AS has_DM
		, SUM(DV_counter) > 0 AS has_DV
		, SUM(DMI_counter) > 0 AS has_DMI
		, SUM(DMA_counter) > 0 AS has_DMA
		, SUM(DI_counter) > 0 AS has_DI
		, SUM(DP_counter) > 0 AS has_DP
		, SUM(DA_counter) > 0 AS has_DA
		, SUM(DF_counter) > 0 AS has_DF
		, SUM(DB_counter) > 0 AS has_DB
		
		, SUM(DS_counter) > 0 AS has_DS
		, SUM(ML_counter) > 0 AS has_ML
		, SUM(AI_counter) > 0 AS has_AI
		, SUM(CS_counter) > 0 AS has_CS
		, SUM(research_counter) > 0 AS has_research
		, SUM(statistic_counter) > 0 AS has_statistic
		, SUM(mathematics_counter) > 0 AS has_mathematics
		, SUM(quantitative_counter) > 0 AS has_quantitative
		, SUM(algorithm_counter) > 0 AS has_algorithm
		, SUM(DM_counter) > 0 AS has_DM
		, SUM(SA_counter) > 0 AS has_SA
        
        , SUM(communication_counter) > 0 AS has_communication
		, SUM(reporting_counter) > 0 AS has_reporting
		, SUM(agile_counter) > 0 AS has_agile
		, SUM(stakeholder_counter) > 0 AS has_stakeholder
		, SUM(PM_counter) > 0 AS has_PM
		, SUM(DM_counter) > 0 AS has_DM
		, SUM(IS_counter) > 0 AS has_IS
		, SUM(TM_counter) > 0 AS has_TM
		, SUM(troubleshoot_counter) > 0 AS has_troubleshoot
		, SUM(TQ_counter) > 0 AS has_TQ
		, SUM(phd_counter) > 0 AS has_phd
		, SUM(BI_counter) > 0 AS has_BI

	FROM enriched_details
	GROUP BY job_id
	)

SELECT 
(SELECT COUNT(DISTINCT job_id) FROM filtered_job_details) AS total_jobs
	, SUM(has_sql) AS sql_jobs
	, SUM(has_excel) AS excel_jobs
	, SUM(has_powerbi) AS powerbi_jobs
	, SUM(has_tableau) AS tableau_jobs
	, SUM(has_python) AS python_jobs
	, SUM(has_cloud) AS cloud_jobs
	, SUM(has_azure) AS azure_jobs
	, SUM(has_aws) AS aws_jobs
	, SUM(has_gcp) AS gcp_jobs
	
	, SUM(has_api) AS api_jobs
	, SUM(has_pipeline) AS pipeline_jobs
	, SUM(has_dimension) AS dimension_jobs
	, SUM(has_etl) AS etl_jobs
	, SUM(has_elt) AS elt_jobs
	, SUM(has_devops) AS devops_jobs
	, SUM(has_cicd) AS cicd_jobs
	, SUM(has_spark) AS spark_jobs
	, SUM(has_java) AS java_jobs
	, SUM(has_scala) AS scala_jobs
	, SUM(has_kafka) AS kafka_jobs
	, SUM(has_linux) AS linux_jobs
	, SUM(has_snowflake) AS snowflake_jobs
	, SUM(has_oracle) AS oracle_jobs
	, SUM(has_kubernetes) AS kubernetes_jobs
	, SUM(has_docker) AS docker_jobs
	, SUM(has_apache) AS apache_jobs

	, SUM(has_DW) AS datawarehouse_jobs
	, SUM(has_DM) AS datamodelling_jobs
	, SUM(has_DV) AS datavisualisation_jobs
	, SUM(has_DMI) AS datamigration_jobs
	, SUM(has_DMA) AS datamanagement_jobs
	, SUM(has_DI) AS dataintegration_jobs
	, SUM(has_DP) AS dataplatform_jobs
	, SUM(has_DA) AS dataarchitecture_jobs
	, SUM(has_DF) AS datafactory_jobs
	, SUM(has_DB) AS databricks_jobs
	
	, SUM(has_DS) AS datascience_jobs
	, SUM(has_ML) AS machinelearning_jobs
	, SUM(has_AI) AS Artificialintelligence_jobs
	, SUM(has_CS) AS computerscience_jobs
	, SUM(has_research) AS research_jobs
	, SUM(has_statistic) AS statistic_jobs
	, SUM(has_mathematics) AS mathematics_jobs
	, SUM(has_quantitative) AS quantitative_jobs
	, SUM(has_algorithm) AS algorithm_jobs
	, SUM(has_DM) AS deeplearning_jobs
	, SUM(has_SA) AS statisticalanalysis_jobs
    
    , SUM(has_communication) AS Communication_Skill
	, SUM(has_reporting) AS Reporting
	, SUM(has_agile) AS Agile
	, SUM(has_stakeholder) AS Stakeholder
	, SUM(has_PM) AS Project_Management
	, SUM(has_BI) AS Business_Intelligence
	, SUM(has_DM) AS Decision_Making
	, SUM(has_IS) AS Interpersonal_Skill
	, SUM(has_TM) AS Time_Management
	, SUM(has_troubleshoot) AS Troubleshooting
	, SUM(has_TQ) AS Tertiary_Qualification
	, SUM(has_phd) AS PhD
    
	
FROM job_details


```

---

 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for Filtering the Related Recruitment Data for DA, DE and DS and Necessary Skills is hidden.
<a href="javascript:code_toggle()"></a>''')

In [42]:
# @hidden
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {

---

```sql

WITH 
enriched AS (
	SELECT job_id, job_title
			, CASE 
			WHEN job_title LIKE '%data%analyst%'  THEN 'data analyst'  
			WHEN job_title LIKE '%bi%'  THEN 'data analyst'  
			WHEN job_title LIKE '%tableau%'  THEN 'data analyst'
			WHEN job_title like '%sql%' THEN 'data analyst'		
			WHEN job_title LIKE '%analytic%'  THEN 'data analyst' 
			WHEN job_title LIKE '%data%modeller%'  THEN 'data analyst' 
			WHEN job_title LIKE '%visualisation%'  THEN 'data analyst' 
			WHEN job_title LIKE '%business%intelligence%'  THEN 'data analyst'  
			WHEN job_title like '%insight%' THEN 'data analyst'
			WHEN job_title like '%reporting%' THEN 'data analyst'
			WHEN job_title like '%model%' THEN 'data analyst'
			
			WHEN job_title like '%data%engineer%' THEN 'data engineer'
			WHEN job_title like '%warehouse%' THEN 'data engineer'
			WHEN job_title like '%architect%' THEN 'data engineer'
			WHEN job_title like '%snowflake%' THEN 'data engineer'
			WHEN job_title like '%etl%' THEN 'data engineer'
			WHEN job_title like '%api%' THEN 'data engineer'
			WHEN job_title like '%cloud%' THEN 'data engineer'
			WHEN job_title like '%aws%' THEN 'data engineer'
			WHEN job_title like '%kafka%' THEN 'data engineer'
			WHEN job_title like '%pipeline%' THEN 'data engineer'
			WHEN job_title like '%migration%' THEN 'data engineer'
			
			WHEN job_title like '%scientist%' THEN 'data scientist'
			WHEN job_title like '%science%' THEN 'data scientist'
			WHEN job_title like '%machine%learning%' THEN 'data scientist'
			WHEN job_title like '% ai%' THEN 'data scientist'
			WHEN job_title like '%ai %' THEN 'data scientist'
			WHEN job_title like '%ml%' THEN 'data scientist'
			WHEN job_title like '%computational%statistics%' then 'data scientist'
			WHEN job_title like '%quantitative%' then 'data scientist'
			WHEN job_title like '%artificial%intelligence%' then 'data scientist'
			END AS job_standard_name
            
	FROM dim_job_info
	ORDER BY 2 DESC
	)

, filtered_job_details AS (
	SELECT *
	FROM dim_job_details
	WHERE job_id IN (
		SELECT job_id FROM enriched 
-- 		WHERE job_standard_name = 'data analyst' )
-- 		WHERE job_standard_name = 'data engineer' )
		WHERE job_standard_name = 'data scientist' )
)
, filtered_skills AS (

	SELECT job_id, 'SQL' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%sql%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Excel' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%excel%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'PowerBI' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%power%bi%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Tableau' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%tableau%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Python' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%python%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Cloud' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%cloud%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Azure' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%azure%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'AWS' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%aws%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'GCP' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%gcp%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'API' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%api%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Pipeline' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%pipeline%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Dimension Modelling' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%dimension%modelling%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'ETL' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%etl%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'ELT' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%elt%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'DevOps' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%devops%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'CI CD' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%ci%cd%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Spark' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%spark%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Java' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%java%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Scala' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%scala%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Oracle' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%oracle%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Kubernetes' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%kubernetes%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Docker' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%docker%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Apache' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%apache%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Kafka' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%kafka%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Linux' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%linux%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Snowflake' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%snowflake%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Data Warehouse' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%data%warehouse%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Data Modelling' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%data%modelling%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Data Visualisation' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%data%visualisation%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Data Migration' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%data%migration%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Data Management' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%data%management%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Data Integration' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%data%integration%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Data Platform' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%data%platform%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Data Architecture' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%data%architecture%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Data Factory' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%data%factory%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Databricks' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%databricks%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Data Science' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%data%science%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Machine Learning' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%machine%learning%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Computer Science' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%computer%science%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Research' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%research%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Statistic' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%statistic%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Mathematics' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%mathematics%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Quantitative' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%quantitative%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Algorithm' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%algorithm%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Deep Learning' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%deep%learning%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Statistical Analysis' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%statistical%analysis%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Communication Skill' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%communication%skill%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Stakeholder' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%stakeholder%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Reporting' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%reporting%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Agile' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%agile%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Project Management' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%project%management%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Business Intelligence' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%business%intelligence%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Decision Making' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%decision%making%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Interpersonal' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%interpersonal%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Time Management' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%time%management%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Troubleshoot' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%troubleshoot%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'Tertiary Qualification' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%tertiary%qualification%'
	GROUP BY job_id

	UNION
	SELECT job_id, 'PhD' AS skill
	, MIN(line_id) AS line_id
	FROM filtered_job_details
	WHERE detail LIKE '%phd%'
	GROUP BY job_id

)
SELECT *
	, CASE 
		WHEN line_id <= 3 THEN '1~3'
		WHEN (line_id > 3 AND line_id <= 5) THEN '4~5'
		ELSE '6+' END AS line_id_category
FROM filtered_skills

```

---

 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for Ranked Line in the Job Description analysis is hidden.
<a href="javascript:code_toggle()"></a>''')