# Top IT skills in Vietnam for 2024

Web scraping official data on IT job recruitment for August 2024.

## Table of Contents
* [01. Identify the problem](#chapter1)
* [02. Extraction](#chapter2)
* [03. Transformation](#chapter3)
* [04. Loading](#chapter4)
* [05. Exploratory Data Analysis (EDA)](#chapter5)
* [06. Visualization and Presentation of Results](#chapter6)
* [07. Decision Making](#chapter7)
* [08. Monitoring and Evaluation](#chapter8)

## 01. Identify the problem <a class="anchor" id="chapter1"></a>

Analysis Objectives:
+ The current IT industry market in Vietnam.
+ Which skills are currently in demand and being recruited the most.
+ Comparison with domestic and international reports.

## 02. Extraction <a id="chapter2"></a>

In [51]:
# Import libraries 
import datetime
import time
import smtplib
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
import random
import undetected_chromedriver as uc



In [52]:
# Array
main_skills_1 = [] #
main_skills_2 = [] #
main_skills_3 = [] #
countries = [] #
company_types = [] #
company_industries = [] #
working_days = [] #
company_sizes = [] #
overtime_policies = [] #
posted_dates = [] #
workplaces = [] #
addresses = [] #
companies = [] #

In [53]:
def main(count):
    # Generate the URL
    URLs = 'https://itviec.com/it-jobs?page=' + str(count)
    print(URLs)
    driver.get(URLs)
    
    # Posted_dates
    posted_dates_temps = driver.find_elements(By.CSS_SELECTOR, 'span[class="small-text text-dark-grey"]')
    for element in posted_dates_temps:
        posted_dates.append(element.get_attribute('innerText'))
        
    # Companys
    companies_temps = driver.find_elements(By.CSS_SELECTOR, 'a[data-controller="utm-tracking"][class="text-rich-grey"]')  
    for element in companies_temps:
        companies.append(element.get_attribute('innerText'))
        
    # Workplaces_and_addresses
    workplaces_and_addresses_temps = driver.find_elements(By.CSS_SELECTOR, 'span[class="ips-2 small-text text-rich-grey"]')
    for i, element in enumerate(workplaces_and_addresses_temps):
        if i % 2 == 0:
            workplaces.append(element.get_attribute('innerText'))
        else:
            addresses.append(element.get_attribute('innerText'))

    # Retrieve elements
    elements = driver.find_elements(By.CSS_SELECTOR, 'div[class="ipy-2"]')
    for element in elements:
        try:
            # Scroll
            actions = ActionChains(driver)
            actions.move_to_element(element).perform()
            
            # Click
            element.click()
            
            time.sleep(1)
            
            # Company_industries
            company_industries_temp1 = driver.find_element(By.CSS_SELECTOR, 'div[class="flex-1 ips-2 ips-md-0"]')
            company_industries_temp2 = company_industries_temp1.find_element(By.CSS_SELECTOR, 'div.d-inline-flex')
            company_industries.append(company_industries_temp2.get_attribute('innerText'))
    
            # Details
            details = driver.find_elements(By.CSS_SELECTOR, 'small[class="normal-text text-it-black col"]')
            company_types.append(details[0].get_attribute('innerText'))
            company_sizes.append(details[1].get_attribute('innerText'))
            countries.append(details[2].get_attribute('innerText'))
            working_days.append(details[3].get_attribute('innerText'))
            if len(details) == 5:
                overtime_policies.append(details[4].get_attribute('innerText'))
            else:
                overtime_policies.append('NULL')
            
            # Main_skill
            skills_temps_1 = driver.find_element(By.CSS_SELECTOR, 'div[class="d-flex align-items-center gap-1"]')
            skills_temps_2 = skills_temps_1.find_elements(By.CSS_SELECTOR, 'div[class="itag itag-light itag-sm"]')
            main_skills_1.append(skills_temps_2[0].get_attribute('innerText'))
            if len(skills_temps_2) == 1:
                main_skills_2.append('NULL')
                main_skills_3.append('NULL')
            elif len(skills_temps_2) == 2:
                main_skills_2.append(skills_temps_2[1].get_attribute('innerText'))
                main_skills_3.append('NULL')
            else:
                main_skills_2.append(skills_temps_2[1].get_attribute('innerText'))
                main_skills_3.append(skills_temps_2[2].get_attribute('innerText'))
        except Exception as e:
            print(f"Không thể nhấn vào phần tử: {e}")

if __name__ == "__main__":
    count = 1
    options = uc.ChromeOptions()
    driver = uc.Chrome(options=options)
    URLs = 'https://itviec.com/it-jobs?'
    print(URLs)
    driver.get(URLs)
    pages = driver.find_elements(By.CSS_SELECTOR, 'a[data-controller="search--pagination"][data-action="ajax:success->search--pagination#paginate"]')
    num = pages[1].get_attribute('innerText')
    limit = int(num)
    while count <= limit:
        main(count)
        count += 1
        delay = random.uniform(1, 4) # Random time sleep
        time.sleep(delay) 
    # Turn off Chrome
    driver.quit()

https://itviec.com/it-jobs?
https://itviec.com/it-jobs?page=1
https://itviec.com/it-jobs?page=2
https://itviec.com/it-jobs?page=3
https://itviec.com/it-jobs?page=4
https://itviec.com/it-jobs?page=5
https://itviec.com/it-jobs?page=6
https://itviec.com/it-jobs?page=7
https://itviec.com/it-jobs?page=8
https://itviec.com/it-jobs?page=9
https://itviec.com/it-jobs?page=10
https://itviec.com/it-jobs?page=11
https://itviec.com/it-jobs?page=12
https://itviec.com/it-jobs?page=13
https://itviec.com/it-jobs?page=14
https://itviec.com/it-jobs?page=15
https://itviec.com/it-jobs?page=16
https://itviec.com/it-jobs?page=17
https://itviec.com/it-jobs?page=18
https://itviec.com/it-jobs?page=19
https://itviec.com/it-jobs?page=20
https://itviec.com/it-jobs?page=21
https://itviec.com/it-jobs?page=22
https://itviec.com/it-jobs?page=23
https://itviec.com/it-jobs?page=24
https://itviec.com/it-jobs?page=25
https://itviec.com/it-jobs?page=26
https://itviec.com/it-jobs?page=27
https://itviec.com/it-jobs?page=28
h

In [54]:
#Tester
def print_list_and_length(title, data_list):
    print(f"{title}:")
    print(f" - Số lượng phần tử: {len(data_list)}")
    if not data_list:
        print("Danh sách trống.")
    else:
        for item in data_list:
            print(f"   - {item}")
    print()

# print_list_and_length("Main Skills 1", main_skills_1)
# print_list_and_length("Main Skills 2", main_skills_2)
# print_list_and_length("Main Skills 3", main_skills_3)
# print_list_and_length("Countries", countries)
# print_list_and_length("Company Types", company_types)
# print_list_and_length("Company Industries", company_industries)
# print_list_and_length("Working Days", working_days)
# print_list_and_length("Company Sizes", company_sizes)
# print_list_and_length("Overtime Policies", overtime_policies)
# print_list_and_length("Posted Dates", posted_dates)
# print_list_and_length("Workplaces", workplaces)
# print_list_and_length("Addresses", addresses)
# print_list_and_length("Companies", companies)

## 03. Transformation <a id="chapter3"></a>

In [55]:
from datetime import datetime, timedelta
import calendar

def convert_time(posted_date, current_time):
    string = posted_date.split()
    temp1 = string[1]
    temp2 = string[2]
    if temp2 == 'minutes' or temp2 == 'minute':
        target_minute = int(temp1)
        time_difference = timedelta(minutes=target_minute)
        new_time = current_time - time_difference
        return new_time
    elif temp2 == 'hours' or temp2 == 'hour':
        target_hour = int(temp1)
        time_difference = timedelta(hours=target_hour)
        new_time = current_time - time_difference
        return new_time
    else:
        target_day = int(temp1)
        time_difference = timedelta(days=target_day)
        new_time = current_time - time_difference
        return new_time

# Current time
current_time = datetime.now()

for i in range(len(posted_dates)):
    posted_dates[i] = convert_time(posted_dates[i], current_time)
    time.sleep(0.0025)


## 04. Loading <a id="chapter4"></a>

In [56]:
# Import libraries 
import pandas as pd

In [57]:
# Dictionary
dict1 = {
    "Posted Date": [],  
    "Main Skill 1": [], 
    "Main Skill 2": [], 
    "Main Skill 3": [], 
    "Company": [],
    "Workplace": [], 
    "Address": [],  
    "Company Type": [], 
    "Company Industry": [], 
    "Company Size": [], 
    "Country": [], 
    "Working Days": [], 
    "Overtime Policy": [], 
}
dict2 = {
    "Posted Date": [], 
    "Skill": [],
    "Company": [],
    "Workplace": [], 
    "Company Type": [],
    "Company Industry": [],
    "Company Size": [], 
    "Country": [], 
}

In [58]:
# Assuming the variables you mentioned are still available
for post_date, skill1, skill2, skill3, company, workplace, address, company_type, company_industry, company_size, country, working_day, overtime_policy in zip(posted_dates, main_skills_1, main_skills_2, main_skills_3, companies , workplaces, addresses , company_types, company_industries, company_sizes, countries, working_days, overtime_policies): 
    dict1['Posted Date'].append(post_date)
    dict1['Main Skill 1'].append(skill1)
    dict1['Main Skill 2'].append(skill2)
    dict1['Main Skill 3'].append(skill3)
    dict1['Company'].append(company)
    dict1['Workplace'].append(workplace)
    dict1['Address'].append(address)
    dict1['Company Type'].append(company_type)
    dict1['Company Industry'].append(company_industry)
    dict1['Company Size'].append(company_size)
    dict1['Country'].append(country)
    dict1['Working Days'].append(working_day)
    dict1['Overtime Policy'].append(overtime_policy)
    time.sleep(0.0025)
#
def append_to_dict2(posted_dates, skills, companies, workplaces, company_types, company_industries, company_sizes, countries):
    for post_date, skill, company, workplace, company_type, company_industry, company_size, country in zip(posted_dates, skills, companies, workplaces, company_types, company_industries, company_sizes, countries):
        dict2['Posted Date'].append(post_date)
        dict2['Skill'].append(skill)
        dict2['Company'].append(company)
        dict2['Workplace'].append(workplace)
        dict2['Company Type'].append(company_type)
        dict2['Company Industry'].append(company_industry)
        dict2['Company Size'].append(company_size)
        dict2['Country'].append(country)
        time.sleep(0.0025)

# Each set of skills
append_to_dict2(posted_dates, main_skills_1, companies, workplaces, company_types, company_industries, company_sizes, countries)
append_to_dict2(posted_dates, main_skills_2, companies, workplaces, company_types, company_industries, company_sizes, countries)
append_to_dict2(posted_dates, main_skills_3, companies, workplaces, company_types, company_industries, company_sizes, countries)

    

In [59]:
# Current month
current_month = datetime.now().month
month_name = calendar.month_name[current_month]

# Export DataFrame to CSV file.
file_path1 = "IT Main" + " " + "in " + month_name + ".csv"
df1 = pd.DataFrame.from_dict(dict1)
df1.to_csv(file_path1, header=True, index=False, encoding='utf-8')

file_path2 = "IT Secondary" + " " + "in " + month_name + ".csv"
df2 = pd.DataFrame.from_dict(dict2)
df2.to_csv(file_path2, header=True, index=False, encoding='utf-8')

## 05. Exploratory Data Analysis (EDA) <a id="chapter5"></a>

### IT Main

In [79]:
df1_read = pd.read_csv('IT Main in August.csv')

In [80]:
df1_read.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1015 entries, 0 to 1014
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Posted Date       1015 non-null   object
 1   Main Skill 1      1015 non-null   object
 2   Main Skill 2      1015 non-null   object
 3   Main Skill 3      859 non-null    object
 4   Company           1015 non-null   object
 5   Workplace         1015 non-null   object
 6   Address           1015 non-null   object
 7   Company Type      1015 non-null   object
 8   Company Industry  972 non-null    object
 9   Company Size      1015 non-null   object
 10  Country           1015 non-null   object
 11  Working Days      1015 non-null   object
 12  Overtime Policy   909 non-null    object
dtypes: object(13)
memory usage: 103.2+ KB


Perform using the Transform Data function in Power BI:
- Main Skill 3 remains "Null"
- Replace "N/A" in Company Industry with "Other"
- Replace "blank" in Overtime Policy with "No OT"

In [81]:
df1_read.head()

Unnamed: 0,Posted Date,Main Skill 1,Main Skill 2,Main Skill 3,Company,Workplace,Address,Company Type,Company Industry,Company Size,Country,Working Days,Overtime Policy
0,2024-08-29 20:21:20.095789,IT Support,Scrum,,Renesas Design Vietnam,At office,Ho Chi Minh,IT Product,IT Hardware and Computing,1000+,Japan,Monday - Friday,Extra salary for OT
1,2024-08-29 20:21:20.095789,English,IT Support,,Renesas Design Vietnam,Hybrid,Ho Chi Minh,IT Product,IT Hardware and Computing,1000+,Japan,Monday - Friday,Extra salary for OT
2,2024-08-29 19:21:20.095789,ReactJS,JavaScript,NodeJS,NAB Innovation Centre Vietnam,Hybrid,Ho Chi Minh,IT Product,Banking,1000+,Australia,Monday - Friday,No OT
3,2024-08-29 19:21:20.095789,Java,JavaScript,,Renesas Design Vietnam,Hybrid,Ho Chi Minh,IT Product,IT Hardware and Computing,1000+,Japan,Monday - Friday,Extra salary for OT
4,2024-08-29 19:21:20.095789,Project Manager,Japanese,English,FUJIFILM Business Innovation Việt Nam,At office,Ho Chi Minh,IT Product,IT Hardware and Computing,151-300,Vietnam,Monday - Friday,Extra salary for OT


In [82]:
df1_read.duplicated().sum()

21

The duplicated values do not corrupt the data because the data is not extracted from job titles. Therefore, duplicate values are not removed.

### IT Secondary

In [83]:
df2_read = pd.read_csv('IT Secondary in August.csv')

In [84]:
df2_read.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3045 entries, 0 to 3044
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Posted Date       3045 non-null   object
 1   Skill             2889 non-null   object
 2   Company           3045 non-null   object
 3   Workplace         3045 non-null   object
 4   Company Type      3045 non-null   object
 5   Company Industry  2916 non-null   object
 6   Company Size      3045 non-null   object
 7   Country           3045 non-null   object
dtypes: object(8)
memory usage: 190.4+ KB


Perform using the Transform Data function in Power BI:
- Remove skills with NULL values.
- Replace "N/A" in Company Industry with "Other"

In [87]:
df2_read.head()

Unnamed: 0,Posted Date,Skill,Company,Workplace,Company Type,Company Industry,Company Size,Country
0,2024-08-29 20:21:20.095789,IT Support,Renesas Design Vietnam,At office,IT Product,IT Hardware and Computing,1000+,Japan
1,2024-08-29 20:21:20.095789,English,Renesas Design Vietnam,Hybrid,IT Product,IT Hardware and Computing,1000+,Japan
2,2024-08-29 19:21:20.095789,ReactJS,NAB Innovation Centre Vietnam,Hybrid,IT Product,Banking,1000+,Australia
3,2024-08-29 19:21:20.095789,Java,Renesas Design Vietnam,Hybrid,IT Product,IT Hardware and Computing,1000+,Japan
4,2024-08-29 19:21:20.095789,Project Manager,FUJIFILM Business Innovation Việt Nam,At office,IT Product,IT Hardware and Computing,151-300,Vietnam


In [86]:
df2_read.duplicated().sum()

172

The duplicated values do not corrupt the data because the data is not extracted from job titles. Therefore, duplicate values are not removed.

## 06. Visualization and Presentation of Results <a id="chapter6"></a>

![IT in August-1.png](attachment:4a96f1ff-1618-41d2-a38b-c222aad79904.png)

![IT in August-2.png](attachment:7616c33a-5352-4fe6-82bd-92739e535553.png)

#### Top languages used on github in 2022

![top_programming_lang_github.png](attachment:a24406c3-6118-4ff0-b6ff-b47f324f0a4b.png)

#### Key takeaways
- The comparison between "Top 10 Programming Languages in Vietnam" and "Top Languages Used in 2022 on GitHub" shows that the changes are not significant.
- Java, JavaScript, and Python remain in the top 3 most popular programming languages.
- Ruby and Shell are not popular in Vietnam.
- Golang has a notable position in Vietnam.



![IT in August-3.png](attachment:559392a4-93a7-4373-aebc-9cfe301e2068.png)

#### Key takeaways
- English is the most important skill, not hard programming skills.
- Web development is still in high demand (ReactJS and NodeJS).
- QA/QC and Testing are also good career paths.
- AWS and Cloud Computing are growing rapidly.

![IT in August-4.png](attachment:6c25ed40-f711-4f5e-898f-faaa395d0d56.png)

#### Key takeaways
- Companies with 1000+ employees occupy a significant position.
- Hybrid and remote job opportunities are increasingly coming from product companies.
- Top industry sectors: IT Services and IT Consulting.

![image.png](attachment:7e9d97d1-af0a-4fda-8a11-6218b897c576.png)

#### Key takeaways
- For hybrid jobs: Proficiency in Java, JavaScript, and Python is recommended.
- Important skills: English, ReactJS, AWS, QA/QC, Cloud, Testing.

![image.png](attachment:27554571-28fd-4454-9018-8f4c5a28428e.png)

#### Key takeaways
- For remote jobs: Proficiency in JavaScript and Python is recommended.
- Important skills: English, ReactJS, AWS, QA/QC, Cloud, Testing.
- Top industries for remote work are Software Products and Web Services.

## 07. Decision Making <a id="chapter7"></a>

#### For applicants
- You can base your decisions on top programming languages and skills to refine your field and career.
- It’s advisable to improve your English more.
- If you want hybrid and remote jobs, front-end development (ReactJS) is a good choice.
- For all types of work, the Cloud and AWS sectors are growing rapidly.

## 08. Monitoring and Evaluation <a id="chapter8"></a>

### Monitoring
- Update monthly.

### Evaluation

#### Advantages
- The report is presented in real-time.
- Suitable for candidates to study, practice, and orient themselves in the IT industry.

#### Disadvantages
- The data is not extensive enough.
- Do not request data continuously.