### Introduction

#### Français
Ce Jupyter Notebook vise à représenter mes propres données que j'ai extraites des sites d'emploi via un processus de web scraping, ces données sont des offres d'emploi provenant essentiellement de LinkedIn, mais à l'avenir, je mettrai en œuvre un scraper pour extraire les offres du site web Indeed, afin d'avoir des données diversifiées. Pour l'instant avec les données que j'ai en ma possession, j'ai représenté sous forme graphique les technologies web les plus couramment utilisées dans les projets web d'entreprise dans les pays de l'UE et certains pays d'Asie.

#### English
This Jupyter Notebook aims to represent my own data that I extracted from jobboards via a web scraping process, this data is job offers coming mainly from LinkedIn, but in the future I will implement a scraper to extract job offers from Indeed website, in order to have diversified data. For now, with the data I have in my possession, I have represented in graphical form the most commonly used web technologies in companies web projects in EU countries and some Asian countries.

### Import libraries

In [20]:
from pandas import Series, DataFrame, read_csv
from json import loads
import plotly.express as px

### Loading and filtering data

In [21]:
# data dump from MySQL database at 12/13/2022
df = read_csv('jobs_offers.csv', encoding='utf-8')

# convert str (json in database) to python list with loads function
df.technologies = df.technologies.apply(lambda str_list: loads(str_list))

# set job_offer_id as default index
df.set_index('job_offer_id', inplace=True)

# delete unnecessary columns
df.drop(['description', 'company_url', 'date_time',
        'criteria', 'job_offer_url'], axis=1, inplace=True)

# technologies filter
technologies_filter = df.technologies.apply(lambda l: len(l) > 0)

# apply technologies filter to df
df = df[technologies_filter]
df.shape

(967, 5)

### Creating DataFrames

In [22]:
# 18 countries selected for EU
union_european_countries = [
    'FRANCE', 'GERMANY', 'BELGIUM', 'DENMARK', 'ESTONIA',
    'FINLAND', 'GREECE', 'ICELAND', 'IRELAND', 'ITALY',
    'LUXEMBOURG', 'NETHERLANDS', 'NORWAY', 'POLAND',
    'PORTUGAL', 'SPAIN', 'SWEDEN', 'SWITZERLAND',
]
# create DataFrame by filtering only countries listed in union_european_countries list
df_eu = df.query('country in @union_european_countries')
df_eu.shape

(545, 5)

In [23]:
# 3 countries selected for Asia
asia_countries = ['KOREA', 'CHINESE', 'JAPAN']
df_asia = df.query('country in @asia_countries')
df_asia.shape

(35, 5)

### Utility functions

In [24]:
def get_freq(S: Series, columns: list = ['Frequencies']) -> DataFrame:
    L = []
    for el in S:
        L.extend(el)
    return DataFrame(Series(L).value_counts(normalize=True), columns=columns)


def get_figure(df: DataFrame, n_rows: int, title: str, labels: dict):
    return px.bar(
        df.head(n_rows),
        title=title,
        template='plotly_dark',
        labels=labels
    )

In [25]:
freq_technologies_eu = get_freq(df_eu.technologies)
# print top 10 for df_eu
freq_technologies_eu.head(10)

Unnamed: 0,Frequencies
React,0.09981
Angular,0.07557
TypeScript,0.05846
Java,0.057034
Git,0.056559
C#,0.056084
.NET,0.052757
Azure,0.039449
Docker,0.037548
Python,0.035646


In [26]:
fig_technologies_freq_eu = get_figure(
    freq_technologies_eu, 30,
    f"Top 30 technologies used in EU web projects, number of job offers: {df_eu.shape[0]}",
    dict(value='Frequency', index='Technology')
)
fig_technologies_freq_eu.show()

In [27]:
freq_technologies_asia = get_freq(df_asia.technologies)
# print top 10 for df_asia
freq_technologies_asia.head(10)

Unnamed: 0,Frequencies
Python,0.080247
TypeScript,0.074074
React,0.067901
Java,0.061728
Docker,0.061728
GitHub,0.055556
Node.js,0.049383
AWS,0.04321
Git,0.037037
PostgreSQL,0.030864


In [28]:
fig_technologies_freq_asia = get_figure(
    freq_technologies_asia, 30,
    f"Top 30 technologies used in Asia web projects, number of job offers: {df_asia.shape[0]}",
    dict(value='Frequency', index='Technology')
)
fig_technologies_freq_asia.show()

In [29]:
freq_countries = df.country.value_counts(normalize=True)
freq_countries.head(10)

POLAND           0.047570
BELGIUM          0.046536
UNITED STATES    0.045502
CANADA           0.045502
GERMANY          0.044467
SWEDEN           0.043433
TURKEY           0.041365
ITALY            0.041365
FRANCE           0.041365
NETHERLANDS      0.037229
Name: country, dtype: float64