### Introduction

#### Français
Ce Jupyter Notebook vise à représenter mes propres données que j'ai extraites des sites d'emploi via un processus de web scraping, ces données sont des offres d'emploi provenant essentiellement de LinkedIn, mais à l'avenir, je mettrai en œuvre un scraper pour extraire les offres du site web Indeed, afin d'avoir des données diversifiées. Pour l'instant avec les données que j'ai en ma possession, j'ai représenté sous forme graphique les technologies web les plus couramment utilisées dans les projets web d'entreprise dans les pays de l'UE et certains pays d'Asie.

#### English
This Jupyter Notebook aims to represent my own data that I extracted from jobboards via a web scraping process, this data is job offers coming mainly from LinkedIn, but in the future I will implement a scraper to extract job offers from Indeed website, in order to have diversified data. For now, with the data I have in my possession, I have represented in graphical form the most commonly used web technologies in companies web projects in EU countries and some Asian countries.

### Import libraries

In [1]:
from pandas import Series, DataFrame, read_csv
from json import loads
import plotly.express as px

### Loading and filtering data

In [2]:
# data dump from MySQL database at 17/13/2022
df = read_csv('jobs_offers.csv', encoding='utf-8')

# convert str (json in database) to python dict with loads function
df.technologies = df.technologies.apply(lambda str_dict: loads(str_dict))

# delete unnecessary columns
df.drop(['description', 'company_url', 'date_time',
        'criteria', 'job_offer_url'], axis=1, inplace=True)

# set job_offer_id as default index
df.set_index('job_offer_id', inplace=True)

# technologies filter
technologies_filter = df.technologies.apply(lambda d: d != None and len(d) > 0)

# apply technologies filter to df
df = df[technologies_filter]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6116 entries, 1 to 8142
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         6116 non-null   object
 1   location      6114 non-null   object
 2   country       6116 non-null   object
 3   company_name  6116 non-null   object
 4   technologies  6116 non-null   object
dtypes: object(5)
memory usage: 286.7+ KB


### Creating DataFrames

In [3]:
# 18 countries selected for EU
union_european_countries = [
    'FRANCE', 'GERMANY', 'BELGIUM', 'DENMARK', 'ESTONIA',
    'FINLAND', 'GREECE', 'IRELAND', 'ITALY', 'AUSTRIA',
    'LUXEMBOURG', 'NETHERLANDS', 'NORWAY', 'POLAND',
    'PORTUGAL', 'SPAIN', 'SWEDEN', 'SWITZERLAND'
]
# create DataFrame by filtering only countries listed in union_european_countries list
df_eu = df.query('country in @union_european_countries')
df_eu.shape

(3301, 5)

In [4]:
# 5 countries selected for Asia
asia_countries = ['SOUTH KOREA', 'CHINA', 'JAPAN', 'THAILAND', 'SINGAPORE']
df_asia = df.query('country in @asia_countries')
df_asia.shape

(661, 5)

In [5]:
# 3 countries selected for North America
n_america_countries = ['UNITED STATES', 'CANADA', 'MEXICO']
df_n_america = df.query('country in @n_america_countries')
df_n_america.shape

(600, 5)

### Utility functions

In [6]:
def get_frequencies(df: DataFrame) -> dict:
    """ 
    frequency of technologies by category with
    the number of job offers in the same category
    """
    D = {
        "Python Frameworks": [[], 0],
        "PHP Frameworks": [[], 0],
        "JavaScript Frameworks": [[], 0],
        "Main tech": [[], 0],
        "Java Frameworks": [[], 0],
        "Project management": [[], 0],
        "Hosting services": [[], 0],
        "DBMS": [[], 0], "Tests": [[], 0],
        "Other Frameworks": [[], 0],
        "App container": [[], 0],
        "Cloud computing": [[], 0],
        "CMS": [[], 0],
        "Bundlers": [[], 0],
        "Task runners": [[], 0]
    }
    for dictionary in df.technologies:
        for category in D.keys():
            if dictionary.__contains__(category):
                D[category][0].extend(dictionary[category])
                D[category][1] += 1
    return {k: (Series(v[0]).value_counts(normalize=True), v[1]) 
            for k, v in D.items() if v[1] > 1}

### Analysis of EU data

In [7]:
freq_eu = get_frequencies(df_eu)
freq_eu

{'Python Frameworks': (Django     0.627660
  Flask      0.255319
  FastAPI    0.063830
  Pyramid    0.031915
  Grok       0.021277
  dtype: float64,
  78),
 'PHP Frameworks': (Laravel        0.415888
  Symfony        0.345794
  Zend           0.065421
  Flight         0.056075
  CodeIgniter    0.028037
  Slim           0.028037
  Laminas        0.023364
  Yii            0.023364
  CakePHP        0.009346
  Lumen          0.004673
  dtype: float64,
  156),
 'JavaScript Frameworks': (React          0.513158
  Angular        0.347368
  Vue.js         0.094298
  Next.js        0.020614
  Svelte         0.005702
  Express.js     0.005263
  Ext            0.003509
  Nuxt.js        0.003509
  Nest.js        0.002193
  Ember.js       0.001754
  Aurelia        0.001316
  Meteor         0.000877
  Backbone.js    0.000439
  dtype: float64,
  1679),
 'Main tech': (Java            0.219621
  TypeScript      0.192793
  C#              0.170700
  Python          0.130195
  PHP             0.086796
  

### Analysis of Asia data

In [8]:
freq_asia = get_frequencies(df_asia)
freq_asia

{'Python Frameworks': (Django     0.48
  Flask      0.36
  FastAPI    0.12
  Falcon     0.04
  dtype: float64,
  17),
 'PHP Frameworks': (Laravel        0.555556
  Flight         0.111111
  CakePHP        0.111111
  CodeIgniter    0.111111
  Symfony        0.111111
  dtype: float64,
  15),
 'JavaScript Frameworks': (React          0.564706
  Angular        0.267647
  Next.js        0.067647
  Vue.js         0.058824
  Nest.js        0.017647
  Express.js     0.008824
  Ext            0.002941
  Svelte         0.002941
  Nuxt.js        0.002941
  Ember.js       0.002941
  Backbone.js    0.002941
  dtype: float64,
  235),
 'Main tech': (Java            0.208333
  Python          0.180208
  TypeScript      0.145833
  C++             0.122917
  C#              0.100000
  PHP             0.057292
  Kotlin          0.050000
  Ruby            0.033333
  Scala           0.029167
  Swift           0.025000
  XML             0.020833
  SCSS            0.011458
  Rust            0.007292
  Object

### Analysis of North America data

In [9]:
freq_n_america = get_frequencies(df_n_america)
freq_n_america

{'Python Frameworks': (Django     0.800000
  FastAPI    0.133333
  Flask      0.066667
  dtype: float64,
  15),
 'PHP Frameworks': (Laravel        0.5625
  Flight         0.1250
  Zend           0.1250
  Yii            0.1250
  CodeIgniter    0.0625
  dtype: float64,
  14),
 'JavaScript Frameworks': (React          0.603491
  Angular        0.299252
  Vue.js         0.057357
  Next.js        0.012469
  Express.js     0.007481
  Svelte         0.007481
  Nuxt.js        0.002494
  Aurelia        0.002494
  Ext            0.002494
  Backbone.js    0.002494
  Ember.js       0.002494
  dtype: float64,
  289),
 'Main tech': (Java            0.233509
  Python          0.163588
  TypeScript      0.133245
  C#              0.118734
  C++             0.097625
  PHP             0.084433
  XML             0.035620
  Ruby            0.034301
  Kotlin          0.026385
  Swift           0.026385
  SCSS            0.019789
  Objective-C     0.011873
  Scala           0.007916
  XHTML           0.0039

In [10]:
freq_countries_dict = df.country.value_counts().to_dict()
fig_countries_freq = px.bar(
    x=freq_countries_dict.keys(),
    y=freq_countries_dict.values(),
    labels=dict(y='Number of job offers', x='Country'),
    template='plotly_dark',
    title='Data volume by country'
)
fig_countries_freq.show()