# Рынок вакансий дата- и BI-аналитиков в Европе (по данным Linkedln)

## Описание проекта ##

### Цель ###

Подготовка дашборда с визуалицацией информации о рынке вакансий дата- и BI-аналитиков в Европе.

### Источник данных ###

Информация о вакансиях за неделю в виде csv-файла, полученного парсингом с Linkedln 07.09.2022г.

### Ход выполнения ###

1. Получение из представленного csv-файла необходимых атрибутов:
    - наименование вакансии;
    - город;
    - страна;
    - тип занятости;
    - название компании;
    - количество работников в компании;
    - сфера деятельности компании;
    - требуемые hard-skills;
    - дата публикации вакансии;
    - количество кандидатов на вакансию.
2. Предобработка данных:
    - фильтрация датафрейма по релевантным вакансиям и атрибутам;
    - обработка пропусков и дубликатов.
3. Визуализация данных:
    - построение дашборда в Tableau.
    - содержание дашборда:
        - фильтры: по стране, по типу занятости;
        - количество вакансий (абсолютные значения) – индикатор;
        - количество вакансий по странам (относительные величины) — stack bar chart;
        - количество вакансий по городам - map;
        - тип занятости — pie chart;
        - список нанимающих компаний с указанием количества вакансий, отсортированный в порядке убывания — heat map;
        - ТОП-10 сфер деятельности компаний — barchart;
        - размер компаний и количество вакансий — pie chart;
        - требуемые hard-skills — barchart;
        - зависимость количества кандидатов на вакансию от даты публикации объявления — линейный график.

## Получение из csv-файла необходимых атрибутов ##

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import seaborn as sns
import re
from datetime import datetime
pd.set_option('display.float_format', '{:,.2f}'.format)

In [2]:
df = pd.read_csv('D:\Агарев\Yandex_Practicum\Мастерская\masterskaya_yandex_2022_09_07.csv')

### Наименование вакансии

In [3]:
df['title'] = df['html'].apply(lambda x:  BeautifulSoup(x).find('h2').text.strip())

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,html,title
0,0,"\n <div>\n <div class=""\n jobs-deta...",Stage - Assistant Ingénieur Qualité - Beyrand ...
1,1,"\n <div>\n <div class=""\n jobs-deta...","développeur matlab/simulink, secteur automobil..."
2,2,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst
3,3,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst - Belgium
4,4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst


### Город

In [5]:
def get_geo(cell):
    try:
        return BeautifulSoup(cell).find('span', class_ = 'jobs-unified-top-card__bullet').text.strip()
    except:
        return np.nan

In [6]:
df['geo'] = df['html'].apply(get_geo)

In [7]:
def get_city(cell):
    if len(cell.split(',')) > 1:
        return cell.split(',')[0].strip()
    elif "Metropolitan" in cell or "Greater" in cell:
        return cell.replace('Greater', '').replace('Metropolitan', '').replace('Area', '').replace('Region','').strip()
    else:
        return np.nan

In [8]:
df['city'] = df['geo'].apply(get_city)

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,html,title,geo,city
0,0,"\n <div>\n <div class=""\n jobs-deta...",Stage - Assistant Ingénieur Qualité - Beyrand ...,"Limoges, Nouvelle-Aquitaine, France",Limoges
1,1,"\n <div>\n <div class=""\n jobs-deta...","développeur matlab/simulink, secteur automobil...","Toulouse, Occitanie, France",Toulouse
2,2,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst,"Skara, Vastra Gotaland County, Sweden",Skara
3,3,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst - Belgium,"West Flanders, Flemish Region, Belgium",West Flanders
4,4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst,"Mecklenburg-West Pomerania, Germany",Mecklenburg-West Pomerania


### Страна

In [10]:
def get_country(cell):
    if len(cell.split(',')) > 1:
        return cell.split(',')[-1].strip()
    elif "Metropolitan" in cell or "Greater" in cell or "Region" in cell:
        return np.nan
    else:
        return cell

In [11]:
df['country'] = df['geo'].apply(get_country)

In [12]:
df.head()

Unnamed: 0.1,Unnamed: 0,html,title,geo,city,country
0,0,"\n <div>\n <div class=""\n jobs-deta...",Stage - Assistant Ingénieur Qualité - Beyrand ...,"Limoges, Nouvelle-Aquitaine, France",Limoges,France
1,1,"\n <div>\n <div class=""\n jobs-deta...","développeur matlab/simulink, secteur automobil...","Toulouse, Occitanie, France",Toulouse,France
2,2,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst,"Skara, Vastra Gotaland County, Sweden",Skara,Sweden
3,3,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst - Belgium,"West Flanders, Flemish Region, Belgium",West Flanders,Belgium
4,4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst,"Mecklenburg-West Pomerania, Germany",Mecklenburg-West Pomerania,Germany


### Тип занятости

In [13]:
def get_workplace_type(cell):
    try:
        return BeautifulSoup(cell).find('span', class_ = 'jobs-unified-top-card__workplace-type').text.strip()
    except:
        return np.nan

In [14]:
df['workplace_type'] = df['html'].apply(get_workplace_type)

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,html,title,geo,city,country,workplace_type
0,0,"\n <div>\n <div class=""\n jobs-deta...",Stage - Assistant Ingénieur Qualité - Beyrand ...,"Limoges, Nouvelle-Aquitaine, France",Limoges,France,On-site
1,1,"\n <div>\n <div class=""\n jobs-deta...","développeur matlab/simulink, secteur automobil...","Toulouse, Occitanie, France",Toulouse,France,On-site
2,2,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst,"Skara, Vastra Gotaland County, Sweden",Skara,Sweden,Remote
3,3,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst - Belgium,"West Flanders, Flemish Region, Belgium",West Flanders,Belgium,Remote
4,4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst,"Mecklenburg-West Pomerania, Germany",Mecklenburg-West Pomerania,Germany,Remote


### Название компании

In [16]:
def company_name(cell):
    try:
        return BeautifulSoup(cell).find('span', class_ = 'jobs-unified-top-card__company-name').text.strip()
    except:
        return np.nan

In [17]:
df['company_name'] = df['html'].apply(company_name)

In [18]:
df.head()

Unnamed: 0.1,Unnamed: 0,html,title,geo,city,country,workplace_type,company_name
0,0,"\n <div>\n <div class=""\n jobs-deta...",Stage - Assistant Ingénieur Qualité - Beyrand ...,"Limoges, Nouvelle-Aquitaine, France",Limoges,France,On-site,Hermès
1,1,"\n <div>\n <div class=""\n jobs-deta...","développeur matlab/simulink, secteur automobil...","Toulouse, Occitanie, France",Toulouse,France,On-site,AUSY
2,2,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst,"Skara, Vastra Gotaland County, Sweden",Skara,Sweden,Remote,TELUS International AI Data Solutions
3,3,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst - Belgium,"West Flanders, Flemish Region, Belgium",West Flanders,Belgium,Remote,TELUS International
4,4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst,"Mecklenburg-West Pomerania, Germany",Mecklenburg-West Pomerania,Germany,Remote,TELUS International AI Data Solutions


### Количество работников в компании

In [19]:
def employees(cell):
    try:
        return BeautifulSoup(cell).find('div', class_ = 'mt5 mb2').find_all(
            'li', class_ = 'jobs-unified-top-card__job-insight')[1].text.strip().split('·', 1)[0]
    except:
        return np.nan

In [20]:
df['employees'] = df['html'].apply(employees)

In [21]:
df.head()

Unnamed: 0.1,Unnamed: 0,html,title,geo,city,country,workplace_type,company_name,employees
0,0,"\n <div>\n <div class=""\n jobs-deta...",Stage - Assistant Ingénieur Qualité - Beyrand ...,"Limoges, Nouvelle-Aquitaine, France",Limoges,France,On-site,Hermès,"10,001+ employees"
1,1,"\n <div>\n <div class=""\n jobs-deta...","développeur matlab/simulink, secteur automobil...","Toulouse, Occitanie, France",Toulouse,France,On-site,AUSY,"5,001-10,000 employees"
2,2,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst,"Skara, Vastra Gotaland County, Sweden",Skara,Sweden,Remote,TELUS International AI Data Solutions,"10,001+ employees"
3,3,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst - Belgium,"West Flanders, Flemish Region, Belgium",West Flanders,Belgium,Remote,TELUS International,"10,001+ employees"
4,4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst,"Mecklenburg-West Pomerania, Germany",Mecklenburg-West Pomerania,Germany,Remote,TELUS International AI Data Solutions,"10,001+ employees"


### Сфера деятельности компании

In [22]:
def company_specialization(cell):
    try:
        return BeautifulSoup(cell).find('div', class_ = 'mt5 mb2').find_all(
            'li', class_ = 'jobs-unified-top-card__job-insight')[1].text.strip().split('·', 1)[1]
    except:
        return np.nan

In [23]:
df['company_specialization'] = df['html'].apply(company_specialization)

In [24]:
df.head()

Unnamed: 0.1,Unnamed: 0,html,title,geo,city,country,workplace_type,company_name,employees,company_specialization
0,0,"\n <div>\n <div class=""\n jobs-deta...",Stage - Assistant Ingénieur Qualité - Beyrand ...,"Limoges, Nouvelle-Aquitaine, France",Limoges,France,On-site,Hermès,"10,001+ employees",Retail Luxury Goods and Jewelry
1,1,"\n <div>\n <div class=""\n jobs-deta...","développeur matlab/simulink, secteur automobil...","Toulouse, Occitanie, France",Toulouse,France,On-site,AUSY,"5,001-10,000 employees",IT Services and IT Consulting
2,2,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst,"Skara, Vastra Gotaland County, Sweden",Skara,Sweden,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting
3,3,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst - Belgium,"West Flanders, Flemish Region, Belgium",West Flanders,Belgium,Remote,TELUS International,"10,001+ employees",IT Services and IT Consulting
4,4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst,"Mecklenburg-West Pomerania, Germany",Mecklenburg-West Pomerania,Germany,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting


### Требуемые hard-skills

In [25]:
df['description'] = df['html'].apply(lambda x: BeautifulSoup(x).find('div', {'id':'job-details'}).text.strip())

In [26]:
skills = (['datahub', 'api', 'github', 'google analytics', 'adobe analytics', 'ibm coremetrics', 'omniture'
            'gitlab', 'erwin', 'hadoop', 'spark', 'hive'
           'databricks', 'aws', 'gcp', 'azure','excel',
            'redshift', 'bigquery', 'snowflake',  'hana'
            'grafana', 'kantar', 'spss', 
           'asana', 'basecamp', 'jira', 'dbeaver','trello', 'miro', 'salesforce', 
           'rapidminer', 'thoughtspot',  'power point',  'docker', 'jenkins','integrate.io', 'talend', 'apache nifi','aws glue','pentaho','google data flow',
             'azure data factory','xplenty','skyvia','iri voracity','xtract.io','dataddo', 'ssis',
             'hevo data','informatica','oracle data integrator','k2view','cdata sync','querysurge', 
             'rivery', 'dbconvert', 'alooma', 'stitch', 'fivetran', 'matillion','streamsets','blendo',
             'iri voracity','logstash', 'etleap', 'singer', 'apache camel','actian', 'airflow', 'luidgi', 'datastage',
           'python', 'vba', 'scala', ' r ', 'java script', 'julia', 'sql', 'matlab', 'java', 'html', 'c++', 'sas',
           'data studio', 'tableau', 'looker', 'powerbi', 'cognos', 'microstrategy', 'spotfire',
             'sap business objects','microsoft sql server', 'oracle business intelligence', 'yellowfin',
             'webfocus','sas visual analytics', 'targit', 'izenda',  'sisense', 'statsbot', 'panorama', 'inetsoft',
             'birst', 'domo', 'metabase', 'redash', 'power bi', 'alteryx', 'dataiku', 'qlik sense', 'qlikview'
          ]) 

In [27]:
def get_skills(cell):
    list_skills = []
    for skill in skills:
        if skill in cell.lower().replace('powerbi', 'power bi'):
            list_skills.append(skill)
    return list_skills

In [28]:
df['skills'] = df.description.apply(get_skills)

In [29]:
df.head()

Unnamed: 0.1,Unnamed: 0,html,title,geo,city,country,workplace_type,company_name,employees,company_specialization,description,skills
0,0,"\n <div>\n <div class=""\n jobs-deta...",Stage - Assistant Ingénieur Qualité - Beyrand ...,"Limoges, Nouvelle-Aquitaine, France",Limoges,France,On-site,Hermès,"10,001+ employees",Retail Luxury Goods and Jewelry,"LA SOCIETE : \nCréée en 1926, la société Beyra...","[api, excel]"
1,1,"\n <div>\n <div class=""\n jobs-deta...","développeur matlab/simulink, secteur automobil...","Toulouse, Occitanie, France",Toulouse,France,On-site,AUSY,"5,001-10,000 employees",IT Services and IT Consulting,Dans le cadre de la croissance de nos activité...,[matlab]
2,2,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst,"Skara, Vastra Gotaland County, Sweden",Skara,Sweden,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,[]
3,3,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst - Belgium,"West Flanders, Flemish Region, Belgium",West Flanders,Belgium,Remote,TELUS International,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,[]
4,4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst,"Mecklenburg-West Pomerania, Germany",Mecklenburg-West Pomerania,Germany,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,[]


### Дата публикации вакансии

In [30]:
def publication_date(cell):
    try:
        return BeautifulSoup(cell).find('span', class_ = 'jobs-unified-top-card__posted-date').text.strip()
    except:
        return np.nan

In [31]:
df['publication_date'] = df['html'].apply(publication_date)

In [32]:
df.head()

Unnamed: 0.1,Unnamed: 0,html,title,geo,city,country,workplace_type,company_name,employees,company_specialization,description,skills,publication_date
0,0,"\n <div>\n <div class=""\n jobs-deta...",Stage - Assistant Ingénieur Qualité - Beyrand ...,"Limoges, Nouvelle-Aquitaine, France",Limoges,France,On-site,Hermès,"10,001+ employees",Retail Luxury Goods and Jewelry,"LA SOCIETE : \nCréée en 1926, la société Beyra...","[api, excel]",13 minutes ago
1,1,"\n <div>\n <div class=""\n jobs-deta...","développeur matlab/simulink, secteur automobil...","Toulouse, Occitanie, France",Toulouse,France,On-site,AUSY,"5,001-10,000 employees",IT Services and IT Consulting,Dans le cadre de la croissance de nos activité...,[matlab],4 days ago
2,2,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst,"Skara, Vastra Gotaland County, Sweden",Skara,Sweden,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,[],6 days ago
3,3,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst - Belgium,"West Flanders, Flemish Region, Belgium",West Flanders,Belgium,Remote,TELUS International,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,[],6 days ago
4,4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst,"Mecklenburg-West Pomerania, Germany",Mecklenburg-West Pomerania,Germany,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,[],8 hours ago


### Количество кандидатов на вакансию

In [33]:
def applicants(cell):
    try:
        return BeautifulSoup(cell).find('span', class_ = 'jobs-unified-top-card__applicant-count').text.strip().split(' ', 1)[0]
    except:
        return np.nan

In [34]:
df['applicants'] = df['html'].apply(applicants)

In [35]:
df.head()

Unnamed: 0.1,Unnamed: 0,html,title,geo,city,country,workplace_type,company_name,employees,company_specialization,description,skills,publication_date,applicants
0,0,"\n <div>\n <div class=""\n jobs-deta...",Stage - Assistant Ingénieur Qualité - Beyrand ...,"Limoges, Nouvelle-Aquitaine, France",Limoges,France,On-site,Hermès,"10,001+ employees",Retail Luxury Goods and Jewelry,"LA SOCIETE : \nCréée en 1926, la société Beyra...","[api, excel]",13 minutes ago,
1,1,"\n <div>\n <div class=""\n jobs-deta...","développeur matlab/simulink, secteur automobil...","Toulouse, Occitanie, France",Toulouse,France,On-site,AUSY,"5,001-10,000 employees",IT Services and IT Consulting,Dans le cadre de la croissance de nos activité...,[matlab],4 days ago,6.0
2,2,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst,"Skara, Vastra Gotaland County, Sweden",Skara,Sweden,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,[],6 days ago,12.0
3,3,"\n <div>\n <div class=""\n jobs-deta...",Online Data Analyst - Belgium,"West Flanders, Flemish Region, Belgium",West Flanders,Belgium,Remote,TELUS International,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,[],6 days ago,11.0
4,4,"\n <div>\n <div class=""\n jobs-deta...",Data Analyst,"Mecklenburg-West Pomerania, Germany",Mecklenburg-West Pomerania,Germany,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,[],8 hours ago,2.0


In [36]:
df['link'] = df['html'].apply(lambda x: "https://linkedin.com" + BeautifulSoup(x).find('a').get('href'))
#получаем ссылки на вакансий, для последующей группировке по ним в процессе визуализации

## Предобработка данных

### Фильтрация по релевантным вакансиям

Прежде чем приступать к предобработке данных отфильтруем все нерелевантные вакансии. 

In [37]:
title_filter = re.compile('''(analyst)|(bi-analyst)|(bi analyst)|(business intelligence analyst)
|(business intelligence-analyst)|(data-analyst)|(data analyst)|(product analyst)|(product-analyst)''', re.X) #фильтр

df['title'] = df['title'].str.lower() #приведение всех названий к нижнему регистру

df_analyst = df[df['title'].str.contains(title_filter)].reset_index(drop=True)

  df_analyst = df[df['title'].str.contains(title_filter)].reset_index(drop=True)


In [38]:
df_analyst = df_analyst.explode('skills') #разбивка скиллов на отдельные строки

In [39]:
df_analyst['title'].unique()

array(['online data analyst', 'online data analyst - belgium',
       'data analyst', 'alternant/ alternante data analyst m/f',
       'junior test analyst', 'data analyst h/f',
       'stage - data analyst (h/f)', 'data analyst (f/h)',
       'data analyst (tableau)', 'online data analyst | french speaker',
       'online data analyst | flexible work',
       'online data analyst | remote opportunity', 'data analyst (it)',
       'data analyst (m/f/d)', 'data analyst (m/w/d)',
       'remote | data analyst', 'data analyst sa1/sa2',
       'data analyst | deals (m&a) | cdi | h/f',
       'data analyst - adsales & storyworks',
       'data analyst with focus on solution design (m/f/d)',
       'remote| data analyst', 'data analyst - (m/f/d)',
       'alternance - data analyst/dataviz specialist h/f',
       'data analyst  - boursorama',
       'technology strategy & advisory junior analyst',
       'data analyst in forensic technology services team',
       'junior business analyst', 'a

Уберём из полученного датафрейма одну из специальностей явно не связанную с дата-аналитикой - "pressure analyst".

In [40]:
df_analyst = df_analyst.query('title != "pressure analyst"')

In [41]:
def information(df): #функция для получения общих сведений о датафрейме
    print('\033[1m' + 'Общая информация:' + '\033[0m')
    df.info()
    print()
    print('\033[1m' + 'Первые 2 строки:' + '\033[0m')
    display(df.head(2))
    print('\033[1m' + 'Количество дубликатов:' + '\033[0m', df.duplicated().sum())
    print()
    print('\033[1m' + 'Количество пропусков:' + '\033[0m')
    display(df.isna().sum())
    print('\033[1m' + 'Доля пропусков:' + '\033[0m')
    print(df.isna().sum() / len(df))

In [42]:
information(df_analyst)

[1mОбщая информация:[0m
<class 'pandas.core.frame.DataFrame'>
Int64Index: 974 entries, 0 to 353
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Unnamed: 0              974 non-null    int64 
 1   html                    974 non-null    object
 2   title                   974 non-null    object
 3   geo                     974 non-null    object
 4   city                    935 non-null    object
 5   country                 914 non-null    object
 6   workplace_type          772 non-null    object
 7   company_name            974 non-null    object
 8   employees               971 non-null    object
 9   company_specialization  904 non-null    object
 10  description             974 non-null    object
 11  skills                  873 non-null    object
 12  publication_date        974 non-null    object
 13  applicants              875 non-null    object
 14  link                    974 non-

Unnamed: 0.1,Unnamed: 0,html,title,geo,city,country,workplace_type,company_name,employees,company_specialization,description,skills,publication_date,applicants,link
0,2,"\n <div>\n <div class=""\n jobs-deta...",online data analyst,"Skara, Vastra Gotaland County, Sweden",Skara,Sweden,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,,6 days ago,12,https://linkedin.com/jobs/view/3248499929/?alt...
1,3,"\n <div>\n <div class=""\n jobs-deta...",online data analyst - belgium,"West Flanders, Flemish Region, Belgium",West Flanders,Belgium,Remote,TELUS International,"10,001+ employees",IT Services and IT Consulting,TELUS International AI-Data Solutions partners...,,6 days ago,11,https://linkedin.com/jobs/view/3248879065/?alt...


[1mКоличество дубликатов:[0m 0

[1mКоличество пропусков:[0m


Unnamed: 0                  0
html                        0
title                       0
geo                         0
city                       39
country                    60
workplace_type            202
company_name                0
employees                   3
company_specialization     70
description                 0
skills                    101
publication_date            0
applicants                 99
link                        0
dtype: int64

[1mДоля пропусков:[0m
Unnamed: 0               0.00
html                     0.00
title                    0.00
geo                      0.00
city                     0.04
country                  0.06
workplace_type           0.21
company_name             0.00
employees                0.00
company_specialization   0.07
description              0.00
skills                   0.10
publication_date         0.00
applicants               0.10
link                     0.00
dtype: float64


В получившемся датафрейме отсутствуют дубликаты, но есть пропуски в нескольких столбцах. Также есть лишние столбцы, а данные о дате размещения вакансии необходимо привести к соответствующему виду, чтобы они были представлены реальной датой, а не временным промежутком с момента публикации до момента парсинга.

In [43]:
df_analyst = df_analyst.drop(['Unnamed: 0', 'html', 'geo', 'description'], axis=1) #удаление ненужных столбцов

### Получение даты публикации вакансии

In [44]:
df_analyst['publication_date'].unique()

array(['6 days ago', '8 hours ago', '5 days ago', '1 day ago',
       '2 days ago', '7 hours ago', '16 hours ago', '3 hours ago',
       '23 hours ago', '2 hours ago', '15 hours ago', '19 hours ago',
       '21 hours ago', '9 hours ago', '14 hours ago', '4 hours ago',
       '3 days ago', '4 days ago', '6 hours ago', '18 hours ago',
       '11 hours ago', '10 minutes ago', '5 hours ago', '1 week ago',
       '10 hours ago', '22 hours ago'], dtype=object)

In [45]:
df_analyst['time'] = df_analyst['publication_date'].str.split(' ').str.get(0)
df_analyst['time_units'] = df_analyst['publication_date'].str.split(' ').str.get(1)
df_analyst['time'] = df_analyst['time'].astype('int')
#разбиваем информацию о временном промежутке на два столбца, и приводим количество единиц времени к целочисленному формату

In [46]:
def time_to_unix(row): #функция для перевода единиц времени в формат unix
    if row['time_units'] == 'day' or row['time_units'] == 'days':
        return row['time'] * 86400
    elif row['time_units'] == 'hours':
        return row['time'] * 3600
    elif row['time_units'] == 'minutes':
        return row['time'] * 60
    else:
        return row['time'] * 604800

In [47]:
df_analyst['publication_date'] = df_analyst.apply(time_to_unix, axis=1)

df_analyst['publication_date'] = 1662570000 - df_analyst['publication_date'] #получение даты публикации в формате unix

In [48]:
df_analyst['publication_date'] = df_analyst['publication_date'].apply(
    lambda x: datetime.utcfromtimestamp(x).strftime('%Y-%m-%d')) #перевод даты публикации из формата unix в datetime

In [49]:
df_analyst = df_analyst.drop(['time', 'time_units'], axis=1) #удаление побочных столбцов

### Обработка пропусков и аномальных значений

In [50]:
df_analyst[df_analyst['city'].isna()]

Unnamed: 0,title,city,country,workplace_type,company_name,employees,company_specialization,skills,publication_date,applicants,link
10,data analyst (tableau),,Gibraltar,,Guardian Jobs,51-200 employees,Staffing and Recruiting,api,2022-09-06,30.0,https://linkedin.com/jobs/view/3248368719/?alt...
10,data analyst (tableau),,Gibraltar,,Guardian Jobs,51-200 employees,Staffing and Recruiting,excel,2022-09-06,30.0,https://linkedin.com/jobs/view/3248368719/?alt...
10,data analyst (tableau),,Gibraltar,,Guardian Jobs,51-200 employees,Staffing and Recruiting,python,2022-09-06,30.0,https://linkedin.com/jobs/view/3248368719/?alt...
10,data analyst (tableau),,Gibraltar,,Guardian Jobs,51-200 employees,Staffing and Recruiting,sql,2022-09-06,30.0,https://linkedin.com/jobs/view/3248368719/?alt...
10,data analyst (tableau),,Gibraltar,,Guardian Jobs,51-200 employees,Staffing and Recruiting,tableau,2022-09-06,30.0,https://linkedin.com/jobs/view/3248368719/?alt...
13,data analyst,,Poland,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting,,2022-09-07,9.0,https://linkedin.com/jobs/view/3257231590/?alt...
32,online data analyst - belgium,,Belgium,Remote,TELUS International,"10,001+ employees",IT Services and IT Consulting,,2022-09-01,11.0,https://linkedin.com/jobs/view/3248871894/?alt...
66,internet analyst,,Poland,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting,,2022-09-07,7.0,https://linkedin.com/jobs/view/3249290419/?alt...
80,data analyst,,Poland,Remote,TELUS International AI Data Solutions,"10,001+ employees",IT Services and IT Consulting,,2022-09-02,9.0,https://linkedin.com/jobs/view/3249557969/?alt...
96,online data analyst,,Finland,Remote,TELUS International,"10,001+ employees",IT Services and IT Consulting,,2022-09-01,15.0,https://linkedin.com/jobs/view/3248881314/?alt...


Все вакансии для которых не указан город, за исключением одной, подразумевают удаленный формат работы, поэтому указывать для них город дислокации нет смысла, ещё одна вакансия находится в Гибраалтаре который сам по себе фактически является городом-государством, т.е. замена пропусков в данном случае не требуется.

In [51]:
df_analyst[df_analyst['country'].isna()]['city'].unique()

array(['Przemyśl', 'Grudziadz', 'Norrköping', 'Radom', 'Zamosc',
       'Edinburgh', 'Lodz', 'Mons', 'Milan', 'Bologna', 'Ghent', 'Bruges',
       'Gijón', 'Kortrijk', 'Rome', 'Zurich', 'Namur', 'Barcelona',
       'Turin', 'Brussels', 'Paris', 'Antwerp', 'Warsaw', 'Genoa',
       'Liege', 'Louvain', 'Athens'], dtype=object)

In [52]:
def country(row): #функция для замены пропусков в стране по городу
    if row['city'] in ['Przemyśl', 'Grudziadz', 'Radom', 'Zamosc', 'Lodz', 'Warsaw']:
        return 'Poland'
    elif row['city'] in ['Norrköping']:
        return 'Sweden'
    elif row['city'] in ['Edinburgh']:
        return 'Scotland'
    elif row['city'] in ['Mons', 'Ghent', 'Bruges', 'Kortrijk', 'Namur', 'Brussels', 'Antwerp', 'Liege', 'Louvain']:
        return 'Belgium'
    elif row['city'] in ['Milan', 'Bologna', 'Rome', 'Turin', 'Genoa']:
        return 'Italy'
    elif row['city'] in ['Gijón', 'Barcelona']:
        return 'Spain'
    elif row['city'] in ['Zurich']:
        return 'Switzerland'
    elif row['city'] in ['Paris']:
        return 'France'
    elif row['city'] in ['Athens']:
        return 'Greece'
    else:
        return row['country']

In [53]:
df_analyst['country'] = df_analyst.apply(country, axis=1)

In [54]:
df_analyst['country'].isna().sum()

0

In [55]:
df_analyst[df_analyst['workplace_type'].isna()]

Unnamed: 0,title,city,country,workplace_type,company_name,employees,company_specialization,skills,publication_date,applicants,link
10,data analyst (tableau),,Gibraltar,,Guardian Jobs,51-200 employees,Staffing and Recruiting,api,2022-09-06,30,https://linkedin.com/jobs/view/3248368719/?alt...
10,data analyst (tableau),,Gibraltar,,Guardian Jobs,51-200 employees,Staffing and Recruiting,excel,2022-09-06,30,https://linkedin.com/jobs/view/3248368719/?alt...
10,data analyst (tableau),,Gibraltar,,Guardian Jobs,51-200 employees,Staffing and Recruiting,python,2022-09-06,30,https://linkedin.com/jobs/view/3248368719/?alt...
10,data analyst (tableau),,Gibraltar,,Guardian Jobs,51-200 employees,Staffing and Recruiting,sql,2022-09-06,30,https://linkedin.com/jobs/view/3248368719/?alt...
10,data analyst (tableau),,Gibraltar,,Guardian Jobs,51-200 employees,Staffing and Recruiting,tableau,2022-09-06,30,https://linkedin.com/jobs/view/3248368719/?alt...
...,...,...,...,...,...,...,...,...,...,...,...
311,systems data analyst,London,United Kingdom,,dns umbrella,1-10 employees,Financial Services,power bi,2022-09-01,,https://linkedin.com/jobs/view/3256666230/?alt...
321,online data analyst - hungary,Budapest,Hungary,,TELUS International,"10,001+ employees",IT Services and IT Consulting,,2022-09-07,36,https://linkedin.com/jobs/view/3156985237/?alt...
348,junior business analyst,Hartlepool,United Kingdom,,Paul Gough Media LLC,11-50 employees,Advertising Services,api,2022-09-02,38,https://linkedin.com/jobs/view/3247376375/?alt...
348,junior business analyst,Hartlepool,United Kingdom,,Paul Gough Media LLC,11-50 employees,Advertising Services,ssis,2022-09-02,38,https://linkedin.com/jobs/view/3247376375/?alt...


Подходящей замены для пропусков типа рабочего места нет, поэтому заменим эти пропуски заглушками 'unknown'.

In [56]:
df_analyst['workplace_type'] = df_analyst['workplace_type'].fillna('unknown')

In [57]:
df_analyst['workplace_type'].isna().sum()

0

In [58]:
df_analyst[df_analyst['company_specialization'].isna()]

Unnamed: 0,title,city,country,workplace_type,company_name,employees,company_specialization,skills,publication_date,applicants,link
148,data analyst (product & operations) - sustaina...,Paris,France,Hybrid,Greenly,51-200 employees,,api,2022-09-06,40,https://linkedin.com/jobs/view/3249141676/?alt...
148,data analyst (product & operations) - sustaina...,Paris,France,Hybrid,Greenly,51-200 employees,,excel,2022-09-06,40,https://linkedin.com/jobs/view/3249141676/?alt...
148,data analyst (product & operations) - sustaina...,Paris,France,Hybrid,Greenly,51-200 employees,,python,2022-09-06,40,https://linkedin.com/jobs/view/3249141676/?alt...
148,data analyst (product & operations) - sustaina...,Paris,France,Hybrid,Greenly,51-200 employees,,scala,2022-09-06,40,https://linkedin.com/jobs/view/3249141676/?alt...
148,data analyst (product & operations) - sustaina...,Paris,France,Hybrid,Greenly,51-200 employees,,sql,2022-09-06,40,https://linkedin.com/jobs/view/3249141676/?alt...
...,...,...,...,...,...,...,...,...,...,...,...
303,application data analyst hybrid working,Birmingham,United Kingdom,unknown,,IT Services and IT Consulting,,sql,2022-09-02,,https://linkedin.com/jobs/view/3249913331/?alt...
319,interim hr data analyst,Worcester,United Kingdom,On-site,Ashley Kate HR &amp; Finance,See recent hiring trends for Ashley Kate HR &a...,,excel,2022-09-01,1,https://linkedin.com/jobs/view/3247334292/?alt...
319,interim hr data analyst,Worcester,United Kingdom,On-site,Ashley Kate HR &amp; Finance,See recent hiring trends for Ashley Kate HR &a...,,ssis,2022-09-01,1,https://linkedin.com/jobs/view/3247334292/?alt...
343,traffic analyst,,Croatia,Remote,VOX SOLUTIONS,51-200 employees,,excel,2022-09-06,10,https://linkedin.com/jobs/view/3253788059/?alt...


In [59]:
df_analyst[df_analyst['company_specialization'].isna()]['company_name'].unique()

array(['Greenly', 'Landeskriminalamt Nordrhein-Westfalen',
       'M&L Aktiengesellschaft', 'Expectoo Nigeria', 'Software * IT',
       'BeTechnology Group', 'HR Proactivity',
       'Randstad Tech Engineering', 'Web Leaders - Winning People',
       'Quantum Advisory', 'Mec-Diesel S.p.A.', 'Viceversa',
       'Birnbach Communications', 'www.TeamQuest.pl', 'Tefors', '',
       'Ashley Kate HR &amp; Finance', 'VOX SOLUTIONS'], dtype=object)

In [60]:
def company_specialization(row): #функция для замены пропусков сфере деятельности по названию компании
    if row['company_name'] in ['Greenly']:
        return 'Environmental Services'
    elif row['company_name'] in ['Landeskriminalamt Nordrhein-Westfalen']:
        return 'Government Administration'
    elif row['company_name'] in ['M&L Aktiengesellschaft']:
        return 'Business Consulting and Services'
    elif row['company_name'] in ['Expectoo Nigeria', '']:
        return 'IT Services and IT Consulting'
    elif row['company_name'] in ['HR Proactivity', 'www.TeamQuest.pl']:
        return 'Human Resources Services'
    elif row['company_name'] in ['Software * IT', 'BeTechnology Group', 'Randstad Tech Engineering']:
        return 'agency vacancy'
    elif row['company_name'] in ['Web Leaders - Winning People']:
        return 'Advertising Services'
    elif row['company_name'] in ['Quantum Advisory']:
        return 'Accounting'
    elif row['company_name'] in ['Mec-Diesel S.p.A.']:
        return 'Wholesale Motor Vehicles and Parts'
    elif row['company_name'] in ['Viceversa']:
        return 'Financial Services'
    elif row['company_name'] in ['Birnbach Communications']:
        return 'Public Relations and Communications Services'
    elif row['company_name'] in ['Tefors']:
        return 'Retail Apparel and Fashion'
    elif row['company_name'] in ['Ashley Kate HR &amp; Finance']:
        return 'Staffing and Recruiting'
    elif row['company_name'] in ['VOX SOLUTIONS']:
        return 'Telecommunications'
    else:
        return row['company_specialization']

In [61]:
df_analyst['company_specialization'] = df_analyst.apply(company_specialization, axis=1)

In [62]:
df_analyst['company_specialization'].isna().sum()

0

In [63]:
df_analyst[df_analyst['employees'].isna()]

Unnamed: 0,title,city,country,workplace_type,company_name,employees,company_specialization,skills,publication_date,applicants,link
282,customer ledger data analyst,Stockport,United Kingdom,On-site,Birnbach Communications,,Public Relations and Communications Services,excel,2022-09-06,,https://linkedin.com/jobs/view/3254329032/?alt...
282,customer ledger data analyst,Stockport,United Kingdom,On-site,Birnbach Communications,,Public Relations and Communications Services,ssis,2022-09-06,,https://linkedin.com/jobs/view/3254329032/?alt...
282,customer ledger data analyst,Stockport,United Kingdom,On-site,Birnbach Communications,,Public Relations and Communications Services,vba,2022-09-06,,https://linkedin.com/jobs/view/3254329032/?alt...


In [64]:
df_analyst['employees'].value_counts()

10,001+ employees                                                                                   274
1,001-5,000 employees                                                                               138
201-500 employees                                                                                   114
501-1,000 employees                                                                                 106
51-200 employees                                                                                     94
5,001-10,000 employees                                                                               90
1-10 employees                                                                                       52
11-50 employees                                                                                      36
1-10 employees                                                                                       18
51-200 employees                                                

В столбце с количеством сотрудников в компании всего 3 пропуска, но также присутствуют аномальные значения.

In [65]:
df_analyst['employees'] = df_analyst['employees'].fillna('unknown') #заглушка на пропуск

employees = re.compile('(employees)') #фильтр

df_analyst[~(df_analyst['employees'].str.contains(employees))]['company_name'].unique() #получение названий компаний

  df_analyst[~(df_analyst['employees'].str.contains(employees))]['company_name'].unique() #получение названий компаний


array(['Landeskriminalamt Nordrhein-Westfalen', 'M&L Aktiengesellschaft',
       'Software * IT', 'BeTechnology Group', 'Randstad Tech Engineering',
       'Birnbach Communications', 'www.TeamQuest.pl', '',
       'Ashley Kate HR &amp; Finance'], dtype=object)

In [66]:
def employees(row): #функция для замены значений в столбце с количеством сотрудников:
    if row['company_name'] in ['Landeskriminalamt Nordrhein-Westfalen', 'Software * IT', 
                               'BeTechnology Group', 'Randstad Tech Engineering', '']:
        return 'unknown'
    elif row['company_name'] in ['M&L Aktiengesellschaft', 'www.TeamQuest.pl', 'Ashley Kate HR &amp; Finance']:
        return '11-50 employees'
    elif row['company_name'] in ['Birnbach Communications']:
        return '1-10 employees'
    else:
        return row['employees']

In [67]:
df_analyst['employees'] = df_analyst.apply(employees, axis=1)

In [68]:
df_analyst['employees'].value_counts()

10,001+ employees          274
1,001-5,000 employees      128
201-500 employees          114
501-1,000 employees        106
51-200 employees            94
5,001-10,000 employees      90
1-10 employees              45
11-50 employees             36
unknown                     36
1-10 employees              21
51-200 employees            15
11-50 employees             12
201-500 employees            3
Name: employees, dtype: int64

Аномальные значения исчезли, но некоторые строки с одинаковыми значениями опрделяются, как разные значения (возможно из-за пробелов). Оставим только числовой диапазон в столбце а слово 'employees' уберем.

In [69]:
df_analyst['employees'] = df_analyst['employees'].str.split(' ').str.get(0)

In [70]:
df_analyst['employees'].value_counts()

10,001+         274
1,001-5,000     128
201-500         117
51-200          109
501-1,000       106
5,001-10,000     90
1-10             66
11-50            48
unknown          36
Name: employees, dtype: int64

In [71]:
df_analyst[df_analyst['applicants'].isna()]

Unnamed: 0,title,city,country,workplace_type,company_name,employees,company_specialization,skills,publication_date,applicants,link
6,data analyst,Przemyśl,Poland,Remote,TELUS International AI Data Solutions,"10,001+",IT Services and IT Consulting,,2022-09-07,,https://linkedin.com/jobs/view/3257247449/?alt...
24,data analyst,Castres,France,On-site,Pierre Fabre Group,"10,001+",Pharmaceutical Manufacturing,excel,2022-09-02,,https://linkedin.com/jobs/view/2811860634/?alt...
24,data analyst,Castres,France,On-site,Pierre Fabre Group,"10,001+",Pharmaceutical Manufacturing,power point,2022-09-02,,https://linkedin.com/jobs/view/2811860634/?alt...
24,data analyst,Castres,France,On-site,Pierre Fabre Group,"10,001+",Pharmaceutical Manufacturing,tableau,2022-09-02,,https://linkedin.com/jobs/view/2811860634/?alt...
57,data analyst with python,Sofia,Bulgaria,unknown,Fourth,"501-1,000",IT Services and IT Consulting,api,2022-09-07,,https://linkedin.com/jobs/view/3249889835/?alt...
...,...,...,...,...,...,...,...,...,...,...,...
311,systems data analyst,London,United Kingdom,unknown,dns umbrella,1-10,Financial Services,data studio,2022-09-01,,https://linkedin.com/jobs/view/3256666230/?alt...
311,systems data analyst,London,United Kingdom,unknown,dns umbrella,1-10,Financial Services,power bi,2022-09-01,,https://linkedin.com/jobs/view/3256666230/?alt...
324,ing. elettronico/data analyst,Rome,Italy,On-site,Herzum,51-200,IT Services and IT Consulting,python,2022-09-06,,https://linkedin.com/jobs/view/3254064694/?alt...
324,ing. elettronico/data analyst,Rome,Italy,On-site,Herzum,51-200,IT Services and IT Consulting,c++,2022-09-06,,https://linkedin.com/jobs/view/3254064694/?alt...


Для вакансий которые были опубликованы непосредственно в день парсинга отсутствия значений в данном столбце можно было бы объснить тем, что претенденты не успели появиться, но пропуски есть и в вакансиях опубликованных позднее (возможно на вакансию действительно никто не откликнулся или данная информация просто скрыта). В данном столбце оставим Nan без изменений.

In [72]:
df_analyst.duplicated().sum() #проверка на дубликаты после проведенных преобразований

0

In [73]:
df_analyst = df_analyst.drop_duplicates().reset_index(drop=True) #удаление дубликата

In [74]:
df_analyst.to_csv('D:\df_analyst.csv', index=False) #выгрузка датафрейма в csv-файл

## Ссылка на дашборд

[https://public.tableau.com/views/Linkedin_project/Dashboard1?:language=en-US&:display_count=n&:origin=viz_share_link]