<a href="https://colab.research.google.com/github/TK-Problem/Python-mokymai/blob/master/Scripts/cvonline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Importuoti paketus

# playwright biblioteka naudojama importuoti html kodą
!pip install playwright
!playwright install-deps
!playwright install webkit
!pip install nest_asyncio

# playwright veikia TIK asyncio režimu
import nest_asyncio
nest_asyncio.apply()
import asyncio

# importuoti playwright versiją
from playwright.async_api import async_playwright

# bsė naudojama iš HTML ištraukti reikiamą informaciją
from bs4 import BeautifulSoup

# kartais reikia palaikyti kurį laiką programą veikiančią
import time

# paketai dirbti su skaičiais ir duomenimis
import pandas as pd
import numpy as np

# clear output komanda naudojama išvalyti informacijai
from IPython.display import clear_output
clear_output()

# Duomenų atsisiuntimas

Funcijos veikimo žingsniai:

* sukuria `playwright` webdriver'į (webkit),
* sukuria netikrą `user_agent`, kad svetainė tave laikytų tikru varotoju,
* sugeneruoji `cvonline.lt` puslapio URL kartu su raktažodžių (keyword),
* paspaudžia ant pop-up ir cookie mygtukų,
* palaukia prevenciškai 2 sekundes,
* atsisiunčia HTML kodą,
* perkelia jį į `BeautifulSoup` objektą,
* iteruojame per eilutes ir išsitraukiame reikiamą informaciją,
* duomenis sukeliame į `pandas` DataFrame objektą ir jį grąžiname.

In [2]:
#@title CVonline funkcija
async def cvonline(keyword="python"):
    """
    This function returns all available job listings based on search keyword.
    Inputs:
      keyword (str)
    Output:
      returns pandas DataFrame
    """
    async with async_playwright() as p:

        # create webdriver/webkit
        browser = await p.webkit.launch()

        # create user agent for the webdriver
        user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'

        # create new page, i.e. new table in your browser
        page = await browser.new_page(user_agent=user_agent)
        
        # generate URL with a keyword
        url = f"https://cvonline.lt/lt/search?limit=500&offset=0&keywords%5B0%5D={keyword}&fuzzy=true&suitableForRefugees=false&isHourlySalary=false&isRemoteWork=false&isQuickApply=false"
        
        # visit page
        await page.goto(url)

        # click on pop-up window
        await page.click("//button[@class='jsx-4189752321 close-modal-button']")

        # click cookie button
        await page.click("//button[@class='cookie-consent-button']")

        # imlicit wait
        time.sleep(2)

        # get page html contents
        page_source = await page.content()

        # convert to bs4 object
        soup = BeautifulSoup(page_source, "lxml")

        # find all rows (ul - unordered list, li - list item)
        rows = soup.find("ul", {"data-gtm-id": "search-results"}).find_all("li")

        # create tmp. list to store data
        lst = list()

        # iterate over all rows
        for row in rows:
          # find all <a> tags
          for a in row.find_all('a', href=True):
            # condition to find employr info
            if "employer" in a['href']:
              # get element's text
              employer = a.text
          
          # get all row contents
          _contents = row.find_all("span")

          # get specific info about job title and job location
          job_title = _contents[0].text
          job_location = _contents[2].text[3:]

          # get start day (the date job was created)
          for c in _contents[4:]:
            # if it started
            if "Paskelbta" in c.text:
              # get text value
              offer_started = c.text.split("Baigiasi")[0]
            # if job offered is closed
            if "Baigiasi" in c.text:
              # get text value
              offer_ends = c.text

          # get salary (there are two types of employers)
          # use different slicing if there are more columns in an element
          if "TOP Darbdavys" in _contents[-1].text:
            salary = _contents[-4].text
            # if it is hourly rate (/h) read different element
            if salary == "/h":
              salary = _contents[-5].text
          else:
            salary = _contents[-1].text
            # if it is hourly rate (/h) read different element
            if salary == "/h":
              salary = _contents[-2].text
            

          # add data to temp. list
          lst.append([employer, job_title, job_location, offer_started, offer_ends, salary])

        # save image to your enviroment (for debuging)
        # one can close this line
        await page.screenshot(path="cvonline_status.png")
        
        # close webkit
        await browser.close()

        # return pandas DataFrame
        return pd.DataFrame(lst, columns = ['Employer', 'JobTitle', "Location", "Offered", "AddEnds", "Salary"])

In [3]:
#@title Atsisiųsti duomenis
# paieškos žodis
keyword = 'python' # @param {type:"string"}

# iškviečiame funkciją ir išsaugome duomenis ir atvaizduojame pirmus 5 skelbimus
df_cvonline = asyncio.run(cvonline(keyword))

# parašyti kiek rado skelbimų
print(f"Rado {len(df_cvonline)} skelbimų.")

df_cvonline.head()

Rado 75 skelbimų.


Unnamed: 0,Employer,JobTitle,Location,Offered,AddEnds,Salary
0,"Auriga Baltics, UAB",Test Automation Engineer (Python),"Vilnius, Vilniaus rajonas, Lietuva",Paskelbta prieš 24 dienas,Baigiasi: 2022.12.09,€ 3300 – 4000
1,DataArt Ltd,Python Lead with AWS,Lietuva,Paskelbta prieš 2 dienas,Baigiasi: 2023.01.01,€ 6000 – 7000
2,DataArt Ltd,Senior Python Developer with AWS,Lietuva,Paskelbta prieš 2 dienas,Baigiasi: 2023.01.01,€ 6000 – 7000
3,Accenture,DevOps with Python development experience,Lietuva,Paskelbta prieš 9 dienas,Baigiasi: 2022.12.24,€ 2600 – 5800
4,"GODEL TECHNOLOGIES EUROPE, UAB",Paid Intership in SDET (Automation QA Engineer...,"Vilnius, Vilniaus rajonas, Lietuva",Paskelbta prieš apie 1 mėnesį,Baigiasi: 2022.12.03,€ 650


# Duomenų apdorojimas

Duomenis būtina sutvarkyti prieš pradedant analizuoti. Atlyginimo stulpelis `salary` turi keletą tipų reikšmių:

* vieni atlyginimai parašyti per ruožą, pvz. € 3300 – 4000. Tokiu atveju, reikia ištraukti minimalią ir maksimalią atlyginimo vertes, panaikinti euro simbolį.
* kiti skelbimai neskelbia atlygimų, tiesiog rašo "TOP Darbdavys". Tokius įrašus reikia paversti NaN vertėmis.
* yra atlyginimų, kur rašo valandinį, pvz. € 6/h, tokiu šį atlyginimą paversti į mnesinį.

Galiausiai atlyginimai yra sunormuojami į vidurkį tarp minimalaus ir maksimalaus siūlomo varianto.

In [4]:
#@title Sutvarkyti skaitinius duomenis
def clean_num_cols(df):
  """
  Formats salary columns
  Input:
    df - pandas DataFrame
  Output:
    pandas DataFrame
  """
  # clean empty salaries (the ones with Top darbdavys)
  df.Salary = df.Salary.apply(lambda x: "" if "TOP Darbdavys" in x else x)

  # if there is salary range, e.g. x - y, then extract min and max values
  df['SalaryMin'] = df.Salary.apply(lambda x: x.split(" – ")[0][2:] if " – " in x else x)
  df['SalaryMax'] = df.Salary.apply(lambda x: x.split(" – ")[1] if " – " in x else x)

  # remove euro sign
  df['SalaryMin'] = df['SalaryMin'].str.replace("€", "")
  df['SalaryMax'] = df['SalaryMax'].str.replace("€", "")

  # assume that each month has 22 working days with 8 hours a day
  # approximate hourly wages to monthly
  # condition to select rows
  cond = df.Salary.apply(lambda x: "/h" in x)

  # convert hourly data to month salaries
  df.loc[cond, 'SalaryMin'] = df.loc[cond, 'Salary'].apply(lambda x: float(x.split("/h")[0].replace("€", "")) * 22 * 8)
  df.loc[cond, 'SalaryMax'] = df.loc[cond, 'Salary'].apply(lambda x: float(x.split("/h")[0].replace("€", "")) * 22 * 8)

  # convert missing salaries to NaNs
  df.loc[df.SalaryMin == '', 'SalaryMin'] = np.nan
  df.loc[df.SalaryMax == '', 'SalaryMax'] = np.nan

  # convert to floats
  df['SalaryMin'] = df['SalaryMin'].astype(float)
  df['SalaryMax'] = df_cvonline['SalaryMax'].astype(float)

  # calculate average salary
  df['SalaryMean'] = (df['SalaryMin'] + df['SalaryMax']) / 2
  
  # return cleaned DataFrame
  return df


In [5]:
# clean numerical values
df_c = clean_num_cols(df_cvonline)

# drop adds without salary
df_c = df_c.dropna()
df_c.head()

Unnamed: 0,Employer,JobTitle,Location,Offered,AddEnds,Salary,SalaryMin,SalaryMax,SalaryMean
0,"Auriga Baltics, UAB",Test Automation Engineer (Python),"Vilnius, Vilniaus rajonas, Lietuva",Paskelbta prieš 24 dienas,Baigiasi: 2022.12.09,€ 3300 – 4000,3300.0,4000.0,3650.0
1,DataArt Ltd,Python Lead with AWS,Lietuva,Paskelbta prieš 2 dienas,Baigiasi: 2023.01.01,€ 6000 – 7000,6000.0,7000.0,6500.0
2,DataArt Ltd,Senior Python Developer with AWS,Lietuva,Paskelbta prieš 2 dienas,Baigiasi: 2023.01.01,€ 6000 – 7000,6000.0,7000.0,6500.0
3,Accenture,DevOps with Python development experience,Lietuva,Paskelbta prieš 9 dienas,Baigiasi: 2022.12.24,€ 2600 – 5800,2600.0,5800.0,4200.0
4,"GODEL TECHNOLOGIES EUROPE, UAB",Paid Intership in SDET (Automation QA Engineer...,"Vilnius, Vilniaus rajonas, Lietuva",Paskelbta prieš apie 1 mėnesį,Baigiasi: 2022.12.03,€ 650,650.0,650.0,650.0


# Įžvalgos

Keletas klausimų į kurios galima atsakyti tiek vizualiai tiek skaičiais.

In [6]:
# vidutinis atlyginimas pagal raktžodį
print(f'Pagal pieškos žodį {keyword}, {len(df_c)} skelbimai vidutiškai siūlo {df_c.SalaryMean.mean():.0f} €')

Pagal pieškos žodį python, 73 skelbimai vidutiškai siūlo 3784 €


In [7]:
# top N dižiausius atlyginimus turintys skelbimai
N = 5 # @param {type:"integer"}
cols = ['Employer', 'JobTitle', 'AddEnds', 'Salary', 'SalaryMean']

# sort and return N largest
df_c.sort_values(by="SalaryMean").tail(N)[cols]

Unnamed: 0,Employer,JobTitle,AddEnds,Salary,SalaryMean
50,Danske Bank Lithuania,Chief Service Reliability Engineer (SRE) in Me...,Baigiasi: 2022.12.17,€ 4960 – 7440,6200.0
2,DataArt Ltd,Senior Python Developer with AWS,Baigiasi: 2023.01.01,€ 6000 – 7000,6500.0
1,DataArt Ltd,Python Lead with AWS,Baigiasi: 2023.01.01,€ 6000 – 7000,6500.0
14,Alliance for Recruitment,Senior Data Architect,Baigiasi: 2022.12.14,€ 5785 – 7416,6600.5
21,Alliance for Recruitment,Director of Product,Baigiasi: 2022.12.15,€ 6600 – 8300,7450.0
