<a href="https://colab.research.google.com/github/InDataSight/GrowthPan/blob/main/ETLconceptworkflow1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1.0 Setup

### 1.1 Folder structure

In [13]:
#!mkdir ScrapyProfesiaRawData ScrapyProfesiaProcessedData ScrapyProfesiaLogs

### 1.2 Install libraries and modules

In [44]:
!pip install -r requirements.txt

Collecting bs4 (from -r requirements.txt (line 1))
  Using cached bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [45]:
import subprocess
import requests
from bs4 import BeautifulSoup
import json
import re

#Run pip for defined modules in the requirements.txt
pip_install_result = subprocess.run(['pip', 'install', '-r', 'requirements.txt'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

#output results for the pip_install_result
if pip_install_result.returncode == 0:
    print("Pip install successful")
else:
    print("Pip install failed")


Pip install successful


### 1.3 Test

In [46]:
!pytest ScrapyProfesiaSetupTest.py

platform linux -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: /content
plugins: typeguard-4.4.1, anyio-3.7.1
[1mcollecting ... [0m[1mcollected 2 items                                                                                  [0m

ScrapyProfesiaSetupTest.py [32m.[0m[32m.[0m[32m                                                                [100%][0m



## 2.0 Extract Raw Data

In [52]:
LINK = 'https://www.profesia.sk/O4988508'
RAWFILE = '/content/ScrapyProfesiaRawData/O4988508.txt'
PROCESSEDFILE = '/content/ScrapyProfesiaProcessedData/O4988508P.json'

### 2.1 Single page - proof of concept

In [53]:
def download_and_save(url, filename):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes

        soup = BeautifulSoup(response.content, 'html.parser')

        #Convert the parsed html content to a string
        #html_string = str(soup)
        text_content = soup.get_text(separator='\n', strip=True) # Get the text content with newlines as separators

        # Create a dictionary to store the data
        #data = {"html_content": html_string}

        with open(filename, 'w', encoding='utf-8') as f:
            f.write(text_content)
        print(f"Successfully downloaded and saved to {filename}")

    except requests.exceptions.RequestException as e:
        print(f"Error downloading URL: {e}")
    except Exception as e:
        print(f"An error occurred: {e}")


In [54]:
download_and_save(LINK, RAWFILE)

Successfully downloaded and saved to /content/ScrapyProfesiaRawData/O4988508.txt


## 3.0 Transform Raw Data

In [71]:
def extract_data(inputfile,outputfile):

  with open(inputfile, 'r', encoding='utf-8') as f:
          text_content = f.read()
  text_content = re.sub(r'Hľadanie práce.*?Hľadanie práce', '', text_content, flags=re.DOTALL)

  if 'Odporučiť ponuku známemu' in text_content:
    text_content = text_content.split('Odporučiť ponuku známemu', 1)[0]
  else:
    text_content = text_content.split('Reagovať na ponuku', 1)[0]

  data = {}
  data['ID'] = re.search(r'ID:\s*(\d+)', text_content).group(1)
  data['PublishedDate'] = re.search(r'Dátum zverejnenia:\s*([\d\.]+)', text_content).group(1)
  #add data['ExtractDate'] = ... wont work if I dont have metadata available
  #at blob storage - test
  #at VM file extract date is the date I am looking for
  #should be almost same as the blob Creation-Date
  data['Location'] = re.search(r'lokalita:\s*(.+)', text_content).group(1)
  # Find text between 'Pozícia' and 'Spoločnosť'
  positions_text = re.search(r'Pozícia:\s*(.+?)(?=\nSpoločnosť:)', text_content, re.DOTALL).group(1)
  # Get valid lines, ignore ',', create
  positions = [line.strip() for line in positions_text.splitlines() if line.strip() and line.strip() != ',']
  data['Positions'] = positions
  data['Company'] = re.search(r'Spoločnosť:\s*(.+)', text_content).group(1)
  data['SalaryBrutto'] = re.search(r'Základná zložka mzdy \(brutto\):\s*(.+)', text_content).group(1)
  data['JobOfferText'] = text_content

  json_data = json.dumps(data, indent=4)

  with open(outputfile, 'w', encoding='utf-8') as f:
            f.write(json_data)
  print(f'Successfully downloaded and saved to {outputfile}')

In [72]:
extract_data(RAWFILE,PROCESSEDFILE)

Successfully downloaded and saved to /content/ScrapyProfesiaProcessedData/O4988508P.json
