# 1) Retrieving data 

## Retrieving data from NoFluffJobs

The first part of the exercise consists in retrieving data from the NoFluffJobs homepage. Complete the exercise following the steps:

- Write a function that takes two parameters (job name and page number), and returns the HTML code of that page,
- Write a function that takes one parameter (website code) and returns the information saying whether there are more offers on the page (`True` or `False`),
- Write a function that takes one parameter (job name), and then in a loop, starting from 1 page:
    - Retrieves the code of the given page,
    - Checks if there are still ads on the page,
    - If there are, it saves the HTML code to disk and goes to the next page,
    - If there are not, it terminates the operation,
    
Remember to use previously written functions in step 3.

At this stage, we do not process the data yet, we retrieve it as it is available.

Run the script for the following jobs:

- data analyst,
- data scientist,
- data engineer.

### NOTE:
For the website to generate its entire content after opening it needs to be clicked. In other words the process of loading the website should look as follows:

- open the job offers page,
- click any object on the page (e.g. accept cookies).

#### File names

We will adopt the following file naming convention:

```{job_name}_{page_number}.html```

For example: `data analyst_1.html` is going to mean the list of data analyst job offers from page one. The files should be saved in the `/data/raw` directory.

#### Hints
- Remember to add a time interval between every page transition, e.g. 5 seconds,
- As a url to be opened by the browser, you can use the following template:

```https://nofluffjobs.com/pl/jobs?criteria={job_name}&page={page_number}```

- to retrieve the HTML content of the page you can use: `browser.page_source`,
- Because we do not how how many pages we are going to have for each job, you can use a `while` loop,
- if you want to stop executing the loop you can use the `break` keyword.


In [1]:
#importing libraries
from bs4 import BeautifulSoup
import requests
import time

In [2]:
#helper function for creating url based on parameters
def url_code(job_name, page_number):
    job_name2 = job_name.replace(" ", "%20")  #replace space with %20 for url
    url = f'https://nofluffjobs.com/pl/jobs?criteria={job_name2}&page={page_number}'  #writing url address with parameters
    return url

#Function that takes two parameters (job name and page number), and returns the HTML code of that page
def get_html(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup  

In [3]:
#Function that takes one parameter (website code) and returns the information
#saying whether there are more offers on the page (True or False),

def more_offers(website_code):
    try: #check if there is a "multiple offers" option
        more = website_code.find("div", class_="tw-flex tw-flex-wrap tw-justify-center tw-my-8 tw-gap-4 ng-star-inserted").text
        return True
    except AttributeError:   #if not, it fails and we get an attribute error
        return False

In [4]:
#3)Function that takes one parameter (job name), and then in a loop, starting from 1 page -retrieves the code
#of the given page, checks if there are still ads on the page, if there are, it saves the HTML code to disk
#and goes to the next page, if there are not, it terminates the operation
    

def save_html_ads(job_name):
    page_number = 1    #starting from page 1
    url = url_code(job_name, page_number) #url address for the job_name and page 1
    
    soup = get_html(url)  #get html code of that page

    #save that html code to 'raw' folder
    filename = f'data/raw/{job_name}_{page_number}.html'
    with open(filename, "w", encoding="utf-8") as file:
        file.write(str(soup))
    
    #do we have more pages with ads?
    if more_offers(soup):  
   
        while True:  #until we have more pages
            time.sleep(5)  #wait 5 seconds before saving html code of next page
            page_number += 1   #next page
            url = url_code(job_name, page_number)
            
            #save that html code to 'raw' folder
            soup = get_html(url)  
            filename = f'data/raw/{job_name}_{page_number}.html'  
            with open(filename, "w", encoding="utf-8") as file:
                file.write(str(soup))
                    
            if not more_offers(soup):  #while loop breaks if there are no more pages with ads
                print(f"No more pages with job offers for {job_name}.")  
                break
    

In [5]:
#save html codes for data scientist, data analyst and data engineer
save_html_ads("data scientist")

No more pages with job offers for data scientist.


In [6]:
save_html_ads("data analyst")

No more pages with job offers for data analyst.


In [7]:
save_html_ads("data engineer")

No more pages with job offers for data engineer.
