# 2) Scraping data

## Retrieving information about the website

Based on the files from the `/data/raw` directory (generated in exercise 1) extract the following information about an offer:

- location - both city and country. For remote work, set `Remote` as the city and `N/A` as the country,
- salary - both lower and upper limits and currency. If there is no pay range, write the same value in both fields (lower limit = upper limit),
- name of position,
- company,
- technology.

Write the results of a single bid into a dictionary with the following structure:

```
{
    'name': 'name of the position',
    'company': 'name of the employer',
    'technology': 'name of the used technology',
    'job': 'information regarding name of the search e.g. data analyst ',
    'location': {'city': 'city of employment', 'country': 'country of employment'},
    'salary': {'low': 'lower limit', 'high': 'higher limit', 'currency': 'salary currency'} 
}
``` 

Put single items into a list.

A list of such dictionaries can be read using another `Pandas` method - `json_normalize`. It is shown during the workshop, because json is a commonly used construct for communication between modules.

Save the results as `DataFrame` to `data\interim\job_offers.csv` using the `;` separator, `UTF-8` encoding, and without index (`index=False`).

Complete the exercise following the steps:

- Write a function that takes the HTML code of a page and returns a list with pieces of HTML code that contain information about a single ad,
- Write a function that will take the HTML code containing information about one ad and return a dictionary with the information (described above), 3 Assemble this into a working script that:
    - Finds all files in the data\raw directory,
    - For each file:
        - Divides it into sections corresponding to the company,
        - Extracts the necessary information from it as a dictionary,
        - Will add the dictionary to the previously created list,
    - Loads the list with dictionaries using Pandas into the dataset,
    - Saves the dataset in the data\interim\ directory with the current date.

#### File names

We will adopt the following file naming convention:

```
'job_offers_{current date}.csv'
```

Where the `{current date}` parameter should use the `yyyy_mm_dd` format (year month day).

#### Hints

- To get the current date you can use the code: `datetime.today().strftime('%Y_%m_%d')`. Remember to import the appropriate module!
- You can split the data parsing for a single offer into several smaller helper functions. For example, one can retrieve the salary, another - parse the location data. This will make the code easier to maintain.
- To test the performance of your functions, you can manually pull HTML code from a file and pass it as a parameter. This way you don't need the whole script to test how its parts work.

In [1]:
#import libraries
import pandas as pd
from bs4 import BeautifulSoup 
import requests
import re
import glob 
import datetime

In [10]:
file_path = r"data/raw/*.html"  #regex for finding all html files in folder raw

files = glob.glob(file_path)  #all html files in folder raw

ads_information_complete = pd.DataFrame()  #this will be the final dataframe

for file in files:   #for each html file
    
    regex = r"\\\w*\s\w*_"  #regex for indicating job_name from the file name
    job_name = re.search(regex,file).group().replace("\\","").replace("_","")
    
    ads_information = []  #list of information of an offers (dictionaries) that are included on the page

    with open(file, "r", encoding = "utf-8") as file:
        html_content = file.read()  #reading the html code

    soup = BeautifulSoup(html_content, "html.parser")  
    jobs = soup.find_all("aside", class_ = "tw-w-full")   #here is information about all offers on the page

    for job in jobs:  #for each offer
        
        #city
        city = job.find("span", 
                        class_ = "tw-text-ellipsis tw-inline-block tw-overflow-hidden tw-whitespace-nowrap tw-max-w-[100px] md:tw-max-w-[200px] tw-text-right").text.strip()
        
        #country
        country = job.find("div", "tw-flex tw-items-center ng-star-inserted").contents[1].text.replace(","," ").strip()
        if country == "":  #if missing
            country = "N/A"

        #salary (range) - object
        salary_obj = job.find("span",
                            class_ = "text-truncate badgy salary lg:tw-btn tw-text-ink lg:tw-btn-secondary-outline tw-text-xs lg:tw-py-0.5 lg:tw-px-2 ng-star-inserted")

        #lower salary threshold
        if salary_obj != None:   #if data about salary does exist
            salary = salary_obj.text #salary range
            try:  #is it range?
                regex = r"\d+\s\d+\s*–"
                low = re.search(regex,salary).group().replace("–","") 
                low_int = int(re.sub("\s","", low))  #remove space in the number and format as integer
            except AttributeError:  #not range
                regex = r"\d+\s\d+"  #to retrieve the number
                low = re.search(regex,salary).group()
                low_int = int(re.sub("\s","", low))

        else:
            low_int = None

        #upper salary threshold
        if salary_obj != None: 
            salary = salary_obj.text
            try:  #is it range? 
                regex = r"–\s*\d+\s\d+"
                up = re.search(regex,salary).group().replace("–","") 
                up_int = int(re.sub("\s","", up)) 
            except AttributeError:  
                up_int = low_int 
        else:
            up_int = None

        #currency
        if salary_obj != None: 
            salary = salary_obj.text
            regex = r"[A-Z]+"   #to retrieve the currency
            curr = re.search(regex,salary).group()
        else:
            curr = None

        #position name
        position = job.find("h3").text.strip()

        #company name
        company = job.find("h4").text.strip()

        #technology
        techs0 = job.find_all("span", class_ = "lg:tw-text-gray-60 lg:tw-border-2 lg:tw-border-gray-ddd tw-text-xs lg:tw-py-0.5 lg:tw-px-2 tw-text-gray-60")
        techs = [tech.text.strip() for tech in techs0]  #create a list of technologies

        #write the information into a dictionary 
        information = {
        'name': position,
        'company': company,
        'technology': techs,
        'job': job_name,
        'location': {'city': city, 'country': country},
        'salary': {'low': low_int, 'high': up_int, 'currency': curr} 
        }

        ads_information.append(information)  #append it to the list of dictionaries
        
    ads_information = pd.DataFrame(ads_information)  #transform to a dataframe
    
    #append each dataframe to the previous one
    ads_information_complete = pd.concat([ads_information_complete, ads_information], axis=0, ignore_index = True)

In [12]:
#remove duplicate rows based on the same name of the position, company, job
#(note: pages are not separate, but each subsequent page contains the previous ones)

ads_information_complete2 = ads_information_complete.drop_duplicates(subset=["name", "company", "job"])

print(f"Total number of an offers before removing duplicates: {len(ads_information_complete)}.")
print(f"Total number of an offers after removing duplicates: {len(ads_information_complete2)}.")

Total number of an offers before removing duplicates: 1169.
Total number of an offers after removing duplicates: 326.


In [13]:
#dataframe ads_information_complete2 contains other kind of duplicates - some offers have appeared for more job positions
#we need to remove these duplicates and match the offer to the job by name

#all duplicates
duplicates = ads_information_complete2[ads_information_complete2.duplicated(subset=["name", "company"], keep = False)]


for index, row in duplicates.iterrows():
    if re.search(r"analyst", row["name"].lower()):
        row["job"] = "data analyst"
    elif re.search(r"engineer", row["name"].lower()):
        row["job"] = "data engineer"
    elif re.search(r"scientist", row["name"].lower()):
        row["job"] = "data scientist"

#finally delete these duplicates through the columns name, company, job
ads_information_complete3 = duplicates.drop_duplicates(subset = ["name", "company", "job"])

#note: we will keep other duplicates that cannot be assigned to these positions

print(f"Total number of an offers after removing all duplicates: {len(ads_information_complete3)}.")

Total number of an offers after removing all duplicates: 150.


In [14]:
#save to csv with current date

current_date = datetime.datetime.today().strftime("%Y_%m_%d")
file_name = f"job_offers_{current_date}.csv"
    
ads_information_complete3.to_csv(r"data\interim\{}".format(file_name), sep = ";", header=True, index=False)