# Scraping a Job Advert to Map Keyword Frequency

This project scrapes text from an online job advertisement and analyzes the frequency of keywords within the ad. The goal is to identify the most emphasized skills and qualifications, helping job seekers tailor their resumes and applications to specific job postings.


##### <b> BeautifulSoup and Requests

In [2]:
# Import necessary libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# The URL of the job advertisement
url = 'https://www.finn.no/job/fulltime/ad.html?finnkode=362413628'

## Step 1: Check the Status of the Webpage

Before scraping the content, we need to ensure that the webpage is accessible by checking the HTTP status code. A status code of 200 indicates the page is available, while a 404 error means the page is not found.


In [3]:

def CheckSoupStatus(url):
    
    """ 
    Check the HTTP status code of the URL response and print a message 
    indicating whether the request was successful.
    (200 for OK, 404 for Not Found).
    """

    # Send a HTTP GET request to the specified URL
    webpage = requests.get(url)
    print(f'Status code: {webpage.status_code}')

    # Provide a message/feedback based on the status code
    if webpage.status_code == 200:
        print('Full of soup! Feel free to retrieve the soup of choice.')
    elif webpage.status_code == 404:
        print('Server Error. No soup available at the moment.')
    else:
        print('Oh no. Something is not working.')

# Check the status of the webpage
CheckSoupStatus(url)


Status code: 200
Full of soup! Feel free to retrieve the soup of choice.


## Step 2: Scrape and Parse the Webpage Content

Next, we'll send a request to the URL and parse the HTML content of the page using BeautifulSoup. We'll then extract the title of the webpage and inspect the HTML structure.


In [4]:
# Requests.get() returns a response object, 
# which contains the server's response to the request. 
# This object includes attributes like .text (the content of the response) 
# and .status_code (the HTTP status code).

# Send an HTTP GET request to the URL and get the page content
page = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')

# Prints the title of the webpage as a string
soup.title.string

'Vikariat som bioingeniør/ avdelingsingeniør ved Kreftgenomikk | FINN.no'

In [11]:
# Prettify makes the HTML code look pretty (readable) by adding hierarchy

# Get the prettified HTML content
pretty_html = soup.prettify()

# Print only the first 1000 characters
print(pretty_html[:1000])

<!DOCTYPE html>
<html lang="nb">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="0" name="nmp:tracking:aurora"/>
  <meta content="FINN" name="nmp:tracking:brand"/>
  <meta content="job-item-web" name="nmp:tracking:app-name"/>
  <meta content="0" property="mbl:login"/>
  <title>
   Vikariat som bioingeniør/ avdelingsingeniør ved Kreftgenomikk | FINN.no
  </title>
  <meta content="Kreftgenomikk er ei eining i Laboratorieklinikken med ansvar for avansert molekylær kreftdiagnostikk. Avdelinga er ei av åtte einingar i Laboratorieklinikken" name="description"/>
  <meta content="Vikariat som bioingeniør/ avdelingsingeniør ved Kreftgenomikk | FINN.no" property="og:title"/>
  <meta content="https://www.finn.no/job/fulltime/ad.html?finnkode=362413628" property="og:url"/>
  <meta content="Kreftgenomikk er ei eining i Laboratorieklinikken med ansvar for avansert moleky

## Step 3: Extract the Job Advertisement Text

We will search for the specific `div` tag containing the job advertisement text. Once found, we will extract and clean the text.


In [12]:
# 'div' parameter finds all the div tags within the HTML code
soup.find_all('div', class_='import-decoration');

In [13]:
# Using find instead of find_all to extract text
job_ad = soup.find('div', class_='import-decoration').text.strip()

job_ad

'Kreftgenomikk er ei eining i Laboratorieklinikken med ansvar for avansert molekylær kreftdiagnostikk. Avdelinga er ei av åtte einingar i Laboratorieklinikken som rapporterer direkte til klinikkdirektør. Kreftgenomikk samarbeider med Avdeling for patologi, Avdeling for medisinsk genetikk, Avdeling for medisinsk biokjemi og farmakologi, Regionalt kompetansesenter for arveleg kreft, Seksjon for bioinformatikk og Biobank Haukeland. Dei tilsette har ulik yrkesfagleg bakgrunn med høg faglegkompetanse (helsesekretær, bioingeniørar, molekylærbiologar og legar).Innan kreftfeltet har vi for tida ei nasjonal satsing for å bygge opp infrastruktur for presisjonsdiagnostikk (InPreD). Helse Bergen HF er eit nivå 1-sjukehus og er dermed leiande i utviklingsarbeidet. Kreftgenomikk har ei sentral rolle i innføring av nødvendige laboratorieanalysar for presisjonsmedisin og skal i tillegg til rutinediagnostik og leggje til rette infrastruktur, biobanking og studiespesifikke analysar i kliniske studiar. A

## Step 4: Process the Text Data

We'll split the job ad text into individual words, remove any empty entries, and convert all words to lowercase to standardize them.


In [17]:

# Define multiple separators using regex pattern, including a way to parse camelCase and PascalCase strings
pattern = r'(?<=[a-z])(?=[A-Z])|[;,/.() ]'

# Split the job ad text based on the defined pattern
word_list = re.split(pattern, job_ad)

# Showing a snippet of the list
word_list[:15]

['Kreftgenomikk',
 'er',
 'ei',
 'eining',
 'i',
 'Laboratorieklinikken',
 'med',
 'ansvar',
 'for',
 'avansert',
 'molekylær',
 'kreftdiagnostikk',
 '',
 'Avdelinga',
 'er']

In [18]:
words = []

# Process the word list: remove empty list entries and standardize to lowercase
for word in word_list:
    if word == '':
        continue
    else:
        words.append(word.lower())

# Showing a snippet of the list
words[:15]

['kreftgenomikk',
 'er',
 'ei',
 'eining',
 'i',
 'laboratorieklinikken',
 'med',
 'ansvar',
 'for',
 'avansert',
 'molekylær',
 'kreftdiagnostikk',
 'avdelinga',
 'er',
 'ei']

## Step 5: Analyze Keyword Frequency

Using Pandas, we'll count the frequency of each word in the job ad and store the results in a DataFrame. Finally, we'll save the word frequency data to a CSV file for further analysis.


In [10]:
# Create a Pandas Series from the word list and count the frequency of each word
series = pd.Series(words).value_counts()

# Create a DataFrame with the word counts
df = pd.DataFrame(index=series.index)
df['Count'] = series

df

# Save the word frequency DataFrame to a CSV file
df.to_csv('word_list')