# Scraping a Job Advert to Map Keyword Frequency

This project scrapes text from an online job advertisement and analyzes the frequency of keywords within the ad. The goal is to identify the most emphasized skills and qualifications, helping job seekers tailor their resumes and applications to specific job postings.


##### <b> BeautifulSoup and Requests

In [3]:
# Import necessary libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# The URL of the job advertisement (this one is fake for preview purposes)
url = 'https://www.techjobs.com/job/fulltime/ad.html?jobcode=123456'

## Step 1: Check the Status of the Webpage

Before scraping the content, we need to ensure that the webpage is accessible by checking the HTTP status code. A status code of 200 indicates the page is available, while a 404 error means the page is not found.


In [4]:
def CheckSoupStatus(url):
    
    """ 
    Check the HTTP status code of the URL response and print a message 
    indicating whether the request was successful.
    (200 for OK, 404 for Not Found).
    """

    # Send a HTTP GET request to the specified URL
    webpage = requests.get(url)
    print(f'Status code: {webpage.status_code}')

    # Provide a message/feedback based on the status code
    if webpage.status_code == 200:
        print('Full of soup! Feel free to retrieve the soup of choice.')
    elif webpage.status_code == 404:
        print('Server Error. No soup available at the moment.')
    else:
        print('Oh no. Something is not working.')

# Check the status of the webpage
CheckSoupStatus(url)


Status code: 200
Full of soup! Feel free to retrieve the soup of choice.


## Step 2: Scrape and Parse the Webpage Content

Next, we'll send a request to the URL and parse the HTML content of the page using BeautifulSoup. We'll then extract the title of the webpage and inspect the HTML structure.


In [15]:
# Requests.get() returns a response object, 
# which contains the server's response to the request. 
# This object includes attributes like .text (the content of the response) 
# and .status_code (the HTTP status code).

# Send an HTTP GET request to the URL and get the page content
page = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')

> Below is the fake preview, loaded from a local HTML file. We'll act as if we just read this from a website, as shown above.

In [47]:
fake_html_file = 'fictious_job_advertisement.html'

# Read the HTML content from the file
with open(fake_html_file, 'r', encoding='utf-8') as file:
    fake_html_content = file.read()

# Parse the HTML content using BeautifulSoup
fake_soup = BeautifulSoup(fake_html_content, 'html.parser')

# Prints the title of the webpage as a string
fake_soup.title.string

'Software Engineer Position | TechJobs.com'

In [22]:
# Prettify makes the HTML code look pretty (readable) by adding hierarchy
# Get the prettified HTML content
pretty_html = fake_soup.prettify()

# Print only the first 1000 characters
print(pretty_html[:1000])

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="telephone=no" name="format-detection"/>
  <title>
   Software Engineer Position | TechJobs.com
  </title>
  <meta content="Join our innovative team as a Software Engineer, leading the development of groundbreaking applications." name="description"/>
  <meta content="Software Engineer Position | TechJobs.com" property="og:title"/>
  <meta content="https://www.techjobs.com/job/fulltime/ad.html?jobcode=123456" property="og:url"/>
  <meta content="Join our innovative team as a Software Engineer, leading the development of groundbreaking applications." property="og:description"/>
 </head>
 <body>
  <div id="job-advertisement">
   <div class="job-title">
    Software Engineer Position
   </div>
   <div class="company-name">
    TechJobs.com
   </div>
   <div class="job-location">
    San Francisco, CA
   </div>
   <div class="job-descript

## Step 3: Extract the Job Advertisement Text

We will search for the specific `div` tag containing the job advertisement text. Once found, we will extract and clean the text.


In [24]:
# 'div' parameter finds all the div tags within the HTML code
fake_soup.find_all('div', class_='job-description');

In [36]:
# Using find instead of find_all to extract text
job_ad = fake_soup.find('div', class_='job-description').text.strip()

# Showing a snippet of the string
job_ad[:200]

'We are looking for a passionate Software Engineer to design, develop, and implement new features for our web applications. The ideal candidate has a solid background in software engineering, experienc'

## Step 4: Process the Text Data

We'll split the job ad text into individual words, remove any empty entries, and convert all words to lowercase to standardize them.


In [37]:

# Define multiple separators using regex pattern, including a way to parse camelCase and PascalCase strings
pattern = r'(?<=[a-z])(?=[A-Z])|[;,/.() ]'

# Split the job ad text based on the defined pattern
word_list = re.split(pattern, job_ad)

# Showing a snippet of the list
word_list[:15]

['We',
 'are',
 'looking',
 'for',
 'a',
 'passionate',
 'Software',
 'Engineer',
 'to',
 'design',
 '',
 'develop',
 '',
 'and',
 'implement']

In [38]:
words = []

# Process the word list: remove empty list entries and standardize to lowercase
for word in word_list:
    if word == '':
        continue
    else:
        words.append(word.lower())

# Showing a snippet of the list
words[:15]

['we',
 'are',
 'looking',
 'for',
 'a',
 'passionate',
 'software',
 'engineer',
 'to',
 'design',
 'develop',
 'and',
 'implement',
 'new',
 'features']

## Step 5: Analyze Keyword Frequency

Using Pandas, we'll count the frequency of each word in the job ad and store the results in a DataFrame. Finally, we'll save the word frequency data to a CSV file for further analysis.


In [46]:
# Create a Pandas Series from the word list and count the frequency of each word
series = pd.Series(words).value_counts()

# Create a DataFrame with the word counts
df = pd.DataFrame(index=series.index)
df['Count'] = series

# Showing the top 15 words
print(df.head(15))

# Save the word frequency DataFrame to a CSV file
df.to_csv('word_list')

              Count
and              13
in                5
software          4
to                3
with              3
web               3
a                 3
for               3
code              2
technologies      2
the               2
of                2
python            2
script            2
experience        2
