# Which Skill Set Is Most Needed?: Web Scraping from BC Jobs

This project aims to build a web scraping tool using the `BeautifulSoup` library to extract job postings data from [BC Jobs](https://www.bcjobs.ca/). The focus of this project is to extract data related to Data Analyst job postings and analyze the technical skill requirements that are most in demand.

The project involves scraping job postings data using various technical skills as keywords for the search query. The data extracted from each search query will be saved in a structured format, which can then be used to analyze the technical skills that are most in demand for Data Analyst roles in British Columbia.

The analysis will include identifying the frequency of each technical skill requirement across job postings. This information can be used to provide insights to job seekers looking to pursue a career in Data Analysis in British Columbia and also to employers looking to hire Data Analysts.

As the final output, I will summarize the findings and provide recommendations based on the insights gained from the data analysis.


## Building Web Scraping Model

To build a web scraping model, we first need to import necessary libraries such as `BeautifulSoup` and `requests`, and retrieve the webpage using BeautifulSoup and requests. We will be using the keyword 'Python' to search for job postings related to this skill set. To filter out outdated job posts, we will only retrieve data that have been posted within the last month.

In [1]:
# Import libraries
from bs4 import BeautifulSoup
import requests
import json

html_text = requests.get('https://www.bcjobs.ca/search-jobs?q=python&location=#page=1&q=python&location=&range=0&freshness=31&categoryIds=&employerTypeIds=&positionTypeIds=&memberStatusIds=&careerLevelIds=&featuredEmployersOnly=false&trainingPositionsOnly=false').text
soup = BeautifulSoup(html_text, 'html.parser')
jobs = soup.find_all('a', class_='list-item-wrapper')

print(jobs)

[<a class="list-item-wrapper clearfix odd" href="/jobs/web-developer-burnaby-1056458" target="_self" title="Web Developer">
<div class="col-xs-12 col-sm-8 u_p-none">
<div class="text-16 list-item-title text-limit text-bold">Web Developer</div>
<div class="list-divider flex-w100">
<div class="text-limit">Fortinet</div>
</div>
</div>
<div class="col-xs-12 col-sm-4 text-right xs-text-left u_p-none">
<strong>May. 9, 2023</strong><br/>
Burnaby, BC<br/>
</div>
</a>, <a class="list-item-wrapper clearfix even" href="/jobs/senior-cloud-developer-burnaby-1022693" target="_self" title="Senior Cloud Developer">
<div class="col-xs-12 col-sm-8 u_p-none">
<div class="text-16 list-item-title text-limit text-bold">Senior Cloud Developer</div>
<div class="list-divider flex-w100">
<div class="text-limit">Fortinet</div>
</div>
</div>
<div class="col-xs-12 col-sm-4 text-right xs-text-left u_p-none">
<strong>May. 9, 2023</strong><br/>
Burnaby, BC<br/>
</div>
</a>, <a class="list-item-wrapper clearfix odd" h

For the quick analysis, I'll use the first job posting for now.

In [2]:
job = soup.find('a', class_='list-item-wrapper')

print(job)

<a class="list-item-wrapper clearfix odd" href="/jobs/web-developer-burnaby-1056458" target="_self" title="Web Developer">
<div class="col-xs-12 col-sm-8 u_p-none">
<div class="text-16 list-item-title text-limit text-bold">Web Developer</div>
<div class="list-divider flex-w100">
<div class="text-limit">Fortinet</div>
</div>
</div>
<div class="col-xs-12 col-sm-4 text-right xs-text-left u_p-none">
<strong>May. 9, 2023</strong><br/>
Burnaby, BC<br/>
</div>
</a>


In [3]:
job_title = job.find('div', class_='list-item-title').text

print(job_title)

Web Developer


In [4]:
company_name = job.find('div', class_='list-divider flex-w100').text.replace('\n', '')

print(company_name)

Fortinet


In [5]:
published_date = job.find('strong').text
published_date

'May. 9, 2023'

In [6]:
link = job['href']

link

'/jobs/web-developer-burnaby-1056458'

It looks like this is a partial link. Let's add 'https://www.bcjobs.ca' in front of it so it can work properly.

In [7]:
link = "https://www.bcjobs.ca" + job['href']

print(link)

https://www.bcjobs.ca/jobs/web-developer-burnaby-1056458


Let's iterate the code for each job posting in the first page.

In [8]:
for job in jobs:
    job_title = job.find('div', class_='list-item-title').text
    company_name = job.find('div', class_='list-divider flex-w100').text.replace('\n', '')
    published_date = job.find('strong').text
    post_link = "https://www.bcjobs.ca" + job['href']
    
    if post_link:        
        print(f"Job Title: {job_title}")
        print(f"Company Name: {company_name}")
        print(f"Published Date: {published_date}")
        print(f"Link: {post_link}")
        print("")

Job Title: Web Developer
Company Name: Fortinet
Published Date: May. 9, 2023
Link: https://www.bcjobs.ca/jobs/web-developer-burnaby-1056458

Job Title: Senior Cloud Developer
Company Name: Fortinet
Published Date: May. 9, 2023
Link: https://www.bcjobs.ca/jobs/senior-cloud-developer-burnaby-1022693

Job Title: Release QA Specialist
Company Name: Fortinet
Published Date: May. 11, 2023
Link: https://www.bcjobs.ca/jobs/release-qa-specialist-burnaby-1079339

Job Title: DevOps Specialist
Company Name: Fortinet
Published Date: May. 8, 2023
Link: https://www.bcjobs.ca/jobs/devops-specialist-burnaby-927734

Job Title: Software Dev QA
Company Name: Fortinet
Published Date: May. 10, 2023
Link: https://www.bcjobs.ca/jobs/software-dev-qa-burnaby-1077727

Job Title: Software Dev QA
Company Name: Fortinet
Published Date: May. 10, 2023
Link: https://www.bcjobs.ca/jobs/software-dev-qa-burnaby-1080005

Job Title: Senior Consultant, Full Stack Developer
Company Name: KPMG
Published Date: May. 8, 2023
Lin

Let's expand our web scraping model to retrieve job postings from every page of the search results and store the data in a variable named `python_results`. 

To achieve this, we'll loop through each page of the search results and extract the relevant job data using `BeautifulSoup` and `requests`. Once we've extracted all of the relevant job data, we'll store it in a Python list named `python_results`.

In [9]:
from bs4 import BeautifulSoup
import requests

base_url = "https://www.bcjobs.ca/search-jobs"
params = {
    "q": "python",
    "location": "",
    "range": "0",
    "freshness": "31",
    "categoryIds": "",
    "employerTypeIds": "",
    "positionTypeIds": "",
    "memberStatusIds": "",
    "careerLevelIds": "",
    "featuredEmployersOnly": "false",
    "trainingPositionsOnly": "false"
}

# Get the last page number
html_text = requests.get(base_url, params=params).text
soup = BeautifulSoup(html_text, "html.parser")
total_results = int(soup.find("strong", {"data-outlet": "total"}).text)
num_pages = total_results // 10 + 1 if total_results % 10 != 0 else total_results // 10

# Scrape the job postings from all pages and store the results in a variable
python_results = []
for page in range(1, num_pages + 1):
    params["page"] = page
    html_text = requests.get(base_url, params=params).text
    soup = BeautifulSoup(html_text, "html.parser")
    jobs = soup.find_all("a", class_="list-item-wrapper")

    for job in jobs:
        job_title = job.find("div", class_="list-item-title").text.strip()
        company_name = job.find("div", class_="list-divider flex-w100").text.strip()
        published_date = job.find("strong").text.strip()
        post_link = "https://www.bcjobs.ca" + job["href"]
        
        python_results.append({
            "Job Title": job_title,
            "Company Name": company_name,
            "Published Date": published_date,
            "Link": post_link
        })

In [10]:
print(f"There are {len(python_results)} job postings for Python")

for job in python_results:
    print(json.dumps(job, indent=4))

There are 115 job postings for Python
{
    "Job Title": "Web Developer",
    "Company Name": "Fortinet",
    "Published Date": "May. 9, 2023",
    "Link": "https://www.bcjobs.ca/jobs/web-developer-burnaby-1056458"
}
{
    "Job Title": "Senior Cloud Developer",
    "Company Name": "Fortinet",
    "Published Date": "May. 9, 2023",
    "Link": "https://www.bcjobs.ca/jobs/senior-cloud-developer-burnaby-1022693"
}
{
    "Job Title": "Release QA Specialist",
    "Company Name": "Fortinet",
    "Published Date": "May. 11, 2023",
    "Link": "https://www.bcjobs.ca/jobs/release-qa-specialist-burnaby-1079339"
}
{
    "Job Title": "DevOps Specialist",
    "Company Name": "Fortinet",
    "Published Date": "May. 8, 2023",
    "Link": "https://www.bcjobs.ca/jobs/devops-specialist-burnaby-927734"
}
{
    "Job Title": "Software Dev QA",
    "Company Name": "Fortinet",
    "Published Date": "May. 10, 2023",
    "Link": "https://www.bcjobs.ca/jobs/software-dev-qa-burnaby-1077727"
}
{
    "Job Title": "

## Scraping for Each Skill Set

Now, we can modify the previous code so that the model scrapes all the result for each skill set. I'll search for `Python`, `SQL`, `Tableau`, and `Power BI`. I know `R` is used a lot in the field too, but it is hard to search because it contains a lot of non-related data. I apologize for that.

In [11]:
from bs4 import BeautifulSoup
import requests

base_url = "https://www.bcjobs.ca/search-jobs"
params = {
    "location": "",
    "range": "0",
    "freshness": "31",
    "categoryIds": "",
    "employerTypeIds": "",
    "positionTypeIds": "",
    "memberStatusIds": "",
    "careerLevelIds": "",
    "featuredEmployersOnly": "false",
    "trainingPositionsOnly": "false"
}

skill_list = ['python', 'sql', 'tableau', 'power bi']

results = {}

for skill in skill_list:
    params['q'] = skill
    
    html_text = requests.get(base_url, params=params).text
    soup = BeautifulSoup(html_text, "html.parser")
    total_results = int(soup.find("strong", {"data-outlet": "total"}).text)
    num_pages = total_results // 10 + 1 if total_results % 10 != 0 else total_results // 10
    
    # Scrape the job postings from all pages and store the results in a variable
    skill_result = []
    for page in range(1, num_pages + 1):
        params["page"] = page
        html_text = requests.get(base_url, params=params).text
        soup = BeautifulSoup(html_text, "html.parser")
        jobs = soup.find_all("a", class_="list-item-wrapper")

        for job in jobs:
            job_title = job.find("div", class_="list-item-title").text.strip()
            company_name = job.find("div", class_="list-divider flex-w100").text.strip()
            published_date = job.find("strong").text.strip()
            post_link = "https://www.bcjobs.ca" + job["href"]

            skill_result.append({
                "Job Title": job_title,
                "Company Name": company_name,
                "Published Date": published_date,
                "Link": post_link
            })

    # Store the result for this skill set in the dictionary
    results[skill] = skill_result

In [12]:
for job in results['sql']:
    print(json.dumps(job, indent=4))

{
    "Job Title": "Software Application Developer",
    "Company Name": "Fortinet",
    "Published Date": "May. 10, 2023",
    "Link": "https://www.bcjobs.ca/jobs/software-application-developer-burnaby-1079338"
}
{
    "Job Title": "Senior Software Applications Developer",
    "Company Name": "Fortinet",
    "Published Date": "May. 9, 2023",
    "Link": "https://www.bcjobs.ca/jobs/senior-software-applications-developer-burnaby-987724"
}
{
    "Job Title": "Software Applications Developer",
    "Company Name": "Fortinet",
    "Published Date": "May. 9, 2023",
    "Link": "https://www.bcjobs.ca/jobs/software-applications-developer-burnaby-987725"
}
{
    "Job Title": "Senior .net/C# Developer",
    "Company Name": "Fortinet",
    "Published Date": "May. 9, 2023",
    "Link": "https://www.bcjobs.ca/jobs/senior-net-c-developer-burnaby-987726"
}
{
    "Job Title": "Sr. Dev Ops Developer",
    "Company Name": "Fortinet",
    "Published Date": "May. 15, 2023",
    "Link": "https://www.bcjobs

In [13]:
print(f"There are {len(results['python'])} job postings for Python")
print(f"There are {len(results['sql'])} job postings for SQL")
print(f"There are {len(results['tableau'])} job postings for Tableau")
print(f"There are {len(results['power bi'])} job postings for Power BI")

There are 115 job postings for Python
There are 101 job postings for SQL
There are 22 job postings for Tableau
There are 43 job postings for Power BI


## Conclusion

After scraping the BCJobs website, we can see that Python has the most job postings, with 115 results. However, we need to consider that this search includes not only data-related jobs, but also software developer and other tech jobs. SQL is the second most in-demand skill with 101 job postings, followed by Power BI with 43 job postings. Interestingly, Tableau only has 22 job postings. This suggests that while Tableau is a valuable tool for data visualization, it may not be as widely used as Power BI in the BC job market.

Overall, this model provides a simple and efficient way to compare the demand for different skill sets and access job postings and links. For future improvements, an automated updating system could be added to eliminate the need for manual updates and provide real-time job postings. Additionally, a notification system could be developed to alert users when new job postings are available for a specific skill set.