<a href="https://colab.research.google.com/github/Oughty-Otieno/-Web-Scraping-with-Python/blob/main/Web_Scraping_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color='#2F4F4F'>To use this notebook on Colaboratory, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# <font color='#2F4F4F'>AfterWork Data Science: Web Scraping with Python</font>

# Background Information
Together with a team of startup entrepreneurs, you decide to work on an idea that could
change the way people search for jobs. You decide that job scraping could be the next
big thing as there are actively many people looking for jobs in the country, in this case,
Kenya.
# Problem Statement
The problem is that there are many job listings which can not get visits for the target job
seekers. While working in a team, your task as a data scientist for this project is to
scrape for job titles and links and then put them in a single table that can be used by
your team members to further build a job aggregator.
You will be required to scrape for data from the following three technology webpages:

● PigiaMe: https://www.pigiame.co.ke/it-software-jobs

● MyJobMag: https://www.myjobmag.co.ke/jobs-by-field/information-technology

● KenyaJob:
https://www.kenyajob.com/job-vacancies-search-kenya?f%5B0%5D=im_field_offr
e_secteur%3A133

## <font color='#2F4F4F'>Prerequisites</font>

In [1]:
# We first import the required libraries
# ---
#
import pandas as pd             # library for data manupation
import requests                 # library for fetching a web page 
from bs4 import BeautifulSoup   # library for extrating contents from a webpage 

## <font color='#2F4F4F'>Step 1: Obtaining our Data</font>

In [2]:
# PigiaMe: https://www.pigiame.co.ke/it-software-jobs
# ---
#
pigia_me = requests.get('https://www.pigiame.co.ke/it-software-jobs')
pigia_me

<Response [200]>

In [3]:
# MyJobMag: https://www.myjobmag.co.ke/jobs-by-field/information-technology
# ---
#
my_job_mag = requests.get('https://www.myjobmag.co.ke/jobs-by-field/information-technology')
my_job_mag

<Response [200]>

In [4]:
# KenyanJob: https://www.kenyajob.com/job-vacancies-search-kenya?f%5B0%5D=im_field_offre_secteur%3A133
# ---
#
kenyan_job = requests.get('https://www.kenyajob.com/job-vacancies-search-kenya?f%5B0%5D=im_field_offre_secteur%3A133')
kenyan_job

<Response [200]>

## <font color='#2F4F4F'>Step 2: Parsing</font>

In [5]:
# Parsing our document: pigia_me
# ---
# 
pigia_soup = BeautifulSoup(pigia_me.text, "html.parser")


In [6]:
# Parsing our document: my_job_mag
# ---
#  
my_job_soup = BeautifulSoup(my_job_mag.text, "html.parser")

In [7]:
# Parsing our document: kenyan_job
# ---
# 
kenyan_soup = BeautifulSoup(kenyan_job.text, "html.parser")

## <font color='#2F4F4F'>Step 3: Extracting Required Elements</font>

In [9]:
# 1. Extracting job titles and links: pigia me
# ---
# 

pigia_me_results = pigia_soup.find_all('div',attrs={'class':'listing-card__header__title'})
pigia_me_results

[<div class="listing-card__header__title">
 Officer – Solution Developer
 </div>, <div class="listing-card__header__title">
 Data Analyst
 </div>, <div class="listing-card__header__title">
 Manager-System Developer
 </div>, <div class="listing-card__header__title">
 Manager-Database Management
 </div>, <div class="listing-card__header__title">
 Software Developer
 </div>, <div class="listing-card__header__title">
 Assistant Sales Data Analyst
 </div>, <div class="listing-card__header__title">
 Business Intelligence Analyst
 </div>, <div class="listing-card__header__title">
 Software Developer - Server Team Leader
 </div>, <div class="listing-card__header__title">
 Data Processing Officer- Programmer
 </div>, <div class="listing-card__header__title">
 Software Engineer – Karen, Kenya
 </div>]

In [12]:
#clean the tags
pigia_me_results = [tag.get_text().strip()for tag in pigia_me_results]
pigia_me_results

['Officer – Solution Developer',
 'Data Analyst',
 'Manager-System Developer',
 'Manager-Database Management',
 'Software Developer',
 'Assistant Sales Data Analyst',
 'Business Intelligence Analyst',
 'Software Developer - Server Team Leader',
 'Data Processing Officer- Programmer',
 'Software Engineer – Karen, Kenya']

In [10]:
# 2. Extracting job titles: my_job_mag
# ---
# 
my_job_mag_results = my_job_soup.find_all('h2')
my_job_mag_results

[<h2><a href="/job/senior-technical-consultant-london-stock-exchange-group">Senior Technical Consultant at London Stock Exchange Group</a></h2>,
 <h2><a href="/job/it-intern-african-guarantee-fund-agf">IT Intern at African Guarantee Fund (AGF)</a></h2>,
 <h2><a href="/job/head-of-cyber-risk-red-team-equity-bank-kenya">Head of Cyber Risk &amp; Red Team at Equity Bank Kenya</a></h2>,
 <h2><a href="/job/cyber-risk-red-team-specialist-equity-bank-kenya">Cyber Risk &amp; Red Team Specialist at Equity Bank Kenya</a></h2>,
 <h2><a href="/job/it-risk-specialist-equity-bank-kenya">IT Risk Specialist at Equity Bank Kenya</a></h2>,
 <h2><a href="/job/senior-manager-systems-infrastructure-kcb-bank-kenya-1">Senior Manager Systems Infrastructure at KCB Bank Kenya</a></h2>,
 <h2><a href="/job/itil-processes-manager-kcb-bank-kenya">ITIL Processes Manager at KCB Bank Kenya</a></h2>,
 <h2><a href="/job/cybersecurity-analyst-devsecops-kcb-bank-kenya-1">Cybersecurity Analyst, DevSecOps at KCB Bank Kenya</

In [13]:
#clean the tags
my_job_mag_results = [tag.get_text().strip()for tag in my_job_mag_results]
my_job_mag_results

['Senior Technical Consultant at London Stock Exchange Group',
 'IT Intern at African Guarantee Fund (AGF)',
 'Head of Cyber Risk & Red Team at Equity Bank Kenya',
 'Cyber Risk & Red Team Specialist at Equity Bank Kenya',
 'IT Risk Specialist at Equity Bank Kenya',
 'Senior Manager Systems Infrastructure at KCB Bank Kenya',
 'ITIL Processes Manager at KCB Bank Kenya',
 'Cybersecurity Analyst, DevSecOps at KCB Bank Kenya',
 'Cybersecurity Specialist, Data Security & Privacy at KCB Bank Kenya',
 'Information Technology (IT) Associate at Village Enterprise',
 'Deputy Director, Information Communication Technology at Kenya Medical Supplies Authority (KEMSA)',
 'IT Service Desk Analyst at Penda Health',
 'Graduate Assistant - ICT at Zetech University',
 'Assistant Area Manager at Jamii Telecommunications',
 'Information Technology (IT) Associate at Village Enterprise',
 'Regional ICT Engineer at International Committee of the Red Cross (ICRC)',
 'IP NPI and NW Design Engineer at Nokia',
 'S

In [11]:
# 3. Extracting job titles: kenya_job
# ---
#
kenya_job_results = kenyan_soup.find_all('h5')
kenya_job_results

[<h5><a href="/job-vacancies-kenya/sales-agent-110259">Sales Agent</a></h5>,
 <h5><a href="/job-vacancies-kenya/senior-service-engineering-manager-114237">Senior Service Engineering Manager</a></h5>,
 <h5><a href="/job-vacancies-kenya/account-technology-strategist-cross-114240">Account Technology Strategist (Cross)</a></h5>,
 <h5><a href="/job-vacancies-kenya/payroll-service-delivery-manager-114242">Payroll Service Delivery Manager</a></h5>,
 <h5><a href="/job-vacancies-kenya/software-engineering-lead-114243">Software Engineering Lead</a></h5>,
 <h5><a href="/job-vacancies-kenya/data-contribution-project-kenya-109699">Data Contribution Project - Kenya</a></h5>,
 <h5><a href="/job-vacancies-kenya/customer-service-officer-110222">Customer Service Officer </a></h5>,
 <h5><a href="/job-vacancies-kenya/customer-service-executive-110933">Customer Service Executive</a></h5>,
 <h5><a href="/job-vacancies-kenya/data-labeler-112385">Data Labeler</a></h5>,
 <h5><a href="/job-vacancies-kenya/trave

In [14]:
#clean the tags
kenya_job_results = [tag.get_text().strip()for tag in kenya_job_results]
kenya_job_results

['Sales Agent',
 'Senior Service Engineering Manager',
 'Account Technology Strategist (Cross)',
 'Payroll Service Delivery Manager',
 'Software Engineering Lead',
 'Data Contribution Project - Kenya',
 'Customer Service Officer',
 'Customer Service Executive',
 'Data Labeler',
 'Travel Safety & Security Manager  and Compliance Coordinator',
 'Remote Sales Staff',
 'Freelance Drupal Developer',
 'Freelance ECommerce Designer',
 'Freelance Digital Designer',
 'Growth Product Manager',
 'SMB Sales Lead',
 'Payroll Service Delivery Manager',
 'Account Technology Strategist (Cross)',
 'Software Engineer II',
 'Engineer- Access  Network Transmission & Support - (22000437)',
 'Freelance ECommerce Designer',
 'SMB Account Specialist - EMEA',
 'Freelance ECommerce Designer',
 'Freelance Web Designer',
 'Global Account Manager (German Speaker)']

## <font color='#2F4F4F'>Step 4: Saving our Data</font>

In [15]:
# Saving the scraped contents in a dataframe and preview our data
# ---
#
#Pigia Me
pigia_me_df = pd.DataFrame({"title":pigia_me_results})
pigia_me_df.head()


Unnamed: 0,title
0,Officer – Solution Developer
1,Data Analyst
2,Manager-System Developer
3,Manager-Database Management
4,Software Developer


In [16]:
#My job
my_job_mag_df = pd.DataFrame({"title":my_job_mag_results})
my_job_mag_df.head()

Unnamed: 0,title
0,Senior Technical Consultant at London Stock Ex...
1,IT Intern at African Guarantee Fund (AGF)
2,Head of Cyber Risk & Red Team at Equity Bank K...
3,Cyber Risk & Red Team Specialist at Equity Ban...
4,IT Risk Specialist at Equity Bank Kenya


In [17]:
#Kenya job
kenya_job_df = pd.DataFrame({"title":kenya_job_results})
kenya_job_df.head()

Unnamed: 0,title
0,Sales Agent
1,Senior Service Engineering Manager
2,Account Technology Strategist (Cross)
3,Payroll Service Delivery Manager
4,Software Engineering Lead
