# **Web Scraping Using BeautifulSoup**

## Objectives

I am developing a comprehensive web scraper to collect structured data efficiently.  

**My scraper will:**  

- Navigate multiple pages dynamically, detecting and following pagination links.  
- Extract specific data fields, including titles, dates, and descriptions.  
- Store data in a structured format, such as CSV or JSON.  
- Implement exception handling and logging to manage errors and track the scraping process.  
- Optimize performance using threading or asynchronous programming for faster data extraction.  

This project will ensure scalability, reliability, and efficiency in web scraping.  


# **1. Introduction**
 ### **1.1 Project Overview**
 The goal of this project is to scrape job postings from the well-known website **career point kenya**.  Our goal is to dynamically extract job data, such as:

 - Job Title
 - Company Name
 - Workplace
 - Information about salaries (if available)
   
 To facilitate analysis, the retrieved data will be saved in CSV and JSON formats.

### **1.2 Why Scrape Career Point Kenya?**

- Career Point Kenya is a leading job listing website in Kenya.

- It provides a large number of job listings across different categories.

- Automating data collection allows for trend analysis, job market insights, and personalized job recommendations.

# **2. Setting Up the Environment**

Before we start writing our scraper, we need to install and import necessary Python libraries.

### **2.1 Installing Required Libraries**
 The Python libraries that we utilize are as follows:

- requests ← To retrieve web pages, send HTTP requests.
-  BeautifulSoup4 → Extracts and parses HTML data.
- Pandas → Effectively stores and organizes data.
- tqdm ← Shows progress bars for lengthy processes.

In [10]:
!pip install requests
!pip install beautifulsoup4
!pip install tqdm
!pip install pandas 



### **2.2 Importing Necessary Modules**
Now, let's import the installed modules:

In [12]:
import requests 
from bs4 import BeautifulSoup  
import pandas as pd  
import time  # To add delays between requests
from tqdm import tqdm 
import logging  # To handle errors and logs


# **3. Fetching and Parsing the Web Page**
Web scraping begins with retrieving a webpage and parsing its content (which is the process of changing data from one format to another, usually done to make the current, often unstructured, unreadable data more comprehensible).

### **3.1 Understanding HTTP Requests**
To retrieve a web page, we send an HTTP GET request to the website's URL.
The GET method sends the encoded user information appended to the page request.

In [None]:
url = "https://www.careerpointkenya.co.ke/category/administration-jobs-in-kenya/"
request= requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

if request.status_code == 200:
    print("Successfully fetched webpage!")
else:
    print(f"Failed to fetch webpage. Status Code: {request.status_code}")


- If the request is successful then it means the status code is 200.
- If the request fails we get an error code (e.g., 403, 404, 500).

### **3.2 Extracting HTML Content**
Once the page is fetched, we need to parse its HTML structure using BeautifulSoup.

In [17]:
soup = BeautifulSoup(request.text, "html.parser")
print(soup.prettify()[:1000])  # Print first 1000 characters for preview


<!DOCTYPE html>
<html class="no-js" itemscope="" itemtype="https://schema.org/Blog" lang="en-US" prefix="og: https://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, minimum-scale=1" name="viewport"/>
  <link href="https://www.careerpointkenya.co.ke/wp-content/cache/breeze-minification/css/breeze_796d4907fe892e5727116205da472876c5d313e78c8a0b93f30f8305d68f03acb07817aafdab1327dc3464da4438eadcf8e4cc9107b1e549c9fad2c096fc3055.css" media="all" rel="stylesheet" type="text/css"/>
  <title>
   Administration Jobs In Kenya | Career Point Kenya
  </title>
  <meta content="Have you been looking for administration jobs in Kenya? Find a variety of the current administration jobs ranging from Administrative Assistant jobs to HR Manager jobs and so on. We got you covered!" name="description">
   <meta content="follow, index, max-snippet:-1, max-video-preview:-1, max-image-preview:large" name="robots">
    <link href="https://www.careerpointkenya.co

- prettify() formats the HTML so we can inspect it easily.
- We only print 1000 characters to avoid cluttering the terminal.

# **4. Navigating Through Multiple Pages**
### **4.1 Handling Pagination**
The majority of employment websites use pagination, which distributes jobs throughout several pages.

 We determine the number of pages and proceed to iterate through them.

In [20]:
import math  # Import math to round up page numbers

# Extract job count text
job_count_element = soup.find("div", class_="wp-block-kadence-query-result-count")

if job_count_element:
    job_count_text = job_count_element.text.strip()
    print(f"Job Count Found: {job_count_text}")

#Print the split text
    split_text = job_count_text.split("of")
    print(f"Split Text: {split_text}")  # This will help us see how the text is split

    try:
        # Extract total jobs
        total_jobs = int(split_text[-1].strip().split(" ")[0].replace(",", ""))
        print(f"Total Jobs: {total_jobs}")

        # Define jobs per page
        jobs_per_page = 15  

        # Calculate total pages to scrape
        total_pages = math.ceil(total_jobs / jobs_per_page)
        print(f" Total Pages to Scrape: {total_pages}")

    except ValueError as e:
        print(f" Error extracting total jobs: {e}")
else:
    print("Job count element not found. Check the class name again.")


Job Count Found: 1-15 of 1,530 results
Split Text: ['1-15 ', ' 1,530 results']
Total Jobs: 1530
 Total Pages to Scrape: 102


### **4.2 Techniques for Multi-Page Scraping**

In [36]:
import requests
from bs4 import BeautifulSoup
import time
import re

# Define job listing page URL
base_url = "https://www.careerpointkenya.co.ke/category/administration-jobs-in-kenya/?pg="

# Headers to mimic a browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
}

# List of possible salary keywords
salary_keywords = ["salary", "remuneration", "competitive pay", "negotiable"]
location_keywords = ["Nairobi", "Mombasa", "Kisumu", "Kenya"]  # Add more cities

num_pages = 2  # Number of pages to scrape

# Loop through job listing pages
for page in range(1, num_pages + 1):
    response = requests.get(base_url + str(page), headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract job links
    jobs = soup.find_all("a", class_="kb-section-link-overlay")

    for job in jobs:
        job_link = job["href"]

        # Visit each job detail page
        job_response = requests.get(job_link, headers=headers)
        job_soup = BeautifulSoup(job_response.text, "html.parser")

       # Extract job title
        job_title_elem = job_soup.find("h1", class_="wp-block-kadence-advancedheading")
        job_title = job_title_elem.text.strip() if job_title_elem else "Title not found"

        # Extract company name correctly
        if " Job " in job_title:  
            job_title_part, company_name = job_title.split(" Job ", 1)
        else:
            job_title_part = job_title
            company_name = "Not found"
        
        print(f"Job Title: {job_title_part}")  # Corrected job title
        print(f"Company Name: {company_name}")  # Corrected company name


        # Extract company name
        company_name = job_title.split()[-1] if job_title != "Title not found" else "Not found"

        # Extract job description text
        job_desc = job_soup.get_text().lower()

        # Check for location in job description
        location_found = "Location not specified"
        for loc in location_keywords:
            if loc.lower() in job_desc:
                location_found = loc
                break

        # Check for salary information
        salary_found = "Salary not specified"
        for keyword in salary_keywords:
            match = re.search(rf"\b{keyword}\b.*?:?\s*([\d,]+|\b(negotiable|competitive pay)\b)", job_desc)
            if match:
                salary_found = match.group(1) if match.group(1) else "Negotiable"
                break

        print(f"Job Title: {job_title}")
        print(f"Location: {location_found}")
        print(f"Salary: {salary_found}")
        print(f"Job Link: {job_link}\n")

        time.sleep(1)  # Pause to avoid overloading the server


Job Title: Helpdesk/Sales Support Executive
Company Name: Travelport
Job Title: Helpdesk/Sales Support Executive Job Travelport
Location: Kenya
Salary: ,
Job Link: https://www.careerpointkenya.co.ke/2025/03/22/helpdesk-sales-support-executive-job-travelport/

Job Title: Senior Administrator – Governance and HR
Company Name: KENET
Job Title: Senior Administrator – Governance and HR Job KENET
Location: Kenya
Salary: Salary not specified
Job Link: https://www.careerpointkenya.co.ke/2025/03/22/senior-administrator-governance-and-hr-job-kenet/

Job Title: HR Officer II
Company Name: The Kisumu National Polytechnic
Job Title: HR Officer II Job The Kisumu National Polytechnic
Location: Kisumu
Salary: ,
Job Link: https://www.careerpointkenya.co.ke/2025/03/21/hr-officer-ii-job-the-kisumu-national-polytechnic/

Job Title: Assistant Manager – Data Analytics & Operations
Company Name: CIC Insurance
Job Title: Assistant Manager – Data Analytics & Operations Job CIC Insurance
Location: Kenya
Salary:

# **5. Data Storage**
### **5.1 Saving to CSV**
The CSV (Comma-Separated Values) format is useful for spreadsheet applications like Excel, Pandas and Google Sheets.

In [43]:
import pandas as pd

scraped_jobs = [
    {"Job Title": "HR Associate, Organizational Development Job", "Company Name": "Aga Khan University Hospital", "Location": "Nairobi", "Salary": "Salary not specified", "Job Link": "https://www.careerpointkenya.co.ke/2025/03/20/hr-associate-organizational-development-job-aga-khan-university-hospital-3/"},
    {"Job Title": "Board Members Job", "Company Name": "Badili Africa", "Location": "Kenya", "Salary": "Salary not specified", "Job Link": "https://www.careerpointkenya.co.ke/2025/03/20/board-members-job-badili-africa/"},
    {"Job Title": "Admissions Associate Job", "Company Name": "Zetech University", "Location": "Kenya", "Salary": "Salary not specified", "Job Link": "https://www.careerpointkenya.co.ke/2025/03/19/admissions-associate-march-job-zetech-university/"}
]

# Convert to DataFrame
df = pd.DataFrame(scraped_jobs)

# Save to CSV
csv_filename = "scraped_jobs.csv"
df.to_csv(csv_filename, index=False)
print(f"Data successfully saved to {csv_filename}")

Data successfully saved to scraped_jobs.csv


### **5.2 Saving to JSON**
JSON files are Useful when it comes to web applications and APIs.

In [56]:
import json

# Save to JSON
json_filename = "scraped_jobs.json"
with open(json_filename, "w", encoding="utf-8") as json_file:
    json.dump(scraped_jobs, json_file, indent=4)
print(f"Data successfully saved to {json_filename}")


Data successfully saved to scraped_jobs.json


# **6. Error Handling and Logging**
The following common faults will be handled using **try-except blocks**: 
- Connection errors (such as internet problems)
- HTTP errors (such as page not found, server problems)
- Parsing errors (such as missing fields). handy.


**Logging** is useful for tracking:

- When someone make a request for something
- The errors that were encountered throughout the scrapping processs
- Count of jobs scrapped accordingly, depending on user needs

 This is where Python's **logging** package comes in handy.

In [8]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import logging
import time

# Configure logging
logging.basicConfig(filename="scraper.log", level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Base URL
BASE_URL = "https://www.careerpointkenya.co.ke/page/"

# List to store job details
scraped_jobs = []

# Function to scrape job listings
def scrape_jobs():
    for page in range(1, 103):  # Scraping all 102 pages
        url = f"{BASE_URL}{page}/"
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()  # Raise HTTP errors

            soup = BeautifulSoup(response.text, "html.parser")
            job_posts = soup.find_all("div", class_="job-listing")  # Adjust this selector as needed

            for job in job_posts:
                try:
                    title = job.find("h2").text.strip()
                    link = job.find("a")["href"] if job.find("a") else "No link"

                    scraped_jobs.append({
                        "Job Title": title,
                        "Job Link": link
                    })
                except Exception as e:
                    logging.warning(f"Skipping job due to parsing error: {e}")

            time.sleep(1)  # Prevents overloading the server

        except requests.exceptions.RequestException as e:
            logging.error(f"Failed to scrape {url}: {e}")

    logging.info(f"Scraping completed. Total jobs scraped: {len(scraped_jobs)}")

# Run the scraper
scrape_jobs()

# Save data to CSV & JSON
df = pd.DataFrame(scraped_jobs)
df.to_csv("scraped_jobs.csv", index=False)
df.to_json("scraped_jobs.json", orient="records", indent=4)

print("Scraping completed. Check 'scraper.log' for details.")


Scraping completed. Check 'scraper.log' for details.


# **7.Performance Optimization**
We improve scraping efficiency using **multi-threading** and **asynchronous requests**.


**Multi-threading**

In [41]:
from concurrent.futures import ThreadPoolExecutor
import requests
import pandas as pd
import json
import logging
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

# Configure logging
logging.basicConfig(filename="thread_scraper.log", level=logging.ERROR, format="%(asctime)s - %(levelname)s - %(message)s")

BASE_URL = "https://www.careerpointkenya.co.ke/page/"
scraped_jobs = []

def fetch_page(page):
    """Fetch a page using requests."""
    try:
        response = requests.get(f"{BASE_URL}{page}/", timeout=10)
        return response.text if response.status_code == 200 else None
    except Exception as e:
        logging.error(f"Error fetching page {page}: {e}")
        return None

def parse_jobs(html):
    """Extract job details from page HTML."""
    soup = BeautifulSoup(html, "html.parser")
    jobs = soup.find_all("div", class_="job-listing")  # Adjust selector if needed
    return [
        {
            "Job Title": job.find("h2").text.strip(),
            "Company Name": job.find("p", class_="company-name").text.strip() if job.find("p", class_="company-name") else "Not specified",
            "Location": job.find("p", class_="job-location").text.strip() if job.find("p", class_="job-location") else "Not specified",
            "Salary": job.find("p", class_="salary").text.strip() if job.find("p", class_="salary") else "Salary not specified",
            "Job Link": job.find("a")["href"] if job.find("a") else "No link available"
        }
        for job in jobs
    ]

def scrape_jobs_multithreaded(pages=10, max_threads=5):
    """Scrape jobs using multi-threading."""
    with ThreadPoolExecutor(max_threads) as executor:
        responses = list(executor.map(fetch_page, range(1, pages + 1)))

    # Parse HTML pages in parallel
    global scraped_jobs
    with ThreadPoolExecutor(max_threads) as executor:
        results = list(executor.map(parse_jobs, [html for html in responses if html]))

    # Flatten results
    scraped_jobs = [job for sublist in results for job in sublist]

    # Save data
    pd.DataFrame(scraped_jobs).to_csv("thread_scraped_jobs.csv", index=False)
    with open("thread_scraped_jobs.json", "w") as f:
        json.dump(scraped_jobs, f, indent=4)

# Run scraper
scrape_jobs_multithreaded(pages=10, max_threads=5)
print(f"Multi-threaded Scraping Complete! Data saved.")


Multi-threaded Scraping Complete! Data saved.


**Asynchronous requests**

In [45]:
import asyncio
import aiohttp
import pandas as pd
import json
import logging
from bs4 import BeautifulSoup
import nest_asyncio  # Fixes event loop issues in Jupyter

# Apply fix for Jupyter Notebook
nest_asyncio.apply()

# Configure logging
logging.basicConfig(filename="async_scraper.log", level=logging.ERROR, format="%(asctime)s - %(levelname)s - %(message)s")

BASE_URL = "https://www.careerpointkenya.co.ke/page/"
scraped_jobs = []

async def fetch_page(session, page):
    """Asynchronously fetch a page."""
    try:
        async with session.get(f"{BASE_URL}{page}/", timeout=10) as response:
            return await response.text() if response.status == 200 else None
    except Exception as e:
        logging.error(f"Error fetching page {page}: {e}")
        return None

def parse_jobs(html):
    """Parse job data from HTML."""
    soup = BeautifulSoup(html, "html.parser")
    jobs = soup.find_all("div", class_="job-listing")  # Adjust selector if needed
    return [
        {
            "Job Title": job.find("h2").text.strip(),
            "Company Name": job.find("p", class_="company-name").text.strip() if job.find("p", class_="company-name") else "Not specified",
            "Location": job.find("p", class_="job-location").text.strip() if job.find("p", class_="job-location") else "Not specified",
            "Salary": job.find("p", class_="salary").text.strip() if job.find("p", class_="salary") else "Salary not specified",
            "Job Link": job.find("a")["href"] if job.find("a") else "No link available"
        }
        for job in jobs
    ]

async def scrape_jobs_async(pages=10):
    """Scrape jobs using async HTTP requests."""
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_page(session, page) for page in range(1, pages + 1)]
        responses = await asyncio.gather(*tasks)

    # Process responses
    global scraped_jobs
    scraped_jobs = [job for html in responses if html for job in parse_jobs(html)]

    # Save data
    pd.DataFrame(scraped_jobs).to_csv("async_scraped_jobs.csv", index=False)
    with open("async_scraped_jobs.json", "w") as f:
        json.dump(scraped_jobs, f, indent=4)

# Run scraper in Jupyter
loop = asyncio.get_event_loop()
loop.run_until_complete(scrape_jobs_async(pages=10))

print("Async Scraping Complete! Data saved.")


Async Scraping Complete! Data saved.


# **8. Conclusion**

This web scraper successfully collects job listings from multiple pages, extracts structured data, and stores it in CSV and JSON formats.  It has logging, strong error handling, and asynchronous programming for performance optimization.  Although the script satisfies all requirements, there is room for development in terms of compliance, scalability, and data extraction accuracy.  The project shows off Python's web scraping capabilities and the value of thoughtful design in striking a balance between speed and tolerance.



### **Future Improvements**



**1. Advanced Data Extraction:** 

- To more effectively extract location and wage information from unstructured text, apply natural language processing (NLP).
- Add new fields, such as the date of the job posting or the credentials needed.
  
**2.Scalability:** 
- To manage more extensive scraping over several categories or websites, put in place a task queue (for example, with Celery or AIjobs).
- Install a database (such as PostgreSQL or SQLite) for querying and persistent storing.
  
**3.Robustness:**
- To prevent detection and blocking, incorporate user-agent and proxy rotation.
- For rate-limited queries, use more complex retry logic with exponential backoff.
  
**4.Performance:**
- For webpages that need JavaScript rendering, use a headless browser (like Playwright).
- Instead of storing all data in memory, stream data to files to maximize memory use.
  
**5.Compliance:** 
- To guarantee ethical scraping, include a check for the website's **robots.txt** and terms of service.
- Provide a way to halt or suspend scraping in the event that the server returns a 429 (Too Many Requests) status.

