# House Pricing Web Scraper

This notebook is designed to automate the collection of house pricing data from multiple real estate websites. The goal of this project is to build a comprehensive dataset that can be used for further analysis and to train machine learning models to predict house prices based on various features.

## Table of Contents
1. [Introduction](#introduction)
2. [Setup](#setup)
3. [Web Scraping](#web-scraping)
4. [Save Data](#save-data)
5. [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>
In this section, we will describe the purpose of the notebook and the approach we will take to scrape the data from multiple websites.

## Setup <a name="setup"></a>
In this section, we will import the necessary libraries and set up any configurations required for web scraping.

```python


In [1]:
import csv
import requests
from bs4 import BeautifulSoup
import time
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

BASE_DOMAIN = os.getenv("BASE_DOMAIN")
BASE_PATH = os.getenv("BASE_PATH")
HEADERS_USER_AGENT = os.getenv("HEADERS_USER_AGENT")
headers = {"User-Agent": HEADERS_USER_AGENT}

## Web Scraping <a name="web-scraping"></a>
In this section, we will scrape the house listings from multiple websites. The process will continue until no more pages are available or a set page limit is reached.

In [2]:
# Function to get advertisement links from a page
def get_ad_links(soup):
    links = []
    listings = soup.find_all("div", {"class": "_637fa00f"})
    if not listings:
        print("No listings found on the page.")
    for listing in listings:
        try:
            link_tag = listing.find("a", href=True)
            full_link = BASE_DOMAIN + link_tag["href"]
            links.append(full_link)
        except AttributeError:
            continue
    return links


# Scrape links from multiple pages
ad_links = []
page_number = 1
while page_number <= 40:
    print(f"Scraping page {page_number}")
    url = f"{BASE_DOMAIN}{BASE_PATH}{page_number}"
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, "html.parser")
    links_on_page = get_ad_links(soup)
    ad_links.extend(links_on_page)
    print(f"Found {len(links_on_page)} links on page {page_number}")
    page_number += 1
    time.sleep(1)

if not ad_links:
    print("No links were found.")

Scraping page 1
Found 50 links on page 1
Scraping page 2
Found 50 links on page 2
Scraping page 3
Found 50 links on page 3
Scraping page 4
Found 50 links on page 4
Scraping page 5
Found 50 links on page 5
Scraping page 6
Found 50 links on page 6
Scraping page 7
Found 50 links on page 7
Scraping page 8
Found 50 links on page 8
Scraping page 9
Found 50 links on page 9
Scraping page 10
Found 50 links on page 10
Scraping page 11
Found 50 links on page 11
Scraping page 12
Found 50 links on page 12
Scraping page 13
Found 50 links on page 13
Scraping page 14
Found 50 links on page 14
Scraping page 15
Found 50 links on page 15
Scraping page 16
Found 50 links on page 16
Scraping page 17
Found 50 links on page 17
Scraping page 18
Found 50 links on page 18
Scraping page 19
Found 50 links on page 19
Scraping page 20
Found 50 links on page 20
Scraping page 21
Found 50 links on page 21
Scraping page 22
Found 50 links on page 22
Scraping page 23
Found 50 links on page 23
Scraping page 24
Found 50 lin

## Save Data <a name="save-data"></a>
In this section, we will save the collected data into CSV files for further analysis.

In [None]:
# Save links to CSV
save_path = "../data/ad_links.csv"
with open(save_path, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Ad URL"])
    for link in ad_links:
        writer.writerow([link])


# Function to scrape data from an individual ad page
def scrape_ad_page(ad_url):
    r = requests.get(ad_url, headers=headers)
    ad_soup = BeautifulSoup(r.text, "html.parser")
    try:
        price = ad_soup.find("span", {"aria-label": "Price"}).text.strip()
        area = ad_soup.find("span", {"aria-label": "Area"}).text.strip()
        bedrooms = ad_soup.find("span", {"aria-label": "Bedrooms"}).text.strip()
        bathrooms = ad_soup.find("span", {"aria-label": "Bathrooms"}).text.strip()
        location = ad_soup.find("span", {"aria-label": "Location"}).text.strip()
        return [ad_url, price, area, bedrooms, bathrooms, location]
    except AttributeError:
        print(f"Missing data on page: {ad_url}")
        return None


# Load links from the CSV file and scrape data from each ad
scraped_data = []
links_file_path = (
    "../data/ad_links.csv"
)
with open(links_file_path, "r", newline="", encoding="utf-8") as file:
    reader = csv.reader(file)
    next(reader)  # Skip the header row
    ad_links = [row[0] for row in reader]

for ad_link in ad_links:
    ad_data = scrape_ad_page(ad_link)
    if ad_data:
        scraped_data.append(ad_data)
    time.sleep(1)  # To avoid overloading the server

# Save scraped data to CSV
save_path = (
    "../data/ad_details.csv"
)
headers = ["Ad URL", "Price", "Area (m^2)", "Bedrooms", "Bathrooms", "Location"]
with open(save_path, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(headers)
    writer.writerows(scraped_data)

## Conclusion <a name="conclusion"></a>
In this notebook, we successfully scraped house pricing data from multiple websites. The data has been saved into a CSV file for further analysis. This dataset can now be used for training machine learning models to predict house prices based on various features.
