# **Data Collection**
***Data Source - BlueMotive Cars***

In [15]:
import urllib.request
import pandas as pd
from bs4 import BeautifulSoup
import bs4
import os
import re

## **`fetch_cars_details(url)`**

The function **fetches HTML content** from a given webpage using `urllib.request`. It plays a crucial role in web scraping by retrieving the page’s source code for further processing.

---

**Functionality:**

- **Sends an HTTP request** to fetch webpage content.
- **Retrieves and decodes HTML** into a readable format.
- **Returns the HTML** for further data extraction.
- **Handles errors gracefully** (e.g., invalid URL, server issues).

---


In [16]:
def fetch_cars_details(url):
    try:
        response = urllib.request.urlopen(url)
        html = response.read().decode()
        return html
    except urllib.error.URLError as e:
        print(f'Fetching Failed from {url}. Error: {e}')
        return None

## **extract_total_pages(html_content)**

The function extracts the **total number of pages** from the given HTML content. It helps in paginating through multiple pages in web scraping.

**Functionality:**
- **Parses the HTML** using `BeautifulSoup`.
- **Finds the `<h2>` tag**, which contains the total page count.
- **Uses regex** to extract the last page number.
- **Returns the total pages** as an integer.
- **Defaults to 1** if no page information is found.


In [17]:
def extract_total_pages(html_content):
    soup = BeautifulSoup(html_content, "html.parser")
    page_info = soup.find("h2")

    if page_info:
        match = re.search(r'Page \d+ of (\d+)', page_info.text)
        if match:
            return int(match.group(1))
    return 1

## **extract_car_details(html_content)**

The function extracts **car details** from the given HTML content, parsing structured data such as **brand, model, price, year, and specifications**.

**Functionality:**
- **Checks if HTML content exists**, otherwise returns an empty list.
- **Parses the HTML** using `BeautifulSoup` to extract structured information.
- **Extracts the brand name** from the `<title>` tag.
- **Finds all car listings (`<li>` elements)** and processes them.
- **Extracts model names** from `<span class="make-model">`.
- **Retrieves car attributes** (e.g., price, year, mileage) from `<table class="car">`.
- **Stores all extracted details** in a list of dictionaries.
- **Returns the final list of car details** for further processing.


In [18]:
def extract_car_details(html_content):
    if not html_content:
        print("No HTML content to process.")
        return []

    soup = BeautifulSoup(html_content, "html.parser")

    title = soup.find("title").text.strip()
    brand_name = title.split("-")[-1].strip()

    car_details = []

    for li in soup.find_all("li"):
        model_tag = li.find("span", class_="make-model")
        car_data = {}

        if model_tag:
            car_data["Model"] = model_tag.text.strip()
            car_data["Brand"] = brand_name

            table = li.find("table", class_="car")
            if table:
                for row in table.find_all("tr"):
                    cells = row.find_all("td")
                    if len(cells) == 2:
                        key = cells[0].text.replace(":", "").strip()
                        value = cells[1].text.strip()
                        car_data[key] = value

            car_details.append(car_data)
    return car_details

## **fetch_all_pages(base_url)**

The function **fetches and extracts car details** from multiple pages dynamically by detecting the total number of pages and iterating through them.

**Functionality:**
- **Formats the first page URL** to initiate data extraction.
- **Fetches the first page's HTML content** to determine the total number of pages.
- **Extracts the total pages** using `extract_total_pages()`.
- **Loops through all detected pages**, dynamically formatting the URL.
- **Fetches HTML content for each page** using `fetch_cars_details()`.
- **Extracts car details** from each page using `extract_car_details()`.
- **Combines all extracted car data** into a single list.
- **Returns the final dataset** containing car listings from all pages.


In [19]:
def fetch_all_pages(base_url):
    first_page_url = base_url.format(1)
    first_page_html = fetch_cars_details(first_page_url)

    if not first_page_html:
        print("Failed to fetch the first page. Exiting.")
        return []

    total_pages = extract_total_pages(first_page_html)
    print(f"Total pages detected: {total_pages}")

    all_cars = []
    for page_num in range(1, total_pages + 1):
        url = base_url.format(page_num)
        print(f"Fetching: {url}")

        html_data = fetch_cars_details(url)
        if html_data:
            all_cars.extend(extract_car_details(html_data))
    print(all_cars)
    return all_cars

## **save_to_csv(car_data, brand_name, output_folder)**

The function **saves extracted car data** into a CSV file, ensuring structured storage for further analysis.

**Functionality:**
- **Creates a pandas DataFrame** from the extracted car details.
- **Formats the CSV filename** using the brand name (e.g., `Audi.csv`).
- **Constructs the file path** using the specified output folder.
- **Saves the DataFrame as a CSV file** without the index column.
- **Prints a confirmation message** upon successful saving.
- **Returns the file path** of the saved CSV file.


In [20]:
def save_to_csv(car_data, brand_name, output_folder="/Users/jiveshdhakate/Documents/UCD Sem 2/Data Science in Python/car_price_predictor/data"):
    if not car_data:
        print("No data to save.")
        return None

    df = pd.DataFrame(car_data)
    csv_filename = f"{brand_name}.csv"
    csv_path = os.path.join(output_folder, csv_filename)
    df.to_csv(csv_path, index=False)

    print(f"{brand_name} Car Data saved to {csv_path}")
    return csv_path

In [21]:
urls = [
    "http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page{:02d}.html",
    "http://mlg.ucd.ie/modules/python/assignment1/cars/BMW-page{:02d}.html",
    "http://mlg.ucd.ie/modules/python/assignment1/cars/Mercedes-Benz-page{:02d}.html",
    "http://mlg.ucd.ie/modules/python/assignment1/cars/Volkswagen-page{:02d}.html"
]

for url in urls:
    car_data = fetch_all_pages(url)

    if car_data:
        brand_name = car_data[0].get("Brand", "Unknown")
        save_to_csv(car_data, brand_name)


Total pages detected: 20
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page01.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page02.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page03.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page04.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page05.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page06.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page07.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page08.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page09.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page10.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page11.html
Fetching: http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page12.html
Fetching: http://mlg.ucd.ie/modules/python/assignme