# COMP41680 Assignment 1

#### Student Name: Aasim Shah
#### Student ID: 24203773
#### Data Source: http://mlg.ucd.ie/modules/python/assignment1/cars/index.html

## Introduction

This notebook focuses on **Task 1: Data Collection** of the COMP41680 Assignment 1. 
It uses web scraping techniques to collect car data from the specified website and saves it into a CSV file for further analysis in a separate notebook.

## Task 1: Data Collection

In [41]:
"""
Imports necessary libraries:
- urllib.request: For fetching web pages.
- bs4: For parsing HTML content.
- csv: For handling CSV file operations.
- pandas: For data manipulation and analysis.
"""
import urllib.request as req
import bs4
import csv
import pandas as pd

### Data Source and Endpoint
The data for this assignment is collected from a website containing information about cars of different makes and models. The base URL and endpoint are defined below to access the website's content.

In [43]:
"""
Define base URL and endpoint for data source:
- BASE_URL: The base URL of the website.
- ENDPOINT: The specific page to scrape.
"""
BASE_URL = "http://mlg.ucd.ie/modules/python/assignment1/cars"
ENDPOINT = "index.html"

### Retrieving the Main Page HTML

The first step in the data collection process is to retrieve the HTML content of the main page. This HTML will be parsed to extract links to pages for each car make.

In [45]:
"""
Fetch the main page HTML:
- Constructs the URL using BASE_URL and ENDPOINT.
- Sends a request to the URL and retrieves the HTML content.
- Decodes the HTML content into a string.

Attributes:
    url (str): The complete URL of the main page.
    response (http.client.HTTPResponse): The response object from the URL request.
    html (str): The HTML content of the main page.
"""
url = f'{BASE_URL}/{ENDPOINT}'
response = req.urlopen(url)
html = response.read().decode()

In [46]:
print(html)

<!DOCTYPE html>
<html lang="en">
<head>
  <meta name="robots" content="noindex">  
  <meta name="description" content="Content on this site is posted for teaching purposes only. Original data is from theguardian.com">
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Car Sale Records</title>
  <link rel="stylesheet"  target="_blank" href="http://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
  <script src="http://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js"></script>
  <link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
    <div class="container mtb">
      <div class="logo">
        <a href="index.html"><img src="logo.png" class="logo" alt="Property Price Register"></a>
      </div>

        <div class="row instructions">
           

### Extracting Car Make Links

The HTML content of the main page is parsed to extract links to pages for each car make. These links will be used to navigate to individual car make pages and extract car details.

In [48]:
"""
Extract links to pages for each car make:
- Uses BeautifulSoup to parse the HTML content.
- Finds all 'h4' tags containing car make links.
- Extracts the links and stores them in a list.

Attributes:
    soup (bs4.BeautifulSoup): The BeautifulSoup object for parsing HTML.
    car_make_links (list): A list of URLs pointing to pages for each car make.
"""
soup = bs4.BeautifulSoup(html, "html.parser")
car_make_links = [
    f'{BASE_URL}/{tag["href"]}'
    for match in soup.find_all("h4")
    for tag in match.find_all("a")
]

In [49]:
car_make_links

['http://mlg.ucd.ie/modules/python/assignment1/cars/Audi-page01.html',
 'http://mlg.ucd.ie/modules/python/assignment1/cars/BMW-page01.html',
 'http://mlg.ucd.ie/modules/python/assignment1/cars/Mercedes-Benz-page01.html',
 'http://mlg.ucd.ie/modules/python/assignment1/cars/Volkswagen-page01.html']

### Extracting Total Page & Car Counts

Before extracting detailed car information, the total number of pages for each make and the total number of cars for each make are extracted. This provides an overview of the data available for each make.

In [51]:
"""
Extract total page & car counts for each make:

- Iterates through car make links.
- Fetches the HTML content of each make's page.
- Uses BeautifulSoup to parse the HTML.
- Finds the number of pages for each make and the total car count for the make.
- Prints the make and its total car count.

Attributes:
    link (str): The URL of a car make's page.
    response (http.client.HTTPResponse): The response object from the URL request.
    html (str): The HTML content of the car make's page.
    soup (bs4.BeautifulSoup): The BeautifulSoup object for parsing the HTML.
    row (bs4.element.Tag): A 'div' tag with class 'row' containing make information.
    heading (str): The heading containing the make name.
    total (str): The total number of cars for the make.
"""
for link in car_make_links:
    with req.urlopen(link) as response:
        html = response.read().decode()
    
    soup = bs4.BeautifulSoup(html, "html.parser")
    
    for row in soup.find_all('div', class_='row'):
        heading = row.find('h2').get_text(strip=True)
        total = row.find('div', class_='col-md-12 total').get_text(strip=True)
        print(f"{heading}\n{total}\n")

Car Sales: Audi — Page 1 of 20
Showing results 1 to 20 from total of 384 car sale records for this make

Car Sales: BMW — Page 1 of 20
Showing results 1 to 20 from total of 396 car sale records for this make

Car Sales: Mercedes-Benz — Page 1 of 30
Showing results 1 to 20 from total of 593 car sale records for this make

Car Sales: Volkswagen — Page 1 of 17
Showing results 1 to 20 from total of 327 car sale records for this make



### Extracting Detailed Car Data

Now, the detailed information for each car is extracted from the individual car make pages. This includes attributes like make, model, price, mileage, and other specifications.

In [53]:
"""
Extract car data from HTML:
- Parses the HTML content using BeautifulSoup.
- Iterates through 'li' tags to extract car information.
- Stores the data in a list of dictionaries.

Args:
    html (str): The HTML content of the page.
    make (str): The make of the car.

Attributes:
    data (list): A list of dictionaries, where each dictionary represents a car's data.
    soup (bs4.BeautifulSoup): The BeautifulSoup object for parsing HTML.
    car_info (dict): A dictionary containing information about a single car.

Returns:
    data: A list of dictionaries, where each dictionary represents a car's data.
"""
def extract_car_data(html, make):
    data = []
    soup = bs4.BeautifulSoup(html, "html.parser")
    for li in soup.find_all('li'):
        car_info = {'Make': make}
        car_info['Make/Model'] = li.find('span', class_='make-model').get_text()
        for tr in li.find_all('tr'):
            field = tr.find('td', class_='field').get_text(strip=True)
            value = tr.find_all('td')[1].get_text(strip=True)
            car_info[field[:-1]] = value
        data.append(car_info)
    return data

### Scraping All Car Data

This section iterates through the car make links, fetches the HTML content of each subpage, extracts the car data using the `extract_car_data` function, and stores all extracted data in a list.

In [55]:
"""
Scrape all car data:
- Iterates through car make links.
- Determines the number of subpages for each make.
- Fetches HTML content from each subpage.
- Extracts car data using the `extract_car_data` function.
- Stores all extracted data in a list.

Attributes:
    all_car_data (list): A list to store all extracted car data.
    subpage_counts (dict): A dictionary mapping car makes to their subpage counts.
    link (str): The URL of a car make page.
    make (str): The name of the car make.
    num_subpages (int): The number of subpages for the current make.
    page_num (int): The current subpage number.
    subpage_url (str): The URL of the current subpage.
    response (http.client.HTTPResponse): The response object from the subpage request.
    html (str): The HTML content of the subpage.
"""
all_car_data = []
subpage_counts = {
    "audi": 20,
    "bmw": 20,
    "mercedes-benz": 30,
    "volkswagen": 17
}

for link in car_make_links:
    make = link.split('/')[-1].title()
    num_subpages = subpage_counts.get(make[:-12].lower(), 1)  # Get subpage count or default to 1

    for page_num in range(1, num_subpages + 1):
        subpage_url = f'{link[:-7]}{page_num:02}.html'  # Construct subpage URL
        response = req.urlopen(subpage_url)
        html = response.read().decode()
        all_car_data.extend(extract_car_data(html, make[:-12]))

### Saving Data to CSV

The extracted car data is saved to a CSV file named 'data.csv'. This file will be used for further analysis in a separate notebook.

In [57]:
"""
Save data to CSV:
- Opens a CSV file for writing.
- Creates a CSV writer object.
- Writes the header row and all car data rows to the file.

Attributes:
    csv_file (str): The name of the CSV file to save data to.
    fieldnames (list): A list of field names for the CSV header.
    writer (csv.DictWriter): The CSV writer object.
"""
csv_file = 'car_sales_data.csv'
with open(csv_file, 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = all_car_data[0].keys()
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(all_car_data)

### Reading Data into pandas DataFrame

The saved CSV file is read into a pandas DataFrame for easier manipulation and analysis in the next notebook.

In [59]:
"""
Read data into pandas DataFrame:
- Reads the CSV file into a pandas DataFrame.
- Displays the DataFrame.

Attributes:
    df (pd.DataFrame): The pandas DataFrame containing the car data.
"""
df = pd.read_csv('car_sales_data.csv')
df

Unnamed: 0,Make,Make/Model,Date of Sale,Sale Price,Year,Mileage,Classification,Transmission,Fuel Type,Description,Sale Location
0,Audi,Audi A1 (TFSI),06/01/2024,"€4,800.00",2012,130377,Hatchback,Manual,Petrol,Grey Audi A11.4 TFSI Sport Euro 5 (s/s) 3dr2 p...,Waterford
1,Audi,Audi Q7 (S line Plus),07/01/2024,"€14,450.00",2012,91483,SUV,Automatic,Diesel,Blue Audi Q73.0 TDI S line Plus Tiptronic quat...,Clare
2,Audi,Audi RS3 (TFSI),09/01/2024,"€70,256.00",2022,3869,Saloon,Automatic,Petrol,Black Audi RS3 SALOON2.5 RS 3 TFSI QUATTRO VOR...,Mayo
3,Audi,Audi A3 (S line),09/01/2024,"€10,308.00",2013,83389,Hatchback,Manual,Diesel,Black Audi A32.0 TDI S line Hatchback 3dr Dies...,Limerick
4,Audi,Audi A6 Saloon (ultra),2024-01-09,"€21,833.00",2018,42918,Saloon,Automatic,Diesel,Red Audi A6 Saloon2.0 TDI Ultra Black Edition ...,Dublin
...,...,...,...,...,...,...,...,...,...,...,...
1695,Volkswagen,Volkswagen Tiguan (R-Line),19/12/2024,"€14,075.00",2015,73104,SUV,Automatic,Diesel,Volkswagen Tiguan2.0 R LINE TDI BLUEMOTION TEC...,Tipperary
1696,Volkswagen,Volkswagen Tiguan (R-Line),20/12/2024,"€27,755.00",2017,26055,SUV,Automatic,Petrol,Volkswagen Tiguan2.0 TSI R-Line DSG 4Motion (s...,Wexford
1697,Volkswagen,Volkswagen Passat (BlueMotion Tech),21/12/2024,"€13,503.00",2015,53236,Saloon,Manual,Diesel,Volkswagen Passat2.0 TDI SE Business 4dr2 prev...,Tipperary
1698,Volkswagen,Volkswagen Golf (MHEV),22/12/2024,"€29,262.00",2021,3143,Hatchback,Automatic,Petrol Hybrid,Volkswagen Golf1.5 R-LINE ETSI DSG 5d 148 BHP ...,Galway
