# **Web Scrapping (BeautifulSoup + URLlib) Part 1**

**Group:** Group 1 \
**Name:** Camily Tang Jia Lei \
**Matrics. No:** A22EC0039

**Group Members:**
  1.  Marcus Joey Sayner (A22EC0193)
  2. Muhammad Luqman Hakim bin Mohd Rizaudin (A22EC0086)
  3. Camily Tang Jia Lei (A22EC0039)
  4. Goh Jing Yang (A22EC0052)

**Objective:**:
  The objective of this project is to scrape car listings from Carlist.my, focusing on extracting key details such as car name, brand, model, price, mileage, and more. The extracted data is then saved in a CSV file for further analysis and potential use in a car market analysis.

# 1. Importing Libraries
The first step in this project is to import the necessary Python libraries. These libraries are used for fetching the webpage, parsing the HTML content, and handling data.

*   **urllib.request:** This module is used for sending HTTP requests and fetching the HTML content of web pages. We use it to download the webpage’s HTML content.
*   **BeautifulSoup:** A powerful library used to parse HTML and XML documents. It provides methods to navigate and search the parse tree, making it easier to extract specific elements from the page.
*   **csv:** A built-in Python module for reading from and writing to CSV files. It's used here to store the scraped data in a structured tabular format.
*   **time:** The time module is used to add delays between requests, helping to avoid overwhelming the website with too many requests in a short period.
*   **json:** This module is used for parsing JSON data. Since the car listings are embedded in a JSON-LD script, this module allows us to extract and handle that data effectively.

In [10]:
import urllib.request
from bs4 import BeautifulSoup, NavigableString
import csv
import time
import json

## 2. Configuration
Here, we set up the configuration for the scraping process, including the base URL for the car listings and the page range that we want to scrape. We also prepare an empty list to store the car listings and specify the output file name.



*   **base_url:** The base URL is a template for the car listings pages on Carlist.my. The {} placeholder is replaced with the current page number during the scraping process. The page size is set to 25 listings per page.
*   **start_page and end_page:** These variables define the range of pages to scrape. In this case, the scraper will fetch pages 1 and 2.
*   **car_listings:** This list will store the data extracted from each car listing. Each listing will be appended to this list as a dictionary.
*   **output_file:** The name of the CSV file where the scraped data will be saved.





In [None]:
# Configuration
base_url = "https://www.carlist.my/cars-for-sale/malaysia?page_number={}&page_size=25"
start_page = 5233    # Starting page number to scrape 
end_page = 6000      # Ending page number to scrape
car_listings = []  # List to store car listings data

output_file = 'D:/camily_carlist_listings_1.csv'

# 3. Helper Functions for Data Extraction
Several helper functions are defined to extract specific details from the HTML content and embedded JSON data.

3.1 Get HTML Attributes <br>
This function retrieves the value of a specified HTML attribute from a given element. If the attribute is not found, it returns an empty string.

In [12]:
# Get HTML attribute or empty string if attribute doesn't exist
def get_attr(element, attr_name):
    return element.get(attr_name, '') if element else ''

3.2 Extract the First Text Node <br>
This function extracts the first visible text node inside a div element. It ignores non-text elements like tags and returns only the first piece of text content.

In [13]:
# Get the first text node inside a <div> (ignores non-text elements)
def get_first_text_node(div):
    for node in div.contents if div else []:
        if isinstance(node, NavigableString) and node.strip():
            return node
    return ''  # Return empty string if no text node is found

3.3 Extract Mileage Data <br>
This function searches for the meter icon in each car listing and retrieves the mileage information associated with it. If the icon is found, it returns the text next to it; otherwise, it returns an empty string.

In [14]:
# Extract raw mileage information from the article (after the meter icon)
def get_mileage(article):
    icon = article.find('i', class_='icon--meter')  # Find meter icon
    return str(icon.next_sibling) if icon and icon.next_sibling else ''  # Return mileage text

3.4 Extract Location Data<br>
This function retrieves the location information for each car listing. It searches for the location icon and extracts the text that follows it.

In [15]:
# Extract raw location information from the article (after the location icon)
def get_location(article):
    icon = article.find('i', class_='icon--location')  # Find location icon
    if not icon:
        return ''
    text = ''
    for sib in icon.next_siblings:  # Loop through siblings after icon
        if isinstance(sib, NavigableString):
            text += str(sib)
        elif getattr(sib, 'name', None) == 'span':
            text += sib.get_text()
        else:
            break
    return text  # Return extracted location text

3.5 Extract JSON-LD Data <br>
This function extracts the embedded JSON-LD data from the webpage. JSON-LD is used to represent structured data about the car listings. The function looks for the script tag containing this data, parses it, and returns the list of car listings.

In [16]:
# Pull JSON-LD block containing the car listings from the page
def extract_json_ld(soup):
    for script in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(script.string)  # Parse the JSON-LD data
            if isinstance(data, list):
                for d in data:
                    if 'itemListElement' in d:
                        return d['itemListElement']  # Return list of car listings
        except:
            continue
    return None  # Return None if no valid JSON-LD is found

# 4. Tracking Execution Time
This line captures the current time (in seconds) before the main scraping loop begins. The time.time() function returns the time as a floating-point number. By capturing the start time, we can later calculate the total execution time of the script. This helps in assessing the efficiency of the scraping process, particularly useful when scraping multiple pages or large amounts of data. It allows for performance optimization and comparison with future improvements or different approaches.

In [17]:
# Start time
start_time = time.time()

# 5. Main Scraping Loop
In this section, the main scraping process is performed. We iterate over the specified pages and extract the relevant car listing data.

In [18]:


# Main scraping loop
for page_num in range(start_page, end_page + 1):
    url = base_url.format(page_num)  # Generate URL for the current page
    print(f"\n🔎 Extracting page {page_num} - {url}")

    # Load the page and parse it with BeautifulSoup
    req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
    html = urllib.request.urlopen(req).read()
    soup = BeautifulSoup(html, 'html.parser')

    # Find all articles (car listings) and extract JSON-LD data
    articles = soup.find_all('article', class_='listing')
    ld_json = extract_json_ld(soup)
    if not ld_json:
        print("⚠️ No JSON-LD found on this page.")
        continue

    page_count = 0
    # Loop through articles and corresponding JSON-LD data
    for article, item in zip(articles, ld_json):
        car = item['item']  # Extract JSON details for the car

        # Extract fields from the <article> element’s attributes (HTML-based fields)
        name = get_attr(article, 'data-title')
        brand = get_attr(article, 'data-make')
        model = get_attr(article, 'data-model')
        body = get_attr(article, 'data-body-type')
        transmission = get_attr(article, 'data-transmission')
        installment = get_attr(article, 'data-installment')

        # Extract exactly what is displayed on the page (visible text-based fields)
        mileage = get_mileage(article)
        sales_channel = get_first_text_node(article.find('div', class_='listing__spec--dealer'))
        location = get_location(article)

        # Extract additional fields from the JSON-LD data
        year = car.get('vehicleModelDate', '')
        fuel = car.get('fuelType', '')
        color = car.get('color', '')
        price = car.get('offers', {}).get('price', '')
        condition = car.get('itemCondition', '')
        seats = car.get('seatingCapacity', '')

        # Append the extracted data to the car_listings list
        car_listings.append({
            'Car Name': name,
            'Car Brand': brand,
            'Car Model': model,
            'Manufacture Year': year,
            'Body Type': body,
            'Fuel Type': fuel,
            'Mileage': mileage,
            'Transmission': transmission,
            'Color': color,
            'Price': price,
            'Installment': installment,
            'Condition': condition,
            'Seating Capacity': seats,
            'Location': location,
            'Sales Channel': sales_channel
        })
        page_count += 1

    print(f"✅ Found {page_count} cars on page {page_num}")
    print(f"📄 Total scraped: {len(car_listings)}")
    time.sleep(2)  # Sleep to avoid making too many requests in a short time


🔎 Extracting page 5233 - https://www.carlist.my/cars-for-sale/malaysia?page_number=5233&page_size=25
✅ Found 25 cars on page 5233
📄 Total scraped: 25

🔎 Extracting page 5234 - https://www.carlist.my/cars-for-sale/malaysia?page_number=5234&page_size=25
✅ Found 25 cars on page 5234
📄 Total scraped: 50

🔎 Extracting page 5235 - https://www.carlist.my/cars-for-sale/malaysia?page_number=5235&page_size=25
✅ Found 25 cars on page 5235
📄 Total scraped: 75

🔎 Extracting page 5236 - https://www.carlist.my/cars-for-sale/malaysia?page_number=5236&page_size=25
✅ Found 25 cars on page 5236
📄 Total scraped: 100

🔎 Extracting page 5237 - https://www.carlist.my/cars-for-sale/malaysia?page_number=5237&page_size=25
✅ Found 25 cars on page 5237
📄 Total scraped: 125

🔎 Extracting page 5238 - https://www.carlist.my/cars-for-sale/malaysia?page_number=5238&page_size=25
✅ Found 25 cars on page 5238
📄 Total scraped: 150

🔎 Extracting page 5239 - https://www.carlist.my/cars-for-sale/malaysia?page_number=5239&pa

# 6. Saving Data to CSV
The script checks if any car listings were successfully scraped.

It opens the output CSV file for writing and uses csv.DictWriter to write the data.

The writeheader() method writes the column headers, while writerows() writes the data for all the car listings.

In [19]:
# Save the extracted data to a CSV file
if car_listings:
    with open(output_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=car_listings[0].keys())
        writer.writeheader()  # Write the column headers
        writer.writerows(car_listings)  # Write the car listings data
    print(f"\n✅ Saved {len(car_listings)} cars to '{output_file}'")
else:
    print("\n⚠️ No car listings found.")


✅ Saved 19200 cars to 'D:/camily_carlist_listings_1.csv'


# 7. Calculate Total Execution Time
At the end of the script, after all scraping activities and file-saving operations, the following code is added to record the end time and calculate the total execution time:

In [20]:
# End time and elapsed time calculation
end_time = time.time()
execution_time = end_time - start_time
print(f"\n🕒 Total execution time: {execution_time:.2f} seconds")


🕒 Total execution time: 2934.52 seconds


# 8. Conclusion
The script efficiently scrapes car listings from Carlist.my, extracting both HTML and JSON data, and saves it to a CSV file for further use. This process includes handling missing data, extracting structured information, and respecting website policies by adding a delay between requests.

# Full Code

```python
import urllib.request
from bs4 import BeautifulSoup, NavigableString
import csv
import time
import json

# Configuration
base_url = "https://www.carlist.my/cars-for-sale/malaysia?page_number={}&page_size=25"
start_page = 5233    # Starting page number to scrape
end_page = 6000      # Ending page number to scrape
car_listings = []  # List to store car listings data

output_file = 'D:/camily_carlist_listings_1.csv'

# Get HTML attribute or empty string if attribute doesn't exist
def get_attr(element, attr_name):
    return element.get(attr_name, '') if element else ''

# Get the first text node inside a <div> (ignores non-text elements)
def get_first_text_node(div):
    for node in div.contents if div else []:
        if isinstance(node, NavigableString) and node.strip():
            return node
    return ''  # Return empty string if no text node is found

# Extract raw mileage information from the article (after the meter icon)
def get_mileage(article):
    icon = article.find('i', class_='icon--meter')  # Find meter icon
    return str(icon.next_sibling) if icon and icon.next_sibling else ''  # Return mileage text

# Extract raw location information from the article (after the location icon)
def get_location(article):
    icon = article.find('i', class_='icon--location')  # Find location icon
    if not icon:
        return ''
    text = ''
    for sib in icon.next_siblings:  # Loop through siblings after icon
        if isinstance(sib, NavigableString):
            text += str(sib)
        elif getattr(sib, 'name', None) == 'span':
            text += sib.get_text()
        else:
            break
    return text  # Return extracted location text

# Pull JSON-LD block containing the car listings from the page
def extract_json_ld(soup):
    for script in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(script.string)  # Parse the JSON-LD data
            if isinstance(data, list):
                for d in data:
                    if 'itemListElement' in d:
                        return d['itemListElement']  # Return list of car listings
        except:
            continue
    return None  # Return None if no valid JSON-LD is found

# Start time
start_time = time.time()

# Main scraping loop
for page_num in range(start_page, end_page + 1):
    url = base_url.format(page_num)  # Generate URL for the current page
    print(f"\n🔎 Extracting page {page_num} - {url}")

    # Load the page and parse it with BeautifulSoup
    req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
    html = urllib.request.urlopen(req).read()
    soup = BeautifulSoup(html, 'html.parser')

    # Find all articles (car listings) and extract JSON-LD data
    articles = soup.find_all('article', class_='listing')
    ld_json = extract_json_ld(soup)
    if not ld_json:
        print("⚠️ No JSON-LD found on this page.")
        continue

    page_count = 0
    # Loop through articles and corresponding JSON-LD data
    for article, item in zip(articles, ld_json):
        car = item['item']  # Extract JSON details for the car

        # Extract fields from the <article> element’s attributes (HTML-based fields)
        name = get_attr(article, 'data-title')
        brand = get_attr(article, 'data-make')
        model = get_attr(article, 'data-model')
        body = get_attr(article, 'data-body-type')
        transmission = get_attr(article, 'data-transmission')
        installment = get_attr(article, 'data-installment')

        # Extract exactly what is displayed on the page (visible text-based fields)
        mileage = get_mileage(article)
        sales_channel = get_first_text_node(article.find('div', class_='listing__spec--dealer'))
        location = get_location(article)

        # Extract additional fields from the JSON-LD data
        year = car.get('vehicleModelDate', '')
        fuel = car.get('fuelType', '')
        color = car.get('color', '')
        price = car.get('offers', {}).get('price', '')
        condition = car.get('itemCondition', '')
        seats = car.get('seatingCapacity', '')

        # Append the extracted data to the car_listings list
        car_listings.append({
            'Car Name': name,
            'Car Brand': brand,
            'Car Model': model,
            'Manufacture Year': year,
            'Body Type': body,
            'Fuel Type': fuel,
            'Mileage': mileage,
            'Transmission': transmission,
            'Color': color,
            'Price': price,
            'Installment': installment,
            'Condition': condition,
            'Seating Capacity': seats,
            'Location': location,
            'Sales Channel': sales_channel
        })
        page_count += 1

    print(f"✅ Found {page_count} cars on page {page_num}")
    print(f"📄 Total scraped: {len(car_listings)}")
    time.sleep(2)  # Sleep to avoid making too many requests in a short time

# Save the extracted data to a CSV file
if car_listings:
    with open(output_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=car_listings[0].keys())
        writer.writeheader()  # Write the column headers
        writer.writerows(car_listings)  # Write the car listings data
    print(f"\n✅ Saved {len(car_listings)} cars to '{output_file}'")
else:
    print("\n⚠️ No car listings found.")

# End time and elapsed time calculation
end_time = time.time()
execution_time = end_time - start_time
print(f"\n🕒 Total execution time: {execution_time:.2f} seconds")
```