Part 1: Setup and Initialization

This section focuses on setting up the web scraping environment and initializing the project. It involves:

~Importing the necessary libraries required for data extraction and analysis.

~Creating the project structure within a Jupyter Notebook (.ipynb) for better organization and reproducibility.

~Sending a test HTTP request to the Cars24 website to verify connectivity and ensure that the website can be accessed successfully.

This initial step is crucial as it ensures that the team members can proceed seamlessly and establish a strong foundation helping maintain efficiency, consistency, and smooth integration throughout the entire project workflow.

In [None]:
# Step 0: Install Dependencies (Optional)
# Uncomment and run the following lines to install the required libraries if not already installed.
# This is useful when setting up the environment in a new kernel.

# !pip install requests
# !pip install beautifulsoup4
# !pip install pandas

# Note: It's recommended to create a requirements.txt file for better dependency management.
# To generate a requirements.txt file, run the following command in the terminal:
# pip freeze > requirements.txt


In [None]:
# Step 1: Importing Required Libraries

import requests                      # For sending HTTP requests
from bs4 import BeautifulSoup         # For parsing HTML content
import pandas as pd    # For data manipulation and analysis
import os                  # For creating project structure
import sys                
from urllib import robotparser      # For handling robots.txt files

# Printing package versions for reproducibility

print("Library Versions:")
print(f"requests:", requests.__version__)
print(f"beautifulsoup4(bs4): {BeautifulSoup.__version__ if hasattr(BeautifulSoup, '__version__') else 'N/A'}")
print(f"pandas:", pd.__version__)
print(f"Python:", sys.version)

Library Versions:
requests: 2.32.3
beautifulsoup4(bs4): N/A
pandas: 2.2.2
Python: 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)]


In [None]:
# Step 2: Sending a test HTTP request to verify connectivity (hardened version)

test_url = "https://www.cars24.com/buy-used-hyundai-cars-mumbai/?sort=bestmatch&serveWarrantyCount=true&listingSource=Homepage_Filters"

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
})

try:
    response = session.get(test_url, timeout=10)
    response.raise_for_status()  # Raises HTTPError for bad responses

    soup = BeautifulSoup(response.text, "html.parser")
    # Minimal validation: check if the page <title> contains 'Hyundai' and 'Mumbai'
    page_title = soup.title.string if soup.title else ""
    if "Hyundai" in page_title and "Mumbai" in page_title:
        print("Successfully connected and validated Cars24 Mumbai Hyundai page structure.")
    else:
        print("Connected, but page structure/title did not match expectations. Title found:", page_title)
except requests.exceptions.RequestException as e:
    print(f"HTTP request failed: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

✅ Successfully connected and validated Cars24 Mumbai Hyundai page structure.


In [11]:
# Step 2b: Check robots.txt permissions before scraping
# This step demonstrates responsible scraping by checking the site's robots.txt rules.

from urllib import robotparser

robots_url = "https://www.cars24.com/robots.txt"
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()

test_url = "https://www.cars24.com/buy-used-hyundai-cars-mumbai/"
if rp.can_fetch("*", test_url):
    print(f"Allowed to scrape: {test_url}")
else:
    print(f"Not allowed to scrape: {test_url}")

# Note: This check ensures we do not scrape disallowed paths.

Not allowed to scrape: https://www.cars24.com/buy-used-hyundai-cars-mumbai/


In [None]:
# Extra Step: Creating project structure

project_dir = "cars24_hyundai_mumbai"              # name of the project folder
if not os.path.exists(project_dir):                # Check if the directory already exists
    os.makedirs(project_dir)                       # If not, create the directory
    print(f"Project directory '{project_dir}' created successfully.")
else:
    print(f"Project directory '{project_dir}' already exists.")                # Printing a message if it already exists

Project directory 'cars24_hyundai_mumbai' already exists.


In [None]:
# Step 4: Data Cleaning
# To be completed by the next team members


In [None]:
# Step 5: Data Extraction
# To be completed by the next team members
