Part 1: Setup and Initialization

This section focuses on setting up the web scraping environment and initializing the project. It involves:

~Importing the necessary libraries required for data extraction and analysis.

~Creating the project structure within a Jupyter Notebook (.ipynb) for better organization and reproducibility.

~Sending a test HTTP request to the Cars24 website to verify connectivity and ensure that the website can be accessed successfully.

This initial step is crucial as it ensures that the team members can proceed seamlessly and establish a strong foundation helping maintain efficiency, consistency, and smooth integration throughout the entire project workflow.

In [None]:
# Step 0: Dependency Installation (Optional - for new environments)
# Uncomment and run the following lines if you need to install dependencies in a new kernel
# !pip install requests beautifulsoup4 pandas
# 
# Note: For production environments, create a requirements.txt file with:
# requests>=2.25.0
# beautifulsoup4>=4.9.0
# pandas>=1.3.0
#
# Then install with: pip install -r requirements.txt

print("Dependency installation cell ready (commented for safety)")

Dependency installation cell ready (commented for safety)


In [16]:
# Step 1: Importing Required Libraries and Checking Versions

import requests                      # For sending HTTP requests
from bs4 import BeautifulSoup         # For parsing HTML content
import pandas as pd                   # For data manipulation and analysis
import os                             # For creating project structure
import sys                            # For system information
import urllib.robotparser             # For robots.txt checking

# Print package versions for reproducibility
print("=== Package Versions ===")
print(f"requests: {requests.__version__}")
try:
    import bs4
    print(f"beautifulsoup4: {bs4.__version__}")
except AttributeError:
    print("beautifulsoup4: Version not available")
print(f"pandas: {pd.__version__}")
print(f"Python: {sys.version}")
print("\nLibraries imported successfully!")

=== Package Versions ===
requests: 2.32.5
beautifulsoup4: 4.14.2
pandas: 2.3.3
Python: 3.13.5 (tags/v3.13.5:6cb20a2, Jun 11 2025, 16:15:46) [MSC v.1943 64 bit (AMD64)]

Libraries imported successfully!


In [17]:
# Step 2: Robots.txt Compliance Check
# Demonstrating awareness of scraping rules and ethical guidelines

def check_robots_txt(base_url):
    """Check robots.txt for scraping permissions"""
    try:
        rp = urllib.robotparser.RobotFileParser()
        rp.set_url(f"{base_url}/robots.txt")
        rp.read()
        
        print("=== Robots.txt Analysis ===")
        print(f"Checking robots.txt for: {base_url}")
        
        # Check if we can access the main page
        can_fetch = rp.can_fetch("*", "/")
        print(f"Can fetch main page: {can_fetch}")
        
        # Check specific paths we might need
        search_paths = ["/buy-used-hyundai-cars-mumbai/", "/buy-used-cars/"]
        for path in search_paths:
            can_access = rp.can_fetch("*", path)
            print(f"Can access {path}: {can_access}")
        
        # Get crawl delay if specified
        crawl_delay = rp.crawl_delay("*")
        if crawl_delay:
            print(f"Recommended crawl delay: {crawl_delay} seconds")
        else:
            print("No specific crawl delay specified")
            
        return rp
        
    except Exception as e:
        print(f"Error checking robots.txt: {e}")
        return None

# Check robots.txt for Cars24
base_url = "https://www.cars24.com"
robots_parser = check_robots_txt(base_url)


=== Robots.txt Analysis ===
Checking robots.txt for: https://www.cars24.com
Can fetch main page: False
Can access /buy-used-hyundai-cars-mumbai/: False
Can access /buy-used-cars/: False
No specific crawl delay specified


In [18]:
# Step 3: Basic HTTP Connectivity Test
# Simple connectivity check without detailed exception handling

# Create a session with proper headers
session = requests.Session()
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})

# Make request
response = session.get("https://www.cars24.com/buy-used-hyundai-cars-mumbai/?sort=bestmatch&serveWarrantyCount=true&listingSource=Homepage_Filters", timeout=10)

print("=== Website Connectivity Test ===")
print(f"✓ HTTP Status: {response.status_code}")
print(f"✓ Content Length: {len(response.content)} bytes")
print(f"✓ Content Type: {response.headers.get('content-type', 'Unknown')}")

# Basic page validation
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title')
if title:
    print(f"✓ Page Title: {title.get_text().strip()}")

print("✓ Successfully connected to Cars24 website")


=== Website Connectivity Test ===
✓ HTTP Status: 200
✓ Content Length: 1240825 bytes
✓ Content Type: text/html; charset=utf-8
✓ Page Title: 437 Hyundai Used Cars in Mumbai | Second Hand Hyundai Cars in Mumbai starting from ₹0.89 lakh - CARS24
✓ Successfully connected to Cars24 website


In [19]:
# Step 4: Project Structure Setup

project_dir = "cars24_hyundai_mumbai"              # name of the project folder
if not os.path.exists(project_dir):                # Check if the directory already exists
    os.makedirs(project_dir)                       # If not, create the directory
    print(f"Project directory '{project_dir}' created successfully.")
else:
    print(f"Project directory '{project_dir}' already exists.")                # Printing a message if it already exists

Project directory 'cars24_hyundai_mumbai' already exists.


In [20]:
# Step 5: Exception Handling 
# To be completed by the next team members


In [21]:
# Step 6: Data Cleaning
# To be completed by the next team members
