# Web Scraping Bus Routes for Data Analysis

This notebook, we will scrape data from https://www.redbus.com/
To demonstrates how to scrape bus data from a saved HTML file.  
We extract **route information** (source and destination) and **all bus listings**, then save them into a CSV file for analysis.


In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import re
import os


**Collect HTML Files**

Here, we are listing all the HTML files stored in the `Data` folder.  
This will allow us to process each file one by one for scraping the required data.


In [2]:
html_folder = "Data"
html_files = [os.path.join(html_folder, f) for f in os.listdir(html_folder) if f.endswith(".html")]


**Convert Duration to Minutes**

This function `duration_to_minutes` converts a duration string (like `"2h 30m"` or `"45m"`) into total minutes.  
It makes it easier to work with duration values numerically in our analysis.


In [3]:
def duration_to_minutes(dur_str):
    h, m = 0, 0
    if 'h' in dur_str:
        parts = dur_str.split('h')
        h = int(parts[0].strip())
        if 'm' in parts[1]:
            m = int(parts[1].replace('m','').strip())
    elif 'm' in dur_str:
        m = int(dur_str.replace('m','').strip())
    return h * 60 + m


## Extract Bus Data from HTML Files
In this step, we loop through all the saved HTML files and extract bus information:  

1. Open each HTML file and parse it using BeautifulSoup.  
2. Extract the **source** and **destination** of the route.  
3. For each bus listed on the page, scrape details like:
   - `Bus_ID`, `Departure`, `Arrival`, `Duration`, `Duration_Minutes`  
   - `Seats`, `Single_Seats`, `Price`  
   - `Operator`, `Bus_Type`  
   - `Rating`, `Rating_Count`, `Live_Tracking`  
4. Store all this information as a dictionary and append it to a list `all_data`.  

This gives us a structured collection of all buses across multiple HTML files.


In [4]:
all_data = []

for html_file in html_files:
    print(f"\nProcessing: {html_file}")
    if not os.path.exists(html_file):
        print(f"File not found: {html_file}")
        continue

    with open(html_file, "r", encoding="utf-8") as f:
        soup = BeautifulSoup(f, "lxml")

    # Extract source & destination
    route_div = soup.find("div", class_="searchTopSection__ind-search-styles-module-scss-klY3y")
    route_text = route_div.get_text(separator=" ", strip=True) if route_div else ""

    # Remove date prefixes (e.g., Jan Mon) and trailing Bus
    route_text = re.sub(r'^[A-Za-z]{3}\s+[A-Za-z]{3}\s+', '', route_text)
    match = re.search(r'([A-Za-z\s]+?)\s+to\s+([A-Za-z\s]+)', route_text, re.IGNORECASE)
    if match:
        source, destination = match.groups()
        source = source.strip()
        destination = re.sub(r'Bus.*', '', destination).strip()
    else:
        source, destination = "Unknown", "Unknown"

    # Scrape bus entries
    bus_list = soup.find_all("li", class_="tupleWrapper___da903c")
    print(f"Found {len(bus_list)} buses in {html_file}")

    for li in bus_list:
        bus = {}
        bus["Bus_ID"] = li.get("id", "")

        # Trip times
        dep = li.find("p", class_='boardingTime___a78ae0')
        arr = li.find("p", class_='droppingTime___c814da')
        dur = li.find("p", class_='duration___b3a515')
        bus["Departure"] = dep.get_text(strip=True) if dep else ""
        bus["Arrival"] = arr.get_text(strip=True) if arr else ""
        bus["Duration"] = dur.get_text(strip=True) if dur else ""
        bus["Duration_Minutes"] = duration_to_minutes(bus["Duration"]) if bus["Duration"] else 0

        # Seats
        seats = li.find("p", class_='totalSeats___53250b')
        singles = li.find("p", class_='singleSeats___1cb9f1')
        bus["Seats"] = int(re.sub(r"\D", "", seats.get_text())) if seats else 0
        bus["Single_Seats"] = int(re.sub(r"\D", "", singles.get_text())) if singles else 0

        # Price
        price = li.find("p", class_='finalFare___057afc')
        bus["Price"] = int(re.sub(r"[^\d]", "", price.get_text())) if price else 0

        # Misc
        onwards = li.find("p", class_='postFareText___096de6')
        bus["Onwards"] = onwards.get_text(strip=True) if onwards else ""

        op = li.find("div", class_='travelsName___950ec8')
        bus["Operator"] = op.get_text(strip=True) if op else ""

        btype = li.find("p", class_='busType___675d0a')
        bus["Bus_Type"] = btype.get_text(strip=True) if btype else ""

        rating = li.find("div", class_='rating___24df2f')
        bus["Rating"] = float(rating.get_text(strip=True)) if rating and rating.get_text(strip=True) else 0.0

        rc = li.find("div", class_='ratingCount___618e68')
        bus["Rating_Count"] = int(re.sub(r"\D", "", rc.get_text())) if rc else 0

        live = li.find("div", class_='liveTracking___b387f3')
        bus["Live_Tracking"] = "Yes" if live else "No"

        # Route info
        bus["Source"] = source
        bus["Destination"] = destination

        all_data.append(bus)
    


Processing: Data\Bangalore_to_Chennai.html
Found 20 buses in Data\Bangalore_to_Chennai.html

Processing: Data\Bangalore_to_Hyderabad.html
Found 49 buses in Data\Bangalore_to_Hyderabad.html

Processing: Data\Chennai_to_Bangalore.html
Found 28 buses in Data\Chennai_to_Bangalore.html

Processing: Data\Chennai_to_Hyderabad.html
Found 12 buses in Data\Chennai_to_Hyderabad.html

Processing: Data\Hyderabad_to_Bangalore.html
Found 62 buses in Data\Hyderabad_to_Bangalore.html

Processing: Data\Hyderabad_to_Chennai.html
Found 7 buses in Data\Hyderabad_to_Chennai.html


## Clean and Prepare the DataFrame
After collecting all bus data, we perform some cleaning and preparation:  

1. **Clean Route Names**:  
   - Keep only the relevant text for `Source` and remove extra words from `Destination`.  

2. **Remove Duplicates**:  
   - Ensure each `Bus_ID` appears only once.  

3. **Process the `Onwards` Column**:  
   - Strip whitespace, standardize text, replace empty values with `'Unknown'`.  
   - Convert to a categorical type for easier analysis in ML models or Power BI.  



In [7]:
df = pd.DataFrame(all_data)

if not df.empty:
    # Ensure Source & Destination exist before cleaning
    if "Source" in df.columns:
        df["Source"] = df["Source"].str.replace(r"^.*?([A-Za-z]+)$", r"\1", regex=True).str.strip()
    if "Destination" in df.columns:
        df["Destination"] = df["Destination"].str.replace(r"Bus.*", "", regex=True).str.strip()

    # Remove duplicates
    df = df.drop_duplicates(subset="Bus_ID", keep="first")

    if "Onwards" in df.columns:
        # Strip whitespace
        df["Onwards"] = df["Onwards"].astype(str).str.strip().str.title()
        # Replace empty strings or single spaces with 'Unknown'
        df.loc[df["Onwards"] == "", "Onwards"] = "Unknown"
        # Convert to categorical type for ML and Power BI
        df["Onwards"] = df["Onwards"].astype("category")
else:
    print("No bus data found in the HTML files.")

df.head(10)

Unnamed: 0,Bus_ID,Departure,Arrival,Duration,Duration_Minutes,Seats,Single_Seats,Price,Onwards,Operator,Bus_Type,Rating,Rating_Count,Live_Tracking,Source,Destination
0,26409654,23:15,06:10,6h 55m,415,42,10,1250,Onwards,Jayavin Travels,A/C Seater / Sleeper (2+1),4.4,575,Yes,Bangalore,Chennai
1,44319378,21:55,04:30,6h 35m,395,36,12,585,Onwards,HYBUS,Bharat Benz A/C Sleeper (2+1),4.5,245,Yes,Bangalore,Chennai
2,36258160,21:35,05:00,7h 25m,445,24,8,1600,Onwards,PADMAVATHI TRAVELS,A/C Sleeper (2+1),4.3,797,Yes,Bangalore,Chennai
3,37971794,22:45,05:45,7h,420,25,6,1400,Onwards,Krish Travels,Bharat Benz A/C Seater /Sleeper (2+1),4.2,1377,Yes,Bangalore,Chennai
4,44319377,21:40,04:55,7h 15m,435,36,12,585,Onwards,HYBUS,Bharat Benz A/C Sleeper (2+1),4.5,166,Yes,Bangalore,Chennai
5,rdBoost_44324353,11:05,17:45,6h 40m,400,24,8,899,Unknown,VEE VEE BUS,A/C Sleeper (2+1),4.4,36,No,Bangalore,Chennai
6,32913496,23:15,05:15,6h,360,25,6,1400,Onwards,Krish Travels,Bharat Benz A/C Seater /Sleeper (2+1),4.2,932,Yes,Bangalore,Chennai
7,44324353,11:05,17:45,6h 40m,400,24,8,899,Unknown,VEE VEE BUS,A/C Sleeper (2+1),4.4,36,No,Bangalore,Chennai
8,18582298,23:00,06:00,7h,420,36,12,1620,Onwards,Dream Line Travels Pvt Ltd,VE A/C Sleeper (2+1),4.1,888,Yes,Bangalore,Chennai
9,9698430,22:30,04:35,6h 5m,365,35,5,1220,Onwards,YBM Travels(BLM),A/C Sleeper (2+1),4.1,600,Yes,Bangalore,Chennai


## Save Cleaned Data to CSV
- Save the cleaned and prepared bus data into a CSV file named `All_Routes_Buses_Final.csv` inside the `Data` folder.  


In [6]:
final_csv = os.path.join(html_folder, "All_Routes_Buses_Final.csv")
os.makedirs(os.path.dirname(final_csv), exist_ok=True)
df.to_csv(final_csv, index=False)

print("\nScraping & cleaning completed")
print(f"Total unique buses scraped: {len(df)}")
print(f"Final CSV saved at: {final_csv}")


Scraping & cleaning completed
Total unique buses scraped: 178
Final CSV saved at: Data\All_Routes_Buses_Final.csv
