### WEB SCRAPING ###

***Importing required modules***
- 1 . Initialising WebDriver : Initialize a WebDriver instance (e.g., ChromeDriver) using webdriver.Chrome() or another browser driver (Firefox, Edge, etc.).
- 2 . Importing By elements from selenium to locate HTML element on the webpage
- 3 . Importing services from webdriver to manage webdriver services
- 4 . Importing WebDriverWait to wait for some more elements to get loaded before proceeding to load datas, this usualy use to load dynamic contents(large datas)
- 5 . Importing expected_conditions to handle errors in the content loading dynamically
- 6 . Importing time to wait for sometime to avoid overloading the datas

In [5]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

***Extract bus details from below webpages***
1. https://www.redbus.in/online-booking/apsrtc/?utm_source=rtchometile
2. https://www.redbus.in/online-booking/ksrtc-kerala/?utm_source=rtchometile
3. https://www.redbus.in/online-booking/tsrtc/?utm_source=rtchometile
4. https://www.redbus.in/online-booking/ktcl/?utm_source=rtchometile
5. https://www.redbus.in/online-booking/rsrtc/?utm_source=rtchometile
6. https://www.redbus.in/online-booking/south-bengal-state-transport-corporation-sbstc/?utm_source=rtchometile
7. https://www.redbus.in/online-booking/hrtc/?utm_source=rtchometile
8. https://www.redbus.in/online-booking/uttar-pradesh-state-road-transport-corporation-upsrtc/?utm_source=rtchometile
9. https://www.redbus.in/online-booking/wbtc-ctc/?utm_source=rtchometile
10. https://www.redbus.in/online-booking/chandigarh-transport-undertaking-ctu

***Extracting all the bus links from all pages from the selected state link***
- 1 . initializing the WebDriver and navigate to the initial page.
- 2 . Handling Pagination: Using WebDriverWait to wait for pagination elements and iterating through each page to load data shows good handling of
      dynamic content.
- 3 . Scrolling and Clicking: using JavaScript to scroll elements into view and click on pagination links, which is essential for loading 
      content dynamically.
- 4 . data Extraction: Extracting href attributes from bus route links (route_link class) on each page demonstrates effective element location and            extraction.
- 5 . Error Handling: Implementing basic error handling (try-except) to catch exceptions during navigation and data extraction is a good practice for         robustness.
- 6 . WebDriver Cleanup: Properly closing the WebDriver instance (driver.quit()) ensures resources are released after scraping. 

In [6]:
# Starting the Webdriver
driver = webdriver.Chrome()
# To open initial page we use driver.get(" ")
driver.get("https://www.redbus.in/online-booking/apsrtc/?utm_source=rtchometile")
extracted_links_per_state = {'apsrtc':[]}
# Waiting to load all the datas/elements 
pagination_elements = WebDriverWait(driver, 10).until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".DC_117_paginationTable .DC_117_pageTabs"))
)
# Iterating through each page to load all elements
for page_number, page_element in enumerate(pagination_elements):
    try:
        # Scroling to the page elements to bring them into view
        driver.execute_script("arguments[0].scrollIntoView(true);", page_element)
        time.sleep(1)  # Adjusting sleep time to load elements
        # Using JavaScript to enable the click elements
        driver.execute_script("arguments[0].click();", page_element)
        # Waiting for the page to load
        time.sleep(5)  # Adjusting sleep time based on page's loading speed

        # Extracting data from the current page
        bus_links = driver.find_elements(By.CLASS_NAME, 'route_link')
        for link in bus_links:
            href = link.find_element(By.TAG_NAME, 'a').get_attribute('href')
            extracted_links_per_state['apsrtc'].append(href)

    except Exception as e:
        print(f"Error navigating to page {page_number + 1}: {e}")
        break
# Closing the WebDriver after loading the enough elements
driver.quit()


In [7]:
extracted_links_per_state

{'apsrtc': ['https://www.redbus.in/bus-tickets/vijayawada-to-hyderabad',
  'https://www.redbus.in/bus-tickets/hyderabad-to-vijayawada',
  'https://www.redbus.in/bus-tickets/kakinada-to-visakhapatnam',
  'https://www.redbus.in/bus-tickets/visakhapatnam-to-kakinada',
  'https://www.redbus.in/bus-tickets/chittoor-andhra-pradesh-to-bangalore',
  'https://www.redbus.in/bus-tickets/kadapa-to-bangalore',
  'https://www.redbus.in/bus-tickets/ananthapur-to-bangalore',
  'https://www.redbus.in/bus-tickets/tirupathi-to-bangalore',
  'https://www.redbus.in/bus-tickets/visakhapatnam-to-vijayawada',
  'https://www.redbus.in/bus-tickets/ongole-to-hyderabad',
  'https://www.redbus.in/bus-tickets/bangalore-to-tirupathi',
  'https://www.redbus.in/bus-tickets/macherla-to-hyderabad',
  'https://www.redbus.in/bus-tickets/rajahmundry-to-visakhapatnam',
  'https://www.redbus.in/bus-tickets/nandyala-to-hyderabad',
  'https://www.redbus.in/bus-tickets/bangalore-to-kadapa',
  'https://www.redbus.in/bus-tickets/

***Extracrting all elements from the links which I have extracted bus_route links from the selected states***
- 1 . final_output = []: This list will store dictionaries (res dictionaries) containing scraped data for each route.
- 2 . The code iterates through 'extracted_links_per_state.items()', where each item contains state-wise extracted links from the RedBus website.
- 3 . Webscraping loop = For each route URL (v), the script initializes Selenium's Chrome WebDriver (driver) and navigates to the URL.
- 4 . Scrolling and Data extraction= It enters a loop (while) that attempts to scroll down the page (PAGE_DOWN) to load more data dynamically.
It sets limits (max_scroll_attempts) to prevent infinite scrolling and timeouts (max_time) to avoid prolonged waits.
- 5 . Extracting datas using xpath = Various elements like bus types, durations, departing times, etc., are extracted using XPath queries           (driver.find_elements).
- 6 . Appending Extracted Data= Extracted data is appended to respective lists (bus_type, duration, etc.) which accumulate data from each scroll attempt
- 7 . Handling scroll and data loading= It checks if new data has loaded (current_bus_count compared with previous_bus_count) to decide whether to continue scrolling
- 8 . Storing results= After scraping all available data for a route, it constructs a dictionary (res) containing lists of scraped attributes (bus_name, duration, etc.).
This dictionary includes additional metadata such as bus_route_name and bus_route_link derived from the route URL (v).
- 9 . Appending to final_output= Each res dictionary is appended to final_output after scraping and before quitting the WebDriver (driver.quit()).
- 10 . The final output (final_output) will contain a list of dictionaries, where each dictionary represents data scraped from one route, structured with various attributes like bus name, duration, price, etc., associated with that route.

In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Assuming extracted_links_per_state is defined earlier in your script
final_output = []

for key, value in extracted_links_per_state.items():
    for v in value:
        bus_type = []
        duration = []
        departing_time = []
        bus_name = []
        reaching_time = []
        price = []
        star_rating = []
        seat_availability = []

        route = v.split('/')[-1]
        print("Route:", route)
        driver = webdriver.Chrome()
        driver.get(v)
        time.sleep(20)  # Initial wait for the page to load

        previous_bus_count = 0
        current_bus_count = -1  # Initialize with a different value to enter the loop
        max_scroll_attempts = 10  # To avoid infinite loop in case of errors
        scroll_attempts = 0
        max_time = 300  # Maximum time in seconds to avoid infinite loops
        start_time = time.time()

        while scroll_attempts < max_scroll_attempts and (time.time() - start_time) < max_time:
            previous_bus_count = current_bus_count

            # Extract bus details
            bus_types = driver.find_elements(By.XPATH, '//div[@class="bus-type f-12 m-top-16 l-color evBus"]')
            durations = driver.find_elements(By.XPATH, '//div[@class="dur l-color lh-24"]')
            departing_times = driver.find_elements(By.XPATH, '//div[@class="column-three p-right-10 w-10 fl"]')
            bus_names = driver.find_elements(By.XPATH, '//div[@class="travels lh-24 f-bold d-color"]')
            reaching_times = driver.find_elements(By.XPATH, '//div[@class="column-four p-right-10 w-10 fl"]')
            prices = driver.find_elements(By.XPATH, '//div[@class="fare d-block"]')
            star_ratings = driver.find_elements(By.XPATH, '//div[@class="rating-sec lh-24"]')
            seat_availabilities = driver.find_elements(By.XPATH, '//div[@class="column-eight w-15 fl"]')

           # print(f"Bus Types Found: {len(bus_types)}")
           # print(f"Durations Found: {len(durations)}")
           # print(f"Departing Times Found: {len(departing_times)}")
           # print(f"Bus Names Found: {len(bus_names)}")
           # print(f"Reaching Times Found: {len(reaching_times)}")
           # print(f"Prices Found: {len(prices)}")
           # print(f"Star Ratings Found: {len(star_ratings)}")
           # print(f"Seat Availabilities Found: {len(seat_availabilities)}")

            # Append extracted data to lists
            bus_type += [element.text.strip() for element in bus_types]
            duration += [i.text.strip() for i in durations]
            departing_time += [j.text.strip() for j in departing_times]
            bus_name += [k.text.strip() for k in bus_names]
            reaching_time += [a.text.strip() for a in reaching_times]
            price += [b.text.strip() for b in prices]
            star_rating += [c.text.strip() for c in star_ratings]
            seat_availability += [d.text.strip() for d in seat_availabilities]

            current_bus_count = len(bus_name)  # Update current bus count
            #print(f"Scroll Attempt: {scroll_attempts}, Previous Bus Count: {previous_bus_count}, Current Bus Count: {current_bus_count}")

            # Scroll down
            driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.PAGE_DOWN)
            time.sleep(3)  # Wait for new data to load

            if current_bus_count == previous_bus_count:
                break  # No new buses loaded, break the loop

            scroll_attempts += 1

        # Store the results
        res = {
            "bus_name": bus_name,
            "bus_route_name": [route] * len(bus_name),
            "bus_route_link": [v] * len(bus_name),
            "bus_type": bus_type,
            "duration": duration,
            "departing_time": departing_time,
            "reaching_time": reaching_time,
            "price": price,
            "star_rating": star_rating,
            "seat_availability": seat_availability
        }
        print("Results:", res)
        final_output.append(res)
        driver.quit()


Route: vijayawada-to-hyderabad
Results: {'bus_name': ['IntrCity SmartBus', 'IntrCity SmartBus', 'FRESHBUS', '', '', 'IntrCity SmartBus', 'IntrCity SmartBus', 'FRESHBUS', 'AdSri Sanvi Tours and Travels', 'FRESHBUS', 'IntrCity SmartBus', 'IntrCity SmartBus', 'FRESHBUS', 'AdSri Sanvi Tours and Travels', 'FRESHBUS', 'FRESHBUS', 'FRESHBUS', 'AdSai RK Travels', 'FRESHBUS', 'FRESHBUS', 'IntrCity SmartBus', 'IntrCity SmartBus', 'FRESHBUS', 'AdSri Sanvi Tours and Travels', 'FRESHBUS', 'FRESHBUS', 'FRESHBUS', 'AdSai RK Travels', 'FRESHBUS', 'FRESHBUS', 'IntrCity SmartBus', 'IntrCity SmartBus', 'FRESHBUS', 'AdSri Sanvi Tours and Travels', 'FRESHBUS', 'FRESHBUS', 'FRESHBUS', 'AdSai RK Travels', 'FRESHBUS', 'FRESHBUS', 'IntrCity SmartBus', 'IntrCity SmartBus', 'FRESHBUS', 'AdSri Sanvi Tours and Travels', 'FRESHBUS', 'FRESHBUS', 'FRESHBUS', 'AdSai RK Travels', 'FRESHBUS', 'FRESHBUS', 'IntrCity SmartBus', 'GMRK Tours and Travels', 'FRESHBUS', 'IntrCity SmartBus', 'FRESHBUS', 'IntrCity SmartBus', 'Int

In [9]:
final_output

[{'bus_name': ['IntrCity SmartBus',
   'IntrCity SmartBus',
   'FRESHBUS',
   '',
   '',
   'IntrCity SmartBus',
   'IntrCity SmartBus',
   'FRESHBUS',
   'AdSri Sanvi Tours and Travels',
   'FRESHBUS',
   'IntrCity SmartBus',
   'IntrCity SmartBus',
   'FRESHBUS',
   'AdSri Sanvi Tours and Travels',
   'FRESHBUS',
   'FRESHBUS',
   'FRESHBUS',
   'AdSai RK Travels',
   'FRESHBUS',
   'FRESHBUS',
   'IntrCity SmartBus',
   'IntrCity SmartBus',
   'FRESHBUS',
   'AdSri Sanvi Tours and Travels',
   'FRESHBUS',
   'FRESHBUS',
   'FRESHBUS',
   'AdSai RK Travels',
   'FRESHBUS',
   'FRESHBUS',
   'IntrCity SmartBus',
   'IntrCity SmartBus',
   'FRESHBUS',
   'AdSri Sanvi Tours and Travels',
   'FRESHBUS',
   'FRESHBUS',
   'FRESHBUS',
   'AdSai RK Travels',
   'FRESHBUS',
   'FRESHBUS',
   'IntrCity SmartBus',
   'IntrCity SmartBus',
   'FRESHBUS',
   'AdSri Sanvi Tours and Travels',
   'FRESHBUS',
   'FRESHBUS',
   'FRESHBUS',
   'AdSai RK Travels',
   'FRESHBUS',
   'FRESHBUS',
   'IntrC

***Importing Pandas to make the extracted datas into dataframe***

In [13]:
import pandas as pd
df = pd.DataFrame(final_output)

In [14]:
dfs = []

# Convert each dictionary to a DataFrame and store in the list
for i in range(len(final_output)):
    series_data = {key: pd.Series(value) for key, value in final_output[i].items()}
    df = pd.DataFrame(series_data)
    dfs.append(df)
# Concatenate all DataFrames in the list into a single DataFrame
final_df = pd.concat(dfs, ignore_index=True)


In [16]:
final_df

Unnamed: 0,bus_name,bus_route_name,bus_route_link,bus_type,duration,departing_time,reaching_time,price,star_rating,seat_availability
0,IntrCity SmartBus,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,A/C Seater / Sleeper (2+1),06h 15m,23:45\nBenz Circle,06h 15m,INR 513,4.6,42 Seats available\n9 Single
1,IntrCity SmartBus,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,A/C Seater / Sleeper (2+1),07h 00m,23:25\nBenz Circle,07h 00m,INR 488,4.6,42 Seats available\n7 Single
2,FRESHBUS,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,Electric A/C Seater (2+2),07h 15m,10:00\nRtc bus stand,07h 15m,450,4.7,35 Seats available\n17 Window
3,,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,,,,,,,
4,,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
5770,Orange Tours And Travels,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,A/C Seater / Sleeper (2+1),06h 15m,22:45\nAnanthapur By pass,06h 15m,INR 800,,28 Seats available\n1 Single
5771,Sri Balaji Transports,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,A/C Sleeper (2+1),06h 17m,23:10\nThapovanam Circle,06h 17m,INR 1090,,11 Seats available\n2 Single
5772,SRS Travels,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,Scania Multi-Axle AC Semi Sleeper (2+2),05h 50m,15:45\nSapthagiri Circle,05h 50m,INR 1000,,25 Seats available\n8 Window
5773,Orange Tours And Travels,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,Scania AC Multi Axle Sleeper (2+1),05h 30m,23:30\nAnanthapur By pass,05h 30m,INR 1280,,7 Seats available\n2 Single


In [17]:
final_df.shape

(5775, 10)

***Droping the null values to make datas more meaningful***

In [18]:
final_df = final_df.replace('', pd.NA)

In [19]:
final_df = final_df.dropna(how='any')

In [20]:
final_df = final_df.map(lambda x: x.replace('\n', ' ') if isinstance(x, str) else x)

***Converting dataframes into csv file to store in my local file***

In [26]:
final_df.to_csv('andhrapradesh_redbus_project_redbus.csv', index= None)


In [25]:
final_df

Unnamed: 0,bus_name,bus_route_name,bus_route_link,bus_type,duration,departing_time,reaching_time,price,star_rating,seat_availability
0,IntrCity SmartBus,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,A/C Seater / Sleeper (2+1),06h 15m,23:45 Benz Circle,06h 15m,INR 513,4.6,42 Seats available 9 Single
1,IntrCity SmartBus,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,A/C Seater / Sleeper (2+1),07h 00m,23:25 Benz Circle,07h 00m,INR 488,4.6,42 Seats available 7 Single
2,FRESHBUS,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,Electric A/C Seater (2+2),07h 15m,10:00 Rtc bus stand,07h 15m,450,4.7,35 Seats available 17 Window
5,IntrCity SmartBus,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,A/C Seater / Sleeper (2+1),06h 15m,23:45 Benz Circle,06h 15m,INR 513,4.6,42 Seats available 9 Single
6,IntrCity SmartBus,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,A/C Seater / Sleeper (2+1),07h 00m,23:25 Benz Circle,07h 00m,INR 488,4.6,42 Seats available 7 Single
...,...,...,...,...,...,...,...,...,...,...
5761,Al madeena Travels,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,Non A/C Seater / Sleeper (2+1),08h 35m,21:45 Raju Road,08h 35m,INR 750,3.9,20 Seats available
5762,Al madeena Travels,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,Non A/C Seater / Sleeper (2+1),07h 15m,22:15 Raju Road,07h 15m,INR 750,3.9,21 Seats available 1 Single
5763,ORM Tours and Travels,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,A/C Sleeper (2+1),07h 00m,22:30 Ananthapur,07h 00m,INR 1090,3.0,5 Seats available 3 Window
5764,Elegance Tours And Travels Pvt Ltd,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,Bharat Benz A/C Sleeper (2+1),05h 45m,18:00 Anantpur,05h 45m,1899,4.2,25 Seats available


***Above converted csv files are rearranged by its order by using excel and then store them in excel format in my local folder***

In [27]:
import pandas as pd
df = pd.read_excel(r"C:\Users\Green Gen Tech\Downloads\andhrapradesh_redbus_project_redbus.xls.xlsx")
df

Unnamed: 0,bus_name,bus_route_name,bus_route_link,bus_type,duration,departing_time,reaching_time,price,star_rating,seat_availability
0,IntrCity SmartBus,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,A/C Seater / Sleeper (2+1),06h 15m,23:45 Benz Circle,06h 15m,INR 513,4.6,42 Seats available 9 Single
1,IntrCity SmartBus,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,A/C Seater / Sleeper (2+1),07h 00m,23:25 Benz Circle,07h 00m,INR 488,4.6,42 Seats available 7 Single
2,FRESHBUS,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,Electric A/C Seater (2+2),07h 15m,10:00 Rtc bus stand,07h 15m,450,4.7,35 Seats available 17 Window
3,IntrCity SmartBus,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,A/C Seater / Sleeper (2+1),06h 15m,23:45 Benz Circle,06h 15m,INR 513,4.6,42 Seats available 9 Single
4,IntrCity SmartBus,vijayawada-to-hyderabad,https://www.redbus.in/bus-tickets/vijayawada-t...,A/C Seater / Sleeper (2+1),07h 00m,23:25 Benz Circle,07h 00m,INR 488,4.6,42 Seats available 7 Single
...,...,...,...,...,...,...,...,...,...,...
5604,Al madeena Travels,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,Non A/C Seater / Sleeper (2+1),08h 35m,21:45 Raju Road,08h 35m,INR 750,3.9,20 Seats available
5605,Al madeena Travels,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,Non A/C Seater / Sleeper (2+1),07h 15m,22:15 Raju Road,07h 15m,INR 750,3.9,21 Seats available 1 Single
5606,ORM Tours and Travels,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,A/C Sleeper (2+1),07h 00m,22:30 Ananthapur,07h 00m,INR 1090,3.0,5 Seats available 3 Window
5607,Elegance Tours And Travels Pvt Ltd,ananthapur-to-hyderabad,https://www.redbus.in/bus-tickets/ananthapur-t...,Bharat Benz A/C Sleeper (2+1),05h 45m,18:00 Anantpur,05h 45m,1899,4.2,25 Seats available
