# DATA WRANGLING HACKATHON

# WEB SCRAPING STEP

### Overview
This data dictionary describes High Volume FHV trip data. Each row represents a single trip in an FHV dispatched by one of NYC’s licensed High Volume FHV bases. On August 14, 2018, Mayor de Blasio signed Local Law 149 of 2018, creating a new license category for TLC-licensed FHV businesses that currently dispatch or plan to dispatch more than 10,000 FHV trips in New York City per day under a single brand, trade, or operating name, referred to as High-Volume For-Hire Services (HVFHS). This law went into effect on Feb 1, 2019.

### Objective
The main goal of this hackathon is to determine if the client is going to give a tip. 
Your submission file should be a CSV file with two columns (see example in sample_	submission.csv):
ID:  Id of the observation
Tipped: If the client Tipped or not

A dataset spread over several data sources has been provided for you. The total number of features is plentiful and it’s up to you to use as many or as little as you want. Given that, some features might be more relevant than others. 
Keep in mind that this is a Data Wrangling specialization. 

### Datasets:
| **Dataset** | **Information**   | Location|
|-------------|-------------------|---------------------|
|API          | Trip Mileage      | https://hckt02-api.lisbondatascience.org/docs#/default/get_data_data_get |
|Webpage      | Taxi Zone Data    | https://s02-infrastructure.s3.eu-west-1.amazonaws.com/hackathon-02-batch8/index.html |
|Files        | Detailed Trip Data| https://drive.google.com/drive/folders/12MhOAVrplggHVTm6-CtjqkkjI9xrVPek?usp=drive_link|
|Database     | Weather Data      | batch-s02.ctq2kxc7kx1i.eu-west-1.rds.amazonaws.com



# Selenium WebDriver API
Selenium is a widely used automation tool primarily designed for testing web applications. It allows developers and testers to simulate browser actions, such as navigating to web pages, interacting with elements, and validating functionality, without manual intervention. 

Beyond testing, Selenium is also commonly employed for web scraping. Its ability to mimic human interactions with dynamic web pages, including those that rely heavily on JavaScript, makes it a powerful tool for extracting data from websites where traditional scraping methods may fall short. This versatility allows users to scrape data even from sites with complex structures or interactive features.

### API docs: 
https://www.selenium.dev/selenium/docs/api/py/api.html

In [15]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

## Code Modularization
### We'll apply a little modularization to the code to improve maintenance and readability by creating some functions in Python language for all extraction operations.

## Selenium WebDriver initialization
### Initializes the Chrome driver.

In [16]:
def initialize_driver():
    return webdriver.Chrome()

# Webpage Extraction Step
### Opens the page and checks for title

In [17]:
def open_page(driver, url, title_check):
    driver.get(url)
    assert title_check in driver.title

## Extract Table Header

In [18]:
def extract_table_header(driver):
    # Extracts the cols names from the webpage
    headers = driver.find_elements(By.TAG_NAME, "th")
    col_names = [th.text for th in headers]
    # print(col_names)
    return col_names

## Extract Data

In [19]:
def extract_table_data(driver):
    """Extracts the data from the table.."""
    elements = driver.find_elements(By.TAG_NAME, "td")
    data = []
    row = []
    for i, td in enumerate(elements):
        row.append(td.text)
        if (i + 1) % 4 == 0:  # Considering 4 columns per row and dividing by 4, a remainder of zero gives us the start of the next row.
            data.append(row)
            row = []
    return data

## Generating a Pandas DataFrame for the Webpage data
### We need the data in data frames format to operate the merges when possible in order to use the data for analysis

In [20]:
import pandas as pd
import numpy as np

## Control Function - main()

In [21]:
def main():
    driver = initialize_driver()
    try:
        url = "https://s02-infrastructure.s3.eu-west-1.amazonaws.com/hackathon-02-batch8/index.html"
        open_page(driver, url, "Taxi Zone Data - Full")

        col_names = extract_table_header(driver)
        # print(f"Column Names: {col_names}")
        
        web_data = extract_table_data(driver)
        # print(f"Data: {data}")
        
        # generating a data frame with the collected data for later use
        df_webpage = pd.DataFrame(data=web_data, columns=col_names)

        # returning the scraped data
        return df_webpage
        
        # Checks for page errors
        assert "No results found." not in driver.page_source
    finally:
        driver.close()

## Main Execution
### Column header and data will be extracted and added to a data frame for later use.

In [22]:
if __name__ == "__main__":
    df_webpage = main()

## Checking data frame content

In [23]:
df_webpage.head(2)

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone


In [17]:
import os

def filesize(filename):
    file_path = filename
    file_size = os.path.getsize(file_path) / (1024 * 1024)
    message = f"physical file size: {file_size:.2f} MB"
    return message

In [35]:
df_webpage.to_parquet(".data/webpage/bronze/webpage_data.parquet")

In [30]:
import pandas as pd
df = pd.read_parquet(".data/webpage/bronze/webpage_data.parquet")

In [36]:
df.head(1000)

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone
...,...,...,...,...
260,261,Manhattan,World Trade Center,Yellow Zone
261,262,Manhattan,Yorkville East,Yellow Zone
262,263,Manhattan,Yorkville West,Yellow Zone
263,264,Unknown,NV,


## FINISHED WEB SCRAPING COMPUTATION

**Conclusion:** Now we'll work on other data sources to bring everything together and create a single file for the ML model training.

# Appendix

### Other Web Scraping Options:

1. **BeautifulSoup**  
   **Description**: A Python library that simplifies the extraction of information from HTML and XML files.  
   **Use**: Ideal for simple scraping tasks where the content is static.  
   **Limitation**: Not suitable for handling dynamic pages generated by JavaScript.  

2. **Scrapy**  
   **Description**: A robust Python framework for web scraping and crawling.  
   **Use**: Allows the creation of spiders that traverse entire websites to extract structured data.  
   **Advantage**: Extremely efficient and well-suited for complex scraping projects.  
   **Limitation**: Does not include native support for interacting with JavaScript; it focuses more on static content.  

3. **Puppeteer**  
   **Description**: A Node.js library that controls the Google Chrome or Chromium browser via API.  
   **Use**: Ideal for scraping dynamic pages that rely on JavaScript.  
   **Advantage**: Offers full control over the browser with less overhead compared to Selenium.  
   **Limitation**: Written in JavaScript/Node.js, which can be a barrier for Python users.  

4. **Octoparse**  
   **Description**: A visual scraping tool that requires no programming knowledge.  
   **Use**: Designed for users who prefer a graphical interface to configure data extractions.  
   **Advantage**: Very intuitive, with support for both static and dynamic pages.  
   **Limitation**: May be less flexible than programmatic libraries.  

5. **ParseHub**  
   **Description**: Another visual scraping tool, focused on dynamic and interactive websites.  
   **Use**: Popular among users who need a quick, no-code solution.  
   **Advantage**: Easy to use and cloud-based.  
   **Limitation**: May be limited for more customized projects.  

6. **Apify**  
   **Description**: A cloud-based platform for web scraping and automation.  
   **Use**: Provides a user-friendly interface for creating crawlers and automating data extraction.  
   **Advantage**: Integration with multiple programming languages and support for JavaScript scripts.  
   **Limitation**: Higher costs for large data volumes.  