# DATA WRANGLING HACKATHON

# API STEP

### Overview
This data dictionary describes High Volume FHV trip data. Each row represents a single trip in an FHV dispatched by one of NYC’s licensed High Volume FHV bases. On August 14, 2018, Mayor de Blasio signed Local Law 149 of 2018, creating a new license category for TLC-licensed FHV businesses that currently dispatch or plan to dispatch more than 10,000 FHV trips in New York City per day under a single brand, trade, or operating name, referred to as High-Volume For-Hire Services (HVFHS). This law went into effect on Feb 1, 2019.

### Objective
The main goal of this hackathon is to determine if the client is going to give a tip. 
Your submission file should be a CSV file with two columns (see example in sample_	submission.csv):
ID:  Id of the observation
Tipped: If the client Tipped or not

A dataset spread over several data sources has been provided for you. The total number of features is plentiful and it’s up to you to use as many or as little as you want. Given that, some features might be more relevant than others. 
Keep in mind that this is a Data Wrangling specialization. 

### Datasets:
| **Dataset** | **Information**   | Location|
|-------------|-------------------|---------------------|
|API          | Trip Mileage      | https://hckt02-api.lisbondatascience.org/docs#/default/get_data_data_get |
|Webpage      | Taxi Zone Data    | https://s02-infrastructure.s3.eu-west-1.amazonaws.com/hackathon-02-batch8/index.html |
|Files        | Detailed Trip Data| https://drive.google.com/drive/folders/12MhOAVrplggHVTm6-CtjqkkjI9xrVPek?usp=drive_link|
|Database     | Weather Data      | batch-s02.ctq2kxc7kx1i.eu-west-1.rds.amazonaws.com



## Downloading API Data

### Downloading the data to the "datalake"

**Step:** What we're going to do here is to simulate the reading and saving of data into a datalake. The "datalake" in this case is your own laptop disk.

In [None]:
import requests
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed

# URL base and parâmetros
base_url = "https://hckt02-api.lisbondatascience.org/data"
page = 1
size = 100000 # number of records per page that we'll read

# Creating a list to store the laizy Dask Dataframes 
delayed_dfs = []

while True:
    url = f"{base_url}?page={page}&size={size}"
    if page == 1 or page % 10 == 0:
        print(f"Page number: {page} - Records read: {page*size}")
        print(f"Fetching data from: {url}")

    try:
        response = requests.get(url, verify=False)
        response.raise_for_status()  # Raises exception for HTTP erros

    except requests.exceptions.HTTPError as e:
        if response.status_code == 404:
            print(f"Last page reached: {page}")
            break
        else:
            print(f"Error accessing the API: {e}")
            break

    # Converting JSON to Python dictionary
    response_json = response.json()

    # Chekcing for data
    if not response_json.get('data', []):
        print("End of data.")
        break

    # Adds the current page to a laizy Dask dataframe using delayed
    delayed_df = delayed(pd.DataFrame)(response_json['data'])
    delayed_dfs.append(delayed_df)

    # Sets page to next page
    page += 1

# Converting the list od dataframes to a single Dask dataframe
dask_df = dd.from_delayed(delayed_dfs)

In [46]:
# Checking the results - it will take a while to run this cell since the dask_df.head() 
# will actually run the dataframe operation and generate the output we need 
dask_df.head(10)

Unnamed: 0,ID,trip_miles
0,5010374,1.84
1,6063883,1.45
2,4941792,16.732
3,7765520,2.477
4,7881861,0.64
5,5462344,3.98
6,8334347,11.6
7,7597616,1.51
8,5056493,1.72
9,6600816,0.96


In [47]:
# Saving the data from the Dask Dataframe into our "Data Lake"
dask_df.to_parquet('.data/api/raw/')

In [50]:
# Reading from the "Data Lake" to check the output
dask_df = dd.read_parquet('.data/api/raw/')

In [51]:
# Checking the Dask Dataframe output
dask_df.head(10)

Unnamed: 0,ID,trip_miles
0,5010374,1.84
1,6063883,1.45
2,4941792,16.732
3,7765520,2.477
4,7881861,0.64
5,5462344,3.98
6,8334347,11.6
7,7597616,1.51
8,5056493,1.72
9,6600816,0.96


## FINISHED FILES COMPUTATION

**Conclusion:** Now we'll work on other data sources to bring everything together and create a single file for the ML model training.

* We've made it to collect and store the API data into our "Data Lake" which will be useful in the next steps of the hackathon.