This code is a Python script designed to automate the process of downloading NYC TLC (Taxi and Limousine Commission) trip data files from a GitHub repository, converting them from CSV to Parquet format, and then uploading them to a Google Cloud Storage (GCS) bucket. It's structured to handle datasets for different taxi services by year and month. Let's break down the functionality step by step:

```mermaid

graph LR;
    A[Start] --> B[Download CSVs from GitHub];
    B --> C[Convert CSV to Parquet];
    C --> D[Upload Parquet to GCS];
    D --> E[End];

```
### Pre-requisites:
- The script requires `pandas`, `pyarrow`, and `google-cloud-storage` Python packages installed. These are essential for data manipulation, file format conversion, and interacting with Google Cloud Storage, respectively.
- It also requires setting up the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to authenticate with Google Cloud using a service account key.
- The `GCP_GCS_BUCKET` environment variable should be set to specify the target Google Cloud Storage bucket. If not set, a default bucket name (`dtc-data-lake-bucketname`) is used.

### Key Components:

#### Global Variables:
- `init_url`: The base URL for downloading the dataset files from GitHub.
- `BUCKET`: The name of the Google Cloud Storage bucket where the files will be uploaded. It tries to read from the environment variable `GCP_GCS_BUCKET` or defaults to `dtc-data-lake-bucketname`.

#### `upload_to_gcs` Function:
- Takes `bucket` (name of the GCS bucket), `object_name` (the GCS object name or path), and `local_file` (the path to the local file to be uploaded).
- It initializes a GCS client, gets the specified bucket, creates a blob object with the given object name, and uploads the file from the given local file path.
- This function encapsulates the GCS upload logic, abstracting away the details of connecting to GCS and handling file uploads.

#### `web_to_gcs` Function:
- Takes `year` and `service` parameters to determine which dataset to process.
- Iterates through all months (1 to 12), constructs the file name for each month's dataset, and downloads the `.csv.gz` file using the `requests` library.
- After downloading, it reads the CSV file into a pandas DataFrame, converts it to Parquet format using the `pyarrow` engine, and saves the Parquet file locally.
- Finally, it uploads the Parquet file to the specified GCS bucket, organizing files by service and maintaining the naming convention.
- This function essentially automates the end-to-end process of acquiring the data, transforming it, and storing it in a cloud-based data lake.

### Execution Calls:
- The script calls `web_to_gcs` function twice, each time specifying a different year (`2019` and `2020`) for the `green` taxi service. 
- It's structured to easily extend or modify to include additional years or services (`fhv`, `green`, `yellow`) by uncommenting the relevant lines.

### Workflow Summary:
1. Download compressed CSV files from a specified URL.
2. Convert each CSV file to Parquet format for more efficient storage and query performance.
3. Upload the Parquet files to a Google Cloud Storage bucket, organizing them by service type and preserving the naming convention for easy retrieval.

In [1]:
import io
import os
import requests
import pandas as pd
from dotenv import load_dotenv
from google.cloud import storage

# Pre-requisites:
# 1. `pip install pandas pyarrow google-cloud-storage`
# 2. Set GOOGLE_APPLICATION_CREDENTIALS to your project/service-account key
# 3. Set GCP_GCS_BUCKET as your bucket or change default value of BUCKET

# Define the initial URL for downloading the data and the default bucket name.
init_url = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/'
BUCKET = os.environ.get("GCP_GCS_BUCKET", "claytor_taxi_warehouse")
CREDENTIALS = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS", "/home/.gcp/claytor-mage.json")

def upload_to_gcs(bucket, object_name, local_file):
    """
    Uploads a file to Google Cloud Storage (GCS).
    
    Args:
        bucket (str): The name of the GCS bucket.
        object_name (str): The name/path for the file in the GCS bucket.
        local_file (str): The path to the local file to upload.
    """
    storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB
    storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB
    client = storage.Client()
    bucket = client.bucket(bucket)
    blob = bucket.blob(object_name)
    blob.upload_from_filename(local_file)

def web_to_gcs(year, service):
    """
    Downloads data for a given year and service, converts it to parquet, and uploads to GCS.
    
    Args:
        year (str): The year of the data to download.
        service (str): The taxi service (e.g., 'green', 'yellow') of the data.
    """
    for i in range(12):  # Loop through all months
        
        # Format the month to ensure it's two digits
        month = '0'+str(i+1)
        month = month[-2:]

        # Construct the file name for the CSV file
        file_name = f"{service}_tripdata_{year}-{month}.csv.gz"

        # Construct the URL and download the file
        request_url = f"{init_url}{service}/{file_name}"
        r = requests.get(request_url)
        open(file_name, 'wb').write(r.content)
        print(f"Local: {file_name}")

        # Read the downloaded CSV file into a DataFrame and convert to Parquet
        df = pd.read_csv(file_name, compression='gzip')
        file_name = file_name.replace('.csv.gz', '.parquet')
        df.to_parquet(file_name, engine='pyarrow')
        print(f"Parquet: {file_name}")

        # Upload the Parquet file to GCS
        upload_to_gcs(BUCKET, f"{service}/{file_name}", file_name)
        print(f"GCS: {service}/{file_name}")

# Example calls to download data for the 'green' service for years 2019 and 2020
web_to_gcs('2019', 'green')
web_to_gcs('2020', 'green')
# The following lines are commented out, but you could uncomment them to download for other years/services
# web_to_gcs('2019', 'yellow')
# web_to_gcs('2020', 'yellow')


Local: green_tripdata_2019-01.csv.gz
Parquet: green_tripdata_2019-01.parquet


Forbidden: 403 POST https://storage.googleapis.com/upload/storage/v1/b/claytor_taxi_warehouse/o?uploadType=resumable: {
  "error": {
    "code": 403,
    "message": "Access denied.",
    "errors": [
      {
        "message": "Access denied.",
        "domain": "global",
        "reason": "forbidden"
      }
    ]
  }
}
: ('Request failed with status code', 403, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.CREATED: 201>)