## E-Commerce Data Raw Ingestion Pipeline

This notebook represents the first stage in the E-commerce ETL pipeline. It performs:
1. Acquisition of raw data from the Kaggle Brazilian E-commerce dataset by Olist
2. Secure transfer of data to Azure Data Lake Storage (ADLS)
3. Preparation for subsequent transformation stages

### Environment Setup and Credential Management

The following code:
1. Installs the Kaggle API for data acquisition
2. Initializes the Spark session
3. Securely retrieves Kaggle credentials from Azure Key Vault
4. Configures the environment for Kaggle API authentication

In [28]:
# Install Kaggle API (only needed once)
!pip install kaggle

# Import necessary libraries
import os
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.getOrCreate()

# Retrieve Kaggle API credentials from Azure Key Vault
kaggle_credential = mssparkutils.credentials.getSecret("ecom-sales-kv", "KaggleAPI")

# Save Kaggle API credentials securely
kaggle_json_path = "/tmp/kaggle.json"
with open(kaggle_json_path, "w") as f:
    f.write(kaggle_credential)

# Set permissions and environment variable for Kaggle
!chmod 600 /tmp/kaggle.json
os.environ['KAGGLE_CONFIG_DIR'] = '/tmp'

### Dataset Acquisition

This section:
1. Downloads the Brazilian E-commerce dataset from Kaggle
2. Extracts the compressed files to a temporary location
3. Prepares the data for upload to the data lake

The Olist dataset contains anonymized commercial data from multiple Brazilian marketplaces.

In [29]:
download_path = "/synfs/tmp"
os.system(f"kaggle datasets download -d olistbr/brazilian-ecommerce --unzip -p {download_path}")

### Storage Path Configuration

This section defines a utility function that:
1. Standardizes path generation for different data lake containers
2. Ensures consistent access patterns throughout the ETL pipeline
3. Encapsulates storage account information for maintainability

This modular approach simplifies storage path management across all pipeline stages.

In [30]:
# Step 2: Define a function for generating ADLS paths
def get_adls_path(container: str, folder: str) -> str:
    """
    Generate an ADLS path based on the container and folder.
    
    Parameters:
        container (str): The ADLS container (e.g., 'raw', 'processed', 'curated').
        folder (str): The folder path inside the container.
    
    Returns:
        str: The formatted ADLS path.
    """
    storage_account = "ecomsalessa"
    return f"abfss://{container}@{storage_account}.dfs.core.windows.net/{folder}/"


### Data Transfer to Azure Data Lake Storage

This section:
1. Transfers the downloaded files to the raw container in ADLS
2. Maintains the original file structure and formats
3. Prepares the data for transformation in subsequent pipeline stages

This completes the raw data ingestion process, creating a reliable foundation for the ETL pipeline.

In [31]:
adls_path = get_adls_path("raw", "ecommerce-dataset")

for file in os.listdir(download_path):
    local_file_path = os.path.join(download_path, file)
    adls_file_path = os.path.join(adls_path, file)

    if os.path.isfile(local_file_path):
        print(f"Uploading {file} to ADLS...")
        mssparkutils.fs.cp(f"file://{local_file_path}", adls_file_path, recurse=True)

print("Upload completed successfully.")