<a href="https://colab.research.google.com/github/Shazizan/portfolio/blob/master/etl_vault_lambda_distribution_centers_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Method 3 of Using Lambda Fx**

**STEP 1 - Import Required Libraries**

In [39]:
import requests               # make HTTP requests to GITHUB API
import json                   # to work with JSON data format
import csv                    # to parse CSV files
import base64                 # GitHub API requires file content to be base64 encoded
from io import StringIO       # allow to treat strings as file object

**STEP 2 - Configuration**

In [40]:
# GitHub Personal Access Token - needed for authentication
GITHUB_TOKEN = "PLACE_YOUR_TOKEN_HERE"

# Source Repository Configuration (where CSV file is located?)
SOURCE_OWNER = "Shazizan"                           # GitHub username/org of source repo
SOURCE_REPO = "data"                                # Name of source repository
SOURCE_FILE_PATH = "distribution_centers.csv"       # Path to CSV file in source repo

# Destination Repository Configuration (where JSON file will be uploaded?)
DEST_OWNER = "Shazizan"                              # GitHub username/org of destination repo
DEST_REPO = "pipeline-vault"                         # Name of destination repository
DEST_FILE_PATH = "distribution_centers.json"         # Path where JSON will be saved

**STEP 3 - Lambda Functions For ETL Process**

create_headers > build_url > decode_content > encode_content > parse_csv > to_json

In [41]:
# Lambda fx to create GitHub API headers
# Purpose: Adds authentication & specifies we're sending JSON
create_headers = lambda token: {
    "Authorization": f"Bearer {token}",           # authenticates our request
    "Accept": "application/vnd.github.v3+json",  # specifies GitHub API version
    "Content-Type": "application/json"           # tells GitHub we're sending JSON
}

# Lambda fx to construct GitHub API URL
# Purpose: Builds the correct URL to access files in a repository
build_url = lambda owner, repo, path: f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"

# Lambda fx to decode base64 content
# Purpose: GitHub returns file content in base64, this decodes it to text
decode_content = lambda content: base64.b64decode(content).decode('utf-8')

# Lambda fx to encode content to base64
# Purpose: GitHub requires file uploads to be base64 encoded
encode_content = lambda content: base64.b64encode(content.encode('utf-8')).decode('utf-8')

# Lambda fx to parse CSV to list of dictionaries
# Purpose: Convert CSV rows into Python dictionaries for easy manipulation
parse_csv = lambda csv_text: list(csv.DictReader(StringIO(csv_text)))

# Lambda fx to convert list to JSON String
# Purpose: Transforms python data structure into formatted JSON
to_json = lambda data: json.dumps(data, indent=2)      # indent=2 makes it readable

**STEP 4 - Extract Function**

In [42]:
def extract_csv_from_github(owner, repo, file_path, token):
  # docstrings are the text in triple quotes """...""" that explain what a function does
  # purpose: for professional & helps others (and future you) understand the code!
  """
  Extract CSV data from GitHub repository

  Args:
      owner: GitHub username/organization
      repo: Repository name
      file_path: Path to file within repository
      token: GitHub personal access token

  Returns:
      String content of the CSV file
  """

  # build the API URL for the file
  url = build_url(owner, repo, file_path)

  # create authentication headers
  headers = create_headers(token)

  # make GET request to GitHub API
  print(f"üì• Extracting data from: {owner}/{repo}/{file_path}")
  response = requests.get(url, headers=headers)

  # check if request was successful
  if response.status_code == 200:
      # parse the JSON response
      file_data = response.json()

      # decode the base64 content to get actual CSV text
      csv_content = decode_content(file_data['content'])

      print("‚úÖ Extraction successful!")
      return csv_content
  else:
      # if request failed ,raise an error with details
      raise Exception(f"‚ùå Failed to extract: {response.status_code} - {response.text}")

**STEP 5 - Transform Function**

In [43]:
def transform_csv_to_json(csv_content):
  """
  Transform CSV content to JSON format

  Args:
      csv_content: String content of CSV file

  Returns:
      JSON string representation of the data
  """

  print("üîÑTransforming SCV to JSON...")

  # parse CSV text into list of dictionaries using lambda fx
  data = parse_csv(csv_content)

  # convert to JSON string using lambda fx
  json_content = to_json(data)

  print(f"‚úÖTransformation complete! Converted {len(data)} rows")
  return json_content

**STEP 6 - Load Function**

In [44]:
def load_json_to_github(owner, repo, file_path, json_content, token, commit_message="ETL: Upload transformed data"):
  """
  Load JSON data to GitHub repository

  Args:
      owner: GitHub username/organization
      repo: Repository name
      file_path: Path where file should be saved
      json_content: JSON string to upload
      token: GitHub personal access token
      commit_message: Git commit message

  Returns:
      Response from GitHub API
  """

  print(f"üì§Loading data to: {owner}/{repo}/{file_path}")

  # build the API URL for destination / target
  url = build_url(owner, repo, file_path)

  # create authentication headers
  headers = create_headers(token)

  # first, check if file already exists (to get SHA for update)
  check_response = requests.get(url, headers=headers)

  # prepare the payload for GitHub API
  payload = {
      "message": commit_message,                # git commit message
      "content": encode_content(json_content),  # base64 encoded content
      "branch": "main"                          # target branch (change if needed)
  }

  # if file exists, we need to include its SHA for update
  if check_response.status_code == 200:
     payload["sha"] = check_response.json()["sha"]
     print("üìùFile exists, updating...")
  else:
     print("üìùCreating new file...")

  # make PUT request to create/update file
  response = requests.put(url, headers=headers, json=payload)

  # check if upload was successful
  if response.status_code in [200, 201]:
     print("‚úÖLoad successful!")
     return response.json()
  else:
     raise Exception(f"‚ùå Failed to load: {response.status_code} - {response.text}")

**STEP 7 - Main ETL Pipeline**

In [45]:
def run_etl_pipeline():
  """
  Execute the complete ETL pipeline

  This function orchestrates the Extract, Transform, Load process
  """

  print("\n" + "="*60)
  print("üöÄ Starting ETL Pipeline: GitHub CSV ‚Üí JSON Transfer")
  print("="*60 + "\n")

  try:
      # EXTRACT: Get CSV data from source repository
      csv_data = extract_csv_from_github(
          owner=SOURCE_OWNER,
          repo=SOURCE_REPO,
          file_path=SOURCE_FILE_PATH,
          token=GITHUB_TOKEN
      )

      print()  # empty line for readability

      # TRANSFORM: Convert CSV to JSON
      json_data = transform_csv_to_json(csv_data)

      print()  # empty line for readability

      # LOAD: Upload JSON to destination repository
      result = load_json_to_github(
          owner=DEST_OWNER,
          repo=DEST_REPO,
          file_path=DEST_FILE_PATH,
          json_content=json_data,
          token=GITHUB_TOKEN,
          commit_message="ETL Pipeline: Automated CSV to JSON conversion"
      )

      print("\n" + "="*60)
      print("üéâ ETL Pipeline completed successfully!")
      print("="*60)
      print(f"\nüìä File uploaded to: {result['content']['html_url']}")

  except Exception as e:
      print("\n" + "="*60)
      print(f"üí• ETL Pipeline failed: {str(e)}")
      print("="*60)

**STEP 8 - Execute The Pipeline**

In [46]:
if __name__ == "__main__":
    # Run the ETL pipeline
    run_etl_pipeline()


üöÄ Starting ETL Pipeline: GitHub CSV ‚Üí JSON Transfer

üì• Extracting data from: Shazizan/data/distribution_centers.csv
‚úÖ Extraction successful!

üîÑTransforming SCV to JSON...
‚úÖTransformation complete! Converted 10 rows

üì§Loading data to: Shazizan/pipeline-vault/distribution_centers.json
üìùCreating new file...
‚úÖLoad successful!

üéâ ETL Pipeline completed successfully!

üìä File uploaded to: https://github.com/Shazizan/pipeline-vault/blob/main/distribution_centers.json


**Additional Helper Functions (optional but useful)**

In [47]:
# Lambda to validate GitHub token format
validate_token = lambda token: token.startswith(('ghp_', 'github_pat_'))

# Lambda to get file extension
get_extension = lambda filename: filename.split('.')[-1]

# Lambda to create timestamp for unique filenames
from datetime import datetime
create_timestamp = lambda: datetime.now().strftime("%Y%m%d_%H%M%S")

# Example: Create unique output filename with timestamp
# unique_filename = lambda base: f"{base}_{create_timestamp()}.json"