<a href="https://colab.research.google.com/github/GerardRagbir/Python-Notebooks/blob/main/etlPipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ETL Pipeline

## Environment

In [None]:
'''Install python 3.11'''
!sudo apt-get update -y # update system dependencies
!sudo apt-get install python3.11 # update python to 3.11 (latest as of 2022-Oct)

'''Install required packages'''
!pip install sqlalchemy pyodbc pandas # required packages, modify as needed

Typically you'd need to set the ENV VARIABLES to the PATH using specific methods via BASH for the OS you're on. However, since we're only concerned with Python 3.X, we'll use the following script:

In [None]:
# Write ENV variables to OS via a Python script such as:

os.environ['PGPASS'] = 'password'
os.environ['PGUID'] = 'etluser'


'''Alternative for quickly setting env variables'''

# %env PGPASS=password
# %env PGUID=etluser

## Required Modules

<b>Resources</b>

SQL Alchemy: https://towardsdatascience.com/sqlalchemy-python-tutorial-79a577141a91

PyODBC: https://learn.microsoft.com/en-us/sql/connect/python/pyodbc/python-sql-driver-pyodbc?view=sql-server-ver16

Pandas: https://www.w3schools.com/python/pandas/default.asp

In [32]:
from sqlalchemy import create_engine
import pyodbc as odbc
import pandas as pd
import os

## Get variables from OS Environment

REMINDER: Never store variables within source!

In [33]:
''' 
These linked variables can be refactored directly to their methods, 
I am just separating them for explanation during live sessions!

eg pwd = os.environp['PGPASS'] would give a similar result.
'''

ETLPWD = str('PGPASS')
ETLUID = str('PGUID')

pwd = os.environ[ETLPWD]
uid = os.environ[ETLUID]

In [34]:
'''SQL Connection'''
driver = "{ODBC Driver 17 for SQL Server}" #eg ODBC Driver 17 for SQL Server
server = "localhost" #use localhost if using a local machine
database = "#Name_of_DB_HERE" 
port = 5432

table = "#tableName"


CONNECTION_PATH = f'DRIVER={driver};SERVER={server}\SQLEXPRESS;DATABASE={database};UID={uid};PWD={pwd}'

Refer to: https://www.connectionstrings.com/formating-rules-for-connection-strings/

In [35]:
QUERY = """
        SELECT t.name AS table_name
        FROM sys.tables t WHERE t.name IN ('DimProduct', 'DimProductSubcategory', 'DimProductCategory', 'DimSalesTerritory', 'FactInternetSales');
        """

def extract():
  try:
    src_connect = odbc.connect(CONNECTION_PATH)
    cursor = src_connect.cursor()

    #execute query
    cursor.execute(QUERY)
    rows = cursor.fetchall()

    for row in rows:
      df = pd.read_sql_query(f'SELECT * FROM {table[0]}', src_connect)
      load(df, table[0])

  except Exception as e:
    print(f"Extraction Error: {e}")

  finally:
    src_connect.close()

In [36]:
def load():
  try:
    rows_imported = 0
    engine = create_engine(f'postgresql://{uid}:{pwd}@{server}:{port}/{table}')
    print(f'Importing rows: {rows_imported} to {rows_imported+len(df)} ... for table {table}')

    #commit dataframe to database (postgresql here)
    df.to_sql(f'stg_{table}', engine, if_exists='replace', index=False)
    rows_imported += len(df)
    print("Data imported successfully!")
  except Exception as e:
    print(f"Load Error: {e}")


## Run ETL

In [None]:
try:
  extract()
except Exception as e:
  print(f"Error during ETL: {e}")