<a href="https://colab.research.google.com/github/Idaogah/Data-Science-and-Data-Engineering/blob/main/simple_ETL_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Instruction:
Here is a sample ETL script that demonstrates how to extract data from Yahoo Finance and Wikipedia, transform it, and load it into a PostgreSQL database using Python and other relevant libraries:

# install packages

In [None]:
!pip install sqlalchemy
!pip install requests

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Import

In [None]:
# Import necessary libraries
import requests
import pandas as pd
from sqlalchemy import create_engine


# Extract data

In [None]:
# Extract data from Yahoo Finance
url = 'https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1577836800&period2=1609459200&interval=1d&events=history'
data = pd.read_csv(url)

# Extract data from Wikipedia
url = 'https://en.wikipedia.org/wiki/Apple_Inc.'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]

print(df.dtypes, '\n')
print(data.dtypes)

Authority control      object
Authority control.1    object
dtype: object 

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object


# Transform data

In [None]:
# Transform Yahoo data
data['date'] = pd.to_datetime(data['Date'])
data = data.drop(columns=['Date'])
# data = data.reset_index(drop=True) # reset the index by dropping the old index values as a new column in the DataFrame
data.index = data.index + 1 # setting index to begin from 1 instead of zero
data['date'] = data['date'].dt.date # Convert timestamp column to date
data['High'] = data['High'].astype(float).round(3) # to change specified column type and 2d.p
data['Low'] = data['Low'].round(3) # to round to 2d.p
data = data.round(4) # Round up all float or decimal columns to 4 decimal places
print("Stock table", '\n', data)

# Transform Wikipedia data
df = df.rename(columns={'Founded': 'founded'})
df = df.rename(columns={'Headquarters': 'headquarters'})
df.index = df.index + 1

print('\n', "Wiki table")
df

Stock table 
          Open     High      Low     Close  Adj Close     Volume        date
1     74.0600   75.150   73.798   75.0875    73.4494  135480400  2020-01-02
2     74.2875   75.145   74.125   74.3575    72.7353  146322800  2020-01-03
3     73.4475   74.990   73.188   74.9500    73.3149  118387200  2020-01-06
4     74.9600   75.225   74.370   74.5975    72.9701  108872000  2020-01-07
5     74.2900   76.110   74.290   75.7975    74.1439  132079200  2020-01-08
..        ...      ...      ...       ...        ...        ...         ...
249  131.3200  133.460  131.100  131.9700   130.2058   54930100  2020-12-24
250  133.9900  137.340  133.510  136.6900   134.8627  124486200  2020-12-28
251  138.0500  138.790  134.340  134.8700   133.0670  121047300  2020-12-29
252  135.5800  135.990  133.400  133.7200   131.9324   96452100  2020-12-30
253  134.0800  134.740  131.720  132.6900   130.9162   99116600  2020-12-31

[253 rows x 7 columns]

 Wiki table


Unnamed: 0,Authority control,Authority control.1
1,General,ISNI 1 VIAF 1 WorldCat
2,National libraries,Norway 2 France (data) Argentina Germany Israe...
3,Art research institutes,Artist Names (Getty)
4,Scientific databases,CiNii (Japan)
5,Other,MusicBrainz artist MusicBrainz label RERO (Swi...


# Load data

## connect

In [None]:
# Replace the code line below with your database credentials 
# engine = create_engine('postgresql://username:password@host:port/database')

## Load

In [None]:
# Load data into PostgreSQL
test_engine.execute('CREATE SCHEMA IF NOT EXISTS \"TEST_ETL\"')
data.to_sql('stock_data', test_engine, if_exists='replace', index=True, index_label='id', schema='TEST_ETL') 
df.to_sql('wikipedia_data', test_engine, if_exists='replace', index=True, index_label='id', schema='TEST_ETL')

# Index=False if you don't want the DataFrame index to be added to your sql table
# index_label='id' if you want to set index as primary key
# you can set the if_exists argument to 'fail' or 'append' to abort or append data.

  test_engine.execute('CREATE SCHEMA IF NOT EXISTS \"TEST_ETL\"')


# Fetch from db

In [None]:
conn = test_engine.connect()
query = "SELECT * FROM \"TEST_ETL\".stock_data LIMIT 5"
result = conn.execute(query)
column_names = result.keys()
df = pd.DataFrame(result.fetchall(), columns=column_names)
print(df)

   id     Open    High     Low    Close  Adj Close     Volume        date
0   1  74.0600  75.150  73.798  75.0875    73.4494  135480400  2020-01-02
1   2  74.2875  75.145  74.125  74.3575    72.7353  146322800  2020-01-03
2   3  73.4475  74.990  73.188  74.9500    73.3149  118387200  2020-01-06
3   4  74.9600  75.225  74.370  74.5975    72.9701  108872000  2020-01-07
4   5  74.2900  76.110  74.290  75.7975    74.1439  132079200  2020-01-08


Please note that this is a basic example and you may need to adjust it to suit your specific requirements.

Also, you need to have the following python packages installed: requests, pandas, sqlalchemy
You can install them via pip by running pip install requests pandas sqlalchemy

Please also note that the above script is not very robust and is for demonstration purposes only.
It is important to handle errors, add proper logging and exception handling, and test the script thoroughly before using it in a production environment.