# API-Based ETL Pipeline for COVID-19 Data  

## Overview  
This notebook demonstrates an **ETL (Extract, Transform, Load) pipeline** that extracts COVID-19 data from an API, processes it using pandas, and prepares it for further analysis or storage in a database.  

## Objectives:  
- **Extract** real-time COVID-19 data from an open API.  
- **Transform** the data (filtering, cleaning, formatting).  
- **Load** the structured data into a PostgreSQL database.  

## Technologies Used:  
- **Python** (`requests`, `pandas`, `sqlalchemy`, `psycopg2`)  
- **REST API** (as the data source)  
- **PostgreSQL** (for storing processed data)  

## Expected Outcome:  
By the end of this notebook, structured COVID-19 data will be available in PostgreSQL, ready for querying and analysis.  


### Step 1: Extract data from the API into a Pandas Dataframe

In [39]:
# Import Libraries

import pandas as pd # For data Extract/transformation/manipulation/wrangling/analysis, etc
from sqlalchemy import create_engine # To efficiently manage and reuse the database connections
import requests # To send HTTP requests and interact with APIs (Extract step in ETL)
import psycopg2 # For Connecting Python to Postgresql database

# Why use requests
# 1. Access Data from APIs:
#    - Many applications provide data through APIs (e.g., REST APIs).
#    - The requests library allows us to easily connect to these APIs and retrieve data in JSON, XML, or other formats.
#
# 2. Make HTTP Requests:
#    - Supports various HTTP methods:
#      - GET: Fetch data (commonly used for ETL).
#      - POST: Send data or parameters to an API.
#      - PUT/DELETE: Update or delete data on a server. 

In [17]:
# Set the URL for the API endpoint
url = "https://disease.sh/v3/covid-19/countries"

# Send a GET request to the API
response = requests.get(url)

if response.status_code == 200:
    # Get the JSON data from the response
    data = response.json()
    
    # Create a DataFrame from the data
    df = pd.DataFrame(data)  
    
    # Display the first few rows
    display(df.head())  # Shows details according to countries
else:
    print(f"Error: {response.status_code}")




Unnamed: 0,updated,country,countryInfo,cases,todayCases,deaths,todayDeaths,recovered,todayRecovered,active,...,tests,testsPerOneMillion,population,continent,oneCasePerPeople,oneDeathPerPeople,oneTestPerPeople,activePerOneMillion,recoveredPerOneMillion,criticalPerOneMillion
0,1737655687634,Afghanistan,"{'_id': 4, 'iso2': 'AF', 'iso3': 'AFG', 'lat':...",234174,0,7996,0,211080,0,15098,...,1390730,34125,40754388,Asia,174,5097,29,370.46,5179.32,0.0
1,1737655687625,Albania,"{'_id': 8, 'iso2': 'AL', 'iso3': 'ALB', 'lat':...",334863,0,3605,0,330233,0,1025,...,1941032,677173,2866374,Europe,9,795,1,357.59,115209.32,0.0
2,1737655687629,Algeria,"{'_id': 12, 'iso2': 'DZ', 'iso3': 'DZA', 'lat'...",272010,0,6881,0,183061,0,82068,...,230960,5093,45350148,Africa,167,6591,196,1809.65,4036.61,0.0
3,1737655687688,Andorra,"{'_id': 20, 'iso2': 'AD', 'iso3': 'AND', 'lat'...",48015,0,165,0,0,0,47850,...,249838,3225256,77463,Europe,2,469,0,617714.26,0.0,0.0
4,1737655687656,Angola,"{'_id': 24, 'iso2': 'AO', 'iso3': 'AGO', 'lat'...",107327,0,1937,0,103419,0,1971,...,1499795,42818,35027343,Africa,326,18083,23,56.27,2952.52,0.0


In [18]:
df.shape # check the size of datalist

(231, 23)

In [19]:
df.columns

Index(['updated', 'country', 'countryInfo', 'cases', 'todayCases', 'deaths',
       'todayDeaths', 'recovered', 'todayRecovered', 'active', 'critical',
       'casesPerOneMillion', 'deathsPerOneMillion', 'tests',
       'testsPerOneMillion', 'population', 'continent', 'oneCasePerPeople',
       'oneDeathPerPeople', 'oneTestPerPeople', 'activePerOneMillion',
       'recoveredPerOneMillion', 'criticalPerOneMillion'],
      dtype='object')

### Step 2: Transform the data(i.e clean the data)

In [20]:
# remove colums
remove_col = ['countryInfo','casesPerOneMillion', 'deathsPerOneMillion','testsPerOneMillion', 'continent', 'oneCasePerPeople',
       'oneDeathPerPeople', 'oneTestPerPeople', 'activePerOneMillion','recoveredPerOneMillion', 'criticalPerOneMillion']
df = df.drop(columns=remove_col)
df

Unnamed: 0,updated,country,cases,todayCases,deaths,todayDeaths,recovered,todayRecovered,active,critical,tests,population
0,1737655687634,Afghanistan,234174,0,7996,0,211080,0,15098,0,1390730,40754388
1,1737655687625,Albania,334863,0,3605,0,330233,0,1025,0,1941032,2866374
2,1737655687629,Algeria,272010,0,6881,0,183061,0,82068,0,230960,45350148
3,1737655687688,Andorra,48015,0,165,0,0,0,47850,0,249838,77463
4,1737655687656,Angola,107327,0,1937,0,103419,0,1971,0,1499795,35027343
...,...,...,...,...,...,...,...,...,...,...,...,...
226,1737655687788,Wallis and Futuna,3550,0,8,0,438,0,3104,0,20508,10982
227,1737655687803,Western Sahara,10,0,1,0,9,0,0,0,0,626161
228,1737655687759,Yemen,11945,0,2159,0,9124,0,662,0,329592,31154867
229,1737655687621,Zambia,349304,0,4069,0,341316,0,3919,0,4112961,19470234


In [21]:
df.shape

(231, 12)

In [22]:
# Check duplicates
df.duplicated().sum()


np.int64(0)

In [23]:
# check missing values 
df.isnull().sum()

updated           0
country           0
cases             0
todayCases        0
deaths            0
todayDeaths       0
recovered         0
todayRecovered    0
active            0
critical          0
tests             0
population        0
dtype: int64

In [24]:
#sort as acending order
df = df.sort_values(by='country', ascending=True)

### Step 3: Create a database
  go to PGAdmin 4 and create database tables

### Step 4: Load the clean data into the database

In [36]:
# Database credentials
username = "postgres"
password = "********"
host = "localhost"
port = "5432"
db_name = "Covid19"

In [37]:
# Establish a connection
engine = create_engine(f'postgresql://{username}:{password}@{host}:{port}/{db_name}')
try:
    with engine.connect():
        print("Connection successful!")
except Exception as e:
    print(f"Connection failed: {e}")


Connection successful!


In [38]:
# load the database table - Employee
df.to_sql('all_countries', engine, if_exists='replace', index=False)

#close the connection
engine.dispose()

## Summary  

✅ Successfully extracted real-time COVID-19 data from an open API.  
✅ Performed **data cleaning and transformation** for structured storage.  
✅ Prepared the dataset for **loading into a PostgreSQL database**.  
  
