Python Project for Data Engineering - ETL pipeline

Introduction

In this practice project, you will use the skills acquired through the course and create a complete ETL pipeline for accessing data from a website and processing it to meet the requirements.

Project Scenario

An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated.

You can find the required data on this webpage.

The required information needs to be made accessible as a JSON file 'Countries_by_GDP.json' as well as a table 'Countries_by_GDP' in a database file 'World_Economies.db' with attributes 'Country' and 'GDP_USD_billion.'

Your boss wants you to demonstrate the success of this code by running a query on the database table to display only the entries with more than a 100 billion USD economy. Also, log the entire process of execution in a file named 'etl_project_log.txt'.

You must create a Python code 'etl_project_gdp.py' that performs all the required tasks.

Objectives

You have to complete the following tasks for this project

Write a data extraction function to retrieve the relevant information from the required URL.
Transform the available GDP information into 'Billion USD' from 'Million USD'.
Load the transformed information into the required CSV file and as a database file.
Run the required query on the database.
Log the progress of the code with appropriate timestamps.

Initial setup

Before you start building the code, you need to install the required libraries for it.

The libraries needed for the code are as follows:

requests - The library used for accessing the information from the URL.
bs4 - The library containing the BeautifulSoup function used for webscraping.
pandas - The library used for processing the extracted data, storing it to required formats and communicating with the databases.
sqlite3 - The library required to create a database server connection.
numpy - The library required for the mathematical rounding operation as required in the objectives.
datetime - The library containing the function datetime used for extracting the timestamp for logging purposes.

Conclusion

In this project, you performed complex Extract, Transform, and Loading operations on real world data. By now, you should be able to:

Extract relevant information from websites using Webscraping and requests API.
Transform the data to a required format.
Load the processed data to a local file or as a database table.
Query the database table using Python.
Create detailed logs of all operations conducted.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
ETL_Pipline		ETL_Pipline
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python Project for Data Engineering - ETL pipeline

Introduction

Project Scenario

Objectives

Initial setup

Conclusion

About

Uh oh!

Releases

Packages

Languages

AbdullahMahmoud23/ETL-Project-GDP

Folders and files

Latest commit

History

Repository files navigation

Python Project for Data Engineering - ETL pipeline

Introduction

Project Scenario

Objectives

Initial setup

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages