Skip to content

Malik-Naeem-Awan/made-project-FAU

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains data engineering and data science projects and exercises using open data sources as part of the AMSE/SAKI course, taught by the FAU Chair for Open-Source Software (OSS) in the Winter'23 semester. This repo is forked from 2023-amse-template repository.

The task was to build a Data Engineering Project, which takes at least two public available datasources and processes them with an automated datapipeline, in order to report some findings from the result.

# Netherlands R&D Expenditure in relation with Employees from Abroad.

Licence Continuous Integration python

Overview

The aim of this project is to investigate if there is a relationship between the R&D Expenditure and employees working in the Netherland who are from abroad. Therefore it uses open available datasources provided by


For details see the project plan.

Project Structure

project/
├── pipeline.py                 # ETL data pipeline implementation
├── pipeline.sh                 # Bash script for running the datapipeline for ETL
├── requirements.txt            # Dependencies for external libraries
├── test_pipeline.py            # Test cases for component & Systems testing
├── tests.sh                    # Bash script for running all the test cases
├── Know_data_sources.ipynb     # Notebook for data exploration
├── report.ipynb                # Notebook for final project project
└── project-plan.md             # Project plan and documentation

Important files of the project and their roles:

  • project/pipiline.sh: The Bash script will run an automated ETL pipeline that creates an SQLite databases named employees_data.sqlite & R&D_Expenditure.sqlite that contain tables representing two open data sources of the project.
  • project/tests.sh: A bash script that will execute the component and system-level testing for the project.
  • project/report.ipynb: This Jupyter notebook serves as the final report for the project, providing a comprehensive exploration of all aspects and findings. The report primarily investigates to identify how much of the increased rate of employees from abroad is related to the increase in Netherlands R&D expenditure in whole country over the years from 2013 to 2017, addressing various key questions, based on the data in employees_data.sqlite & R&D_Expenditure.sqlite. See the report.

Continuous Integration Pipeline using GitHub Action:

A Continuous Integration pipeline has been implemented using a GitHub action defined in Continuous Integration. This pipeline is triggered whenever changes are made to the project/ directory (with a few exceptions: Know_data_sources.ipynb, report.ipynb, project-plan.md) and pushed to the GitHub repository, or when a pull request is created and merged into the main branch. The Project_feedback.yml workflow executes the project/tests.sh test script, and in case of any failures, it sends an error message to the owner Email, as shown in the sample screenshot below:

slack-alert

Project Setup

  1. Clone this git repository
git clone git@github.com:Malik-Naeem-Awan/made-project-FAU.git
  1. Install Python. Then create a virtual environment inside the repo and activate it.
python3 -m venv <env_name>
source <env_name>/bin/activate
  1. Go to the project/ directory, Download and install the required Python packages for the project.
pip install -r requirements.txt
  1. To run the project, go to the project/ directory and run the pipeline.sh bash script. It will run the whole ETL pipeline and generate SQLite databases named employees_data.sqlite & R&D_Expenditure.sqlite that contains tables, employees and R&D_Expenditure, representing two open data sources of the project.
chmod +x pipeline.sh
sh pipeline.sh
  1. To run the test script which will execute the component and system-level testing for the project, run the following command.
chmod +x tests.sh
sh tests.sh
  1. Finally, run and explore the project/report.ipynb project notebook, and also feel free to modify it.

Exercises (not part of the project)

During the semester we had to complete exercises, sometimes using Python, and sometimes using Jayvee. Automated exercise feedback is provided using a GitHub action that is defined in .github/workflows/exercise-feedback.yml.

  1. exercises/exercise1.jv
  2. exercises/exercise2.py
  3. exercises/exercise3.jv
  4. exercises/exercise4.py
  5. exercises/exercise5.jv

The exercise feedback is executed whenever we make a change in files in the exercise/ directory and push our local changes to the repository on GitHub. To see the feedback, open the latest GitHub Action run, and open the exercise-feedback job and Exercise Feedback steps executed.

About

Template repository for the Methods of Advanced Data Engineering course at FAU

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.6%
  • Python 1.3%
  • Shell 0.1%