UK traffic analysis 🚦

About

The main goal of the project was to create data warehouse in the star architecture based on the data from source: https://www.cs.put.poznan.pl/kjankiewicz/bigdata/projekt2/uk-trafic.zip. The data come from the Department of Transport of United Kingdom. The main tasks of the project are listed below:

project a data warehouse in star architecture containing fact table with 1-5 measures and 3-5 dimension tables with mandatory time dimension
project 2-3 analyses based on projected data warehouse. Each analysis contains 3 dimensions and one measure.
implement mentioned data warehouse using Delta Lake tables
implement ETL processes using Spark programs written in Scala and compile to .jar executable files registered to repeatable and automatic use which will load the data warehouse directly from source data
orchestrate the data workflow using tools such as e.g. Apache Airflow
implement invented analyses (their description are available in this repo inside analysis_description.txt file) in a notebook (formatted graphs)

The data was loaded using best practices such as e.g. RDD - for non-structed data sources, DataFrames for structured. All the implemented programs are also available in the scala_programs folder.

Getting started and usage

Firstly, you have to upload source data from .zip to uk-traffic folder in your private bucket. To run the project you need to setup a cluster (an example run-cluster-command for Google Cloud Platform is available in the proper file). Then you need to upload jars from jars folder and project2.py file to schedule data workflow in Apache Airflow. Next you have to run a set of commands in SSH CLI to run Airflow correctly:

export AIRFLOW_HOME=~/airflow
pip install apache-airflow
export PATH=$PATH:~/.local/bin
airflow db init
airflow standalone

Now you need to save credentials for Airflow service which will be:

username: admin
password: Copy the one which will be shown on your screen in SSH CLI

Then stop Airflow service, create dags folder inside airflow, move jars and project2.py file to proper destinations and run Airflow once again:

mkdir -p ~/airflow/dags/project_files
mv project2.py ~/airflow/dags/
mv *.jar ~/airflow/dags/project_files
airflow webserver --port 8081 -D
airflow scheduler

Stay this console open, finally you have to configure tunnel for localhost:8081 between GCP and your local machine. The process can be done easily e.g. in PuTTY client. Then you can login to Airflow service using stored credentials previously, find your project workflow and configure runing it by delivering your Google account user_name and private bucket_name. The data warehouse should be build approximately in 10 minutes (duration_of_building_warehouse_pipeline file). To run analyses for the warehouse, you need to attach Final_analysis_results.zpln file to Zeppelin client in GCP. Unfortunately, you need to configure spark interpreter to store Delta Lake tables at first. You can also try to run the commands from the notebook inside Databricks and try to deeply understand the results of that analyses ;)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
analysis_results		analysis_results
jars		jars
scala_programs		scala_programs
uk-traffic-analysis_idea		uk-traffic-analysis_idea
Final_analysis_results.zpln		Final_analysis_results.zpln
LICENSE		LICENSE
README.md		README.md
analysis_description.txt		analysis_description.txt
duration_of_building_warehouse_pipeline(ss_from_airflow).png		duration_of_building_warehouse_pipeline(ss_from_airflow).png
project2.py		project2.py
run-cluster-command.txt		run-cluster-command.txt
warehouse_schema.png		warehouse_schema.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UK traffic analysis 🚦

About

Getting started and usage

Warehouse schema

Contributors

About

Releases

Packages

Contributors 2

Languages

License

Michu-dev/uk-traffic-analysis

Folders and files

Latest commit

History

Repository files navigation

UK traffic analysis 🚦

About

Getting started and usage

Warehouse schema

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages