Skip to content

Implementation and manipulation of both relational (postgresql) and NoSql (cassandra) DB ---Data modeling, Tables, Joins, Normalization, Denormalization, Schema, Warehousing

Notifications You must be signed in to change notification settings

Sanmilee/Data_Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 

Repository files navigation

Data_Engineering

Data Engineering standard practices of Extract - Transform - Load (ETL)

A data warehouse is a large and centralized repository of data that is used for storing and managing an organization's data from various sources. The purpose of a data warehouse is to provide a single source of truth for all data in an organization, allowing for easy analysis and reporting.


Terraform is an open-source infrastructure-as-code (IAC) tool developed by HashiCorp. It allows developers to manage and provision infrastructure resources such as virtual machines, networks, and storage using code.

Terraform uses a declarative language to define the desired state of infrastructure resources, allowing developers to easily create, modify, and destroy infrastructure resources using version-controlled configuration files. This enables teams to automate infrastructure provisioning and ensure consistency across environments.


Table of Contents

  1. Project Motivation
  2. Requirements
  3. Contents
  4. Licensing, Authors, and Acknowledgements

Project Motivation

For experimenting several data engineering practices used for preparing data from different sources into formats which are easily and readily available for descriptive and predictive analysis.

Requirements

  1. SQL
  2. Python3
  3. Pandas

Contents

This repo contains 3 folders which seperately analyze different data modeling schemes

  1. Relational_DBMS
  2. NoSQL
  3. Data_Warehousing

Relational_DBMS

The relational DB section entails the management of structured and relational database systems using SQL (Structured Query Language) for the database querying and maintainance. Here, the postgresql engine is used for Data modeling operations which entails: Tables creation, Joins, Normalization, Denormalization, Schema, Warehousing

Requirements: 'python3', 'postgre', 'sql', 'pandas', 'numpy' and 'json' ..

NoSQL

The non-relational database section implements the no-tabular schema that is optimized for the specific requirements of the type of data being stored. Here, the CQL of the Cassandra engine is used for Data modeling operations which entails: Tables creation, Joins, Denormalization, Clauses.

Requirements: 'python3', 'cassandra', 'psycopg2', 'pandas', 'numpy' and 'json' ..

Data_Warehousing

The Data_Warehousing section uses the postgresql and cql to manage schemas on the Pagila dataset including, ETL, Fact and Dimension Tables, OLAP and OLTP Cubes.

Requirements: 'python3', 'postgre', 'sql', 'pandas', 'numpy' and 'json' ..

Licensing, Authors, Acknowledgements

The Pagila posgre movie rental dataset is used for anaysis in this work. You can find the Licensing for the data and other descriptive information at the link available here. Otherwise, feel free to use the code here as you would like.

About

Implementation and manipulation of both relational (postgresql) and NoSql (cassandra) DB ---Data modeling, Tables, Joins, Normalization, Denormalization, Schema, Warehousing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages