- How to choose a Distributed Database
- Cockroach DB Architecture
- Amundsen Review
- Deep Dive - Foundation DB
- The What, Why, and When of Single-Table Design with DynamoDB
- How To Manage And Monitor Apache Spark On Kubernetes
- Git is hard: screwing up is easy, and figuring out how to fix your mistakes is fucking impossible
- 8 Practical Use Cases of Change Data Capture
- Apache Iceberg- Links
- Kubernetes Port Forwarding Manager
- Querying Parquet with Precision using DuckDB - Much faster compared to Pandas
- What is Apache Pinot - Usecases & Architecture
- Change Data Streaming Patterns in Distributedsystems
- Cuckoo Hashing - An alternative to chaining and linear probing for collision handling
- Riak Database
- Database Indexing
- Parallel Databases using Map Reduce
- REST vs GraphQL
- Linux Namespace & Control Group(cgroup)
- SQL Lexical Structure
- Everything about the Linux kernel
- How #dataengineering get complicated over time
- What is eBPF - Sandboxing Programs inside #linux Kernel
- Absolute Basic Explanation of SSTable & Log Structured Merge Trees - Sorted String Table & Faster Random Writes
- Getting started with #dataengineering Volume 6 π
- Getting started with Dataengineering Volume 5 π
- Getting started with Data Engineering, volume 4 ππ‘
- Getting started with Data Engineering, volume 3 ππ‘
- Getting started with Data Engineering, volume 2 ππ‘
- Getting started with Data Engineering, volume 1 ππ‘
- Getting started with #dataengineering from basics
- Apache Airflow 2.0
- Some Interesting essentials while learning Apache Airflow
- Dagster Release 0.10.0 - Everything about Exactly-once, Fault-Tolerant Scheduling - Extremely Important Release πππ
- #getdbt or Data Build Tools interface across all major Data Workflow Management Platform π―β¨π₯
- Apache Superset - An #opensource Fully Featured Business Intelligence Application πππ
- The Hop Orchestration Platform, or Apache #Hop (Incubating), aims to facilitate all aspects of data and metadata orchestration π―π‘β
- Apache Iceberg Partitioning is way better than Hive ! Hidden Partitioning makes everything easier! π
- Trino aka #prestosql is different from Apache Spark SQL - Exclusively designed for Distributed SQL π
- Apache Spark is NOT a Map but an MPP/MPI Engine
- Apache Hudi - Design Principles
- OpenTelemetry specification V1.0
- Everything Around PySpark Pandas UDF π
- Important skill-set of a Dataengineer - Reduce Cost
- Everything on PyFlink - Python with Apache Flink
- Delta Lake Cheat Sheet
- Dataengineering schedule breakdown, a very flexible estimate
- Parquet - Introduction & Design, An OpenSource File Format
- SQL - Avoiding Antipatterns
- Explaining Apache Kafka - In children's book format
- The Perfect #dataengineering: Top INVALID Reasons behind #datapipelines failures
- What is ETL
- What is Proxy & Reverse Proxy
- DataEngg Skills to work with DataScience
- Data Quality, A necessity for Data Driven Projects
- Essential Cloud Skills for Data Engineering
- Open Source Technologies in Data Engineering
- Kubernetes Fundamentals Required as a Data Engineer
- Apache Superset, OSS Business Intelligence for 2021
- #apachekafka as a Database - Summary on both the sides , Arguments, Trade-offs & exceptional π¬ quotes β³π‘β³
- Processing Guarantees in #apachekafka π―ππ - The best resource
- Change Data Analysis with Debezium and Apache Pinot ππ‘πΏ
- Optimizing Apache Kafka Producers & Consumers πππ
- Redpanda -A NON-JVM Streaming Platform for mission critical workloads π‘ππ
- Apache Hudi - Turn Batch Jobs to Incremental Model | Complete file management on a Data Lake
- Apache Iceberg - an open table format for huge analytic datasets
- Ballista - Distributed computing platform built primarily on Rust and powered by Apache Arrow
- ZooKeeper, a distributed, open-source coordination service for distributed applications
- Apache Iceberg - Partition Evolution, its simple but its so amazing
- ApacheKafka without ZooKeeper Sneak Peakπ
- Why Data Discovery is important for Data Engineering
- Queue vs Log - Event driven Architecture
- Database Indexing
- Multiple criteria search at scale with Apache Pinot & Theta Sketches
- VM vs Containers - Similar but Different
- State of Trino aka PrestoSQL
- ETL is an extremely important component for any modern business
- Top 5 ways to complicate a #dataengineering pipeline/application π₯
- Leader election is commonly used aka Master/Namenode/Leader/Driver
- Dagster vs Airflow - A comparison
- About Single Source of Truth in DataEngineering
- Change Data Capture for Distributed Databases
- Deep Dive on Why Apache Iceberg for Change Data Capture, using Apache Flink π
- OpenMetadata is an Open Standard for Metadata. A Single place to Discover, Collaborate, and Get your data right
- About Lakehouse
- etcd - A distributed, reliable key-value store for the most critical data of a distributed system
- What is Redis
- What is Hive
- What is Data Warehouse - An Introduction
- Fundamentals of Designing Data Warehouse
- Database Relational Model - A way of looking at Data
- Data Engineering Infrastructure Notes
- A Data Engineering Story - The Beginning
- Data Engineering - More towards Data Science or Data Analytics or ...
- Data Engineering Interview Patterns
- Basic Checklists while learning Apache Spark
- #apachespark for Distributed Analytics or #businessinteligence Platform - Worth or not ?
- Apache Beam for Search: An Introduction & Addressing the challenge of the Time Problem ππ‘π
- Nextflow is a Workflow Manager exclusively for #bioinformatics π©Ήππ©Ή
- #apachespark Project Zen Update - Making PySpark Better π‘ππ‘
- Design - Exactly Once Delivery & Transactional Messaging in #apachekafka πππ
- underrated but important skill of a Data Engineer
- Fallacies of Distributed Systems
- As a Data Engineer, some Essentials I did which really helped Data Scientists and the Team
- A very normal Data Engineering work π
- What can go wrong in Distributed Data Systems
- Architect and build an #machinelearning use case end to end using Amazon SageMaker π
- Around Data Discovery or Metadata Management Platforms
- Amazon S3 Object Lambda - Provide Different Views of Data to Multiple Applications
- Full Stack Data Engineer
- Data cleaning is Hard but why
- Most exciting things about #dataengineering
- The real impact of Disks on #rocksdb State Backend in Apache Flink
- Tips for Distributed System High Availability
- interesting way of collaboration between a Dataengineer & Datascientis
- Building DistributedLog: High-performance replicated log service π
- Whiz: Data Analytics Execution Framework based on Intermediate Data
- Adding unlimited Nodes in a #dataengineering platform will eventually drop
- A typical Data Engineering Pipeline
- 'Log' is a fundamental component of a Data Engineering Ecosystem
- Flink CDC
- Readings Around Databases
- Code Review Best Practice, bcz Developers, hate code reviews
- Important Performance Criteria to measure DataEngineering Systems
- Database Internals - Storage
- Data Integration for Databases & Data Warehousing - An Introduction
- What is Protocol Buffer - An excellent important data interchange format for serialization, "Zero Copy" format
- Memcached, Redis & Elasticache - To accelerate your data or databases
- What is LSM-Tree
- Tor aka Onion Router - How does it work?
- SQL Database on Kubernetes - Best Practices
- Devtron - An Open Source DevOps on Kubernetes, written in Go π₯ππ
- Most Popular #opensource BI & Data Analytics Platforms ππ‘π
- datapipelines Dataframe APi is now available with #apachebeam π―π₯π―
- Disaster Recovery for Multi-Region Apache Kafka & Data Consumption using #apacheflink π ππ
- Kubernetes Api Structure π―βοΈπ―
- Architecting a Kubernetes Infrastructure π―
- Exploring Kubernetes Operator Pattern π‘
- Docker is an interal part of Data Engineering ML pipeline & that makes security π extremely essential
- Rack awareness for #apachekafka Streams Proposal π
- Dolt is Git for Data π
- Toward Better Data Culture From First Principles by Ube
- Fast and Reliable Schema-Agnostic Log Analytics Platform by Uber
- Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systemsπ
- Diving Deep on S3 Consistency - Insightful
- Ray- General Purpose ML Infrastructure
- Kubernetes Hardening Guide by National Security Agency
- Everything around Load Balancer
- Data Lakehouse - Is it really the end of Datawarehouse
- Real-Time Exactly-Once Processing with Apache Flink, Kafka, and Pinot @uber
- WTH is Kubernetes Operator? - An Introduction
- Lessons Learned from Sharding Postgres
- What is Kubernetes- An introduction
- ELK Stack - Introduction of Scalable Monitoring
- What is NGINX - An Introduction
- What is Load Balancer - An Introduction
- What is OAuth 2 - Introduction|API Based Authorization
- Kubernetes & Networks - It's hard because multiple options are available
- Kubernetes Reconciliation
- Troubleshooting Kubernetes App
- Kubernetes Best Practices - Classics
- Paper: Serverless Computing: A Survey of Opportunities, Challenges, and Applications
- Choosing the Kubernetes Local Cluster
- Monitoring Kubernetes - Fundamentals of #kubernetes Infrastructure Monitoring
- Kubernetes Controller Manager
- Kubernetes: Why the Pod is still in the Pending State?
- Kubernetes Liveness & Readiness Probe
- Kubernetes Pod/Node Affinity
- Advanced SQL - Reference CS 564 Database Management Systems
- SQL and Advanced SQL - An asset
- Database Indexing - Almost Everything
- Tuning SQL queries - Tips for writing efficient & faster Queries
- Database Schema Design - Schema Design is a Complicated Necessity
- SQL Query Processing Plan - Basics
- Revisiting SQL Basics - The beginning of Data Science & Data Engineering
- Distributed Advanced Queries - Presto/Trino
- SQL Notes For Professionals, 100+ pages
- Table Partitioning
- SQL complex Queries - Nested Queries & Aggregation
- Gossip Protocols - Designed for Data Consistency & Fault-Tolerance
- Table Partitioning - An Important Concept
- Database Concurrency Control : 2 Phase Locking
- Database Entity Relationship Model
- SQL Join Fundamentals
- Database Indexing
- Database Indexing Notes
- SQL Injection Introduction
- SQL Constraints Fundamentals
- The fundamental of writing SQL queries is different from
- Building a NoSQL Database using Git
- Against SQL - An article on What is not good with #sql
- Using
EXPLAIN
for Data Problems - Things beyond SQL - 10 SQL Queries to Blow Your Mind π
- Views, Stored Procedures, Functions & Triggers - SQL
- SQL Transaction & ACID Property
- How to Solve complex SQL queries
- Apache Spark SQL - The Introduction from RDMBS till SparkSQL
- Advanced SQL & Functions
- Basic & Intermediate on Database Sharding
- Complex Database Queries with PostgreSQL
- Query Evaluation - Technical Details "when you execute SQL Query"
- What is Materialized View & how does work in Distributed Databases
- Breaking Down NoSQL Sharding, Replication & Consistency
- Database Query Optimization Technique
- Intermediate SQL
- SQL Stored Procedures
- OLAP & OLTP - Datawarehouse Data Mining
- Database Fundamentals
- SQL Subqueries
- NewSQL Introduction - Basic to Intermediate
- SQL Intermediate & basics Deep Dive
- SQL Basics - The Starting point
- Data Warehousing & OLAP Technology
- Snowflake Datawarehouse
- RelationalAlgebra & SQL
- Logical Schema Design: SQL Database
- Kubernetes Pod Internals - Deep Dive
- The Illustrated Children's Guide to Kubernetes
- SQL Subqueries by Example
- What is Write-Ahead-Logging (WAL)
- [SQL Transactions](SQL Transactions - a sequence of database operations)
- Linux Productivity Tools - This is a Data Infrastructure necessity
- [NoSQL & MongoDB]https://www.linkedin.com/posts/iamabhishekchoudhary_nosql-mongodb-activity-6874231633654935553-Z66u)
- CouchDB Introduction - β’ Document Storage Database
- Machine Learning Workflow π―
- Dummy Notes On Machine Learning Infrastructire
- Machine Learning Feature Store π―
- Deploying #machinelearning model in Production is really HARD but #MLOps can fix that.
- List of #machinelearning & #dataengineering Technologies will be following in 2021 ππ‘π
- MLOps - ZenML #machinelearning with reproducible pipelines β π―β
- Why? Data Versioning is a complicated problem for Dataengineers
- Explainable AI Cheat Sheet
- Designing Machine Learning infrastructure
- What is Log - Foundation behind Databases & Distributed Systems
- How does the GIT version control work?
- Streamlit Healthcare Machine Learning Data App
- Dstack AI - An open-source tool to develop data applications with Python πππ
- Adversarial Robustness Toolbox - a Python library for #machinelearning Security π‘ππ
- Biopython is a set of freely available tools for biological computation written in #Python πβοΈπ
- Time to Know More about DASK
- DataEngineering vs Machine Learning
- A good #machinelearning Model is only possible with a good quality of #data. βοΈ
- Statistics for #softwareengineer π₯π―π₯
- Monitoring #machinelearning Applications ππ π
- Dagster is a data orchestrator for machine learning, analytics, and ETL - Officially #machinelearning driven π₯π₯π₯
- Short Notes on -Open source #machinelearning Tracking System
- The best example of Randomness is - #machinelearning model in Production. πππ
- Flyte is declarative, structured, and highly scalable cloud-native workflow orchestration platform for Distributed Machine Learning
- Tips for Distributed System High Availability π
- Building DistributedLog: High-performance replicated log service π
- How to scale Kubernetes with Assurance
- Apache Calcite - Building Sql Query Processor from Scratch over Lucene
- Database Storage
- ACID is the foundation of Database, BASE is for NoSQL Databases
- Some common elements behind many Distributed Databases
- Failure Recovery in TrinoDB
- What is LLVM
- What is Garbage Collection
- What is Canary Deployment
- The Snowflake Paper - Core idea is to build an enterprise-ready #datawarehouse solution for the #cloud ππ°π
- Most important points around Distributed #dataengineering Platform
- Fundamental of #distributedsystems Scaling - Avoiding Co-ordination πβ¨οΈπ
- Technical Debt in #dataengineering #softwareengineering ππ‘π
- Paper on Wander Join: Online Aggregation via Random Walks πππ Join problem
- The Delta Lake Paper - High-Performance ACID Table Storage ππ‘π
- Dynamo - AWS Highly Available Key-value Store #distributedsystem π¬π‘π
- An Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables, A Single SQL for all π‘π©π©
- Secure & Robust Machine Learning in #healthcare ππ§ͺπ₯³
- Progress in Medical Science using #deeplearning ππ‘π
- The Amazon Redshift Paper - A fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing #businessintelligence tools ππ°π
- Advancing #drugdiscovery via Artificial Intelligence ππ₯π₯
- Apache Calcite is a dynamic data management framework πππ
- Lakehouse - A Paper on new Generation of #datawarehouse technology π‘ππ‘
- Calvin: Fast Distributed Transactions for Partitioned Database Systems ππ
- Presto or Trino - #SQL on Everything ( The Design, Motivation & Performance) #presto πππ‘
- Design - Exactly Once Delivery & Transactional Messaging in Apache Kafka
- Apache Kafka Paper : Distributed Messaging System for Log Processing
- Paper: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size
- Paper: Ground is an open-source data context service, a system to manage all the information that informs the use of data
- Azure Data Lake Store(ADLS) is a fully-managed, elastic, scalable, and secure file system that supports #hadoop distributed file system (HDFS) and Cosmos semantics
- An LFU (Least Frequently Used) Cache eviction algorithm of O(1) Runtime complexity
- The Berkeley View on Cloud Computing - Paper
- The Google File System - The Paper π
- Paper: Report on Distributed Deep Learning on Data Systems π
- Crystal: A Unified Cache Storage System for Analytical Databases
- VoltDB
- Magnet - Apache Spark Shuffle mechanism to handle petabytes of daily shuffled data and clusters with thousands of nodes
- Paper: Real-time Data Infrastructure @ Uber
- Paper: DBLog, A Watermark Based Change-Data-Capture Framework by Netflix
- Paper: Large Scale Distributed Systems Tracing Infrastructure
- Paper:Paxos vs Raft: Distributed Consensus π
- Paper: Sorting in a #distributedsystem π
- Paper: A large scale analysis of hundreds of in-memory cache clusters
- Design & Architecture of Amazon Timestream - Streaming at Scale
- Distributed System Synchronization
- Paper: Consistent hashing - Resizing cluster or Load in a #distributedsystems with a simple concept
- Deep Dive - Foundation DB (unbundled database, OLTP, strict serializability, multi-version concurrency control, optimistic concurrency control, simulation testing)
- Distributed Database - ZippyDB is the largest strongly consistent, geographically distributed key-value store at Facebook Database
- BigData Metadata Management System
- Machine Learning for Database Optimizations
- SingleStore - A Distributed Database Management System. It's really more than a Database
- ArrowSAM, in-memory genomics SAM format based on Apache Arrow
- Realtime Data Processing FB - Deep Dive on #streamprocessing
- ArangoDB - Native multi-model NoSQL Distributed #database, From #sql to NoSQL
- To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
- How to bring robustness while Designing Large Scale Complex Systems
- Facebook Datawarehouse
- Building a performant OLTP system on an open-source columnar format, and supporting near-zero overhead data export to external tools
- Towards Demystifying Serverless Machine Learning Training
- Paper: Scalable Linear Algebra on top of Distributed Databases, this will simplify Machine Learning on Databases
- Paper: Are You Sure You Want to Use MMAP in Your Database Management System
- What is RBAC or Role-Based Access Control
- Vectorization vs. Compilation in Query Execution
- SQLite vs DuckDB
- Glow is an open-source toolkit for working with genomic data at biobank-scale and beyond using #apachespark & #deltalake πππ
- ExPASy - Databases and software tools in proteomics, #genomics, phylogeny, systems biology, evolution, population genetics, and transcriptomics π‘ππ
- What is Metadata - A Data Engineering necessity
- What is Distributed Database
- To Partition, or Not to Partition, That is the Join Question in a Real System
- Paper: Solana- A new architecture for a high performance blockchain-inspired by Distributed Systems
- Scaling Large Production Clusters with Partitioned Synchronization
- Paper: Volcano Operator Model is based on relational algebra
- Paper: Faster and Cheaper Serverless Computing on Harvested Resources
- DBOS: A Paper on DBMS-oriented Operating System
- SSD Storage - Scale Caching without increasing too much cost & Smart Indexing for faster data query
- Paper: Lineage Tracing for General Data Warehouse Transformations
- What Every Programmer Should Know About Memory
- Deployment Archetypes for Cloud Applications
- PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers
- Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask
- Dual use of artifcial-intelligence-powered drug discovery