# Data Orchestrations
Data orchestration is the process of managing and coordinating data across different systems, tools, and storage locations to ensure it flows smoothly and is ready for analysis. Think of it as a conductor in an orchestra, guiding various "instruments" (data sources) to work together in harmony, so the right data reaches the right place at the right time. This often involves automating tasks like data extraction, transformation, and loading (ETL) to keep everything running efficiently and consistently across an organization

## Definitions of ETL and ELT
- **ETL:** ETL is ideal when data needs to be standardized and cleaned before storage, often used in on-premise systems where resources for transformation are outside the target system.
- **ELT:** ELT is preferred in modern, cloud-based architectures with scalable storage and processing power, where data can be quickly loaded and transformed in the target system-

## Setbacks and Efficiency of each method
### ETL 
#### Efficiency
- **Processing Load:** Transformation happens outside the data warehouse, so it requires external processing resources, which can slow down the ETL process if resources are limited.
- **Data Accuracy:** Since data is cleaned and standardized before it reaches the target, it often results in cleaner data being stored initially, reducing downstream processing needs.

#### Drawbacks
- **Scalability:** Scaling ETL can be challenging, especially with large data volumes, since transformation is limited by the external processing environment.
- **Complexity:** ETL requires predefined transformations before loading, making it less flexible when requirements or data structures change.
- **Latency:** ETL can introduce delays in data availability because transformations must complete before loading data into the target system.

### ELT 
#### Efficiency
- **Scalability:** ELT leverages the power of modern cloud data warehouses (like Snowflake, BigQuery), which are highly scalable and can process large volumes of data in parallel.
- **Speed:** Since data is loaded directly into the target system, it’s available sooner, and transformations can be run on-demand or scheduled, reducing latency.
- **Flexibility:** Raw data is stored as-is, allowing for diverse transformations directly in the warehouse, accommodating changing data needs.

#### Drawbacks
- **Data Storage Costs:** Loading raw data can increase storage costs, especially if much of the raw data is unused.
- **Data Quality Risks:** Since transformations happen after loading, raw data might include duplicates, errors, or inconsistencies that need extra management within the warehouse.
- **Performance:** Running transformations in the data warehouse could impact query performance, especially if transformations are complex or unoptimized.

## Tools required 
- Extract data using Fivetran (for automated connectors) or Kafka (for real-time data).
- Transform with dbt or Spark for scalable processing.
- Store in a data warehouse like Snowflake.
- Orchestrate using Apache Airflow to automate the ETL process.
- Monitor with Grafana and set alerts on Slack.


## Batch and Stream Processing 
- **Batch Processing:** High latency, large volumes, periodic processing (e.g., overnight reporting).
- **Stream Processing:** Low latency, real-time, continuous processing (e.g., fraud detection, live monitoring).

# Data Structures and Algorithms (DSA)
## Big O -notation 
-
-
-

## Basic data structures in Python 
- Lists
- Tuple 
- Sets 
- Arrays

## Sorting Algorithms 
- Bubble Sort 
- Selection Sort 
- Insertion Sort 
- Merge Sort 
- Quick Sort 
- Counting Sort 
- Radix Sort 
- Bucket Sort

## Search Algorithms 
-
-
-

# Webscraping Frameworks 
Web scraping is a process of automatically extracting information from websites. Instead of copying data manually, a web scraper (a small program) goes through the website’s pages and pulls specific data, like prices, product details, or articles, into a structured format like a spreadsheet or database. It’s useful for tasks like tracking prices on e-commerce sites, gathering news articles, or analyzing social media trends

In [None]:
# Httpx and requests

In [None]:
# BeautifulSoup

In [None]:
# Selenium and Playwright

In [2]:
# Scrapy

# Database Queries 
- Find best interview questions on database queries 

# Potential Improvement the company can implement