# Data Extraction in ETL Assignment

**Question 1 : Describe different types of data sources used in ETL with suitable examples.**

**Ans :** In ETL, data sources are systems from where data is extracted. Common types include:

**1. Relational Databases** : Structured data stored in tables with rows and columns.
- **Examples** : MySQL, PostgreSQL, Oracle, SQL Server
- **Use case** : Customer records, sales transactions.

**2. Flat Files** : Simple file-based data storage formats.
- **Examples** : CSV, TXT, TSV
- **Use case** : Daily sales reports, log files.

**3. Spreadsheets** : Semi-structured data maintained by business users.
- **Examples** : Excel (.xlsx, .xls), Google Sheets
- **Use case** : Budget sheets, manual data entry files.

**4. APIs (Web Services)** : Data accessed programmatically over the internet.
- **Examples** : REST APIs, SOAP APIs
- **Use case** : Weather data, stock prices, social media data.

**5. Cloud Storage & SaaS Platforms** : Data hosted on cloud systems.
- **Examples** : AWS S3, Google BigQuery, Salesforce
- **Use case** : CRM data, analytics data.

**Question 2 : What is data extraction? Explain its role in the ETL pipeline.**

**Ans :** Data extraction is the process of collecting data from various source systems and bringing it into the ETL pipeline.

**Role in ETL Pipeline**
- It is the first step of ETL (Extract → Transform → Load)
- Ensures accurate and complete data collection
- Supports both batch and real-time data processing
- Acts as the foundation for transformation and loading

Without proper extraction, the entire ETL process can fail or produce incorrect results.

**Question 3 : Explain the difference between CSV and Excel in terms of extraction and ETL usage.**

**Ans :** CSV (Comma-Separated Values) and Excel are both commonly used data formats, but they differ significantly in how they are used in ETL processes.

**CSV (Comma-Separated Values)**
- CSV is a plain text file format.
- It stores data in a simple row-and-column structure without formatting.
- CSV files are lightweight and consume less memory.
- They are faster to read and process, especially for large datasets.
- CSV files are widely supported by almost all ETL tools.
- Best suited for large-scale ETL pipelines.

**Excel Files**
- Excel files are binary or XML-based (.xls, .xlsx).
- They support multiple sheets, formulas, formatting, and charts.
- Excel files are heavier and slower to process.
- They require additional libraries or drivers for extraction.
- Not ideal for very large datasets.
- Mostly used for manual analysis and reporting, not heavy ETL.

CSV is preferred in ETL processes due to its simplicity, speed, and efficiency, while Excel is better suited for reporting and manual data analysis rather than large-scale ETL operations.

**Question 4 : Explain the steps involved in extracting data from a relational database.**

**Ans :** Data extraction from a relational database involves systematically retrieving required data while maintaining accuracy and performance. The steps are:

**1. Understand the Source Database** : Analyze database schema, tables, columns, primary keys, and relationships.

**2. Establish Database Connection** : Connect to the database using JDBC/ODBC or database-specific connectors with proper credentials.

**3. Select Required Data** : Identify tables and columns needed for extraction based on business requirements.

**4. Write SQL Queries** : Use SELECT statements with joins, filters, and conditions to fetch relevant data.

**5. Apply Filters or Incremental Logic** : Use WHERE clauses (e.g., date or ID filters) to extract only new or updated records.

**6. Execute Extraction** : Run queries and retrieve data from the database.

**7. Validate Extracted Data** : Check row counts, null values, and data consistency to ensure correctness.

**8. Store in Staging Area** : Save extracted data temporarily for transformation and loading.

**Question 5 : Explain three common challenges faced during data extraction.**

**Ans :** During data extraction in ETL, developers often face the following three common challenges:

**1. Data Quality Issues** : Source data may contain missing values, duplicate records, incorrect formats, or inconsistent data. Poor data quality can lead to inaccurate analysis and unreliable reports.

**2. Performance and Scalability Problems** : Extracting large volumes of data can slow down source systems and ETL jobs. Inefficient queries or full-table scans may impact database performance, especially in enterprise environments.

**3. Schema and Source Changes** : Changes in source systems such as renamed columns, modified data types, or deleted tables can break extraction logic and cause ETL failures.

**Question 6 : What are APIs? Explain how APIs help in real-time data extraction.**

**Ans :** APIs (Application Programming Interfaces) are mechanisms that allow different software applications to communicate with each other and exchange data in a standardized way.

**Role of APIs in Real-Time Data Extraction**

- APIs provide direct access to live data from source systems.
- They enable real-time or near real-time data extraction instead of batch processing.

- Data is usually returned in JSON or XML format, which is easy to process in ETL tools.

- APIs support event-driven data flow, where data is fetched instantly when an event occurs.

- Widely used in cloud-based, SaaS, and microservices architectures.

**Example** : Fetching live weather data, stock prices, or payment transactions using REST APIs.

**Question 7 : Why are databases preferred for enterprise-level data extraction?**

**Ans :** Databases are preferred for enterprise-level data extraction because they are designed to handle large, critical, and continuously growing datasets efficiently.

**Reasons :**

**1. High Performance & Scalability** : Databases can manage millions of records and support parallel and incremental extraction.

**2. Data Consistency & Integrity** : They enforce constraints, relationships, and transactions, ensuring accurate and reliable data.

**3. Security & Access Control** : Databases provide authentication, authorization, and encryption for sensitive enterprise data.

**4. Efficient Querying** : SQL allows precise data selection using joins, filters, and indexes, reducing unnecessary data extraction.

**Reliability & Availability** : Enterprise databases support backups, recovery, and high availability, making them dependable ETL sources.

**Question 8 : What steps should an ETL developer take when extracting data from large CSV files (1GB+)?**

**Ans :** When working with very large CSV files, an ETL developer must ensure efficiency, performance, and data reliability. The following steps should be taken:

**1.Use Chunk-Based Reading** : Read the file in smaller chunks instead of loading the entire file into memory.

**2. Avoid Full Memory Load** : Stream the data line-by-line to prevent memory overflow.

**3. Validate Schema Before Extraction** : Check column names, data types, and delimiters before processing.

**4. Apply Filtering Early** : Remove unnecessary columns or rows during extraction to reduce data size.

**5.Use Parallel Processing** : Split the file and process multiple parts simultaneously, if supported.

**6. Compress and Decompress Efficiently** : Use compressed formats like .gz to reduce I/O time.

**7. Implement Error Handling & Logging** : Capture rejected or corrupted records separately for review.

**8. Schedule During Off-Peak Hours** : Run extraction jobs when system load is low to improve performance.