# 📖 Understanding Data Ingestion

Welcome to the first tutorial in our **Data Ingestion Pipeline** series! In this notebook, you'll learn the fundamentals of data ingestion and why it's crucial for modern businesses.

## 🎯 Learning Objectives

By the end of this tutorial, you will:
- ✅ Understand what data ingestion is and why it matters
- ✅ Learn about different types of data sources
- ✅ Explore real-world data ingestion challenges
- ✅ Understand the components of a data pipeline
- ✅ See practical examples of data ingestion scenarios

---

## 🔄 What is Data Ingestion?

**Data Ingestion** is the process of collecting, importing, and processing data from various sources into a storage system where it can be accessed, analyzed, and used by applications.

Think of it as a **digital conveyor belt** that:
1. 📥 **Collects** data from multiple sources
2. 🔍 **Validates** the data quality
3. 🧹 **Cleans** and transforms the data
4. 💾 **Stores** it in a usable format
5. 📊 **Makes** it available for analysis

### 🏪 Real-World Example: E-commerce Store

Imagine you run **"TechStore"** - an electronics retailer. Every day, you receive orders from:
- 🌐 Your website
- 📱 Mobile app
- 🏬 Physical stores
- 📞 Phone orders
- 🤝 Partner retailers

Without data ingestion, you'd have:
- ❌ Data scattered across different systems
- ❌ Inconsistent formats
- ❌ Manual data entry errors
- ❌ Delayed reporting
- ❌ Poor decision making

With data ingestion, you get:
- ✅ Centralized data storage
- ✅ Consistent data formats
- ✅ Automated quality checks
- ✅ Real-time insights
- ✅ Better business decisions

In [None]:
# Let's start with some basic imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import json

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📦 Libraries imported successfully!")
print(f"🐍 Python version: {pd.__version__}")
print(f"📊 Pandas version: {pd.__version__}")

## 📊 Types of Data Sources

Data can come from many different sources. Let's explore the most common ones:

In [None]:
# Create a visualization of different data sources
data_sources = {
    'Source Type': ['Files (CSV/JSON)', 'APIs', 'Databases', 'Streams', 'Web Scraping', 'IoT Sensors'],
    'Frequency': ['Daily/Hourly', 'Real-time', 'Batch', 'Continuous', 'On-demand', 'Continuous'],
    'Volume': ['Medium', 'High', 'High', 'Very High', 'Low', 'High'],
    'Complexity': ['Low', 'Medium', 'Medium', 'High', 'High', 'Medium'],
    'Use Cases': [
        'Reports, Exports',
        'Live Data, Integrations', 
        'Business Systems',
        'Real-time Analytics',
        'Market Research',
        'Monitoring, Telemetry'
    ]
}

df_sources = pd.DataFrame(data_sources)
print("📋 Common Data Sources:")
print("=" * 50)
display(df_sources)

### 📁 File-Based Sources

**Most Common Types:**
- **CSV Files** - Comma-separated values, easy to read
- **JSON Files** - JavaScript Object Notation, flexible structure
- **Excel Files** - Spreadsheets with multiple sheets
- **XML Files** - Structured markup language
- **Log Files** - Application and system logs

**Advantages:**
- ✅ Simple to understand and process
- ✅ No network dependencies
- ✅ Can be processed offline
- ✅ Easy to backup and archive

**Challenges:**
- ❌ Manual file transfers
- ❌ File format inconsistencies
- ❌ Large file sizes
- ❌ Timing dependencies

In [None]:
# Example: Creating sample data that might come from different sources

# 1. CSV-style data (from store POS system)
csv_data = {
    'order_id': ['ORD-001', 'ORD-002', 'ORD-003'],
    'customer_name': ['John Doe', 'Jane Smith', 'Bob Wilson'],
    'product': ['iPhone 15', 'MacBook Pro', 'AirPods Pro'],
    'quantity': [1, 1, 2],
    'price': [999.99, 1999.99, 249.99],
    'store_location': ['New York', 'Los Angeles', 'Chicago']
}

# 2. JSON-style data (from mobile app)
json_data = {
    'app_version': '2.1.0',
    'upload_time': '2024-01-15T12:00:00Z',
    'orders': [
        {
            'order_id': 'APP-001',
            'customer_name': 'Alice Johnson',
            'product': 'iPad Air',
            'quantity': 1,
            'price': 599.99,
            'device_type': 'iOS'
        },
        {
            'order_id': 'APP-002',
            'customer_name': 'Charlie Brown',
            'product': 'Apple Watch',
            'quantity': 1,
            'price': 399.99,
            'device_type': 'iOS'
        }
    ]
}

print("📊 Sample CSV Data (Store POS):")
df_csv = pd.DataFrame(csv_data)
display(df_csv)

print("\n📱 Sample JSON Data (Mobile App):")
print(json.dumps(json_data, indent=2))

### 🌐 API-Based Sources

**Application Programming Interfaces (APIs)** allow systems to communicate and exchange data in real-time.

**Common API Types:**
- **REST APIs** - Most common, uses HTTP methods
- **GraphQL APIs** - Flexible query language
- **WebSocket APIs** - Real-time bidirectional communication
- **Webhook APIs** - Event-driven data push

**Advantages:**
- ✅ Real-time data access
- ✅ Standardized formats
- ✅ Automatic updates
- ✅ Rich metadata

**Challenges:**
- ❌ Network dependencies
- ❌ Rate limiting
- ❌ Authentication complexity
- ❌ API changes and versioning

In [None]:
# Example: Simulating API response data
import requests
import json

# Simulate an API response (we'll use a real API in later notebooks)
api_response = {
    "status": "success",
    "timestamp": "2024-01-15T14:30:00Z",
    "data": {
        "orders": [
            {
                "id": "WEB-001",
                "customer": {
                    "name": "Sarah Connor",
                    "email": "sarah@example.com",
                    "tier": "premium"
                },
                "items": [
                    {
                        "product": "MacBook Pro",
                        "quantity": 1,
                        "price": 1999.99
                    }
                ],
                "total": 1999.99,
                "created_at": "2024-01-15T14:25:00Z"
            }
        ],
        "pagination": {
            "page": 1,
            "per_page": 10,
            "total": 1,
            "has_more": False
        }
    }
}

print("🌐 Sample API Response:")
print(json.dumps(api_response, indent=2))

# Extract order data from API response
orders = api_response['data']['orders']
print(f"\n📊 Extracted {len(orders)} orders from API response")

## 🏗️ Data Pipeline Components

A complete data ingestion pipeline consists of several key components:

In [None]:
# Visualize the data pipeline components
pipeline_components = {
    'Stage': ['1. Ingestion', '2. Validation', '3. Transformation', '4. Storage', '5. Monitoring'],
    'Purpose': [
        'Collect data from sources',
        'Check data quality',
        'Clean and enrich data', 
        'Save processed data',
        'Track pipeline health'
    ],
    'Key Activities': [
        'Read files, Call APIs, Query DBs',
        'Schema validation, Business rules',
        'Cleaning, Standardization, Enrichment',
        'Database writes, File exports',
        'Logging, Metrics, Alerts'
    ],
    'Tools/Technologies': [
        'Pandas, Requests, SQL',
        'JSON Schema, Custom validators',
        'Pandas, NumPy, Custom logic',
        'SQLite, PostgreSQL, Files',
        'Logging, Prometheus, Grafana'
    ]
}

df_pipeline = pd.DataFrame(pipeline_components)
print("🔄 Data Pipeline Components:")
print("=" * 80)
display(df_pipeline)

### 📥 Stage 1: Data Ingestion

**Purpose:** Collect data from various sources

**Key Activities:**
- 📁 Read files (CSV, JSON, Excel)
- 🌐 Call REST APIs
- 🗄️ Query databases
- 📡 Stream real-time data

**Challenges:**
- Different data formats
- Network connectivity issues
- Large file sizes
- Rate limiting

In [None]:
# Example: Simple data ingestion simulation
def simulate_data_ingestion():
    """Simulate collecting data from multiple sources"""
    
    # Source 1: CSV file data
    csv_orders = pd.DataFrame({
        'order_id': ['CSV-001', 'CSV-002'],
        'source': ['store', 'store'],
        'customer': ['John Doe', 'Jane Smith'],
        'amount': [999.99, 1999.99]
    })
    
    # Source 2: API data
    api_orders = pd.DataFrame({
        'order_id': ['API-001', 'API-002'],
        'source': ['website', 'mobile'],
        'customer': ['Bob Wilson', 'Alice Johnson'],
        'amount': [599.99, 399.99]
    })
    
    # Combine data from all sources
    all_orders = pd.concat([csv_orders, api_orders], ignore_index=True)
    
    return all_orders

# Simulate ingestion
ingested_data = simulate_data_ingestion()
print("📥 Data Ingestion Results:")
display(ingested_data)
print(f"\n✅ Successfully ingested {len(ingested_data)} orders from multiple sources")

### 🔍 Stage 2: Data Validation

**Purpose:** Ensure data quality and completeness

**Key Activities:**
- ✅ Schema validation (correct fields, data types)
- 🔍 Business rule validation (positive prices, valid dates)
- 📊 Data quality scoring
- ⚠️ Error reporting

**Common Issues:**
- Missing required fields
- Invalid data types
- Business rule violations
- Duplicate records

In [None]:
# Example: Simple data validation
def validate_order_data(data):
    """Validate order data quality"""
    
    validation_results = {
        'total_records': len(data),
        'valid_records': 0,
        'issues': []
    }
    
    # Check required fields
    required_fields = ['order_id', 'customer', 'amount']
    for field in required_fields:
        if field not in data.columns:
            validation_results['issues'].append(f"Missing required field: {field}")
    
    # Check for missing values
    missing_values = data.isnull().sum()
    for field, count in missing_values.items():
        if count > 0:
            validation_results['issues'].append(f"{field}: {count} missing values")
    
    # Check business rules
    if 'amount' in data.columns:
        negative_amounts = (data['amount'] <= 0).sum()
        if negative_amounts > 0:
            validation_results['issues'].append(f"Found {negative_amounts} orders with non-positive amounts")
    
    # Calculate valid records
    validation_results['valid_records'] = len(data) - len(validation_results['issues'])
    validation_results['quality_score'] = (validation_results['valid_records'] / validation_results['total_records']) * 100
    
    return validation_results

# Validate our ingested data
validation_results = validate_order_data(ingested_data)

print("🔍 Data Validation Results:")
print(f"Total Records: {validation_results['total_records']}")
print(f"Quality Score: {validation_results['quality_score']:.1f}%")

if validation_results['issues']:
    print("\n⚠️ Issues Found:")
    for issue in validation_results['issues']:
        print(f"  - {issue}")
else:
    print("\n✅ No validation issues found!")

### 🧹 Stage 3: Data Transformation

**Purpose:** Clean, standardize, and enrich data

**Key Activities:**
- 🧽 **Data Cleaning:** Remove duplicates, fix formats
- 📏 **Standardization:** Consistent formats and units
- ➕ **Enrichment:** Add calculated fields and metadata
- 🔄 **Normalization:** Structure data for analysis

**Common Transformations:**
- Date format standardization
- Text cleaning and normalization
- Currency conversion
- Category mapping

In [None]:
# Example: Data transformation
def transform_order_data(data):
    """Transform and enrich order data"""
    
    # Make a copy to avoid modifying original data
    transformed_data = data.copy()
    
    # 1. Standardize customer names (Title Case)
    transformed_data['customer'] = transformed_data['customer'].str.title()
    
    # 2. Add calculated fields
    transformed_data['order_size'] = pd.cut(
        transformed_data['amount'], 
        bins=[0, 500, 1000, 2000, float('inf')],
        labels=['Small', 'Medium', 'Large', 'XLarge']
    )
    
    # 3. Add processing metadata
    transformed_data['processed_at'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    
    # 4. Add source category
    source_mapping = {
        'store': 'Physical',
        'website': 'Online',
        'mobile': 'Online'
    }
    transformed_data['source_category'] = transformed_data['source'].map(source_mapping)
    
    return transformed_data

# Transform our data
transformed_data = transform_order_data(ingested_data)

print("🧹 Data Transformation Results:")
display(transformed_data)

print(f"\n📊 Order Size Distribution:")
print(transformed_data['order_size'].value_counts())

print(f"\n📈 Source Category Distribution:")
print(transformed_data['source_category'].value_counts())

## 🚧 Common Data Ingestion Challenges

Real-world data ingestion comes with many challenges. Let's explore the most common ones:

In [None]:
# Visualize common challenges and their impact
challenges_data = {
    'Challenge': [
        'Data Quality Issues',
        'Format Inconsistencies', 
        'Network Failures',
        'Large Data Volumes',
        'Schema Changes',
        'Rate Limiting',
        'Security & Authentication',
        'Real-time Processing'
    ],
    'Frequency': ['Very High', 'High', 'Medium', 'High', 'Medium', 'Medium', 'High', 'Medium'],
    'Impact': ['High', 'Medium', 'High', 'Medium', 'High', 'Low', 'High', 'Medium'],
    'Solutions': [
        'Validation, Cleaning, Monitoring',
        'Standardization, Schema validation',
        'Retry logic, Circuit breakers',
        'Batch processing, Streaming',
        'Version control, Backward compatibility',
        'Throttling, Queue management',
        'OAuth, API keys, Encryption',
        'Stream processing, Event-driven'
    ]
}

df_challenges = pd.DataFrame(challenges_data)
print("🚧 Common Data Ingestion Challenges:")
print("=" * 80)
display(df_challenges)

### 🔍 Challenge Deep Dive: Data Quality Issues

Let's simulate some common data quality issues and see how to handle them:

In [None]:
# Create sample data with quality issues
problematic_data = pd.DataFrame({
    'order_id': ['ORD-001', '', 'ORD-003', 'ORD-001'],  # Missing and duplicate IDs
    'customer_name': ['john doe', 'JANE SMITH', '', 'Bob Wilson'],  # Inconsistent case, missing
    'product': ['iPhone 15', 'macbook pro', 'AirPods Pro', 'iPad Air'],
    'quantity': [1, -1, 2, 0],  # Negative and zero quantities
    'price': [999.99, 1999.99, -249.99, 599.99],  # Negative price
    'order_date': ['2024-01-15', '2025-12-31', '2024-01-16', 'invalid-date'],  # Future and invalid dates
    'email': ['john@example.com', 'invalid-email', 'bob@example.com', 'alice@example.com']
})

print("🚨 Sample Data with Quality Issues:")
display(problematic_data)

# Analyze the issues
print("\n🔍 Data Quality Analysis:")
print("=" * 40)

# 1. Missing values
missing_values = problematic_data.isnull().sum() + (problematic_data == '').sum()
print("Missing/Empty Values:")
for col, count in missing_values.items():
    if count > 0:
        print(f"  {col}: {count}")

# 2. Duplicates
duplicates = problematic_data.duplicated(subset=['order_id']).sum()
print(f"\nDuplicate Order IDs: {duplicates}")

# 3. Business rule violations
negative_quantities = (problematic_data['quantity'] <= 0).sum()
negative_prices = (problematic_data['price'] <= 0).sum()
print(f"Negative/Zero Quantities: {negative_quantities}")
print(f"Negative/Zero Prices: {negative_prices}")

# 4. Invalid emails
import re
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
invalid_emails = ~problematic_data['email'].str.match(email_pattern, na=False)
print(f"Invalid Email Addresses: {invalid_emails.sum()}")

## 📊 Data Ingestion Patterns

Different scenarios require different ingestion patterns:

In [None]:
# Visualize different ingestion patterns
patterns_data = {
    'Pattern': ['Batch Processing', 'Real-time Streaming', 'Micro-batching', 'Event-driven', 'Hybrid'],
    'Frequency': ['Hours/Days', 'Continuous', 'Minutes', 'On-demand', 'Mixed'],
    'Latency': ['High', 'Very Low', 'Low', 'Variable', 'Mixed'],
    'Complexity': ['Low', 'High', 'Medium', 'Medium', 'High'],
    'Use Cases': [
        'Daily reports, ETL jobs',
        'Live dashboards, Alerts',
        'Near real-time analytics',
        'Webhooks, Notifications',
        'Enterprise systems'
    ]
}

df_patterns = pd.DataFrame(patterns_data)
print("🔄 Data Ingestion Patterns:")
print("=" * 70)
display(df_patterns)

# Create a simple visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Complexity vs Latency
complexity_map = {'Low': 1, 'Medium': 2, 'High': 3}
latency_map = {'Very Low': 1, 'Low': 2, 'Variable': 2.5, 'High': 3, 'Mixed': 2.5}

x = [complexity_map[c] for c in df_patterns['Complexity']]
y = [latency_map[l] for l in df_patterns['Latency']]

ax1.scatter(x, y, s=100, alpha=0.7)
for i, pattern in enumerate(df_patterns['Pattern']):
    ax1.annotate(pattern, (x[i], y[i]), xytext=(5, 5), textcoords='offset points')

ax1.set_xlabel('Complexity')
ax1.set_ylabel('Latency')
ax1.set_title('Ingestion Patterns: Complexity vs Latency')
ax1.grid(True, alpha=0.3)

# Pattern distribution
pattern_counts = df_patterns['Pattern'].value_counts()
ax2.pie([1]*len(pattern_counts), labels=pattern_counts.index, autopct='%1.0f%%')
ax2.set_title('Ingestion Pattern Distribution')

plt.tight_layout()
plt.show()

## 🎯 Best Practices for Data Ingestion

Based on industry experience, here are the key best practices:

In [None]:
# Best practices checklist
best_practices = {
    'Category': [
        'Data Quality',
        'Error Handling', 
        'Performance',
        'Security',
        'Monitoring',
        'Documentation',
        'Testing',
        'Scalability'
    ],
    'Key Practices': [
        'Validate early, Clean consistently, Monitor quality',
        'Retry logic, Circuit breakers, Graceful degradation',
        'Batch processing, Parallel execution, Caching',
        'Encrypt data, Secure APIs, Access control',
        'Log everything, Track metrics, Set up alerts',
        'Document schemas, API contracts, Processes',
        'Unit tests, Integration tests, Data tests',
        'Horizontal scaling, Load balancing, Queuing'
    ],
    'Priority': ['Critical', 'Critical', 'High', 'Critical', 'High', 'Medium', 'High', 'Medium']
}

df_practices = pd.DataFrame(best_practices)
print("💡 Data Ingestion Best Practices:")
print("=" * 80)
display(df_practices)

# Priority distribution
priority_counts = df_practices['Priority'].value_counts()
plt.figure(figsize=(10, 6))
colors = ['red', 'orange', 'yellow']
plt.pie(priority_counts.values, labels=priority_counts.index, autopct='%1.1f%%', colors=colors)
plt.title('Best Practices by Priority Level')
plt.show()

print(f"\n📊 Priority Breakdown:")
for priority, count in priority_counts.items():
    print(f"  {priority}: {count} practices")

## 🔮 Future of Data Ingestion

The data ingestion landscape is rapidly evolving. Here are key trends to watch:

In [None]:
# Future trends in data ingestion
trends_data = {
    'Trend': [
        'AI-Powered Data Quality',
        'Serverless Ingestion',
        'Real-time Everything',
        'Schema Evolution',
        'Edge Computing',
        'Data Mesh Architecture',
        'Privacy-First Design',
        'No-Code/Low-Code'
    ],
    'Impact': ['High', 'High', 'Very High', 'Medium', 'High', 'Medium', 'High', 'Medium'],
    'Timeline': ['2-3 years', '1-2 years', 'Now', '2-3 years', '1-2 years', '3-5 years', 'Now', '1-2 years'],
    'Description': [
        'ML-based anomaly detection and auto-correction',
        'Cloud functions for event-driven ingestion',
        'Sub-second latency for all data processing',
        'Automatic schema migration and compatibility',
        'Processing data closer to the source',
        'Decentralized data ownership and governance',
        'Built-in privacy and compliance features',
        'Visual pipeline builders for non-technical users'
    ]
}

df_trends = pd.DataFrame(trends_data)
print("🔮 Future Trends in Data Ingestion:")
print("=" * 80)
display(df_trends)

# Timeline visualization
timeline_map = {'Now': 0, '1-2 years': 1.5, '2-3 years': 2.5, '3-5 years': 4}
impact_map = {'Medium': 1, 'High': 2, 'Very High': 3}

x = [timeline_map[t] for t in df_trends['Timeline']]
y = [impact_map[i] for i in df_trends['Impact']]

plt.figure(figsize=(12, 8))
scatter = plt.scatter(x, y, s=150, alpha=0.7, c=y, cmap='viridis')

for i, trend in enumerate(df_trends['Trend']):
    plt.annotate(trend, (x[i], y[i]), xytext=(5, 5), textcoords='offset points', fontsize=9)

plt.xlabel('Timeline (Years from Now)')
plt.ylabel('Expected Impact')
plt.title('Data Ingestion Trends: Timeline vs Impact')
plt.grid(True, alpha=0.3)
plt.colorbar(scatter, label='Impact Level')
plt.show()

## 🎯 Key Takeaways

Congratulations! You've completed the first tutorial in our data ingestion series. Here's what you've learned:

### ✅ **Core Concepts**
- **Data Ingestion** is the foundation of all data analytics
- **Multiple Sources** require different handling strategies
- **Data Quality** is critical for reliable insights
- **Pipeline Stages** each serve a specific purpose

### ✅ **Practical Skills**
- Identifying different data source types
- Understanding common data quality issues
- Recognizing ingestion patterns and their use cases
- Applying best practices for reliable pipelines

### ✅ **Industry Knowledge**
- Common challenges and their solutions
- Best practices from real-world implementations
- Future trends shaping the industry
- Career opportunities in data engineering

---

## 🚀 What's Next?

In the next tutorial, **"02_reading_files.ipynb"**, you'll learn:
- 📁 How to read different file formats (CSV, JSON, Excel)
- 🔍 File validation and error handling
- 📊 Processing large files efficiently
- 🔄 Automating file monitoring and processing

### 🎯 **Practice Exercise**

Before moving to the next tutorial, try this exercise:

1. **Create sample data** for a different business (e.g., restaurant, hospital, school)
2. **Identify data sources** that business might have
3. **List potential data quality issues** for that domain
4. **Design a simple pipeline** for that use case

---

## 📚 Additional Resources

- **📖 Books**: "Designing Data-Intensive Applications" by Martin Kleppmann
- **🌐 Online**: Apache Kafka documentation, Apache Airflow tutorials
- **🎓 Courses**: Data Engineering courses on Coursera, Udacity
- **💼 Communities**: Data Engineering Slack, Reddit r/dataengineering

---

**Happy Learning! 🚀**