# 📁 Reading Files - Your First Data Ingestion Step

Welcome to the second tutorial in our **Data Ingestion Pipeline** series! In this hands-on notebook, you'll learn how to read different file formats and handle real-world file processing challenges.

## 🎯 Learning Objectives

By the end of this tutorial, you will:
- ✅ Read CSV and JSON files with Python
- ✅ Handle different file encodings and formats
- ✅ Validate file structure and content
- ✅ Process multiple files automatically
- ✅ Handle common file processing errors
- ✅ Build your first file ingestion function

---

## 🛠️ Setup and Imports

Let's start by importing the libraries we'll need and setting up our environment:

In [None]:
# Essential imports for file processing
import pandas as pd
import numpy as np
import json
import os
import glob
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# For visualizations
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('default')
sns.set_palette("husl")

print("📦 All libraries imported successfully!")
print(f"🐍 Pandas version: {pd.__version__}")
print(f"📊 Current working directory: {os.getcwd()}")

## 📂 Creating Sample Data Files

Before we learn to read files, let's create some sample data files to work with. This simulates the real-world scenario where you receive data files from different sources.

In [None]:
# Create directories for our sample data
data_dir = Path("sample_data")
data_dir.mkdir(exist_ok=True)

csv_dir = data_dir / "csv"
json_dir = data_dir / "json"
csv_dir.mkdir(exist_ok=True)
json_dir.mkdir(exist_ok=True)

print(f"📁 Created directories:")
print(f"  - {csv_dir}")
print(f"  - {json_dir}")

In [None]:
# Create sample CSV data (simulating store sales)
store_sales_data = {
    'order_id': ['ORD-2024-001', 'ORD-2024-002', 'ORD-2024-003', 'ORD-2024-004', 'ORD-2024-005'],
    'customer_name': ['John Doe', 'Jane Smith', 'Bob Wilson', 'Alice Johnson', 'Charlie Brown'],
    'product': ['iPhone 15', 'MacBook Pro', 'AirPods Pro', 'iPad Air', 'Apple Watch'],
    'quantity': [1, 1, 2, 1, 1],
    'price': [999.99, 1999.99, 249.99, 599.99, 399.99],
    'order_date': ['2024-01-15', '2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17'],
    'store_location': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}

# Save as CSV file
df_store = pd.DataFrame(store_sales_data)
csv_file_path = csv_dir / "store_sales_2024_01.csv"
df_store.to_csv(csv_file_path, index=False)

print("📊 Created sample CSV file:")
print(f"  File: {csv_file_path}")
print(f"  Records: {len(df_store)}")
print(f"  Columns: {list(df_store.columns)}")

# Display the data
print("\n📋 Sample CSV Data:")
display(df_store)

In [None]:
# Create sample JSON data (simulating mobile app orders)
mobile_app_data = {
    "app_info": {
        "version": "2.1.0",
        "platform": "iOS",
        "upload_timestamp": "2024-01-15T12:00:00Z"
    },
    "orders": [
        {
            "order_id": "APP-2024-001",
            "customer_name": "Sarah Connor",
            "customer_email": "sarah@example.com",
            "product": "iPhone 15 Pro",
            "quantity": 1,
            "price": 1199.99,
            "order_date": "2024-01-15",
            "device_info": {
                "model": "iPhone 14",
                "os_version": "17.2"
            }
        },
        {
            "order_id": "APP-2024-002",
            "customer_name": "Mike Johnson",
            "customer_email": "mike@example.com",
            "product": "MacBook Air",
            "quantity": 1,
            "price": 1299.99,
            "order_date": "2024-01-15",
            "device_info": {
                "model": "iPhone 13",
                "os_version": "17.1"
            }
        },
        {
            "order_id": "APP-2024-003",
            "customer_name": "Emma Davis",
            "customer_email": "emma@example.com",
            "product": "AirPods Max",
            "quantity": 1,
            "price": 549.99,
            "order_date": "2024-01-16",
            "device_info": {
                "model": "iPhone 15",
                "os_version": "17.2"
            }
        }
    ],
    "summary": {
        "total_orders": 3,
        "total_revenue": 3049.97
    }
}

# Save as JSON file
json_file_path = json_dir / "mobile_orders_2024_01.json"
with open(json_file_path, 'w') as f:
    json.dump(mobile_app_data, f, indent=2)

print("📱 Created sample JSON file:")
print(f"  File: {json_file_path}")
print(f"  Orders: {mobile_app_data['summary']['total_orders']}")
print(f"  Revenue: ${mobile_app_data['summary']['total_revenue']:,.2f}")

# Display the JSON structure
print("\n📋 Sample JSON Structure:")
print(json.dumps(mobile_app_data, indent=2)[:500] + "...")

## 📊 Reading CSV Files

CSV (Comma-Separated Values) files are the most common format for data exchange. Let's learn how to read them properly!

In [None]:
# Basic CSV reading
print("📊 Reading CSV File - Basic Method")
print("=" * 40)

# Method 1: Simple read
df_basic = pd.read_csv(csv_file_path)

print(f"✅ Successfully read CSV file!")
print(f"📏 Shape: {df_basic.shape} (rows, columns)")
print(f"📋 Columns: {list(df_basic.columns)}")
print(f"🔢 Data types:")
for col, dtype in df_basic.dtypes.items():
    print(f"  {col}: {dtype}")

print("\n📋 First 3 rows:")
display(df_basic.head(3))

In [None]:
# Advanced CSV reading with options
print("📊 Reading CSV File - Advanced Method")
print("=" * 40)

# Method 2: With specific options
df_advanced = pd.read_csv(
    csv_file_path,
    encoding='utf-8',           # Specify encoding
    parse_dates=['order_date'], # Parse dates automatically
    dtype={                     # Specify data types
        'order_id': 'string',
        'customer_name': 'string',
        'product': 'string',
        'quantity': 'int64',
        'price': 'float64',
        'store_location': 'string'
    }
)

print(f"✅ Successfully read CSV with advanced options!")
print(f"🔢 Improved data types:")
for col, dtype in df_advanced.dtypes.items():
    print(f"  {col}: {dtype}")

print("\n📊 Data Info:")
df_advanced.info()

In [None]:
# Create a function to read CSV files safely
def read_csv_safely(file_path, encoding='utf-8'):
    """
    Safely read a CSV file with error handling
    
    Args:
        file_path (str): Path to the CSV file
        encoding (str): File encoding (default: utf-8)
    
    Returns:
        tuple: (success, data_or_error_message, file_info)
    """
    try:
        # Check if file exists
        if not os.path.exists(file_path):
            return False, f"File not found: {file_path}", None
        
        # Get file info
        file_size = os.path.getsize(file_path)
        file_modified = datetime.fromtimestamp(os.path.getmtime(file_path))
        
        # Try different encodings if utf-8 fails
        encodings_to_try = [encoding, 'utf-8', 'latin-1', 'cp1252']
        
        for enc in encodings_to_try:
            try:
                df = pd.read_csv(file_path, encoding=enc)
                
                file_info = {
                    'file_path': str(file_path),
                    'file_size_bytes': file_size,
                    'file_size_mb': round(file_size / (1024*1024), 2),
                    'modified_date': file_modified.strftime('%Y-%m-%d %H:%M:%S'),
                    'encoding_used': enc,
                    'rows': len(df),
                    'columns': len(df.columns),
                    'column_names': list(df.columns)
                }
                
                return True, df, file_info
                
            except UnicodeDecodeError:
                continue
        
        return False, "Could not decode file with any supported encoding", None
        
    except Exception as e:
        return False, f"Error reading CSV file: {str(e)}", None

# Test our safe CSV reader
print("🛡️ Testing Safe CSV Reader")
print("=" * 30)

success, data, info = read_csv_safely(csv_file_path)

if success:
    print("✅ File read successfully!")
    print(f"📁 File Info:")
    for key, value in info.items():
        print(f"  {key}: {value}")
    
    print(f"\n📊 Data Preview:")
    display(data.head(2))
else:
    print(f"❌ Error: {data}")

## 📱 Reading JSON Files

JSON (JavaScript Object Notation) files are flexible and can contain nested data structures. Let's learn how to handle them!

In [None]:
# Basic JSON reading
print("📱 Reading JSON File - Basic Method")
print("=" * 40)

# Method 1: Using json library
with open(json_file_path, 'r') as f:
    json_data = json.load(f)

print(f"✅ Successfully read JSON file!")
print(f"📋 Top-level keys: {list(json_data.keys())}")
print(f"📊 Number of orders: {len(json_data['orders'])}")
print(f"💰 Total revenue: ${json_data['summary']['total_revenue']:,.2f}")

print("\n📋 First order:")
print(json.dumps(json_data['orders'][0], indent=2))

In [None]:
# Convert JSON to DataFrame
print("📊 Converting JSON to DataFrame")
print("=" * 35)

# Extract orders from JSON
orders_list = json_data['orders']

# Method 1: Simple conversion (flattens simple fields)
df_json_simple = pd.DataFrame(orders_list)

print("📋 Simple conversion (nested fields remain as objects):")
display(df_json_simple)

print(f"\n🔢 Data types:")
for col, dtype in df_json_simple.dtypes.items():
    print(f"  {col}: {dtype}")

In [None]:
# Advanced JSON processing - Flatten nested data
print("📊 Advanced JSON Processing - Flattening Nested Data")
print("=" * 55)

def flatten_json_orders(json_data):
    """
    Flatten JSON orders data into a clean DataFrame
    
    Args:
        json_data (dict): JSON data with orders
    
    Returns:
        pd.DataFrame: Flattened orders data
    """
    flattened_orders = []
    
    for order in json_data['orders']:
        # Create flattened order record
        flat_order = {
            'order_id': order['order_id'],
            'customer_name': order['customer_name'],
            'customer_email': order['customer_email'],
            'product': order['product'],
            'quantity': order['quantity'],
            'price': order['price'],
            'order_date': order['order_date'],
            # Flatten device_info
            'device_model': order['device_info']['model'],
            'device_os_version': order['device_info']['os_version'],
            # Add metadata from app_info
            'app_version': json_data['app_info']['version'],
            'app_platform': json_data['app_info']['platform'],
            'upload_timestamp': json_data['app_info']['upload_timestamp']
        }
        
        flattened_orders.append(flat_order)
    
    return pd.DataFrame(flattened_orders)

# Flatten our JSON data
df_json_flat = flatten_json_orders(json_data)

print("✅ Successfully flattened JSON data!")
print(f"📏 Shape: {df_json_flat.shape}")
print(f"📋 Columns: {list(df_json_flat.columns)}")

print("\n📊 Flattened Data:")
display(df_json_flat)

In [None]:
# Create a function to read JSON files safely
def read_json_safely(file_path):
    """
    Safely read a JSON file with error handling
    
    Args:
        file_path (str): Path to the JSON file
    
    Returns:
        tuple: (success, data_or_error_message, file_info)
    """
    try:
        # Check if file exists
        if not os.path.exists(file_path):
            return False, f"File not found: {file_path}", None
        
        # Get file info
        file_size = os.path.getsize(file_path)
        file_modified = datetime.fromtimestamp(os.path.getmtime(file_path))
        
        # Read JSON file
        with open(file_path, 'r', encoding='utf-8') as f:
            json_data = json.load(f)
        
        # Analyze JSON structure
        def analyze_json_structure(data, prefix=""):
            structure = {}
            if isinstance(data, dict):
                for key, value in data.items():
                    full_key = f"{prefix}.{key}" if prefix else key
                    if isinstance(value, (dict, list)):
                        structure[full_key] = type(value).__name__
                        if isinstance(value, list) and len(value) > 0:
                            structure[f"{full_key}_length"] = len(value)
                    else:
                        structure[full_key] = type(value).__name__
            return structure
        
        structure = analyze_json_structure(json_data)
        
        file_info = {
            'file_path': str(file_path),
            'file_size_bytes': file_size,
            'file_size_mb': round(file_size / (1024*1024), 2),
            'modified_date': file_modified.strftime('%Y-%m-%d %H:%M:%S'),
            'json_structure': structure,
            'top_level_keys': list(json_data.keys()) if isinstance(json_data, dict) else 'Not a dict'
        }
        
        return True, json_data, file_info
        
    except json.JSONDecodeError as e:
        return False, f"Invalid JSON format: {str(e)}", None
    except Exception as e:
        return False, f"Error reading JSON file: {str(e)}", None

# Test our safe JSON reader
print("🛡️ Testing Safe JSON Reader")
print("=" * 30)

success, data, info = read_json_safely(json_file_path)

if success:
    print("✅ File read successfully!")
    print(f"📁 File Info:")
    for key, value in info.items():
        if key != 'json_structure':  # Skip detailed structure for now
            print(f"  {key}: {value}")
    
    print(f"\n📊 JSON Structure:")
    for key, value in info['json_structure'].items():
        print(f"  {key}: {value}")
else:
    print(f"❌ Error: {data}")

## 🔍 File Validation and Quality Checks

Before processing files, it's crucial to validate their structure and content. Let's learn how to build robust validation!

In [None]:
# File validation function
def validate_csv_file(df, required_columns=None, min_rows=1):
    """
    Validate CSV file structure and content
    
    Args:
        df (pd.DataFrame): DataFrame to validate
        required_columns (list): List of required column names
        min_rows (int): Minimum number of rows required
    
    Returns:
        dict: Validation results
    """
    validation_results = {
        'is_valid': True,
        'errors': [],
        'warnings': [],
        'stats': {
            'total_rows': len(df),
            'total_columns': len(df.columns),
            'missing_values': df.isnull().sum().sum(),
            'duplicate_rows': df.duplicated().sum()
        }
    }
    
    # Check if DataFrame is empty
    if df.empty:
        validation_results['is_valid'] = False
        validation_results['errors'].append("File is empty")
        return validation_results
    
    # Check minimum rows
    if len(df) < min_rows:
        validation_results['is_valid'] = False
        validation_results['errors'].append(f"File has {len(df)} rows, minimum required: {min_rows}")
    
    # Check required columns
    if required_columns:
        missing_columns = set(required_columns) - set(df.columns)
        if missing_columns:
            validation_results['is_valid'] = False
            validation_results['errors'].append(f"Missing required columns: {list(missing_columns)}")
    
    # Check for missing values
    missing_by_column = df.isnull().sum()
    for col, missing_count in missing_by_column.items():
        if missing_count > 0:
            missing_pct = (missing_count / len(df)) * 100
            if missing_pct > 50:
                validation_results['warnings'].append(f"Column '{col}' has {missing_pct:.1f}% missing values")
    
    # Check for duplicates
    if validation_results['stats']['duplicate_rows'] > 0:
        validation_results['warnings'].append(f"Found {validation_results['stats']['duplicate_rows']} duplicate rows")
    
    return validation_results

# Test validation on our CSV data
print("🔍 Validating CSV File")
print("=" * 25)

required_cols = ['order_id', 'customer_name', 'product', 'quantity', 'price']
validation = validate_csv_file(df_advanced, required_columns=required_cols)

print(f"✅ Validation Status: {'PASSED' if validation['is_valid'] else 'FAILED'}")
print(f"📊 File Statistics:")
for key, value in validation['stats'].items():
    print(f"  {key}: {value}")

if validation['errors']:
    print(f"\n❌ Errors:")
    for error in validation['errors']:
        print(f"  - {error}")

if validation['warnings']:
    print(f"\n⚠️ Warnings:")
    for warning in validation['warnings']:
        print(f"  - {warning}")

if validation['is_valid'] and not validation['warnings']:
    print("\n🎉 File validation passed with no issues!")

## 📁 Processing Multiple Files

In real-world scenarios, you'll often need to process multiple files at once. Let's learn how to handle batch file processing!

In [None]:
# Create additional sample files for batch processing
print("📁 Creating Additional Sample Files")
print("=" * 35)

# Create more CSV files
additional_csv_data = [
    {
        'filename': 'store_sales_2024_02.csv',
        'data': {
            'order_id': ['ORD-2024-006', 'ORD-2024-007', 'ORD-2024-008'],
            'customer_name': ['David Lee', 'Lisa Wang', 'Tom Brown'],
            'product': ['iPad Pro', 'AirPods Max', 'Mac Studio'],
            'quantity': [1, 1, 1],
            'price': [1099.99, 549.99, 1999.99],
            'order_date': ['2024-02-01', '2024-02-01', '2024-02-02'],
            'store_location': ['Miami', 'Seattle', 'Boston']
        }
    },
    {
        'filename': 'store_sales_2024_03.csv',
        'data': {
            'order_id': ['ORD-2024-009', 'ORD-2024-010'],
            'customer_name': ['Anna Garcia', 'Chris Wilson'],
            'product': ['iPhone 15 Plus', 'MacBook Air'],
            'quantity': [1, 1],
            'price': [1099.99, 1199.99],
            'order_date': ['2024-03-01', '2024-03-01'],
            'store_location': ['Denver', 'Portland']
        }
    }
]

# Save additional CSV files
for file_info in additional_csv_data:
    df_temp = pd.DataFrame(file_info['data'])
    file_path = csv_dir / file_info['filename']
    df_temp.to_csv(file_path, index=False)
    print(f"📄 Created: {file_info['filename']} ({len(df_temp)} records)")

# Create more JSON files
additional_json_data = {
    "app_info": {
        "version": "2.2.0",
        "platform": "Android",
        "upload_timestamp": "2024-02-01T14:30:00Z"
    },
    "orders": [
        {
            "order_id": "APP-2024-004",
            "customer_name": "Kevin Park",
            "customer_email": "kevin@example.com",
            "product": "Samsung Galaxy S24",
            "quantity": 1,
            "price": 899.99,
            "order_date": "2024-02-01",
            "device_info": {
                "model": "Galaxy S23",
                "os_version": "14"
            }
        }
    ],
    "summary": {
        "total_orders": 1,
        "total_revenue": 899.99
    }
}

json_file_path_2 = json_dir / "mobile_orders_2024_02.json"
with open(json_file_path_2, 'w') as f:
    json.dump(additional_json_data, f, indent=2)

print(f"📱 Created: mobile_orders_2024_02.json (1 record)")

print(f"\n📊 Total files created:")
csv_files = list(csv_dir.glob('*.csv'))
json_files = list(json_dir.glob('*.json'))
print(f"  CSV files: {len(csv_files)}")
print(f"  JSON files: {len(json_files)}")

In [None]:
# Batch file processing function
def process_multiple_files(directory, file_pattern, file_type='csv'):
    """
    Process multiple files from a directory
    
    Args:
        directory (str): Directory path
        file_pattern (str): File pattern (e.g., '*.csv')
        file_type (str): Type of files ('csv' or 'json')
    
    Returns:
        dict: Processing results
    """
    results = {
        'successful_files': [],
        'failed_files': [],
        'total_records': 0,
        'combined_data': None,
        'processing_summary': []
    }
    
    # Find all matching files
    file_paths = glob.glob(os.path.join(directory, file_pattern))
    
    if not file_paths:
        print(f"⚠️ No files found matching pattern: {file_pattern}")
        return results
    
    print(f"📁 Found {len(file_paths)} files to process")
    
    all_dataframes = []
    
    for file_path in file_paths:
        filename = os.path.basename(file_path)
        print(f"\n📄 Processing: {filename}")
        
        try:
            if file_type == 'csv':
                success, data, info = read_csv_safely(file_path)
            elif file_type == 'json':
                success, data, info = read_json_safely(file_path)
                if success:
                    # Convert JSON to DataFrame
                    data = flatten_json_orders(data)
            else:
                success = False
                data = f"Unsupported file type: {file_type}"
                info = None
            
            if success:
                # Add source file information
                data['source_file'] = filename
                data['processed_at'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                
                all_dataframes.append(data)
                results['successful_files'].append(filename)
                results['total_records'] += len(data)
                
                summary = {
                    'filename': filename,
                    'status': 'success',
                    'records': len(data),
                    'columns': len(data.columns),
                    'file_size_mb': info['file_size_mb'] if info else 0
                }
                results['processing_summary'].append(summary)
                
                print(f"  ✅ Success: {len(data)} records")
            else:
                results['failed_files'].append({'filename': filename, 'error': data})
                summary = {
                    'filename': filename,
                    'status': 'failed',
                    'error': data
                }
                results['processing_summary'].append(summary)
                print(f"  ❌ Failed: {data}")
                
        except Exception as e:
            error_msg = f"Unexpected error: {str(e)}"
            results['failed_files'].append({'filename': filename, 'error': error_msg})
            print(f"  ❌ Error: {error_msg}")
    
    # Combine all successful DataFrames
    if all_dataframes:
        try:
            results['combined_data'] = pd.concat(all_dataframes, ignore_index=True)
            print(f"\n🔗 Combined {len(all_dataframes)} files into single dataset")
        except Exception as e:
            print(f"\n⚠️ Warning: Could not combine files: {str(e)}")
    
    return results

# Process all CSV files
print("📊 Processing All CSV Files")
print("=" * 30)

csv_results = process_multiple_files(csv_dir, '*.csv', 'csv')

print(f"\n📈 CSV Processing Summary:")
print(f"  Successful files: {len(csv_results['successful_files'])}")
print(f"  Failed files: {len(csv_results['failed_files'])}")
print(f"  Total records: {csv_results['total_records']:,}")

if csv_results['combined_data'] is not None:
    print(f"  Combined data shape: {csv_results['combined_data'].shape}")
    print(f"\n📋 Sample of combined data:")
    display(csv_results['combined_data'].head())

In [None]:
# Process all JSON files
print("📱 Processing All JSON Files")
print("=" * 30)

json_results = process_multiple_files(json_dir, '*.json', 'json')

print(f"\n📈 JSON Processing Summary:")
print(f"  Successful files: {len(json_results['successful_files'])}")
print(f"  Failed files: {len(json_results['failed_files'])}")
print(f"  Total records: {json_results['total_records']:,}")

if json_results['combined_data'] is not None:
    print(f"  Combined data shape: {json_results['combined_data'].shape}")
    print(f"\n📋 Sample of combined JSON data:")
    display(json_results['combined_data'].head())

## 🔄 Combining Data from Multiple Sources

Now let's combine data from both CSV and JSON sources into a unified dataset!

In [None]:
# Combine CSV and JSON data
print("🔗 Combining Data from Multiple Sources")
print("=" * 40)

def combine_multi_source_data(csv_data, json_data):
    """
    Combine data from CSV and JSON sources
    
    Args:
        csv_data (pd.DataFrame): Data from CSV files
        json_data (pd.DataFrame): Data from JSON files
    
    Returns:
        pd.DataFrame: Combined and standardized data
    """
    combined_datasets = []
    
    # Process CSV data
    if csv_data is not None and not csv_data.empty:
        csv_standardized = csv_data.copy()
        csv_standardized['source_type'] = 'csv'
        csv_standardized['customer_email'] = None  # CSV doesn't have email
        
        # Standardize column order
        csv_columns = ['order_id', 'customer_name', 'customer_email', 'product', 
                      'quantity', 'price', 'order_date', 'source_file', 'source_type']
        
        # Add missing columns with default values
        for col in csv_columns:
            if col not in csv_standardized.columns:
                csv_standardized[col] = None
        
        csv_standardized = csv_standardized[csv_columns + ['store_location', 'processed_at']]
        combined_datasets.append(csv_standardized)
        print(f"📊 CSV data: {len(csv_standardized)} records")
    
    # Process JSON data
    if json_data is not None and not json_data.empty:
        json_standardized = json_data.copy()
        json_standardized['source_type'] = 'json'
        json_standardized['store_location'] = None  # JSON doesn't have store location
        
        # Standardize column order
        json_columns = ['order_id', 'customer_name', 'customer_email', 'product', 
                       'quantity', 'price', 'order_date', 'source_file', 'source_type']
        
        # Add missing columns with default values
        for col in json_columns:
            if col not in json_standardized.columns:
                json_standardized[col] = None
        
        json_standardized = json_standardized[json_columns + ['store_location', 'processed_at']]
        combined_datasets.append(json_standardized)
        print(f"📱 JSON data: {len(json_standardized)} records")
    
    # Combine all datasets
    if combined_datasets:
        final_data = pd.concat(combined_datasets, ignore_index=True)
        
        # Add final processing metadata
        final_data['combined_at'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        final_data['record_id'] = range(1, len(final_data) + 1)
        
        return final_data
    else:
        return pd.DataFrame()

# Combine the data
combined_data = combine_multi_source_data(
    csv_results['combined_data'], 
    json_results['combined_data']
)

if not combined_data.empty:
    print(f"\n🎉 Successfully combined data!")
    print(f"📏 Final dataset shape: {combined_data.shape}")
    print(f"📋 Columns: {list(combined_data.columns)}")
    
    # Show source breakdown
    source_breakdown = combined_data['source_type'].value_counts()
    print(f"\n📊 Source Breakdown:")
    for source, count in source_breakdown.items():
        print(f"  {source.upper()}: {count} records")
    
    print(f"\n📋 Combined Dataset Sample:")
    display(combined_data[['record_id', 'order_id', 'customer_name', 'product', 'price', 'source_type']].head(10))
else:
    print("❌ No data to combine")

## 📊 Data Analysis and Visualization

Now that we have our combined dataset, let's do some basic analysis to understand our data better!

In [None]:
# Basic data analysis
if not combined_data.empty:
    print("📊 Data Analysis Summary")
    print("=" * 25)
    
    # Basic statistics
    print(f"📈 Dataset Overview:")
    print(f"  Total Orders: {len(combined_data):,}")
    print(f"  Unique Customers: {combined_data['customer_name'].nunique()}")
    print(f"  Unique Products: {combined_data['product'].nunique()}")
    print(f"  Date Range: {combined_data['order_date'].min()} to {combined_data['order_date'].max()}")
    
    # Revenue analysis
    total_revenue = combined_data['price'].sum()
    avg_order_value = combined_data['price'].mean()
    print(f"\n💰 Revenue Analysis:")
    print(f"  Total Revenue: ${total_revenue:,.2f}")
    print(f"  Average Order Value: ${avg_order_value:.2f}")
    print(f"  Highest Order: ${combined_data['price'].max():.2f}")
    print(f"  Lowest Order: ${combined_data['price'].min():.2f}")
    
    # Product analysis
    print(f"\n📱 Top Products:")
    top_products = combined_data['product'].value_counts().head(5)
    for product, count in top_products.items():
        print(f"  {product}: {count} orders")
    
    # Create visualizations
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Revenue by source type
    revenue_by_source = combined_data.groupby('source_type')['price'].sum()
    ax1.pie(revenue_by_source.values, labels=revenue_by_source.index, autopct='%1.1f%%')
    ax1.set_title('Revenue by Source Type')
    
    # 2. Order count by product
    top_products.plot(kind='bar', ax=ax2)
    ax2.set_title('Top Products by Order Count')
    ax2.set_xlabel('Product')
    ax2.set_ylabel('Number of Orders')
    ax2.tick_params(axis='x', rotation=45)
    
    # 3. Price distribution
    ax3.hist(combined_data['price'], bins=10, alpha=0.7, edgecolor='black')
    ax3.set_title('Price Distribution')
    ax3.set_xlabel('Price ($)')
    ax3.set_ylabel('Frequency')
    
    # 4. Orders over time (if we have date data)
    combined_data['order_date'] = pd.to_datetime(combined_data['order_date'])
    daily_orders = combined_data.groupby('order_date').size()
    daily_orders.plot(kind='line', marker='o', ax=ax4)
    ax4.set_title('Orders Over Time')
    ax4.set_xlabel('Date')
    ax4.set_ylabel('Number of Orders')
    ax4.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
else:
    print("❌ No data available for analysis")

## 🛡️ Error Handling and Best Practices

Let's learn how to handle common file processing errors gracefully!

In [None]:
# Create problematic files to test error handling
print("🧪 Creating Test Files with Issues")
print("=" * 35)

# Create an empty CSV file
empty_csv = csv_dir / "empty_file.csv"
empty_csv.touch()
print("📄 Created empty CSV file")

# Create a CSV with missing columns
bad_csv_data = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['A', 'B', 'C']
})
bad_csv = csv_dir / "bad_structure.csv"
bad_csv_data.to_csv(bad_csv, index=False)
print("📄 Created CSV with wrong structure")

# Create an invalid JSON file
invalid_json = json_dir / "invalid.json"
with open(invalid_json, 'w') as f:
    f.write('{"invalid": json content}')
print("📄 Created invalid JSON file")

print("\n🧪 Testing Error Handling")
print("=" * 25)

# Test files with issues
test_files = [
    (empty_csv, 'csv'),
    (bad_csv, 'csv'),
    (invalid_json, 'json'),
    ('nonexistent_file.csv', 'csv')
]

for file_path, file_type in test_files:
    print(f"\n📄 Testing: {os.path.basename(str(file_path))}")
    
    if file_type == 'csv':
        success, data, info = read_csv_safely(file_path)
    else:
        success, data, info = read_json_safely(file_path)
    
    if success:
        print(f"  ✅ Success: {len(data)} records")
        
        # Validate the data
        if file_type == 'csv':
            validation = validate_csv_file(data, required_columns=['order_id', 'customer_name'])
            if not validation['is_valid']:
                print(f"  ⚠️ Validation failed: {validation['errors']}")
    else:
        print(f"  ❌ Failed: {data}")

## 🎯 Building Your First File Ingestion System

Let's put everything together and build a complete file ingestion system!

In [None]:
# Complete file ingestion system
class SimpleFileIngestion:
    """
    A simple but robust file ingestion system
    """
    
    def __init__(self, input_directory, output_directory=None):
        self.input_directory = Path(input_directory)
        self.output_directory = Path(output_directory) if output_directory else Path("processed_data")
        self.output_directory.mkdir(exist_ok=True)
        
        # Processing statistics
        self.stats = {
            'files_processed': 0,
            'files_failed': 0,
            'total_records': 0,
            'processing_errors': []
        }
    
    def process_all_files(self):
        """
        Process all CSV and JSON files in the input directory
        
        Returns:
            pd.DataFrame: Combined processed data
        """
        print(f"🚀 Starting file ingestion from: {self.input_directory}")
        print("=" * 50)
        
        all_data = []
        
        # Process CSV files
        csv_files = list(self.input_directory.glob('**/*.csv'))
        print(f"📊 Found {len(csv_files)} CSV files")
        
        for csv_file in csv_files:
            data = self._process_csv_file(csv_file)
            if data is not None:
                all_data.append(data)
        
        # Process JSON files
        json_files = list(self.input_directory.glob('**/*.json'))
        print(f"📱 Found {len(json_files)} JSON files")
        
        for json_file in json_files:
            data = self._process_json_file(json_file)
            if data is not None:
                all_data.append(data)
        
        # Combine all data
        if all_data:
            combined_data = pd.concat(all_data, ignore_index=True)
            self.stats['total_records'] = len(combined_data)
            
            # Save combined data
            output_file = self.output_directory / f"combined_data_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
            combined_data.to_csv(output_file, index=False)
            
            print(f"\n💾 Saved combined data to: {output_file}")
            self._print_summary()
            
            return combined_data
        else:
            print("\n⚠️ No data was successfully processed")
            self._print_summary()
            return pd.DataFrame()
    
    def _process_csv_file(self, file_path):
        """Process a single CSV file"""
        print(f"\n📄 Processing CSV: {file_path.name}")
        
        try:
            success, data, info = read_csv_safely(file_path)
            
            if success:
                # Validate the data
                validation = validate_csv_file(data, min_rows=1)
                
                if validation['is_valid']:
                    # Add metadata
                    data['source_file'] = file_path.name
                    data['source_type'] = 'csv'
                    data['processed_at'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                    
                    self.stats['files_processed'] += 1
                    print(f"  ✅ Success: {len(data)} records")
                    return data
                else:
                    self.stats['files_failed'] += 1
                    error_msg = f"Validation failed: {validation['errors']}"
                    self.stats['processing_errors'].append({'file': file_path.name, 'error': error_msg})
                    print(f"  ❌ Validation failed: {validation['errors']}")
            else:
                self.stats['files_failed'] += 1
                self.stats['processing_errors'].append({'file': file_path.name, 'error': data})
                print(f"  ❌ Failed: {data}")
                
        except Exception as e:
            self.stats['files_failed'] += 1
            error_msg = f"Unexpected error: {str(e)}"
            self.stats['processing_errors'].append({'file': file_path.name, 'error': error_msg})
            print(f"  ❌ Error: {error_msg}")
        
        return None
    
    def _process_json_file(self, file_path):
        """Process a single JSON file"""
        print(f"\n📄 Processing JSON: {file_path.name}")
        
        try:
            success, data, info = read_json_safely(file_path)
            
            if success:
                # Convert to DataFrame
                if 'orders' in data:
                    df = flatten_json_orders(data)
                    
                    # Add metadata
                    df['source_file'] = file_path.name
                    df['source_type'] = 'json'
                    df['processed_at'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                    
                    self.stats['files_processed'] += 1
                    print(f"  ✅ Success: {len(df)} records")
                    return df
                else:
                    self.stats['files_failed'] += 1
                    error_msg = "No 'orders' key found in JSON"
                    self.stats['processing_errors'].append({'file': file_path.name, 'error': error_msg})
                    print(f"  ❌ {error_msg}")
            else:
                self.stats['files_failed'] += 1
                self.stats['processing_errors'].append({'file': file_path.name, 'error': data})
                print(f"  ❌ Failed: {data}")
                
        except Exception as e:
            self.stats['files_failed'] += 1
            error_msg = f"Unexpected error: {str(e)}"
            self.stats['processing_errors'].append({'file': file_path.name, 'error': error_msg})
            print(f"  ❌ Error: {error_msg}")
        
        return None
    
    def _print_summary(self):
        """Print processing summary"""
        print(f"\n📊 Processing Summary")
        print("=" * 20)
        print(f"✅ Files Processed: {self.stats['files_processed']}")
        print(f"❌ Files Failed: {self.stats['files_failed']}")
        print(f"📈 Total Records: {self.stats['total_records']:,}")
        
        if self.stats['processing_errors']:
            print(f"\n⚠️ Errors:")
            for error in self.stats['processing_errors']:
                print(f"  {error['file']}: {error['error']}")

# Test our complete ingestion system
print("🎯 Testing Complete File Ingestion System")
print("=" * 45)

# Initialize the ingestion system
ingestion_system = SimpleFileIngestion(data_dir)

# Process all files
final_data = ingestion_system.process_all_files()

if not final_data.empty:
    print(f"\n🎉 Ingestion completed successfully!")
    print(f"📏 Final dataset: {final_data.shape}")
    print(f"\n📋 Sample of final data:")
    display(final_data[['source_file', 'order_id', 'customer_name', 'product', 'price', 'source_type']].head())
else:
    print(f"\n❌ No data was processed successfully")

## 🎯 Key Takeaways

Congratulations! You've completed the file reading tutorial. Here's what you've learned:

### ✅ **Core Skills Mastered**
- **📊 CSV Reading**: Basic and advanced CSV processing with pandas
- **📱 JSON Processing**: Handling nested JSON structures and flattening data
- **🔍 File Validation**: Checking file structure and data quality
- **📁 Batch Processing**: Processing multiple files automatically
- **🛡️ Error Handling**: Graceful handling of common file issues
- **🔗 Data Combination**: Merging data from multiple sources

### ✅ **Best Practices Learned**
- Always validate files before processing
- Handle different encodings gracefully
- Add metadata to track data lineage
- Implement comprehensive error handling
- Combine data sources systematically
- Log processing results for monitoring

### ✅ **Real-World Applications**
- **Business Reports**: Processing daily/weekly sales files
- **Data Integration**: Combining data from multiple systems
- **ETL Pipelines**: First stage of data transformation
- **Data Quality**: Validating incoming data files

---

## 🚀 What's Next?

In the next tutorial, **"03_calling_apis.ipynb"**, you'll learn:
- 🌐 How to call REST APIs to fetch data
- 🔐 Handling API authentication and rate limiting
- 🔄 Processing API responses and pagination
- ⚡ Real-time data ingestion from APIs
- 🛡️ Error handling for network issues

### 🎯 **Practice Exercise**

Before moving to the next tutorial, try this exercise:

1. **Create your own sample data** for a different business (restaurant, library, etc.)
2. **Save it in both CSV and JSON formats** with different structures
3. **Use the ingestion system** we built to process your files
4. **Add custom validation rules** specific to your business domain
5. **Create visualizations** to analyze your processed data

---

## 🧹 Cleanup

Let's clean up the sample files we created:

In [None]:
# Optional: Clean up sample files
import shutil

cleanup = input("Do you want to clean up the sample files? (y/n): ").lower().strip()

if cleanup == 'y':
    try:
        shutil.rmtree(data_dir)
        print(f"🧹 Cleaned up sample data directory: {data_dir}")
    except Exception as e:
        print(f"⚠️ Could not clean up: {e}")
else:
    print(f"📁 Sample files kept in: {data_dir}")
    print(f"   You can explore them or use them for practice!")

---

**Great job completing this tutorial! 🎉**

You now have solid foundations in file-based data ingestion. In the next tutorial, we'll expand your skills to include API-based data collection.

**Happy Learning! 🚀**