# 📁 Reading Files - Mastering File-Based Data Ingestion

Welcome to the second tutorial in our **Data Ingestion Pipeline** series! In this notebook, you'll master the art of reading and processing different file formats - the foundation of most data ingestion pipelines.

## 🎯 Learning Objectives

By the end of this tutorial, you will:
- ✅ Read CSV, JSON, and Excel files with confidence
- ✅ Handle different encodings and file formats
- ✅ Process large files efficiently
- ✅ Implement robust error handling
- ✅ Automate file monitoring and processing
- ✅ Apply best practices for file-based ingestion

---

## 🛠️ Setup and Imports

Let's start by importing the libraries we'll need and setting up our environment:

In [None]:
# Essential imports for file processing
import pandas as pd
import numpy as np
import json
import os
import sys
from pathlib import Path
import glob
import time
from datetime import datetime, timedelta
import warnings

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# File monitoring (we'll simulate this)
import hashlib
import shutil

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📦 All libraries imported successfully!")
print(f"🐍 Python version: {sys.version.split()[0]}")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔢 NumPy version: {np.__version__}")

## 📁 Setting Up Our File Structure

Let's create a proper directory structure for our file processing examples:

In [None]:
# Create directory structure for our examples
base_dir = Path("tutorial_data")
directories = [
    base_dir / "input" / "csv",
    base_dir / "input" / "json", 
    base_dir / "input" / "excel",
    base_dir / "processed",
    base_dir / "output",
    base_dir / "archive"
]

for directory in directories:
    directory.mkdir(parents=True, exist_ok=True)
    print(f"📁 Created directory: {directory}")

print("\n✅ Directory structure created successfully!")

## 📊 Creating Sample Data Files

Before we can read files, let's create some realistic sample data files to work with:

In [None]:
# Create sample CSV data
def create_sample_csv_files():
    """Create sample CSV files with different characteristics"""
    
    # Sample 1: Clean, well-formatted data
    clean_data = {
        'order_id': ['ORD-2024-001', 'ORD-2024-002', 'ORD-2024-003', 'ORD-2024-004'],
        'customer_name': ['John Doe', 'Jane Smith', 'Bob Wilson', 'Alice Johnson'],
        'product': ['iPhone 15', 'MacBook Pro', 'AirPods Pro', 'iPad Air'],
        'quantity': [1, 1, 2, 1],
        'price': [999.99, 1999.99, 249.99, 599.99],
        'order_date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18'],
        'store_location': ['New York', 'Los Angeles', 'Chicago', 'Houston']
    }
    
    df_clean = pd.DataFrame(clean_data)
    df_clean.to_csv(base_dir / "input" / "csv" / "orders_clean.csv", index=False)
    
    # Sample 2: Data with issues (missing values, inconsistent formats)
    messy_data = {
        'order_id': ['ORD-2024-005', '', 'ORD-2024-007', 'ORD-2024-008'],
        'customer_name': ['mary johnson', 'DAVID BROWN', '', 'Sarah Connor'],
        'product': ['samsung galaxy s24', 'Dell XPS 13', 'Sony WH-1000XM4', 'Nintendo Switch'],
        'quantity': [1, 0, 2, 1],
        'price': [899.99, 1299.99, 349.99, 299.99],
        'order_date': ['01/19/2024', '2024-01-20', 'Jan 21, 2024', '2024-01-22'],
        'store_location': ['miami', 'SEATTLE', 'Phoenix', 'Boston']
    }
    
    df_messy = pd.DataFrame(messy_data)
    df_messy.to_csv(base_dir / "input" / "csv" / "orders_messy.csv", index=False)
    
    # Sample 3: Large dataset (for performance testing)
    np.random.seed(42)
    large_data = {
        'order_id': [f'ORD-2024-{i:06d}' for i in range(1000, 2000)],
        'customer_name': [f'Customer {i}' for i in range(1000)],
        'product': np.random.choice(['iPhone 15', 'MacBook Pro', 'AirPods Pro', 'iPad Air'], 1000),
        'quantity': np.random.randint(1, 5, 1000),
        'price': np.random.uniform(99.99, 1999.99, 1000).round(2),
        'order_date': pd.date_range('2024-01-01', periods=1000, freq='H').strftime('%Y-%m-%d'),
        'store_location': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston'], 1000)
    }
    
    df_large = pd.DataFrame(large_data)
    df_large.to_csv(base_dir / "input" / "csv" / "orders_large.csv", index=False)
    
    return len(df_clean), len(df_messy), len(df_large)

# Create the CSV files
clean_count, messy_count, large_count = create_sample_csv_files()
print(f"📊 Created CSV files:")
print(f"  - orders_clean.csv: {clean_count} records")
print(f"  - orders_messy.csv: {messy_count} records")
print(f"  - orders_large.csv: {large_count} records")

In [None]:
# Create sample JSON files
def create_sample_json_files():
    """Create sample JSON files with different structures"""
    
    # Sample 1: Simple JSON structure
    simple_json = {
        "metadata": {
            "source": "mobile_app",
            "version": "2.1.0",
            "timestamp": "2024-01-15T12:00:00Z"
        },
        "orders": [
            {
                "order_id": "APP-2024-001",
                "customer": {
                    "name": "Emma Watson",
                    "email": "emma@example.com",
                    "tier": "premium"
                },
                "items": [
                    {
                        "product": "iPhone 15 Pro",
                        "quantity": 1,
                        "price": 1199.99
                    }
                ],
                "total": 1199.99,
                "order_date": "2024-01-15T11:30:00Z"
            },
            {
                "order_id": "APP-2024-002",
                "customer": {
                    "name": "Tom Hardy",
                    "email": "tom@example.com",
                    "tier": "standard"
                },
                "items": [
                    {
                        "product": "AirPods Pro",
                        "quantity": 2,
                        "price": 249.99
                    }
                ],
                "total": 499.98,
                "order_date": "2024-01-15T14:15:00Z"
            }
        ]
    }
    
    with open(base_dir / "input" / "json" / "mobile_orders.json", 'w') as f:
        json.dump(simple_json, f, indent=2)
    
    # Sample 2: Flat JSON structure (array of objects)
    flat_json = [
        {
            "order_id": "WEB-2024-001",
            "customer_name": "Chris Evans",
            "customer_email": "chris@example.com",
            "product": "MacBook Air",
            "quantity": 1,
            "price": 1299.99,
            "order_date": "2024-01-16T09:00:00Z",
            "source": "website"
        },
        {
            "order_id": "WEB-2024-002",
            "customer_name": "Scarlett Johansson",
            "customer_email": "scarlett@example.com",
            "product": "iPad Pro",
            "quantity": 1,
            "price": 1099.99,
            "order_date": "2024-01-16T10:30:00Z",
            "source": "website"
        }
    ]
    
    with open(base_dir / "input" / "json" / "website_orders.json", 'w') as f:
        json.dump(flat_json, f, indent=2)
    
    return len(simple_json['orders']), len(flat_json)

# Create the JSON files
mobile_count, web_count = create_sample_json_files()
print(f"📄 Created JSON files:")
print(f"  - mobile_orders.json: {mobile_count} orders")
print(f"  - website_orders.json: {web_count} orders")

## 📊 Reading CSV Files

CSV (Comma-Separated Values) files are the most common format for data exchange. Let's explore different ways to read and process them:

In [None]:
# Basic CSV reading
def read_csv_basic():
    """Demonstrate basic CSV reading"""
    
    print("📊 Reading CSV Files - Basic Approach")
    print("=" * 50)
    
    # Read the clean CSV file
    csv_file = base_dir / "input" / "csv" / "orders_clean.csv"
    
    try:
        # Basic read
        df = pd.read_csv(csv_file)
        
        print(f"✅ Successfully read {len(df)} records from {csv_file.name}")
        print(f"📋 Columns: {list(df.columns)}")
        print(f"📏 Shape: {df.shape}")
        print(f"💾 Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
        
        print("\n📊 First 3 records:")
        display(df.head(3))
        
        print("\n📈 Data types:")
        print(df.dtypes)
        
        return df
        
    except Exception as e:
        print(f"❌ Error reading CSV file: {e}")
        return None

# Read the basic CSV
df_clean = read_csv_basic()

In [None]:
# Advanced CSV reading with options
def read_csv_advanced():
    """Demonstrate advanced CSV reading with various options"""
    
    print("📊 Reading CSV Files - Advanced Techniques")
    print("=" * 50)
    
    csv_file = base_dir / "input" / "csv" / "orders_messy.csv"
    
    try:
        # Read with specific data types and options
        df = pd.read_csv(
            csv_file,
            dtype={
                'order_id': 'string',
                'customer_name': 'string', 
                'product': 'string',
                'quantity': 'Int64',  # Nullable integer
                'price': 'float64',
                'store_location': 'string'
            },
            parse_dates=['order_date'],
            na_values=['', 'NULL', 'null', 'N/A'],  # Additional NA values
            skipinitialspace=True,  # Skip spaces after delimiter
            encoding='utf-8'  # Specify encoding
        )
        
        print(f"✅ Successfully read {len(df)} records with advanced options")
        
        print("\n📊 Data with issues:")
        display(df)
        
        print("\n🔍 Data quality analysis:")
        print(f"Missing values per column:")
        missing_counts = df.isnull().sum()
        for col, count in missing_counts.items():
            if count > 0:
                print(f"  {col}: {count} missing ({count/len(df)*100:.1f}%)")
        
        print(f"\n📈 Improved data types:")
        print(df.dtypes)
        
        return df
        
    except Exception as e:
        print(f"❌ Error reading CSV file: {e}")
        return None

# Read the messy CSV with advanced options
df_messy = read_csv_advanced()

In [None]:
# Reading large CSV files efficiently
def read_csv_large_file():
    """Demonstrate efficient reading of large CSV files"""
    
    print("📊 Reading Large CSV Files - Performance Optimization")
    print("=" * 60)
    
    csv_file = base_dir / "input" / "csv" / "orders_large.csv"
    
    # Method 1: Read in chunks
    print("🔄 Method 1: Reading in chunks")
    start_time = time.time()
    
    chunk_size = 100
    chunks = []
    
    try:
        for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunk_size)):
            # Process each chunk (example: add chunk number)
            chunk['chunk_number'] = i + 1
            chunks.append(chunk)
            
            if i < 3:  # Show first few chunks
                print(f"  Processed chunk {i+1}: {len(chunk)} records")
        
        # Combine all chunks
        df_chunked = pd.concat(chunks, ignore_index=True)
        chunk_time = time.time() - start_time
        
        print(f"  ✅ Chunked reading: {len(df_chunked)} records in {chunk_time:.3f}s")
        
    except Exception as e:
        print(f"  ❌ Error with chunked reading: {e}")
        df_chunked = None
    
    # Method 2: Optimized data types
    print("\n⚡ Method 2: Optimized data types")
    start_time = time.time()
    
    try:
        df_optimized = pd.read_csv(
            csv_file,
            dtype={
                'order_id': 'category',  # Use category for repeated strings
                'customer_name': 'string',
                'product': 'category',
                'quantity': 'int8',  # Smaller integer type
                'price': 'float32',  # Smaller float type
                'store_location': 'category'
            },
            parse_dates=['order_date']
        )
        
        optimized_time = time.time() - start_time
        
        print(f"  ✅ Optimized reading: {len(df_optimized)} records in {optimized_time:.3f}s")
        
        # Memory comparison
        if df_chunked is not None:
            chunked_memory = df_chunked.memory_usage(deep=True).sum() / 1024
            optimized_memory = df_optimized.memory_usage(deep=True).sum() / 1024
            
            print(f"\n💾 Memory usage comparison:")
            print(f"  Chunked approach: {chunked_memory:.2f} KB")
            print(f"  Optimized types: {optimized_memory:.2f} KB")
            print(f"  Memory savings: {((chunked_memory - optimized_memory) / chunked_memory * 100):.1f}%")
        
        return df_optimized
        
    except Exception as e:
        print(f"  ❌ Error with optimized reading: {e}")
        return df_chunked

# Read the large CSV file
df_large = read_csv_large_file()

## 📄 Reading JSON Files

JSON (JavaScript Object Notation) files are flexible and can contain nested structures. Let's explore different approaches:

In [None]:
# Reading JSON files
def read_json_files():
    """Demonstrate different approaches to reading JSON files"""
    
    print("📄 Reading JSON Files")
    print("=" * 30)
    
    # Method 1: Simple JSON structure (flat array)
    print("🔄 Method 1: Flat JSON structure")
    json_file1 = base_dir / "input" / "json" / "website_orders.json"
    
    try:
        # Read directly with pandas
        df_flat = pd.read_json(json_file1)
        
        print(f"  ✅ Read {len(df_flat)} records from flat JSON")
        print(f"  📋 Columns: {list(df_flat.columns)}")
        
        print("\n  📊 Sample data:")
        display(df_flat.head(2))
        
    except Exception as e:
        print(f"  ❌ Error reading flat JSON: {e}")
        df_flat = None
    
    # Method 2: Nested JSON structure
    print("\n🔄 Method 2: Nested JSON structure")
    json_file2 = base_dir / "input" / "json" / "mobile_orders.json"
    
    try:
        # Read with Python's json module first
        with open(json_file2, 'r') as f:
            json_data = json.load(f)
        
        print(f"  📋 JSON structure keys: {list(json_data.keys())}")
        print(f"  📊 Metadata: {json_data['metadata']}")
        
        # Extract orders from nested structure
        orders = json_data['orders']
        
        # Flatten nested customer data
        flattened_orders = []
        for order in orders:
            flat_order = {
                'order_id': order['order_id'],
                'customer_name': order['customer']['name'],
                'customer_email': order['customer']['email'],
                'customer_tier': order['customer']['tier'],
                'product': order['items'][0]['product'],  # Assuming single item
                'quantity': order['items'][0]['quantity'],
                'price': order['items'][0]['price'],
                'total': order['total'],
                'order_date': order['order_date'],
                'source': json_data['metadata']['source'],
                'app_version': json_data['metadata']['version']
            }
            flattened_orders.append(flat_order)
        
        df_nested = pd.DataFrame(flattened_orders)
        
        print(f"  ✅ Flattened {len(df_nested)} records from nested JSON")
        print(f"  📋 Columns: {list(df_nested.columns)}")
        
        print("\n  📊 Sample flattened data:")
        display(df_nested.head(2))
        
    except Exception as e:
        print(f"  ❌ Error reading nested JSON: {e}")
        df_nested = None
    
    return df_flat, df_nested

# Read JSON files
df_json_flat, df_json_nested = read_json_files()

In [None]:
# Advanced JSON processing with error handling
def read_json_with_error_handling():
    """Demonstrate robust JSON reading with comprehensive error handling"""
    
    print("📄 Advanced JSON Processing with Error Handling")
    print("=" * 55)
    
    def safe_json_read(file_path):
        """Safely read JSON file with multiple fallback strategies"""
        
        try:
            # Try different encodings
            encodings = ['utf-8', 'utf-8-sig', 'latin-1', 'cp1252']
            
            for encoding in encodings:
                try:
                    with open(file_path, 'r', encoding=encoding) as f:
                        data = json.load(f)
                    print(f"  ✅ Successfully read with {encoding} encoding")
                    return data, None
                except UnicodeDecodeError:
                    continue
                except json.JSONDecodeError as e:
                    return None, f"JSON decode error: {e}"
            
            return None, "Could not read file with any supported encoding"
            
        except FileNotFoundError:
            return None, f"File not found: {file_path}"
        except Exception as e:
            return None, f"Unexpected error: {e}"
    
    # Test with existing files
    json_files = [
        base_dir / "input" / "json" / "mobile_orders.json",
        base_dir / "input" / "json" / "website_orders.json"
    ]
    
    all_data = []
    
    for json_file in json_files:
        print(f"\n🔄 Processing: {json_file.name}")
        
        data, error = safe_json_read(json_file)
        
        if error:
            print(f"  ❌ Error: {error}")
            continue
        
        # Process based on structure
        if isinstance(data, list):
            # Flat structure
            df = pd.DataFrame(data)
            df['source_file'] = json_file.name
            all_data.append(df)
            print(f"  📊 Processed flat structure: {len(df)} records")
            
        elif isinstance(data, dict) and 'orders' in data:
            # Nested structure
            orders = data['orders']
            df = pd.json_normalize(orders)  # Automatically flatten nested data
            df['source_file'] = json_file.name
            
            # Add metadata if available
            if 'metadata' in data:
                for key, value in data['metadata'].items():
                    df[f'metadata_{key}'] = value
            
            all_data.append(df)
            print(f"  📊 Processed nested structure: {len(df)} records")
        
        else:
            print(f"  ⚠️ Unknown JSON structure in {json_file.name}")
    
    # Combine all data
    if all_data:
        combined_df = pd.concat(all_data, ignore_index=True, sort=False)
        print(f"\n✅ Combined all JSON data: {len(combined_df)} total records")
        print(f"📋 Combined columns: {list(combined_df.columns)}")
        
        return combined_df
    else:
        print("\n❌ No data could be processed")
        return None

# Process JSON files with error handling
df_json_combined = read_json_with_error_handling()

## 🔄 File Processing Automation

In real-world scenarios, you need to automatically monitor directories and process new files as they arrive:

In [None]:
# File monitoring and processing automation
class FileProcessor:
    """Automated file processor with monitoring capabilities"""
    
    def __init__(self, input_dir, processed_dir, output_dir):
        self.input_dir = Path(input_dir)
        self.processed_dir = Path(processed_dir)
        self.output_dir = Path(output_dir)
        self.processed_files = set()
        
        # Ensure directories exist
        for directory in [self.processed_dir, self.output_dir]:
            directory.mkdir(parents=True, exist_ok=True)
    
    def get_file_hash(self, file_path):
        """Calculate MD5 hash of file to detect changes"""
        hash_md5 = hashlib.md5()
        try:
            with open(file_path, "rb") as f:
                for chunk in iter(lambda: f.read(4096), b""):
                    hash_md5.update(chunk)
            return hash_md5.hexdigest()
        except Exception as e:
            print(f"  ⚠️ Error calculating hash for {file_path}: {e}")
            return None
    
    def discover_files(self):
        """Discover new files to process"""
        new_files = []
        
        # Look for CSV and JSON files
        patterns = ['**/*.csv', '**/*.json']
        
        for pattern in patterns:
            for file_path in self.input_dir.glob(pattern):
                if file_path.is_file():
                    file_hash = self.get_file_hash(file_path)
                    file_key = f"{file_path.name}_{file_hash}"
                    
                    if file_key not in self.processed_files:
                        new_files.append((file_path, file_hash))
        
        return new_files
    
    def process_csv_file(self, file_path):
        """Process a CSV file"""
        try:
            df = pd.read_csv(file_path)
            
            # Add processing metadata
            df['source_file'] = file_path.name
            df['processed_at'] = datetime.now().isoformat()
            df['file_size_kb'] = file_path.stat().st_size / 1024
            
            return df, None
            
        except Exception as e:
            return None, str(e)
    
    def process_json_file(self, file_path):
        """Process a JSON file"""
        try:
            with open(file_path, 'r') as f:
                data = json.load(f)
            
            # Handle different JSON structures
            if isinstance(data, list):
                df = pd.DataFrame(data)
            elif isinstance(data, dict) and 'orders' in data:
                df = pd.json_normalize(data['orders'])
                # Add metadata
                if 'metadata' in data:
                    for key, value in data['metadata'].items():
                        df[f'metadata_{key}'] = value
            else:
                df = pd.json_normalize(data)
            
            # Add processing metadata
            df['source_file'] = file_path.name
            df['processed_at'] = datetime.now().isoformat()
            df['file_size_kb'] = file_path.stat().st_size / 1024
            
            return df, None
            
        except Exception as e:
            return None, str(e)
    
    def process_file(self, file_path, file_hash):
        """Process a single file based on its extension"""
        print(f"  🔄 Processing: {file_path.name}")
        
        start_time = time.time()
        
        if file_path.suffix.lower() == '.csv':
            df, error = self.process_csv_file(file_path)
        elif file_path.suffix.lower() == '.json':
            df, error = self.process_json_file(file_path)
        else:
            return False, f"Unsupported file type: {file_path.suffix}"
        
        processing_time = time.time() - start_time
        
        if error:
            print(f"    ❌ Error: {error}")
            return False, error
        
        # Save processed data
        output_file = self.output_dir / f"processed_{file_path.stem}.csv"
        df.to_csv(output_file, index=False)
        
        # Move original file to processed directory
        processed_file = self.processed_dir / file_path.name
        shutil.move(str(file_path), str(processed_file))
        
        # Mark as processed
        file_key = f"{file_path.name}_{file_hash}"
        self.processed_files.add(file_key)
        
        print(f"    ✅ Success: {len(df)} records in {processing_time:.3f}s")
        print(f"    📁 Output: {output_file.name}")
        print(f"    📦 Archived: {processed_file.name}")
        
        return True, df
    
    def run_processing_cycle(self):
        """Run one cycle of file discovery and processing"""
        print("🔄 Starting file processing cycle...")
        
        new_files = self.discover_files()
        
        if not new_files:
            print("  📭 No new files found")
            return []
        
        print(f"  📁 Found {len(new_files)} new files")
        
        results = []
        
        for file_path, file_hash in new_files:
            success, result = self.process_file(file_path, file_hash)
            results.append({
                'file': file_path.name,
                'success': success,
                'result': result
            })
        
        return results

# Create and test the file processor
processor = FileProcessor(
    input_dir=base_dir / "input",
    processed_dir=base_dir / "processed",
    output_dir=base_dir / "output"
)

print("🤖 File Processor Initialized")
print(f"📁 Input directory: {processor.input_dir}")
print(f"📦 Processed directory: {processor.processed_dir}")
print(f"📤 Output directory: {processor.output_dir}")

In [None]:
# Run the file processing cycle
print("🚀 Running File Processing Cycle")
print("=" * 40)

results = processor.run_processing_cycle()

# Summary of results
if results:
    print(f"\n📊 Processing Summary:")
    successful = sum(1 for r in results if r['success'])
    failed = len(results) - successful
    
    print(f"  ✅ Successful: {successful}")
    print(f"  ❌ Failed: {failed}")
    
    if failed > 0:
        print(f"\n❌ Failed files:")
        for result in results:
            if not result['success']:
                print(f"  - {result['file']}: {result['result']}")
    
    # Show output files
    output_files = list(processor.output_dir.glob('*.csv'))
    print(f"\n📤 Generated output files:")
    for output_file in output_files:
        file_size = output_file.stat().st_size / 1024
        print(f"  - {output_file.name} ({file_size:.2f} KB)")

else:
    print("\n📭 No files were processed in this cycle")

## 🔍 File Validation and Quality Checks

Before processing files, it's important to validate their structure and quality:

In [None]:
# File validation system
class FileValidator:
    """Comprehensive file validation system"""
    
    def __init__(self):
        # Define expected schemas
        self.schemas = {
            'orders': {
                'required_columns': ['order_id', 'customer_name', 'product', 'quantity', 'price'],
                'optional_columns': ['order_date', 'store_location', 'customer_email'],
                'data_types': {
                    'quantity': 'numeric',
                    'price': 'numeric'
                }
            }
        }
    
    def validate_file_structure(self, file_path):
        """Validate basic file structure and accessibility"""
        issues = []
        
        # Check if file exists
        if not file_path.exists():
            issues.append(f"File does not exist: {file_path}")
            return issues
        
        # Check file size
        file_size = file_path.stat().st_size
        if file_size == 0:
            issues.append("File is empty")
        elif file_size > 100 * 1024 * 1024:  # 100MB
            issues.append(f"File is very large: {file_size / 1024 / 1024:.2f} MB")
        
        # Check file extension
        if file_path.suffix.lower() not in ['.csv', '.json']:
            issues.append(f"Unsupported file type: {file_path.suffix}")
        
        # Check file permissions
        if not os.access(file_path, os.R_OK):
            issues.append("File is not readable")
        
        return issues
    
    def validate_csv_content(self, file_path, schema_name='orders'):
        """Validate CSV file content against schema"""
        issues = []
        
        try:
            # Try to read the file
            df = pd.read_csv(file_path, nrows=100)  # Read first 100 rows for validation
            
            schema = self.schemas.get(schema_name, {})
            
            # Check required columns
            required_cols = schema.get('required_columns', [])
            missing_cols = [col for col in required_cols if col not in df.columns]
            if missing_cols:
                issues.append(f"Missing required columns: {missing_cols}")
            
            # Check data types
            data_types = schema.get('data_types', {})
            for col, expected_type in data_types.items():
                if col in df.columns:
                    if expected_type == 'numeric':
                        try:
                            pd.to_numeric(df[col], errors='raise')
                        except (ValueError, TypeError):
                            issues.append(f"Column '{col}' contains non-numeric values")
            
            # Check for completely empty columns
            empty_cols = df.columns[df.isnull().all()].tolist()
            if empty_cols:
                issues.append(f"Completely empty columns: {empty_cols}")
            
            # Check data quality
            total_rows = len(df)
            if total_rows == 0:
                issues.append("No data rows found")
            else:
                # Check missing value percentage
                missing_pct = (df.isnull().sum() / total_rows * 100)
                high_missing = missing_pct[missing_pct > 50]
                if not high_missing.empty:
                    issues.append(f"High missing values (>50%): {high_missing.to_dict()}")
            
        except pd.errors.EmptyDataError:
            issues.append("CSV file is empty or has no data")
        except pd.errors.ParserError as e:
            issues.append(f"CSV parsing error: {e}")
        except Exception as e:
            issues.append(f"Unexpected error reading CSV: {e}")
        
        return issues
    
    def validate_json_content(self, file_path):
        """Validate JSON file content"""
        issues = []
        
        try:
            with open(file_path, 'r') as f:
                data = json.load(f)
            
            # Check if it's a valid structure for our use case
            if isinstance(data, list):
                if len(data) == 0:
                    issues.append("JSON array is empty")
                else:
                    # Check if all items have consistent structure
                    first_keys = set(data[0].keys()) if data else set()
                    for i, item in enumerate(data[1:], 1):
                        if set(item.keys()) != first_keys:
                            issues.append(f"Inconsistent structure at item {i}")
                            break
            
            elif isinstance(data, dict):
                if 'orders' in data:
                    orders = data['orders']
                    if not isinstance(orders, list):
                        issues.append("'orders' field should be an array")
                    elif len(orders) == 0:
                        issues.append("No orders found in JSON")
                else:
                    issues.append("JSON structure not recognized (expected 'orders' field)")
            
            else:
                issues.append("JSON should be an object or array")
        
        except json.JSONDecodeError as e:
            issues.append(f"Invalid JSON format: {e}")
        except Exception as e:
            issues.append(f"Unexpected error reading JSON: {e}")
        
        return issues
    
    def validate_file(self, file_path):
        """Comprehensive file validation"""
        file_path = Path(file_path)
        
        print(f"🔍 Validating: {file_path.name}")
        
        all_issues = []
        
        # Structure validation
        structure_issues = self.validate_file_structure(file_path)
        all_issues.extend(structure_issues)
        
        # Content validation (only if structure is OK)
        if not structure_issues:
            if file_path.suffix.lower() == '.csv':
                content_issues = self.validate_csv_content(file_path)
            elif file_path.suffix.lower() == '.json':
                content_issues = self.validate_json_content(file_path)
            else:
                content_issues = ["Unsupported file type for content validation"]
            
            all_issues.extend(content_issues)
        
        # Report results
        if all_issues:
            print(f"  ❌ Validation failed ({len(all_issues)} issues):")
            for issue in all_issues:
                print(f"    - {issue}")
            return False, all_issues
        else:
            print(f"  ✅ Validation passed")
            return True, []

# Test file validation
validator = FileValidator()

print("🔍 File Validation Tests")
print("=" * 30)

# Test files in the processed directory
test_files = list((base_dir / "processed").glob('*'))

if test_files:
    for test_file in test_files[:3]:  # Test first 3 files
        is_valid, issues = validator.validate_file(test_file)
        print()  # Empty line for readability
else:
    print("📭 No files found in processed directory to validate")
    
    # Create a test file with issues for demonstration
    test_csv = base_dir / "input" / "csv" / "test_invalid.csv"
    
    # Create CSV with issues
    invalid_data = pd.DataFrame({
        'order_id': ['ORD-001', '', 'ORD-003'],  # Missing value
        'customer_name': ['John', 'Jane', 'Bob'],
        'product': ['iPhone', 'MacBook', 'iPad'],
        'quantity': ['one', '2', 'three'],  # Non-numeric values
        'price': [999.99, 'expensive', 599.99]  # Mixed types
    })
    
    invalid_data.to_csv(test_csv, index=False)
    print(f"📝 Created test file with issues: {test_csv.name}")
    
    # Validate the problematic file
    is_valid, issues = validator.validate_file(test_csv)

## 📊 Performance Optimization Techniques

When dealing with large files or high-volume processing, performance optimization becomes crucial:

In [None]:
# Performance optimization techniques
def demonstrate_performance_techniques():
    """Show various performance optimization techniques for file processing"""
    
    print("⚡ Performance Optimization Techniques")
    print("=" * 45)
    
    # Create a larger test file for performance testing
    large_file = base_dir / "input" / "csv" / "performance_test.csv"
    
    if not large_file.exists():
        print("📝 Creating large test file...")
        np.random.seed(42)
        
        large_data = {
            'order_id': [f'ORD-{i:08d}' for i in range(10000)],
            'customer_name': [f'Customer {i}' for i in range(10000)],
            'product': np.random.choice(['iPhone', 'MacBook', 'iPad', 'AirPods'], 10000),
            'category': np.random.choice(['Electronics', 'Accessories'], 10000),
            'quantity': np.random.randint(1, 10, 10000),
            'price': np.random.uniform(99.99, 1999.99, 10000).round(2),
            'order_date': pd.date_range('2024-01-01', periods=10000, freq='min').strftime('%Y-%m-%d %H:%M:%S'),
            'store_location': np.random.choice(['NY', 'LA', 'Chicago', 'Houston'], 10000)
        }
        
        pd.DataFrame(large_data).to_csv(large_file, index=False)
        print(f"  ✅ Created {large_file.name} with 10,000 records")
    
    # Technique 1: Basic reading (baseline)
    print("\n🔄 Technique 1: Basic Reading (Baseline)")
    start_time = time.time()
    df_basic = pd.read_csv(large_file)
    basic_time = time.time() - start_time
    basic_memory = df_basic.memory_usage(deep=True).sum() / 1024 / 1024  # MB
    
    print(f"  ⏱️ Time: {basic_time:.3f}s")
    print(f"  💾 Memory: {basic_memory:.2f} MB")
    print(f"  📊 Records: {len(df_basic):,}")
    
    # Technique 2: Optimized data types
    print("\n🔄 Technique 2: Optimized Data Types")
    start_time = time.time()
    df_optimized = pd.read_csv(
        large_file,
        dtype={
            'order_id': 'string',
            'customer_name': 'string',
            'product': 'category',  # Repeated values
            'category': 'category',
            'quantity': 'int8',  # Small integers
            'price': 'float32',  # Reduced precision
            'store_location': 'category'
        },
        parse_dates=['order_date']
    )
    optimized_time = time.time() - start_time
    optimized_memory = df_optimized.memory_usage(deep=True).sum() / 1024 / 1024
    
    print(f"  ⏱️ Time: {optimized_time:.3f}s ({optimized_time/basic_time:.2f}x)")
    print(f"  💾 Memory: {optimized_memory:.2f} MB ({optimized_memory/basic_memory:.2f}x)")
    print(f"  📊 Records: {len(df_optimized):,}")
    
    # Technique 3: Chunked processing
    print("\n🔄 Technique 3: Chunked Processing")
    start_time = time.time()
    
    chunk_results = []
    chunk_size = 1000
    
    for chunk_num, chunk in enumerate(pd.read_csv(large_file, chunksize=chunk_size)):
        # Process each chunk (example: calculate summary statistics)
        chunk_summary = {
            'chunk': chunk_num,
            'records': len(chunk),
            'total_revenue': chunk['price'].sum(),
            'avg_quantity': chunk['quantity'].mean()
        }
        chunk_results.append(chunk_summary)
    
    chunked_time = time.time() - start_time
    
    print(f"  ⏱️ Time: {chunked_time:.3f}s ({chunked_time/basic_time:.2f}x)")
    print(f"  📊 Processed {len(chunk_results)} chunks")
    print(f"  💰 Total revenue: ${sum(r['total_revenue'] for r in chunk_results):,.2f}")
    
    # Technique 4: Column selection
    print("\n🔄 Technique 4: Column Selection")
    start_time = time.time()
    
    # Only read columns we need
    df_selected = pd.read_csv(
        large_file,
        usecols=['order_id', 'product', 'quantity', 'price'],
        dtype={
            'order_id': 'string',
            'product': 'category',
            'quantity': 'int8',
            'price': 'float32'
        }
    )
    
    selected_time = time.time() - start_time
    selected_memory = df_selected.memory_usage(deep=True).sum() / 1024 / 1024
    
    print(f"  ⏱️ Time: {selected_time:.3f}s ({selected_time/basic_time:.2f}x)")
    print(f"  💾 Memory: {selected_memory:.2f} MB ({selected_memory/basic_memory:.2f}x)")
    print(f"  📊 Columns: {len(df_selected.columns)} vs {len(df_basic.columns)}")
    
    # Performance summary
    print("\n📊 Performance Summary:")
    techniques = [
        ('Basic Reading', basic_time, basic_memory),
        ('Optimized Types', optimized_time, optimized_memory),
        ('Chunked Processing', chunked_time, 0),  # Memory not measured for chunks
        ('Column Selection', selected_time, selected_memory)
    ]
    
    for name, time_taken, memory_used in techniques:
        time_improvement = f"{basic_time/time_taken:.1f}x faster" if time_taken < basic_time else "baseline"
        memory_improvement = f"{basic_memory/memory_used:.1f}x less memory" if memory_used > 0 and memory_used < basic_memory else "N/A"
        print(f"  {name:20} | {time_taken:.3f}s ({time_improvement:12}) | {memory_improvement}")

# Run performance demonstration
demonstrate_performance_techniques()

## 🎯 Best Practices Summary

Let's summarize the key best practices for file-based data ingestion:

In [None]:
# Best practices summary
best_practices = {
    'Category': [
        'File Handling',
        'Performance',
        'Error Handling',
        'Data Quality',
        'Security',
        'Monitoring',
        'Maintenance'
    ],
    'Best Practices': [
        'Validate files before processing, Use appropriate encodings, Archive processed files',
        'Optimize data types, Process in chunks, Select only needed columns',
        'Handle encoding issues, Graceful failure recovery, Comprehensive logging',
        'Schema validation, Missing value checks, Data type verification',
        'Secure file permissions, Validate file sources, Sanitize file names',
        'Track processing metrics, Log all operations, Set up alerts',
        'Regular cleanup, Archive old files, Monitor disk space'
    ],
    'Priority': ['Critical', 'High', 'Critical', 'High', 'Medium', 'High', 'Medium']
}

df_practices = pd.DataFrame(best_practices)

print("💡 File Processing Best Practices")
print("=" * 40)
display(df_practices)

# Create a visual summary
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Priority distribution
priority_counts = df_practices['Priority'].value_counts()
colors = ['red', 'orange', 'yellow']
ax1.pie(priority_counts.values, labels=priority_counts.index, autopct='%1.1f%%', colors=colors)
ax1.set_title('Best Practices by Priority')

# Category importance (mock data for visualization)
categories = df_practices['Category'].tolist()
importance_scores = [9, 8, 9, 8, 6, 8, 6]  # Mock importance scores

bars = ax2.barh(categories, importance_scores, color='skyblue')
ax2.set_xlabel('Importance Score (1-10)')
ax2.set_title('Category Importance Scores')
ax2.set_xlim(0, 10)

# Add value labels on bars
for bar, score in zip(bars, importance_scores):
    ax2.text(bar.get_width() + 0.1, bar.get_y() + bar.get_height()/2, 
             str(score), va='center')

plt.tight_layout()
plt.show()

print("\n🎯 Key Takeaways:")
print("  1. Always validate files before processing")
print("  2. Optimize for performance with large files")
print("  3. Implement robust error handling")
print("  4. Monitor and log all operations")
print("  5. Maintain clean and organized file systems")

## 🧹 Cleanup and Summary

Let's clean up our test files and summarize what we've learned:

In [None]:
# Cleanup function
def cleanup_tutorial_files():
    """Clean up tutorial files and directories"""
    
    print("🧹 Cleaning up tutorial files...")
    
    try:
        # Count files before cleanup
        total_files = 0
        for directory in ["input", "processed", "output"]:
            dir_path = base_dir / directory
            if dir_path.exists():
                files = list(dir_path.rglob('*'))
                file_count = len([f for f in files if f.is_file()])
                total_files += file_count
                print(f"  📁 {directory}: {file_count} files")
        
        print(f"\n📊 Total files created during tutorial: {total_files}")
        
        # Ask user if they want to keep the files
        print("\n❓ Keep tutorial files for further exploration?")
        print("   (Files are in the 'tutorial_data' directory)")
        
        # For notebook environment, we'll keep the files by default
        keep_files = True
        
        if keep_files:
            print("✅ Tutorial files preserved for your exploration!")
            print(f"📁 Location: {base_dir.absolute()}")
        else:
            # Remove tutorial directory
            import shutil
            shutil.rmtree(base_dir)
            print("🗑️ Tutorial files cleaned up")
    
    except Exception as e:
        print(f"⚠️ Error during cleanup: {e}")

# Run cleanup
cleanup_tutorial_files()

## 🎯 Tutorial Summary

Congratulations! You've completed the **File Reading** tutorial. Here's what you've mastered:

### ✅ **Core Skills Learned**

1. **📊 CSV File Processing**
   - Basic and advanced reading techniques
   - Handling different encodings and formats
   - Optimizing data types for performance
   - Processing large files with chunking

2. **📄 JSON File Processing**
   - Reading flat and nested JSON structures
   - Flattening complex nested data
   - Error handling for malformed JSON
   - Combining multiple JSON sources

3. **🤖 File Processing Automation**
   - Automated file discovery and monitoring
   - Batch processing workflows
   - File archiving and organization
   - Processing status tracking

4. **🔍 File Validation**
   - Structure and accessibility validation
   - Content and schema validation
   - Data quality assessment
   - Error reporting and handling

5. **⚡ Performance Optimization**
   - Memory-efficient data types
   - Chunked processing for large files
   - Column selection optimization
   - Performance benchmarking

### 🛠️ **Practical Tools Built**

- **FileProcessor**: Automated file processing system
- **FileValidator**: Comprehensive validation framework
- **Performance Optimizer**: Techniques for handling large files
- **Error Handler**: Robust error handling patterns

### 📊 **Real-World Applications**

- **Daily Data Imports**: Process daily sales reports, inventory updates
- **Log File Analysis**: Parse and analyze application logs
- **Data Migration**: Move data between systems efficiently
- **ETL Pipelines**: Extract data from various file sources

---

## 🚀 What's Next?

In the next tutorial, **"03_calling_apis.ipynb"**, you'll learn:

- 🌐 **REST API Integration**: Making HTTP requests and handling responses
- 🔐 **Authentication**: API keys, OAuth, and secure access
- 🔄 **Rate Limiting**: Handling API limits and throttling
- 📊 **Data Pagination**: Processing large datasets from APIs
- ⚠️ **Error Handling**: Dealing with network issues and API failures
- 🔄 **Real-time Data**: Polling and webhook integration

### 🎯 **Practice Exercise**

Before moving to the next tutorial, try this challenge:

1. **Create a file monitoring system** that watches a directory for new files
2. **Process different file formats** (CSV, JSON, maybe Excel)
3. **Implement data validation** with custom business rules
4. **Generate processing reports** showing success/failure rates
5. **Optimize for performance** with large files

---

## 📚 Additional Resources

- **📖 Pandas Documentation**: [Reading Files](https://pandas.pydata.org/docs/user_guide/io.html)
- **🐍 Python JSON Module**: [JSON Processing](https://docs.python.org/3/library/json.html)
- **⚡ Performance Tips**: [Pandas Performance](https://pandas.pydata.org/docs/user_guide/enhancingperf.html)
- **🔍 File Validation**: [Data Validation Patterns](https://github.com/great-expectations/great_expectations)

---

**Great job! You're now ready to handle any file-based data ingestion challenge! 🚀**