# Data Loading & Writing Practice Exercises

Welcome to your comprehensive data loading and writing practice session! These exercises will challenge your pandas skills across different data sources and formats. Each exercise focuses on real-world scenarios you'll encounter in data analytics.

## Instructions
- Complete each exercise independently
- Focus on proper data loading, cleaning, and writing techniques
- Pay attention to data types, missing values, and performance considerations
- Document your approach and any challenges you encounter

## Exercise 1: Complex CSV Analysis - World Bank Data
**Data Source**: World Bank Open Data (CSV format)
**URL**: https://datacatalog.worldbank.org/search/dataset/0037712

**Challenge**: 
Load the World Bank's GDP data CSV file which contains:
- Multiple header rows
- Country codes and names
- Time series data from 1960-2023
- Missing values represented as ".."

**Tasks**:
1. Load the data properly handling the multi-header structure
2. Clean missing values and convert to appropriate data types
3. Reshape from wide to long format for time series analysis
4. Filter for G7 countries only
5. Export results to both Excel (.xlsx) and Parquet formats
6. Create a summary statistics file in JSON format

**Learning Focus**: Complex CSV parsing, data reshaping, multiple output formats

## Exercise 2: RESTful API Data Collection - OpenWeatherMap
**Data Source**: OpenWeatherMap API (JSON format)
**URL**: https://openweathermap.org/api

**Challenge**: 
Work with live weather data from multiple cities using API calls:
- Nested JSON structures
- Rate limiting considerations
- Real-time data with timestamps

**Tasks**:
1. Set up API authentication (free tier available)
2. Fetch current weather data for 20 major world cities
3. Handle nested JSON structure (weather conditions, coordinates, etc.)
4. Combine all city data into a single DataFrame
5. Add calculated fields (feels_like difference, wind speed categories)
6. Store raw JSON responses in a compressed format (.gz)
7. Export processed data to SQLite database
8. Create a backup in pickle format

**Learning Focus**: API interaction, JSON normalization, database writing, compression

## Exercise 3: Web Scraping Challenge - Wikipedia Tables
**Data Source**: Wikipedia (HTML tables)
**URL**: https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

**Challenge**: 
Extract and clean economic data from Wikipedia's complex HTML tables:
- Multiple tables on same page
- Merged cells and footnotes
- Mixed data types and currencies

**Tasks**:
1. Scrape GDP data tables from Wikipedia
2. Handle merged header cells and footnote references
3. Clean currency symbols and convert to numeric values
4. Deal with missing data and "N/A" entries
5. Combine data from multiple years if available
6. Validate data consistency across sources
7. Export cleaned data to CSV with proper encoding (UTF-8)
8. Create a metadata file documenting your cleaning process

**Learning Focus**: HTML parsing, text cleaning, data validation, encoding handling

## Exercise 4: Multi-Sheet Excel Analysis - Financial Reports
**Data Source**: SEC EDGAR Database (Excel/XBRL format)
**URL**: https://www.sec.gov/edgar/searchedgar/companysearch.html

**Challenge**: 
Work with complex financial Excel files containing:
- Multiple worksheets with different structures
- Merged cells and complex formatting
- Financial formulas and calculated fields

**Tasks**:
1. Download annual reports (10-K) in Excel format from a Fortune 500 company
2. Load multiple sheets into separate DataFrames
3. Handle merged cells in headers and totals
4. Extract and clean financial statements (Balance Sheet, Income Statement, Cash Flow)
5. Create relationships between sheets using common identifiers
6. Calculate financial ratios across multiple periods
7. Export consolidated data to HDF5 format for efficient storage
8. Create separate CSV files for each financial statement

**Learning Focus**: Multi-sheet Excel handling, financial data processing, HDF5 format, data relationships

## Exercise 5: Database Integration - PostgreSQL Public Datasets
**Data Source**: PostgreSQL sample databases
**URL**: https://www.postgresqltutorial.com/postgresql-getting-started/postgresql-sample-database/

**Challenge**: 
Work with relational database containing:
- Multiple related tables
- Foreign key relationships
- Large datasets requiring chunked processing

**Tasks**:
1. Set up connection to the DVD Rental sample database (or similar)
2. Explore database schema and relationships
3. Write complex JOIN queries to combine customer, rental, and inventory data
4. Handle large result sets using chunked reading
5. Perform aggregations at the database level vs. pandas level
6. Create materialized views for common queries
7. Export query results to multiple formats (CSV, JSON, Parquet)
8. Implement incremental data loading for new records

**Learning Focus**: SQL integration, database optimization, chunked processing, incremental loading

## Exercise 6: XML Data Processing - RSS Feeds & SOAP APIs
**Data Source**: Multiple RSS feeds and XML APIs
**URL**: http://rss.cnn.com/rss/edition.rss (and others)

**Challenge**: 
Parse complex XML structures with:
- Nested elements and attributes
- Namespaces and CDATA sections
- Large XML files requiring streaming

**Tasks**:
1. Collect RSS feeds from 5 different news sources
2. Parse XML structure handling namespaces properly
3. Extract article metadata (title, date, author, categories)
4. Handle CDATA sections and HTML content within XML
5. Combine feeds into unified news dataset
6. Deal with different XML schemas across sources
7. Implement XML schema validation
8. Export to both normalized JSON and flat CSV formats
9. Create an XML output with your processed data

**Learning Focus**: XML parsing, namespace handling, schema validation, streaming processing

## Exercise 7: High-Performance Computing - NASA Climate Data (HDF5)
**Data Source**: NASA Goddard Earth Sciences Data
**URL**: https://disc.gsfc.nasa.gov/

**Challenge**: 
Work with scientific HDF5 files containing:
- Multi-dimensional arrays
- Hierarchical data structures
- Metadata and attributes
- Very large file sizes (GB range)

**Tasks**:
1. Download climate/weather HDF5 files from NASA
2. Explore hierarchical structure and metadata
3. Extract specific datasets from the HDF5 groups
4. Handle multi-dimensional arrays (time, latitude, longitude, altitude)
5. Perform memory-efficient operations on large datasets
6. Subset data by geographic regions and time periods
7. Convert selected data to pandas-friendly formats
8. Create optimized HDF5 output with custom compression
9. Export time series data to multiple CSV files by region

**Learning Focus**: HDF5 format, scientific data, memory management, hierarchical structures

## Exercise 8: Multi-Source Integration - E-commerce Analytics
**Data Sources**: Mixed formats (CSV, JSON, Excel, Database)
**Scenario**: Artificial e-commerce business data

**Challenge**: 
Integrate data from multiple business systems:
- Customer data (CSV) with encoding issues
- Order transactions (JSON) with nested product details
- Inventory (Excel) with multiple sheets and formulas
- User activity logs (SQLite database)

**Tasks**:
1. Create realistic sample datasets in each format (or find open datasets)
2. Load each data source handling format-specific challenges
3. Standardize column names and data types across sources
4. Handle different customer ID formats and create unified keys
5. Merge datasets using appropriate join strategies
6. Identify and resolve data quality issues
7. Create a master customer analytics dataset
8. Export results to data warehouse format (Parquet with partitioning)
9. Generate business intelligence reports in Excel format

**Learning Focus**: Data integration, ETL processes, data quality, business analytics

## Exercise 9: Real-Time Data Processing - Financial Market Feeds
**Data Source**: Alpha Vantage API / Yahoo Finance
**URL**: https://www.alphavantage.co/ or Yahoo Finance API

**Challenge**: 
Handle streaming financial data with:
- Real-time price updates
- High-frequency data points
- Missing data during market closures
- Different market timezones

**Tasks**:
1. Set up real-time stock price feeds for 10 major stocks
2. Implement continuous data collection with error handling
3. Handle rate limiting and API quotas gracefully
4. Store streaming data incrementally (append-only approach)
5. Implement data validation and anomaly detection
6. Create rolling window calculations (moving averages, volatility)
7. Handle market closure periods and timezone conversions
8. Export hourly snapshots to compressed CSV files
9. Maintain a real-time dashboard data feed (JSON format)
10. Implement data archival strategy for historical data

**Learning Focus**: Streaming data, real-time processing, error handling, time series management

## Exercise 10: Advanced Challenge - Social Media Analytics Pipeline
**Data Sources**: Multiple social media APIs
**URLs**: Twitter API v2, Reddit API, Instagram Basic Display API

**Challenge**: 
Build a complete data pipeline handling:
- Multiple API endpoints with different schemas
- Rate limiting across multiple services
- Text data with various encodings
- Image metadata and binary data
- Geolocation and timestamp normalization

**Tasks**:
1. Set up authentication for 3 different social media APIs
2. Collect posts/tweets about a trending topic over 48 hours
3. Handle different JSON schemas and API response formats
4. Process text data (encoding, emojis, special characters)
5. Extract and download linked media files
6. Implement robust error handling and retry logic
7. Store raw data in MongoDB or similar NoSQL database
8. Create processed analytical datasets in multiple formats:
   - Time series data (CSV) for trend analysis
   - User network data (JSON) for social graph analysis
   - Text corpus (compressed text files) for NLP
   - Media metadata (Excel) for content analysis
9. Implement automated data quality reporting
10. Create a comprehensive ETL documentation report

**Learning Focus**: API orchestration, NoSQL databases, text processing, pipeline automation, documentation

## Bonus Tips & Resources

### Performance Considerations
- Always profile your data loading operations
- Use `chunksize` parameter for large files
- Consider `usecols` to load only needed columns
- Experiment with different `dtype` specifications
- Use compressed formats (Parquet, HDF5) for repeated access

### Data Quality Checklist
- Check for duplicate records
- Validate data types and ranges
- Handle missing values appropriately
- Document data transformations
- Implement data validation tests

### Useful Libraries Beyond Pandas
- **fastparquet** or **pyarrow**: For Parquet files
- **h5py**: For low-level HDF5 operations
- **requests**: For API interactions
- **beautifulsoup4**: For HTML parsing
- **lxml**: For XML processing
- **sqlalchemy**: For database connections
- **openpyxl**: For Excel file manipulation

### Additional Practice Resources
- **Kaggle Datasets**: https://www.kaggle.com/datasets
- **Google Dataset Search**: https://datasetsearch.research.google.com/
- **AWS Open Data**: https://aws.amazon.com/opendata/
- **UCI ML Repository**: https://archive.ics.uci.edu/ml/index.php
- **Data.gov**: https://www.data.gov/

Good luck with your pandas data loading journey! 🐼📊