Skip to content

HariTechPath/Python_Question1

Repository files navigation

Advanced Data Engineering Problems - Comprehensive Solutions

This repository contains 4 comprehensive data engineering problems extracted from images and enhanced with professional-grade requirements, test cases, and optimized solutions.

Problems Overview

1. SQL Fruit Sales Analysis (Problem_1_SQL_Fruit_Sales_Analysis.py)

Problem Type: SQL Data Analysis
Complexity: Medium to Hard
Focus: Time-series analysis, aggregation, and date grouping

Description: Analyze fruit sales data to find the difference between the first and last items sold each day.

Key Features:

  • Handles up to 1,000,000 records efficiently
  • Proper timestamp parsing and date grouping
  • Multiple solution approaches (SQL and Pandas)
  • Comprehensive test cases including edge cases

Sample Output:

date        | first_sold | last_sold | diff
2022-01-01  | apple      | orange    | 5
2022-01-02  | apple      | banana    | -1

2. Array Subset Verification (Problem_2_Array_Subset_Check.py)

Problem Type: Data Structure & Algorithms
Complexity: Easy to Hard
Focus: Efficient subset checking, duplicate handling, memory optimization

Description: Determine if one array is a subset of another, handling duplicates correctly.

Key Features:

  • Multiple optimization strategies
  • Memory-efficient solutions for large datasets
  • Handles mixed data types and edge cases
  • Performance benchmarking included

Sample Output:

Input: arr1 = [11, 1, 13, 21, 3, 7], arr2 = [11, 3, 7, 1]
Output: True (arr2 is a subset of arr1)

3. Revenue Categorization (Problem_3_Revenue_Categorization.py)

Problem Type: Pandas Data Processing
Complexity: Easy to Hard
Focus: Data categorization, large dataset processing, analytics

Description: Categorize products based on revenue ranges and perform advanced analytics.

Key Features:

  • Handles up to 10 million records
  • Multiple categorization strategies
  • Chunked processing for memory optimization
  • Advanced analytics and pattern analysis

Sample Output:

product_id | category    | total_revenue | Revenue Category
P001       | Electronics | 15000        | High
P002       | Clothing    | 800          | Low
P003       | Home & Kitchen | 3500     | Medium

4. Item Rule Matching (Problem_4_Item_Rule_Matching.py)

Problem Type: Data Filtering & Analysis
Complexity: Easy to Hard
Focus: Real-time filtering, batch processing, performance optimization

Description: Count items matching specific rules in large datasets with real-time processing capabilities.

Key Features:

  • Handles up to 10 million items
  • Batch rule processing
  • Preprocessing optimization
  • Real-time streaming simulation

Sample Output:

Input: items = [["phone", "blue", "pixel"], ["computer", "silver", "lenovo"], ["phone", "gold", "iphone"]]
       ruleKey = "type", ruleValue = "phone"
Output: 2

Test Case Distribution

Each problem includes 10 comprehensive test cases:

  • 4 Easy Cases: Basic functionality and common scenarios
  • 3 Medium Cases: Performance testing and moderate complexity
  • 2 Hard Cases: Edge cases and advanced scenarios
  • 1 Extremely Hard Case: Real-world complexity and extreme edge cases

Validation Results

All solutions have been validated and tested:

✅ Array Subset Check: 8/8 tests passed
✅ Item Rule Matching: 8/8 tests passed  
✅ Revenue Categorization: 14/14 tests passed
✅ SQL Fruit Sales Analysis: 2/2 tests passed

Overall Results: 4/4 problems passed
🎉 ALL VALIDATIONS PASSED!

Key Technical Features

Performance Optimization

  • Memory Efficiency: Solutions handle datasets exceeding available RAM
  • Time Complexity: Optimized algorithms for large-scale processing
  • Chunked Processing: For datasets up to 10 million records
  • Vectorized Operations: Using NumPy for maximum performance

Data Quality Handling

  • Missing Values: Graceful handling of null/empty values
  • Data Type Validation: Support for mixed data types
  • Edge Case Management: Comprehensive edge case coverage
  • Error Handling: Robust error management strategies

Scalability Features

  • Real-time Processing: Streaming data simulation
  • Batch Operations: Efficient multi-rule processing
  • Incremental Updates: Support for dynamic data changes
  • Distributed Processing: Architecture considerations for scale

Usage Instructions

Running Individual Problems

python Problem_1_SQL_Fruit_Sales_Analysis.py
python Problem_2_Array_Subset_Check.py
python Problem_3_Revenue_Categorization.py
python Problem_4_Item_Rule_Matching.py

Running Validation

python validation_runner.py

Dependencies

pip install pandas numpy

Problem Complexity Analysis

Problem Time Complexity Space Complexity Dataset Size Difficulty
SQL Fruit Sales O(n log n) O(n) 1M records Medium-Hard
Array Subset O(n+m) O(n+m) 10M elements Easy-Hard
Revenue Categorization O(n) O(n) 10M records Easy-Hard
Item Matching O(n) O(n) 10M items Easy-Hard

Real-World Applications

These problems simulate common data engineering challenges:

  1. E-commerce Analytics: Revenue analysis and product categorization
  2. Data Validation: Array subset checking for data quality
  3. Real-time Filtering: Item matching for recommendation systems
  4. Time-series Analysis: Sales pattern analysis

Advanced Features Demonstrated

  • Memory Optimization: Chunked processing for large datasets
  • Performance Benchmarking: Comparative analysis of different approaches
  • Edge Case Handling: Comprehensive edge case coverage
  • Analytics Integration: Advanced pattern analysis and insights
  • Production Readiness: Error handling and scalability considerations

Contributing

These problems are designed to test and demonstrate advanced data engineering skills including:

  • Algorithm optimization
  • Large-scale data processing
  • Performance tuning
  • Memory management
  • Real-world problem solving

Each solution includes multiple approaches to demonstrate different optimization strategies and trade-offs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •