# Data Engineering Pipeline Project

## Task Overview
As a data engineer, you will build a pipeline for a selected scenario through the following steps:

1. **Select a Scenario** from the options provided below
2. **Document Requirements** using the Five Vs of Data framework:
    - **Value** – What insights can be pulled from the data?
    - **Veracity** – How accurate, precise, and trusted is the data?
    - **Variety** – What types and formats? How many different sources of data?
    - **Velocity** – What is the frequency of new data being generated and ingested?
    - **Volume** – How big is the dataset? How much new data is generated?
3. **Present Results** to the class as directed by your instructor

---

## Scenario: E-Bike Rental Proof of Concept

### Context
Your startup e-bike company has partnered with a local government to pilot an e-bike rental program in several suburban neighborhoods. The government aims to:
- Reduce carbon emissions
- Reduce traffic congestion
- Collect and analyze usage trends
- Determine key metrics for program expansion

### Infrastructure Details
- **Fleet Size:** 50 e-bikes equipped with IoT devices
- **Power Source:** Bike battery
- **Network:** Wireless phone network
- **Data Format:** JSON
- **Transmission Frequency:** Every 2 minutes

### Desired Outcome
Develop a proof-of-concept pipeline to:
- Collect data from IoT devices embedded in e-bikes
- Analyze usage patterns and trends
- Support data-driven decision-making for program expansion

In [None]:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    FIVE Vs OF DATA ANALYSIS                                 │
└─────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────────┐
│ VALUE                                                                        │
├──────────────────────────────────────────────────────────────────────────────┤
│ • Peak usage times and high-demand zones for expansion planning             │
│ • User behavior patterns (trip duration, distance, frequency)               │
│ • Battery performance metrics and maintenance needs                         │
│ • Revenue optimization based on usage trends                                │
│ • Environmental impact (carbon emissions reduced)                           │
└──────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────────┐
│ VERACITY                                                                     │
├──────────────────────────────────────────────────────────────────────────────┤
│ • IoT sensors provide real-time, accurate GPS and battery data              │
│ • Data validation checks for outliers (impossible speeds, invalid zones)    │
│ • Potential issues: GPS signal loss, device malfunction, transmission delay │
│ • Trust level: High for technical metrics; moderate for user behavior       │
└──────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────────┐
│ VARIETY                                                                      │
├──────────────────────────────────────────────────────────────────────────────┤
│ • Format: JSON (structured)                                                  │
│ • Data types: Numeric (coordinates, battery %), text (status), timestamps   │
│ • Sources: 50 IoT devices + user rental records                             │
│ • Potential additional sources: Weather, traffic, municipal geofencing data │
└──────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────────┐
│ VELOCITY                                                                     │
├──────────────────────────────────────────────────────────────────────────────┤
│ • Transmission frequency: Every 2 minutes per bike                           │
│ • Message rate: 36,000 messages/day (25 messages/minute aggregate)          │
│ • Real-time processing required for immediate alerts                        │
│ • Continuous ingestion pipeline needed                                      │
└──────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────────┐
│ VOLUME                                                                       │
├──────────────────────────────────────────────────────────────────────────────┤
│ • Daily: 36,000 messages × 500 bytes = 18 MB/day                            │
│ • Monthly: ~540 MB                                                           │
│ • Annual: ~6.57 GB                                                           │
│ • Scalable: If expanded to 500 bikes, volume increases 10x (~65.7 GB/year)  │
│ • Storage requirements: Moderate; cloud database suitable                   │
└──────────────────────────────────────────────────────────────────────────────┘
│ • Insights           │
└──────────────────────┘

