# 08 - Practical Data Engineering Example

## Introduction

This notebook demonstrates a complete data engineering workflow using NumPy. We'll process sensor data, perform data preprocessing, feature engineering, and statistical analysis.

## Scenario

You work as a data engineer and need to:
1. Process sensor readings from multiple sources
2. Clean and normalize the data
3. Calculate statistics and aggregations
4. Perform feature engineering
5. Generate summary reports


In [1]:
import numpy as np

print("Starting data engineering pipeline...")
print("=" * 50)


Starting data engineering pipeline...


## Step 1: Generate Sample Sensor Data


In [2]:
# Generate sample sensor data (temperature readings from 5 sensors over 10 days)
np.random.seed(42)
sensor_data = np.random.normal(loc=25, scale=5, size=(10, 5))  # 10 days, 5 sensors
print("Sensor Data (10 days x 5 sensors):")
print(sensor_data)
print(f"\nShape: {sensor_data.shape}")


Sensor Data (10 days x 5 sensors):
[[27.48357077 24.30867849 28.23844269 32.61514928 23.82923313]
 [23.82931522 32.89606408 28.83717365 22.65262807 27.71280022]
 [22.68291154 22.67135123 26.20981136 15.43359878 16.37541084]
 [22.18856235 19.9358444  26.57123666 20.45987962 17.93848149]
 [32.32824384 23.8711185  25.33764102 17.87625907 22.27808638]
 [25.55461295 19.24503211 26.87849009 21.99680655 23.54153125]
 [21.99146694 34.26139092 24.93251388 19.71144536 29.11272456]
 [18.89578175 26.04431798 15.20164938 18.35906976 25.98430618]
 [28.6923329  25.85684141 24.42175859 23.49448152 17.60739005]
 [21.40077896 22.69680615 30.28561113 26.71809145 16.18479922]]

Shape: (10, 5)


## Step 2: Data Cleaning and Validation


In [3]:
# Add some outliers and missing values (simulating real-world data)
sensor_data[2, 1] = 100  # Outlier
sensor_data[5, 3] = np.nan  # Missing value
print("Data with outliers and missing values:")
print(sensor_data)

# Remove outliers (values > 50 or < 0)
mask = (sensor_data > 50) | (sensor_data < 0)
sensor_data[mask] = np.nan
print(f"\nAfter removing outliers:")
print(sensor_data)

# Fill missing values with mean of each sensor
for i in range(sensor_data.shape[1]):
    col = sensor_data[:, i]
    mean_val = np.nanmean(col)
    col[np.isnan(col)] = mean_val
    sensor_data[:, i] = col

print(f"\nAfter filling missing values:")
print(sensor_data)


Data with outliers and missing values:
[[ 27.48357077  24.30867849  28.23844269  32.61514928  23.82923313]
 [ 23.82931522  32.89606408  28.83717365  22.65262807  27.71280022]
 [ 22.68291154 100.          26.20981136  15.43359878  16.37541084]
 [ 22.18856235  19.9358444   26.57123666  20.45987962  17.93848149]
 [ 32.32824384  23.8711185   25.33764102  17.87625907  22.27808638]
 [ 25.55461295  19.24503211  26.87849009          nan  23.54153125]
 [ 21.99146694  34.26139092  24.93251388  19.71144536  29.11272456]
 [ 18.89578175  26.04431798  15.20164938  18.35906976  25.98430618]
 [ 28.6923329   25.85684141  24.42175859  23.49448152  17.60739005]
 [ 21.40077896  22.69680615  30.28561113  26.71809145  16.18479922]]

After removing outliers:
[[27.48357077 24.30867849 28.23844269 32.61514928 23.82923313]
 [23.82931522 32.89606408 28.83717365 22.65262807 27.71280022]
 [22.68291154         nan 26.20981136 15.43359878 16.37541084]
 [22.18856235 19.9358444  26.57123666 20.45987962 17.93848149]
 [

## Step 3: Data Normalization


In [4]:
# Z-score normalization (standardization)
mean = np.mean(sensor_data, axis=0)
std = np.std(sensor_data, axis=0)
normalized_data = (sensor_data - mean) / std
print("Normalized data (Z-score):")
print(normalized_data[:3])  # Show first 3 rows
print(f"\nMean of normalized data: {np.mean(normalized_data, axis=0)}")
print(f"Std of normalized data: {np.std(normalized_data, axis=0)}")


Normalized data (Z-score):
[[ 7.81546575e-01 -2.48371920e-01  6.52671814e-01  2.28685988e+00
   3.90477907e-01]
 [-1.77214806e-01  1.60844875e+00  8.06096748e-01  1.55753169e-01
   1.24589539e+00]
 [-4.77994968e-01 -7.68190985e-16  1.32834604e-01 -1.38848663e+00
  -1.25134512e+00]]

Mean of normalized data: [-5.21804822e-16 -6.77236045e-16 -1.77635684e-16  3.33066907e-16
 -2.22044605e-16]
Std of normalized data: [1. 1. 1. 1. 1.]


## Step 4: Statistical Analysis


In [5]:
# Calculate statistics for each sensor
print("Statistics per sensor:")
print(f"Mean: {np.mean(sensor_data, axis=0)}")
print(f"Std: {np.std(sensor_data, axis=0)}")
print(f"Min: {np.min(sensor_data, axis=0)}")
print(f"Max: {np.max(sensor_data, axis=0)}")
print(f"Median: {np.median(sensor_data, axis=0)}")


Statistics per sensor:
Mean: [24.50475772 25.45734378 25.69143284 21.92451143 22.05647633]
Std: [3.81143382 4.62477919 3.90243579 4.67481105 4.53996696]
Min: [18.89578175 19.24503211 15.20164938 15.43359878 16.18479922]
Max: [32.32824384 34.26139092 30.28561113 32.61514928 29.11272456]
Median: [23.25611338 24.88301114 26.39052401 21.19219553 22.90980881]


## Step 5: Feature Engineering


In [6]:
# Calculate daily average across all sensors
daily_avg = np.mean(sensor_data, axis=1)
print("Daily average temperature:")
print(daily_avg)

# Calculate temperature range per day
daily_range = np.max(sensor_data, axis=1) - np.min(sensor_data, axis=1)
print(f"\nDaily temperature range: {daily_range}")

# Calculate correlation between sensors
correlation = np.corrcoef(sensor_data.T)
print(f"\nCorrelation matrix between sensors:")
print(correlation)


Daily average temperature:
[27.29501487 27.18559625 21.23181526 21.41880091 24.33826976 23.42883557
 26.00190833 20.89702501 24.01456089 23.45721738]

Daily temperature range: [ 8.78591616 10.24343601 10.77621258  8.63275517 14.45198478  7.63345798
 14.54994557 10.84266859 11.08494285 14.10081191]

Correlation matrix between sensors:
[[ 1.         -0.16887086  0.28321931  0.1804704  -0.08974728]
 [-0.16887086  1.         -0.09543941 -0.12274636  0.62896914]
 [ 0.28321931 -0.09543941  1.          0.46816469 -0.28480565]
 [ 0.1804704  -0.12274636  0.46816469  1.          0.00154639]
 [-0.08974728  0.62896914 -0.28480565  0.00154639  1.        ]]


## Summary

In this practical example, you learned how to:

1. **Generate sample data**: Using NumPy random functions
2. **Clean data**: Handle outliers and missing values
3. **Normalize data**: Z-score standardization
4. **Calculate statistics**: Mean, std, min, max, median
5. **Feature engineering**: Daily averages, ranges, correlations

**Key Takeaways**:
- NumPy is essential for data preprocessing
- Vectorized operations make data processing efficient
- Statistical functions help understand data patterns
- Feature engineering creates useful derived features

**Next Steps**: Practice with the exercises in the next notebook!
