## 1. Working with Data in Python

Bottling plant with three production lines (A, B, C). We analyze maintenance logs for conveyor motors over one year.


<img src="Bottling.png" alt="Bottling plant" width="500">

**Fields**
- `event_date`: date of failure
- `line`: production line (A, B, C)
- `asset_id`: conveyor motor id
- `failure_mode`: Mechanical, Electrical, Misalignment
- `time_to_failure_days`: days since last failure (proxy for MTBF)
- `repair_time_hours`: hours to repair (MTTR component)
- `downtime_hours`: total downtime per event


### Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

### Generate data

> ℹ️ The cell below creates a small synthetic dataset that mimics common reliability patterns (right‑skewed time to failure, varying lines, and multiple failure modes).

In [2]:
# Reproducibility
rng = np.random.default_rng(42)

# Generate synthetic dataset
n_events = 240
lines = rng.choice(list('ABC'), size=n_events, p=[0.4,0.35,0.25])
asset_ids = [f"{ln}-M{rng.integers(1,11)}" for ln in lines]
failure_modes = rng.choice(["Mechanical","Electrical","Misalignment"], size=n_events, p=[0.5,0.3,0.2])

# Right‑skewed time to failure (gamma), different scales per line
scale_map = {'A': 40, 'B': 35, 'C': 30}
time_to_failure_days = np.array([rng.gamma(shape=2.0, scale=scale_map[ln]) for ln in lines])

# Repair and downtime
repair_time_hours = rng.normal(loc=6, scale=1.8, size=n_events).clip(1, 24)
downtime_hours = (repair_time_hours * rng.uniform(1.0, 1.8, size=n_events)).round(2)

# Event dates within a year
start = np.datetime64('2024-01-01')
dates = start + rng.integers(0, 365, size=n_events).astype('timedelta64[D]')

df = pd.DataFrame({
    "event_date": dates,
    "line": lines,
    "asset_id": asset_ids,
    "failure_mode": failure_modes,
    "time_to_failure_days": time_to_failure_days.round(1),
    "repair_time_hours": repair_time_hours.round(2),
    "downtime_hours": downtime_hours
}).sort_values("event_date").reset_index(drop=True)

# Redner first 5 rows of the generated data
df.head(5)

Unnamed: 0,event_date,line,asset_id,failure_mode,time_to_failure_days,repair_time_hours,downtime_hours
0,2024-01-02,C,C-M2,Electrical,129.9,3.48,4.94
1,2024-01-02,A,A-M10,Mechanical,19.2,3.93,6.56
2,2024-01-02,C,C-M2,Mechanical,22.4,2.22,3.77
3,2024-01-03,B,B-M5,Mechanical,95.9,6.46,10.79
4,2024-01-09,C,C-M1,Mechanical,64.2,4.83,5.71


### Inspect structure

In [3]:
print(df.shape)
df.dtypes

(240, 7)


event_date              datetime64[s]
line                           object
asset_id                       object
failure_mode                   object
time_to_failure_days          float64
repair_time_hours             float64
downtime_hours                float64
dtype: object

### Sampling rows

Random samples help us sanity‑check values without printing the whole table.

In [4]:
df.sample(5, random_state=1)

Unnamed: 0,event_date,line,asset_id,failure_mode,time_to_failure_days,repair_time_hours,downtime_hours
228,2024-12-19,A,A-M3,Mechanical,157.7,6.66,8.99
194,2024-10-23,A,A-M3,Misalignment,8.3,6.32,8.37
88,2024-05-08,C,C-M8,Mechanical,56.0,2.25,3.85
95,2024-05-24,B,B-M8,Electrical,41.5,6.04,7.18
214,2024-11-29,C,C-M8,Mechanical,82.3,6.89,10.48


### Save data

In [5]:
import os

# Ensure data folder exists
os.makedirs('data', exist_ok=True)

# Save dataset
csv_path = 'data/bottling_maintenance_events.csv'
df.to_csv(csv_path, index=False)
print('Data saved to', csv_path)

Data saved to data/bottling_maintenance_events.csv


### Load dataset back into a new DataFrame

In [6]:
df_loaded = pd.read_csv(csv_path)
df_loaded.head()

Unnamed: 0,event_date,line,asset_id,failure_mode,time_to_failure_days,repair_time_hours,downtime_hours
0,2024-01-02,C,C-M2,Electrical,129.9,3.48,4.94
1,2024-01-02,A,A-M10,Mechanical,19.2,3.93,6.56
2,2024-01-02,C,C-M2,Mechanical,22.4,2.22,3.77
3,2024-01-03,B,B-M5,Mechanical,95.9,6.46,10.79
4,2024-01-09,C,C-M1,Mechanical,64.2,4.83,5.71
