### 🧼 Data Quality

This notebook uses `maintenance_data.csv` and demonstrates how to:
- Detect **missing data**
- Impute missing values (numeric → median, categorical → mode)
- Find and remove duplicate records

These are essential data quality steps before reliability & maintenance analytics.

In [1]:
import pandas as pd, os, numpy as np

# Ensure CSV exists; recreate from example if missing
csv_path = 'raw_data/maintenance_data.csv'
df = pd.read_csv(csv_path)
df.head(10)

Unnamed: 0,maintenance_id,equipment_name,equipment_type,last_maintenance,maintenance_interval,status,temperature,cost
0,1,Equip-1,pump,2023-13-45,199 days,active,53.9,6766.31
1,2,Equip-2,Pump,2025-02-20,845 HRS,maint,83.7,$8886
2,3,Equip-3,motor,2025-04-25,573 hours,Maintenance,78.9,1069.3
3,4,Equip-4,Motor,09/07/2025,195 hours,ACTIVE,58.9,$6670
4,5,Equip-5,Motor,2025-09-13,499 HRS,Active,181.1,$9205
5,6,Equip-6,PUMP,2025-06-09,403 hours,ACTIVE,151.4,665.94
6,7,Equip-7,pump,12/17/2024,261 hrs,active,79.5,$9849
7,8,Equip-8,VALVE,2023-13-45,982 hours,active,95.3,493.44
8,9,Equip-9,pump,2023-13-45,773 min,Maint,45.3,$5247
9,10,Equip-10,Valve,2023-13-45,505 hours,ACTIVE,111.2,$1288


### Load data from a url

In [None]:
# import pandas as pd

# url = "https://raw.githubusercontent.com/Dr-AlaaKhamis/ISE518/main/6_Data_imperfection/raw_data/maintenance_data.csv"

# df = pd.read_csv(url, encoding="latin1")
# df.head()

## 🧪 Simulate Missing Values & Duplicates
To demonstrate handling, we'll:
- Introduce NaN in numeric and categorical columns
- Append a duplicate row

In [2]:
demo = df.copy()

# Introduce missing values
demo.loc[1,'temperature'] = np.nan
demo.loc[2,'status'] = np.nan
demo.loc[4,'cost'] = np.nan

# Append a duplicate of the first row
demo = pd.concat([demo, demo.iloc[[0]]], ignore_index=True)
demo.head(10)

Unnamed: 0,maintenance_id,equipment_name,equipment_type,last_maintenance,maintenance_interval,status,temperature,cost
0,1,Equip-1,pump,2023-13-45,199 days,active,53.9,6766.31
1,2,Equip-2,Pump,2025-02-20,845 HRS,maint,,$8886
2,3,Equip-3,motor,2025-04-25,573 hours,,78.9,1069.3
3,4,Equip-4,Motor,09/07/2025,195 hours,ACTIVE,58.9,$6670
4,5,Equip-5,Motor,2025-09-13,499 HRS,Active,181.1,
5,6,Equip-6,PUMP,2025-06-09,403 hours,ACTIVE,151.4,665.94
6,7,Equip-7,pump,12/17/2024,261 hrs,active,79.5,$9849
7,8,Equip-8,VALVE,2023-13-45,982 hours,active,95.3,493.44
8,9,Equip-9,pump,2023-13-45,773 min,Maint,45.3,$5247
9,10,Equip-10,Valve,2023-13-45,505 hours,ACTIVE,111.2,$1288


## 🔍 Detect Missing Data
- Count missing values per column
- Show rows with any NaN values

In [3]:
demo.isna().sum()

maintenance_id           0
equipment_name           0
equipment_type           0
last_maintenance         0
maintenance_interval     0
status                   1
temperature              1
cost                    13
dtype: int64

In [4]:
demo[demo.isna().any(axis=1)]

Unnamed: 0,maintenance_id,equipment_name,equipment_type,last_maintenance,maintenance_interval,status,temperature,cost
1,2,Equip-2,Pump,2025-02-20,845 HRS,maint,,$8886
2,3,Equip-3,motor,2025-04-25,573 hours,,78.9,1069.3
4,5,Equip-5,Motor,2025-09-13,499 HRS,Active,181.1,
12,13,Equip-13,pump,10/15/2024,641 hrs,Maintenance,147.9,
14,15,Equip-15,Motor,2023-13-45,749 HRS,DOWN,40.1,
24,25,Equip-25,Pump,04/07/2025,546 min,active,84.7,
30,31,Equip-31,Motor,06/20/2025,774 hrs,maint,73.1,
38,39,Equip-39,motor,2025-08-14,311 minutes,maint,81.6,
41,42,Equip-42,VALVE,07/11/2025,737 min,down,51.5,
46,47,Equip-47,motor,06/21/2025,720 min,DOWN,171.4,


#### 🧩 Impute Missing Values
Strategy:
- Numeric columns → median
- Categorical columns → mode

In [6]:
numeric_cols = demo.select_dtypes(include=['number']).columns
categorical_cols = [c for c in demo.columns if c not in numeric_cols]

# For numeric columns
for col in numeric_cols:
    if demo[col].isna().any():
        demo[col] = demo[col].fillna(demo[col].median())

# For categorical columns
for col in categorical_cols:
    if demo[col].isna().any():
        mode_val = demo[col].mode().iloc[0]
        demo[col] = demo[col].fillna(mode_val)

demo.head(10)

Unnamed: 0,maintenance_id,equipment_name,equipment_type,last_maintenance,maintenance_interval,status,temperature,cost
0,1,Equip-1,pump,2023-13-45,199 days,active,53.9,6766.31
1,2,Equip-2,Pump,2025-02-20,845 HRS,maint,81.95,$8886
2,3,Equip-3,motor,2025-04-25,573 hours,maint,78.9,1069.3
3,4,Equip-4,Motor,09/07/2025,195 hours,ACTIVE,58.9,$6670
4,5,Equip-5,Motor,2025-09-13,499 HRS,Active,181.1,6766.31
5,6,Equip-6,PUMP,2025-06-09,403 hours,ACTIVE,151.4,665.94
6,7,Equip-7,pump,12/17/2024,261 hrs,active,79.5,$9849
7,8,Equip-8,VALVE,2023-13-45,982 hours,active,95.3,493.44
8,9,Equip-9,pump,2023-13-45,773 min,Maint,45.3,$5247
9,10,Equip-10,Valve,2023-13-45,505 hours,ACTIVE,111.2,$1288


#### 🧭 Find and Remove Duplicates

In [6]:
duplicates = demo[demo.duplicated(keep=False)]
duplicates

Unnamed: 0,maintenance_id,equipment_name,equipment_type,last_maintenance,maintenance_interval,status,temperature,cost
0,1,Equip-1,pump,2023-13-45,199 days,active,53.9,6766.31
100,1,Equip-1,pump,2023-13-45,199 days,active,53.9,6766.31


In [7]:
deduped = demo.drop_duplicates(keep='first')
print('Before:', demo.shape, 'After:', deduped.shape)
deduped.head()

Before: (101, 8) After: (100, 8)


Unnamed: 0,maintenance_id,equipment_name,equipment_type,last_maintenance,maintenance_interval,status,temperature,cost
0,1,Equip-1,pump,2023-13-45,199 days,active,53.9,6766.31
1,2,Equip-2,Pump,2025-02-20,845 HRS,maint,81.95,$8886
2,3,Equip-3,motor,2025-04-25,573 hours,maint,78.9,1069.3
3,4,Equip-4,Motor,09/07/2025,195 hours,ACTIVE,58.9,$6670
4,5,Equip-5,Motor,2025-09-13,499 HRS,Active,181.1,6766.31


#### ✅ Save Cleaned Data

In [8]:
deduped.to_csv('preprocessed_data/maintenance_data_quality_clean.csv', index=False)
deduped.head()

Unnamed: 0,maintenance_id,equipment_name,equipment_type,last_maintenance,maintenance_interval,status,temperature,cost
0,1,Equip-1,pump,2023-13-45,199 days,active,53.9,6766.31
1,2,Equip-2,Pump,2025-02-20,845 HRS,maint,81.95,$8886
2,3,Equip-3,motor,2025-04-25,573 hours,maint,78.9,1069.3
3,4,Equip-4,Motor,09/07/2025,195 hours,ACTIVE,58.9,$6670
4,5,Equip-5,Motor,2025-09-13,499 HRS,Active,181.1,6766.31
