# SECOM Dataset - EDA

Quick look at the dataset before building the model. Main things to figure out:
- How imbalanced are the classes?
- How bad is the missing data?
- Which features are mostly empty and should be dropped?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../data/uci-secom.csv')
print(f'Shape: {df.shape}')
df.head()

In [None]:
df.dtypes.value_counts()

## Class distribution

Labels are -1 (pass) and 1 (fail). Expecting heavy imbalance.

In [None]:
label_counts = df['Pass/Fail'].value_counts()
print(label_counts)
print(f'\nImbalance ratio: {label_counts[-1] / label_counts[1]:.1f}:1 (pass:fail)')

label_counts.plot(kind='bar', color=['steelblue', 'salmon'])
plt.title('Pass vs Fail')
plt.xticks([0, 1], ['Pass (-1)', 'Fail (1)'], rotation=0)
plt.ylabel('Count')
plt.show()

## Missing values

SECOM is known for having tons of NaNs. Need to figure out which columns are mostly empty so we can drop them.

In [None]:
# drop non-feature columns for this analysis
features = df.drop(columns=['Time', 'Pass/Fail'])

missing_pct = (features.isnull().sum() / len(features)) * 100
print(f'Total features: {len(missing_pct)}')
print(f'Features with any missing: {(missing_pct > 0).sum()}')
print(f'Features with >50% missing: {(missing_pct > 50).sum()}')
print(f'Features with zero missing: {(missing_pct == 0).sum()}')

In [None]:
plt.figure(figsize=(10, 4))
plt.hist(missing_pct, bins=50, edgecolor='black')
plt.xlabel('% Missing')
plt.ylabel('Number of Features')
plt.title('Missing Value Distribution Across Features')
plt.axvline(x=50, color='red', linestyle='--', label='50% threshold')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# worst offenders
missing_pct.sort_values(ascending=False).head(15)

## Takeaways

- ~14:1 class imbalance, will need weighted loss or oversampling
- A bunch of features are >50% missing, safe to drop those
- Remaining missing values can be imputed with median
- All features are numeric which makes preprocessing straightforward