# Step 4 – Reproducing Bosch QA Failure Prediction

**Project**: Early Prediction of QA Failures from Test Log Data  
**Dataset**: Bosch Production Line (Kaggle)

## 🔗 Original Research Reference

- **Notebook**: [liamculligan/bosch-production-line-performance](https://github.com/liamculligan/bosch-production-line-performance)
- **Competition**: [Kaggle - Bosch Production Line Performance](https://www.kaggle.com/competitions/bosch-production-line-performance)

I cloned this GitHub repo and reused their XGBoost-based baseline strategy to predict QA failures using Bosch's sensor data.

In [1]:
!git clone https://github.com/liamculligan/bosch-production-line-performance.git

Cloning into 'bosch-production-line-performance'...
remote: Enumerating objects: 264, done.[K
remote: Total 264 (delta 0), reused 0 (delta 0), pack-reused 264 (from 1)[K
Receiving objects: 100% (264/264), 629.35 KiB | 2.53 MiB/s, done.
Resolving deltas: 100% (134/134), done.


In [6]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"kavithaancha","key":"c6254835dc9367bebb728df54dc9ea4e"}'}

In [7]:
import os


!mkdir -p /root/.config/kaggle
!mv kaggle.json /root/.config/kaggle/
!chmod 600 /root/.config/kaggle/kaggle.json

In [8]:
from kaggle.api.kaggle_api_extended import KaggleApi
import zipfile

api = KaggleApi()
api.authenticate()


api.competition_download_files('bosch-production-line-performance')

with zipfile.ZipFile('bosch-production-line-performance.zip', 'r') as zip_ref:
    zip_ref.extractall('bosch_data')

print("Dataset downloaded and extracted.")

✅ Dataset downloaded and extracted.


In [9]:
import pandas as pd

sample_size = 100_000
df = pd.read_csv('bosch_data/train_numeric.csv.zip', nrows=sample_size)

print("Loaded data shape:", df.shape)
df.head()

✅ Loaded data shape: (100000, 970)


Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,,,,,,,,,,0
1,6,,,,,,,,,,...,,,,,,,,,,0
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,,,,,,,,,,0
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,,,,,,,,,,0
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,,,,,,,,,,0


In [10]:
# Drop 'Id' and separate features and label
X = df.drop(columns=['Id', 'Response'])
y = df['Response']

# Fill missing values
X = X.fillna(-999)

In [11]:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import matthews_corrcoef

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mcc = matthews_corrcoef(y_test, y_pred)

print(f"Matthews Correlation Coefficient (MCC): {mcc:.4f}")

Parameters: { "use_label_encoder" } are not used.



✅ Matthews Correlation Coefficient (MCC): 0.1766


## 📊 Results & Observations

- ✅ Reproduced XGBoost model using 100,000 rows from `train_numeric.csv.zip`
- ✅ Achieved Matthews Correlation Coefficient (MCC): **0.1766**
- ⚠️ Model trains well with minimal tuning, but performance is likely constrained by:
  - Imbalanced dataset
  - Lack of timestamp/categorical data usage
  - High sparsity and many missing values

## 🧠 Insights & Next Steps

### Key Takeaways:
- MCC is more informative than plain accuracy in highly imbalanced data.
- This reproduction confirms that baseline modeling is feasible on numeric-only Bosch data.
- XGBoost handles sparsity but may benefit from feature selection or dimensionality reduction.

### Next Steps for Capstone:
- Integrate timestamp features to add temporal insights.
- Explore LightGBM for faster training and better hyperparameter tuning.
- Evaluate feature importance and reduce dimensions using PCA or L1 regularization.
- Consider using more rows or the full dataset if memory allows.