## Bosch Production Line Performance Dataset Overview

This document provides a detailed overview of the Bosch Production Line Performance dataset, a comprehensive and challenging collection of data designed for the development and evaluation of predictive maintenance and defect analysis models within a manufacturing environment.

### Dataset Characteristics

*   **Total Size:** Approximately 14.3 GB.
*   **Structure:** The dataset is systematically partitioned into distinct training and testing sets. Features are organized across separate files based on their data type: numeric, categorical, and date/timestamp.
*   **Training Set:** Comprises 1,184,687 samples, distributed across the following files:
    *   `train_numeric.csv`: Contains numerical feature data.
    *   `train_categorical.csv`: Contains categorical feature data.
    *   `train_date.csv`: Contains date and timestamp information.
*   **Test Set:** Contains 1,183,748 samples, structured identically to the training set, with corresponding files: `test_numeric.csv`, `test_categorical.csv`, and `test_date.csv`.
*   **Features:** The dataset includes over 4,200 anonymized features, encompassing a diverse range of data types including numerical, categorical, and timestamp data.

### Key Challenges

The Bosch Production Line Performance dataset presents several characteristics that pose significant challenges for the development and implementation of machine learning models:

*   **High Dimensionality:** The extensive number of features necessitates the application of robust feature selection and dimensionality reduction techniques to manage complexity and improve model performance.
*   **Sparsity:** A substantial amount of missing data is prevalent across numerous features, requiring careful handling through imputation or other missing data strategies.
*   **Class Imbalance:** The target variable ('Response'), which indicates a defective part, represents a small minority class. This imbalance requires specialized techniques for model training and evaluation to avoid biased predictions and ensure effective identification of defects.

This dataset originated from a Kaggle competition and serves as a valuable resource for researchers and practitioners to explore and evaluate advanced machine learning methodologies in the context of industrial anomaly detection and quality control.

Goal:

To develop an accurate and interpretable predictive model that identifies product failures in a high-dimensional manufacturing process by leveraging presence-based signal extraction and various machine learning techniques, ultimately providing actionable insights for proactive quality control.

Methodology:

*   **Data Acquisition and Sampling:** Obtain the Bosch Production Line Performance dataset and implement a stratified sampling strategy to create a manageable subset while preserving the class distribution of the target variable ('Response').
*   **Exploratory Data Analysis (EDA):** Conduct a thorough EDA on the sampled dataset to understand feature characteristics, missingness patterns, distributions, and relationships with the target variable. This includes analyzing variance, correlations, and unique value distributions, and visualizing key aspects of the data.
*   **Feature Engineering:** Create new features, potentially including presence-based signals from sparse data and features derived from timestamp information (e.g., process duration).
*   **Data Preprocessing:** Handle missing values (e.g., imputation, missing indicators) and encode categorical features.
*   **Feature Selection:** Employ techniques such as variance thresholding and potentially model-based or univariate methods to select a relevant subset of features for modeling.
*   **Addressing Class Imbalance:** Apply techniques like undersampling or oversampling to mitigate the impact of the imbalanced 'Response' variable on model training.
*   **Model Training:** Train various classification models (e.g., Logistic Regression, Decision Trees, Gradient Boosting like XGBoost) on the preprocessed and selected features.
*   **Model Evaluation:** Evaluate model performance using appropriate metrics for imbalanced datasets, such as ROC-AUC and precision-recall curves, in addition to standard classification metrics.
*   **Model Interpretation:** Utilize techniques like permutation importance or SHAP (SHapley Additive exPlanations) to gain interpretable insights into feature contributions to model predictions and failure risk.
*   **Reporting:** Summarize the findings, model performance, and key interpretable insights.

In [None]:
import pandas as pd
import numpy as np
import random
import re
import os





In [None]:
# supress scientific notation and display 8 decimal places
pd.set_option('display.float_format', lambda x: '%.8f' % x)


In [None]:
# Kaggle to Colab dataset transfer steps

!pip install kaggle
# Create the .kaggle directory
!mkdir ~/.kaggle
# Copy the kaggle.json file to the .kaggle directory
!cp kaggle.json ~/.kaggle/
# Set permissions for the kaggle.json file (read/write for the owner only)
!chmod 600 ~/.kaggle/kaggle.json
#download to data folder
!kaggle competitions download -c bosch-production-line-performance
# unzip the main folder
!unzip bosch-production-line-performance.zip -d data

In [None]:
# Unzip the individual CSV zip files located in the 'data' directory

# Path to the data directory
data_dir = 'data'

# Unzip the numeric data
print(f"Unzipping numeric data from {data_dir}/train_numeric.csv.zip...")
!unzip -o {data_dir}/train_numeric.csv.zip -d {data_dir}

# Unzip the categorical data
print(f"Unzipping categorical data from {data_dir}/train_categorical.csv.zip...")
!unzip -o {data_dir}/train_categorical.csv.zip -d {data_dir}

# Unzip the date data
print(f"Unzipping date data from {data_dir}/train_date.csv.zip...")
!unzip -o {data_dir}/train_date.csv.zip -d {data_dir}

print("\nIndividual CSV files should now be available in the 'data' directory.")
# You can use !ls data to verify the files
!ls data

Unzipping numeric data from data/train_numeric.csv.zip...
Archive:  data/train_numeric.csv.zip
  inflating: data/train_numeric.csv  
Unzipping categorical data from data/train_categorical.csv.zip...
Archive:  data/train_categorical.csv.zip
  inflating: data/train_categorical.csv  
Unzipping date data from data/train_date.csv.zip...
Archive:  data/train_date.csv.zip
  inflating: data/train_date.csv     

Individual CSV files should now be available in the 'data' directory.
sample_submission.csv.zip  train_categorical.csv      train_numeric.csv
test_categorical.csv.zip   train_categorical.csv.zip  train_numeric.csv.zip
test_date.csv.zip	   train_date.csv
test_numeric.csv.zip	   train_date.csv.zip


# Task
Develop a stratified sampling strategy for the large "content/data/train_numeric.csv" file to create a downsized dataset of approximately 200MB while maintaining the original proportion of 'Response' values, and then load the corresponding rows from "content/data/train_numeric.csv", "content/data/train_categorical.csv", and "content/data/train_date.csv" based on the sampled 'Id's.

## Develop a stratified sampling strategy for large numeric data

### Subtask:
Create a method to read `train_numeric.csv` in chunks, identify rows with and without defects ('Response'), and collect a stratified sample of 'Id's without loading the entire file into memory.


**Reasoning**:
The first step is to define the file path and determine the total number of rows in the numeric dataset to prepare for chunked reading and sampling.



In [None]:
file_path = r"data/train_numeric.csv"

# Determine the total number of rows (a fast way for large files)
# This assumes the first line is a header.
total_rows = sum(1 for row in open(file_path, 'r')) - 1
print(f"Total number of rows in {file_path}: {total_rows}")

# Initialize lists to store Ids
defective_ids = []
non_defective_ids = []

Total number of rows in data/train_numeric.csv: 1183747


**Reasoning**:
Iterate through the numeric data in chunks to collect 'Id's for defective and non-defective rows without loading the entire file into memory.



In [None]:
chunk_size = 100000  # Adjust chunk size based on memory constraints

for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size)):
    print(f"Processing chunk {i+1}...")
    defective_ids.extend(chunk[chunk['Response'] == 1]['Id'].tolist())
    non_defective_ids.extend(chunk[chunk['Response'] == 0]['Id'].tolist())

print(f"Collected {len(defective_ids)} defective IDs and {len(non_defective_ids)} non-defective IDs.")

Processing chunk 1...
Processing chunk 2...
Processing chunk 3...
Processing chunk 4...
Processing chunk 5...
Processing chunk 6...
Processing chunk 7...
Processing chunk 8...
Processing chunk 9...
Processing chunk 10...
Processing chunk 11...
Processing chunk 12...
Collected 6879 defective IDs and 1176868 non-defective IDs.


**Reasoning**:
Now that I have collected the IDs for defective and non-defective samples, I need to calculate the target number of rows for each group based on the desired downsized dataset size and the original class distribution, then sample those IDs.



In [None]:
# Estimate average row size (rough estimate, can be refined)
# Assuming an average of 1000 columns (numeric, categorical, date combined) and float64/int64/object types
# A very rough estimate: 1000 columns * 8 bytes/column = 8000 bytes/row = 0.008 MB/row
# Let's use a slightly more conservative estimate considering different data types and overhead
bytes_per_row_estimate = 1000 * 16 # Assuming average 16 bytes per value including overhead

target_size_mb = 200
target_size_bytes = target_size_mb * 1024 * 1024

# Calculate the target number of rows
target_total_rows = int(target_size_bytes / bytes_per_row_estimate)
print(f"Target total number of rows for ~{target_size_mb}MB: {target_total_rows}")


# Calculate the original proportion of defective samples
original_defect_proportion = len(defective_ids) / (len(defective_ids) + len(non_defective_ids))

# Calculate the target number of defective and non-defective samples
target_defective_rows = int(target_total_rows * original_defect_proportion)
# Ensure we don't sample more defective rows than available
target_defective_rows = min(target_defective_rows, len(defective_ids))

target_non_defective_rows = target_total_rows - target_defective_rows
# Ensure we don't sample more non-defective rows than available
target_non_defective_rows = min(target_non_defective_rows, len(non_defective_ids))

# Adjust target_total_rows based on actual possible samples
target_total_rows = target_defective_rows + target_non_defective_rows
print(f"Adjusted target total number of rows: {target_total_rows}")
print(f"Target defective rows: {target_defective_rows}")
print(f"Target non-defective rows: {target_non_defective_rows}")


# Randomly sample the IDs
sampled_defective_ids = random.sample(defective_ids, target_defective_rows)
sampled_non_defective_ids = random.sample(non_defective_ids, target_non_defective_rows)

# Combine the sampled IDs
sampled_ids = set(sampled_defective_ids + sampled_non_defective_ids)

print(f"Total sampled IDs: {len(sampled_ids)}")

Target total number of rows for ~200MB: 13107
Adjusted target total number of rows: 13107
Target defective rows: 76
Target non-defective rows: 13031
Total sampled IDs: 13107


## Read sampled data from numeric file

### Subtask:
Read only the rows from `train_numeric.csv` that correspond to the sampled 'Id's.


**Reasoning**:
Initialize an empty list to store sampled numeric data and iterate through the numeric file in chunks, filtering for the sampled IDs and appending the chunks to the list.



In [None]:
# Initialize an empty list to store the sampled numeric data
sampled_numeric_chunks = []

# Iterate through train_numeric.csv in chunks
for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size)):
    print(f"Processing numeric chunk {i+1}...")
    # Filter the chunk to keep only the rows where the 'Id' is in the sampled_ids set
    filtered_chunk = chunk[chunk['Id'].isin(sampled_ids)]
    # Append the filtered chunk to the list
    sampled_numeric_chunks.append(filtered_chunk)

# Concatenate the DataFrames in the list into a single DataFrame
num_sampled = pd.concat(sampled_numeric_chunks, ignore_index=True)

# Display the first few rows of the sampled numeric data
display(num_sampled.head())

# Print the shape of the sampled numeric DataFrame
print(f"\nShape of the sampled numeric data: {num_sampled.shape}")

Processing numeric chunk 1...
Processing numeric chunk 2...
Processing numeric chunk 3...
Processing numeric chunk 4...
Processing numeric chunk 5...
Processing numeric chunk 6...
Processing numeric chunk 7...
Processing numeric chunk 8...
Processing numeric chunk 9...
Processing numeric chunk 10...
Processing numeric chunk 11...
Processing numeric chunk 12...


Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
0,242,0.075,0.078,0.33,0.312,-0.056,-0.021,0.008,0.048,0.071,...,,,,,,,,,,0
1,491,0.03,-0.026,-0.215,-0.197,0.031,-0.157,-0.015,-0.072,-0.097,...,,,,,,,,,,0
2,575,-0.016,-0.026,-0.015,0.003,-0.013,0.116,0.008,0.008,0.03,...,,,,,,,,,,0
3,629,-0.049,-0.034,-0.361,-0.379,0.118,0.116,-0.015,-0.072,-0.072,...,,,,,,,,,,0
4,859,,,,,,,,,,...,,,,,,,,,,0



Shape of the sampled numeric data: (13107, 970)


## Read sampled data from categorical file

### Subtask:
Read only the rows from `train_categorical.csv` that correspond to the sampled 'Id's.


**Reasoning**:
Read the categorical data file in chunks and filter based on the sampled IDs.



In [None]:
# Define the file path for the categorical data
cat_file_path = r"data/train_categorical.csv"

# Initialize an empty list to store the sampled categorical data chunks
sampled_categorical_chunks = []

# Iterate through the categorical data file in chunks
for i, chunk in enumerate(pd.read_csv(cat_file_path, chunksize=chunk_size)):
    print(f"Processing categorical chunk {i+1}...")
    # Filter the chunk to keep only the rows where the 'Id' is in the sampled_ids set
    filtered_chunk = chunk[chunk['Id'].isin(sampled_ids)]
    # Append the filtered chunk to the list
    sampled_categorical_chunks.append(filtered_chunk)

# Concatenate the DataFrames in the list into a single DataFrame
cat_sampled = pd.concat(sampled_categorical_chunks, ignore_index=True)

# Display the first few rows of the sampled categorical data
display(cat_sampled.head())

# Print the shape of the sampled categorical DataFrame
print(f"\nShape of the sampled categorical data: {cat_sampled.shape}")

Collecting IDs from numeric chunk 1...
Collecting IDs from numeric chunk 2...
Collecting IDs from numeric chunk 3...
Collecting IDs from numeric chunk 4...
Collecting IDs from numeric chunk 5...
Collecting IDs from numeric chunk 6...
Collecting IDs from numeric chunk 7...
Collecting IDs from numeric chunk 8...
Collecting IDs from numeric chunk 9...
Collecting IDs from numeric chunk 10...
Collecting IDs from numeric chunk 11...
Collecting IDs from numeric chunk 12...
Generated 13107 sampled IDs.


  for i, chunk in enumerate(pd.read_csv(cat_file_path, chunksize=chunk_size)):


## Read sampled data from date file

### Subtask:
Read only the rows from `train_date.csv` that correspond to the sampled 'Id's.


**Reasoning**:
The categorical data has been successfully read and filtered based on the sampled IDs. The next step is to read the date data using the same sampled IDs, following the original plan.



In [None]:
# Define the file path for the date data
date_file_path = r"data/train_date.csv"

# Initialize an empty list to store the sampled date data chunks
sampled_date_chunks = []

# Iterate through the date data file in chunks
for i, chunk in enumerate(pd.read_csv(date_file_path, chunksize=chunk_size)):
    print(f"Processing date chunk {i+1}...")
    # Filter the chunk to keep only the rows where the 'Id' is in the sampled_ids set
    filtered_chunk = chunk[chunk['Id'].isin(sampled_ids)]
    # Append the filtered chunk to the list
    sampled_date_chunks.append(filtered_chunk)

# Concatenate the DataFrames in the list into a single DataFrame
date_sampled = pd.concat(sampled_date_chunks, ignore_index=True)

# Display the first few rows of the sampled date data
display(date_sampled.head())

# Print the shape of the sampled date DataFrame
print(f"\nShape of the sampled date data: {date_sampled.shape}")

Collecting IDs from numeric chunk 1...
Collecting IDs from numeric chunk 2...
Collecting IDs from numeric chunk 3...
Collecting IDs from numeric chunk 4...
Collecting IDs from numeric chunk 5...
Collecting IDs from numeric chunk 6...
Collecting IDs from numeric chunk 7...
Collecting IDs from numeric chunk 8...
Collecting IDs from numeric chunk 9...
Collecting IDs from numeric chunk 10...
Collecting IDs from numeric chunk 11...
Collecting IDs from numeric chunk 12...
Generated 13107 sampled IDs.
Processing date chunk 1...
Processing date chunk 2...
Processing date chunk 3...
Processing date chunk 4...
Processing date chunk 5...
Processing date chunk 6...
Processing date chunk 7...
Processing date chunk 8...
Processing date chunk 9...
Processing date chunk 10...
Processing date chunk 11...
Processing date chunk 12...


Unnamed: 0,Id,L0_S0_D1,L0_S0_D3,L0_S0_D5,L0_S0_D7,L0_S0_D9,L0_S0_D11,L0_S0_D13,L0_S0_D15,L0_S0_D17,...,L3_S50_D4246,L3_S50_D4248,L3_S50_D4250,L3_S50_D4252,L3_S50_D4254,L3_S51_D4255,L3_S51_D4257,L3_S51_D4259,L3_S51_D4261,L3_S51_D4263
0,102,,,,,,,,,,...,,,,,,,,,,
1,387,653.6,653.6,653.6,653.6,653.6,653.6,653.6,653.6,653.6,...,,,,,,,,,,
2,440,1111.21,1111.21,1111.21,1111.21,1111.21,1111.21,1111.21,1111.21,1111.21,...,,,,,,,,,,
3,570,336.87,336.87,336.87,336.87,336.87,336.87,336.87,336.87,336.87,...,,,,,,,,,,
4,754,,,,,,,,,,...,,,,,,,,,,



Shape of the sampled date data: (13107, 1157)


**Reasoning**:
The sampled dataframes (`num_sampled`, `cat_sampled`, and `date_sampled`) have been successfully regenerated. Now, I can proceed with merging these dataframes as per the instructions.



## Merge sampled dataframes

### Subtask:
Merge the smaller, sampled numeric, categorical, and date DataFrames.


**Reasoning**:
Merge the sampled numeric, categorical, and date DataFrames based on the 'Id' column and display the head and shape of the resulting merged DataFrame.



In [None]:
# Merge the sampled numeric and categorical DataFrames
merged_sampled = pd.merge(num_sampled, cat_sampled, on='Id', how='inner')

# Merge the result with the sampled date DataFrame
merged_sampled = pd.merge(merged_sampled, date_sampled, on='Id', how='inner')

# Display the head of the final merged DataFrame
print("Head of the final merged sampled DataFrame:")
display(merged_sampled.head())

# Print the shape of the final merged DataFrame
print(f"\nShape of the final merged sampled DataFrame: {merged_sampled.shape}")

Head of the final merged sampled DataFrame:


Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_D4246,L3_S50_D4248,L3_S50_D4250,L3_S50_D4252,L3_S50_D4254,L3_S51_D4255,L3_S51_D4257,L3_S51_D4259,L3_S51_D4261,L3_S51_D4263
0,184,-0.167,-0.22,-0.052,0.003,-0.056,0.07,0.037,0.208,-0.199,...,,,,,,,,,,
1,192,0.069,0.131,0.33,0.275,-0.056,-0.203,0.0,0.008,-0.005,...,,,,,,,,,,
2,282,,,,,,,,,,...,,,,,,,,,,
3,435,,,,,,,,,,...,,,,,,,,,,
4,692,-0.075,-0.064,-0.015,-0.016,0.074,-0.021,0.015,0.088,-0.184,...,,,,,,,,,,



Shape of the final merged sampled DataFrame: (13107, 4266)


In [None]:
# Define the file paths to save the DataFrames
output_dir = 'processed_data'
os.makedirs(output_dir, exist_ok=True) # Create the output directory if it doesn't exist

num_sampled_path = os.path.join(output_dir, 'num_sampled.csv')
cat_sampled_path = os.path.join(output_dir, 'cat_sampled.csv')
date_sampled_path = os.path.join(output_dir, 'date_sampled.csv')
merged_sampled_path = os.path.join(output_dir, 'merged_sampled.csv')

# Save the dataframes to CSV
print(f"Saving num_sampled to {num_sampled_path}...")
num_sampled.to_csv(num_sampled_path, index=False)

print(f"Saving cat_sampled to {cat_sampled_path}...")
cat_sampled.to_csv(cat_sampled_path, index=False)

print(f"Saving date_sampled to {date_sampled_path}...")
date_sampled.to_csv(date_sampled_path, index=False)

print(f"Saving merged_sampled to {merged_sampled_path}...")
merged_sampled.to_csv(merged_sampled_path, index=False)

print("\nDataFrames saved successfully!")

Saving num_sampled to processed_data/num_sampled.csv...
Saving cat_sampled to processed_data/cat_sampled.csv...
Saving date_sampled to processed_data/date_sampled.csv...
Saving merged_sampled to processed_data/merged_sampled.csv...

DataFrames saved successfully!
