# Data Pre-processing

## 1. Introduction

he objective of this project is to develop a **machine learning model** capable of predicting the **quality of welds on steel materials**. Reliable weld quality assessment is a key challenge in modern manufacturing, with direct implications for **safety**, **cost optimization**, and **production efficiency**. Given that the welding industry represents a **multi-billion-euro sector**, improving prediction accuracy can yield substantial **economic and operational benefits**.

The dataset used in this work originates from the *Department of Materials Science and Metallurgy* at the **University of Cambridge, U.K.** It contains experimental data related to various **welding parameters** and **material characteristics**.  

The following sections describe the **methodology** applied to explore, clean, and pre-process these data before proceeding to model development.

## 2. Setup

In [51]:
# All the imports

import sys
assert sys.version_info >= (3, 5)
import sklearn
assert sklearn.__version__ >= "0.20"
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import re


## 3. Initial Data Cleaning

This stage focuses on straightforward corrections and transformations. Each step follows a clear workflow:
a. Detect issues through targeted analysis.
b. Apply corrective actions accordingly (e.g., type conversions, handling of abnormal or inconsistent values).

### a. Brief overview of the Data

In [52]:
## Describe, infos, head, tail, shape, dtypes, unique values, missing values etc.


#Sarah

### b. Column specific treatment

In [53]:
# print the problematic columns to show how they must be handled

#Sarah

In [54]:
# handles the problematic columns: nitrogen, hardness etc.

#Sarah

## 4. Advanced feature preparation
Once the data is cleaned, more complex pre-processing tasks are performed to optimize the dataset for machine learning models. These include:
- Feature engineering (to be documented following a literature review)
- Data splitting
- Outlier management
- One-hot encoding
- Imputation and scaling
- Handling multicollinearity and applying PCA when relevant

### a. Feature engineering


Yield strength represents the stress at which a material begins to deform plastically (end of elastic region).   
It is chosen as the target because:
- Fewest missing values among mechanical properties
- Critical for structural design and safety assessment
- Direct indicator of load-bearing capacity


**Carbon Equivalent (CE)** `ce_iww`  

Formula: `CE = C + Mn/6 + (Cr+Mo+V)/5 + (Ni+Cu)/15`  
Explanation:
- Synthesizes the combined hardenability effect of all alloying elements
- Higher CE -> Harder microstructure -> Higher yield strength (but increased brittleness risk)
- Industry-standard metric (IIW, AWS D1.1)

**!!** Contains elements with high NaN rates (Cr, Mo, V, Ni, Cu)

**Carbon Squared (C²)** `carbon_squared`  

Explanation:
- Carbon has a nonlinear effect on yield strength
- Small delta C -> Large deltaYield Strength at high C levels
- Captures threshold effects and embrittlement beyond ~0.25% C  
Physical basis: Carbon interstitial atoms cause lattice distortion (strengthening), but excess C forms carbides (brittleness)

**C/Mn Ratio** `mn_c_ratio`  

Explanation:
- High C/Mn -> High strength, low ductility (carbon-dominated)
- Low C/Mn -> Moderate strength, high ductility (manganese-dominated)
- Balances solid-solution strengthening (Mn) vs. interstitial strengthening (C)

**Arc Energy (Voltage × Current)** `arc_energy`   

Formula: `Arc_energy = Voltage × Current [Watts]`

Explanation:  
- Proxy for heat input intensity
- High arc energy -> Coarser grain structure -> Lower yield strength
- Optimal energy window exists for maximum strength


**HAZ Hardness Estimate** `haz_hardness`  

Formula: `HAZ_hardness = 90 + 1050*C + 45*Si + 30*Mn + 25*Ni + 20*Cr + 60*Cr + 60Mo + 5HI`

Explanation:  
- Heat-Affected Zone hardness correlates strongly with yield strength
- Captures microstructural hardening from composition
- Simplified to avoid NaN-prone elements (Ni, Cr, Mo removed)

Physical basis: Düren formula for predicting HAZ properties

**!!!** Uses elements like Nickel, Chromium, and Molybdenum with many NaN values.

**Mn/S Ratio** `mn_s_ratio`    

Formula: `Mn_S_ratio = Mn / (S + 0.0001)`

Explanation: 
- Sulfur forms brittle MnS inclusions -> Crack initiation sites
- High Mn/S -> Fewer harmful inclusions -> Higher ductility AND yield strength

Metallurgical principle: Manganese "neutralizes" sulfur by forming less harmful inclusion morphology

**Austenite Stabilizer Index (Robust)** `austenite_stabilizer`  

Formula: `Austenite_stabilizer = Mn/2 + 10*C + Ni` 

Explanation: 
- Austenite -> Ductile, strong microstructure (vs. brittle ferrite)
- Higher index -> More retained austenite -> Better toughness-strength balance  

**!!!** Uses Nickel with high NaN rates.



In [None]:
df_before_feature_engineering=df.copy

df=add_features(df)

### b. Data splitting

To properly train and evaluate our model for predicting Yield Strength, we split the dataset into:

- Training set (80%): used to fit the model parameters
- Test set (20%): reserved for final evaluation

In [None]:
X_train, X_test, y_train, y_test = split_data(df, target='yield_strength')

### c. Outlier management

In [57]:
#Karina

### d. One-hot encoding

In [58]:
#Karina

### e. Imputation and scaling


In [59]:
#Eliott

### f. Handling multicollinearity and PCA

In [60]:
#Eliott