# Data Pre-processing

## 1. Introduction

he objective of this project is to develop a **machine learning model** capable of predicting the **quality of welds on steel materials**. Reliable weld quality assessment is a key challenge in modern manufacturing, with direct implications for **safety**, **cost optimization**, and **production efficiency**. Given that the welding industry represents a **multi-billion-euro sector**, improving prediction accuracy can yield substantial **economic and operational benefits**.

The dataset used in this work originates from the *Department of Materials Science and Metallurgy* at the **University of Cambridge, U.K.** It contains experimental data related to various **welding parameters** and **material characteristics**.  

The following sections describe the **methodology** applied to explore, clean, and pre-process these data before proceeding to model development.

## 2. Setup

In [51]:
# All the imports

import sys
assert sys.version_info >= (3, 5)
import sklearn
assert sklearn.__version__ >= "0.20"
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import re


## 3. Initial Data Cleaning

This stage focuses on straightforward corrections and transformations. Each step follows a clear workflow:
a. Detect issues through targeted analysis.
b. Apply corrective actions accordingly (e.g., type conversions, handling of abnormal or inconsistent values).

### a. Brief overview of the Data

In [52]:
## Describe, infos, head, tail, shape, dtypes, unique values, missing values etc.


#Sarah

### b. Column specific treatment

In [53]:
# print the problematic columns to show how they must be handled

#Sarah

In [54]:
# handles the problematic columns: nitrogen, hardness etc.

#Sarah

## 4. Advanced feature preparation
Once the data is cleaned, more complex pre-processing tasks are performed to optimize the dataset for machine learning models. These include:
- Feature engineering (to be documented following a literature review)
- Data splitting
- Outlier management
- One-hot encoding
- Imputation and scaling
- Handling multicollinearity and applying PCA when relevant

### a. Feature engineering

In [55]:
#Albane

### b. Data splitting

In [56]:
#Albane

### c. Outlier management

In [57]:
#Karina

### d. One-hot encoding

In [58]:
#Karina

### e. Imputation and scaling


In [59]:
#Eliott

### f. Handling multicollinearity and PCA

In [60]:
#Eliott