# Data Preprocessing

## Introduction
This document presents the data preprocessing procedures designed to optimize model performance and maximize predictive capability. The analysis begins with the requisite library imports.

In [None]:
import pandas as pd
import numpy as np
import sys
from pathlib import Path

project_root = Path.cwd()
while not (project_root / "src").exists():
    project_root = project_root.parent

sys.path.append(str(project_root / "src"))

RANDOM_STATE = 42

## Encoding

Initially, non-numeric columns such as wine type must be encoded appropriately.

In [None]:
red_wine = pd.read_csv('../data/raw/winequality-red.csv', sep=';')
white_wine = pd.read_csv('../data/raw/winequality-white.csv', sep=';')

red_wine['wine type'] = 0
white_wine['wine type'] = 1
wine_data = pd.concat([red_wine, white_wine], axis=0, ignore_index=True)

X = wine_data.drop(columns='quality')
y = wine_data['quality']

## Data Splitting
The data must then be partitioned into training and testing sets to prevent issues such as data leakage.

In [None]:
from util import split_train_test

X_train, X_test, y_train, y_test = split_train_test(X, y, random_state=RANDOM_STATE, stratify=y)

## Feature Engineering
Based on the correlations identified during exploratory analysis, four features will be addressed: *free sulfur dioxide*, *total sulfur dioxide*, *density*, and *alcohol*. The first pair exhibits a correlation of 0.72, and a potentially effective solution involves creating a unified feature by calculating the ratio of free sulfur dioxide to total sulfur dioxide.

In [None]:
X_train['free sulfur dioxide ratio'] = X_train['free sulfur dioxide'] / X_train['total sulfur dioxide']
X_train = X_train.drop(columns=['total sulfur dioxide', 'free sulfur dioxide'])

X_test['free sulfur dioxide ratio'] = X_test['free sulfur dioxide'] / X_test['total sulfur dioxide']
X_test = X_test.drop(columns=['total sulfur dioxide', 'free sulfur dioxide'])

For the second pair this should not be a significant issue; therefore, the current configuration will be maintained.

In [None]:
from util import plot_correlation_matrix 

plot_correlation_matrix(X_train, 0.6)

A new feature pair has emerged, however, this configuration may prove beneficial in this context; therefore, no modifications will be implemented.

## Scaling

To facilitate model optimization, the data will be standardized using standard scaling techniques.

In [None]:
from util import StandardScaler

standard_scaler = StandardScaler()
X_train = standard_scaler.fit_transform(X_train)
X_test = standard_scaler.transform(X_test)

## Export

In [None]:
X_train.to_csv('../data/processed/X_train.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)