# Step 1

## Preprocessing and Feature Engineering

**Last update: August 8, 2025**

**Copyright (C) 2025 Sukanta Basu**

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.

**Overall Strategy**

**Step 1: Preprocess and engineer new features.**

Step 2: Use AutoGluon to generate OOF predictions for each target separately.
These predictions will be used as additional input features in steps 3 and 4.

Step 3: Train the RealMLP model with processed input (step 1) + ten
AutoGluon-OOFs (step 2). These additional features will capture the correlation
among targets effectively.

Step 4: Similar to step 3 except use the TabPFN model.

Step 5: Combine predictions from RealMLP (step 3) and TabPFN (step 4).

**Imports**

In [None]:
import numpy as np
import pandas as pd
import random

**Set Random Seeds**

In [None]:
random.seed(7)
np.random.seed(7)

**Input & Output Directories**

In [None]:
ROOT_DIR = '/data/Sukanta/Works_AIML/2025_SHELL_FuelProperty/'
DATA_DIR = ROOT_DIR + 'DATA/'
ExtractedDATA_DIR = ROOT_DIR + 'ExtractedDATA/'

**Load Training and Testing Data Provided by the Organizers**

In [None]:
df_XyTrnVal_org = pd.read_csv(DATA_DIR + 'train.csv')
df_XTst_org = pd.read_csv(DATA_DIR + 'test.csv')

**Feature Engineering**

In [None]:
# Create empty data frames
df_XyTrnVal_mod = pd.DataFrame()
df_XTst_mod = pd.DataFrame()

# Add component fractions
for comp in range(1, 6):
    df_XyTrnVal_mod[f'Component{comp}_fraction'] = (
        df_XyTrnVal_org)[f'Component{comp}_fraction']
    df_XTst_mod[f'Component{comp}_fraction'] = (
        df_XTst_org)[f'Component{comp}_fraction']

# Create volume fraction-weighted input features
for prop in range(1, 11):
    for comp in range(1, 6):
        fraction_col = f'Component{comp}_fraction'
        property_col = f'Component{comp}_Property{prop}'
        contribution_col = f'Component{comp}_Contribution_Property{prop}'
        df_XyTrnVal_mod[contribution_col] = (df_XyTrnVal_org[fraction_col] *
                                             df_XyTrnVal_org[property_col])

        df_XTst_mod[contribution_col] = (df_XTst_org[fraction_col] *
                                             df_XTst_org[property_col])

# Create weighted-averaged input features
for prop in range(1, 11):
    df_XyTrnVal_mod[f'WeightedAvg_Property{prop}'] = (
        sum(df_XyTrnVal_org[f'Component{comp}_fraction'] *
            df_XyTrnVal_org[f'Component{comp}_Property{prop}']
            for comp in range(1, 6)))
    df_XTst_mod[f'WeightedAvg_Property{prop}'] = (
        sum(df_XTst_org[f'Component{comp}_fraction'] *
            df_XTst_org[f'Component{comp}_Property{prop}']
            for comp in range(1, 6)))

# Add targets
for target in range(1, 11):
    df_XyTrnVal_mod[f'BlendProperty{target}'] = df_XyTrnVal_org[f'BlendProperty{target}']

**Save Processed Data**

In [None]:
df_XyTrnVal_mod.to_csv(ExtractedDATA_DIR + 'train_processed.csv',index=False)
df_XTst_mod.to_csv(ExtractedDATA_DIR + 'test_processed.csv',index=False)