In [53]:
import numpy as np
import pandas as pd

# Predicting Critical Heat Flux

This is a Kaggle competition: [Feature Imputation with a Heat Flux Dataset](https://www.kaggle.com/competitions/playground-series-s3e15/overview).


## Objective

The objective is to predict missing values in this synthetic dataset, using previously established data and scientific principles, as well as predictive machine learning.

Based on the following data:

Zhao, Xingang (2020), “Data for: On the prediction of critical heat flux using a physics-informed machine learning-aided framework”, Mendeley Data, V1, doi: 10.17632/5p5h37tyv7.1

In [54]:
# Convert the CSV file to a Pandas DataFrame

data = pd.read_csv('synthesized_data.csv', index_col=0)
data

Unnamed: 0_level_0,author,geometry,pressure [MPa],mass_flux [kg/m2-s],x_e_out [-],D_e [mm],D_h [mm],length [mm],chf_exp [MW/m2]
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,Thompson,tube,7.00,3770.0,0.1754,,10.8,432.0,3.6
1,Thompson,tube,,6049.0,-0.0416,10.3,10.3,762.0,6.2
2,Thompson,,13.79,2034.0,0.0335,7.7,7.7,457.0,2.5
3,Beus,annulus,13.79,3679.0,-0.0279,5.6,15.2,2134.0,3.0
4,,tube,13.79,686.0,,11.1,11.1,457.0,2.8
...,...,...,...,...,...,...,...,...,...
31639,Thompson,,,1736.0,0.0886,,7.8,591.0,2.3
31640,,,13.79,,,4.7,4.7,,3.9
31641,Thompson,,18.27,658.0,-0.1224,3.0,3.0,150.0,2.3
31642,Thompson,tube,6.89,3825.0,,23.6,23.6,1972.0,3.7


In [55]:
# View the types of data in each column

data.dtypes

author                  object
geometry                object
pressure [MPa]         float64
mass_flux [kg/m2-s]    float64
x_e_out [-]            float64
D_e [mm]               float64
D_h [mm]               float64
length [mm]            float64
chf_exp [MW/m2]        float64
dtype: object

In [56]:
# Rename the columns for simplicity

columns_map = {'pressure [MPa]': 'pressure',
               'mass_flux [kg/m2-s]': 'mass_flux',
               'x_e_out [-]': 'x_e_out',
               'D_e [mm]': 'D_e',
               'D_h [mm]': 'D_h',
               'length [mm]': 'length',
               'chf_exp [MW/m2]': 'chf_exp'}

data.rename(columns=columns_map, inplace=True)

In [57]:
# Count the NaN values in each column

data.isna().sum()

author        5024
geometry      5500
pressure      4452
mass_flux     4791
x_e_out      10415
D_e           5488
D_h           4589
length        4759
chf_exp          0
dtype: int64

In [58]:
# Locate the indices for where the feature of interest has a missing value

x_e_out_na = data.loc[np.where(data['x_e_out'] != data['x_e_out'])]

### Feature imputation

There are 9 features overall, including one feature of particular interest,`x_e_out [-]`

Most rows of data contain at least one missing value, and many contain up to 4 missing values. 

We will take two approaches to feature imputation:

### 1. Inferrence through regression

Using trends found from the original dataset (done in the other notebook in this repo), we can fill in some of NaN values in `D_e` and `D_h` (if the other is present) with inferred values, based on the least squares linear regression equation observed between them.

In [59]:
# Find how many rows contain at least one of these values

diameters = data[['D_e', 'D_h']]
diameters.dropna(how='all')

Unnamed: 0_level_0,D_e,D_h
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,,10.8
1,10.3,10.3
2,7.7,7.7
3,5.6,15.2
4,11.1,11.1
...,...,...
31639,,7.8
31640,4.7,4.7
31641,3.0,3.0
31642,23.6,23.6


Out of 31644 rows, 30829 contain either or both values. That's a good sign.

Now we can use the equation of the least squares regression line previously attained to fill in some NaN values:

`D_h = (m)*D_e + (b)`  
`m = 1.6505188039072645`  
`b = 0.6244360295422665`

In [60]:
# Set the values for the LSR line:

m = 1.125005293061929
b = 2.4657099242996416

In [61]:
# Use the LSR line equation to estimate `D_h` from `D_e`

diameters['D_h'].fillna(value=(m*diameters['D_e']+b), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


In [62]:
# Reverse the LSR line equation to estimate `D_e` from `D_h`

diameters['D_e'].fillna(value=((diameters['D_h']-b)/m), inplace=True)

In [63]:
# Count how many NaN values remain

diameters.isnull().sum()

D_e    815
D_h    815
dtype: int64

In [64]:
# Place these imputed values into the dataset

data[['D_e', 'D_h']] = diameters

Many of the NaN values of `D_e` and `D_e` have been imputed, if one or the other was present. However, there are still many rows (815) without either value.

In addition, none of the other features seemed to have directly strong correlations (either through regression or categorical association) within the dataset, which makes unknown feature imputation using other known features tricky. One way around this is to use machine learning methods, which leads into the second method for feature imputation:

### 2. K-nearest neighbours

The K-nearest neighbours approach is a supervised machine learning method that operates on the idea that similar samples also have feature similarity that can be mathematically computed as distance or proximity parameters. This method can also be used to impute missing features.

In [65]:
from sklearn.impute import KNNImputer

In [66]:
imputer = KNNImputer()

As previously found, this dataset contains both numerical and categorical data. The KNN Imputer can not directly predict NaN values for categorical features, as it requires all features to be in `float` or `int` types.

Since all features in the dataset describe physical properties of materials except for `author`, this feature will be removed, as there is unlikely to be any meaningful correlation between this feature and the feature of interest.

Simply using encoding methods for categorical values (i.e. label encoder, one hot encoder, or dummy variables) would not work here, as the NaN values would not be preserved. Any of these methods creates a new category for NaN values, which would no longer be recognized as such by the KNN Imputer. Because of this, we need to find a way to encode these variables while preserving the NaN values. In this case, we will use one hot encoding, as the categorical features are not ordinal.

Since the there are no categorical variable encoders with built-in options to preserve NaN values, we will take a manual approach:

1. Specify a temporary category for missing values (NaNs)
2. Apply categorical variable transformation to categorical features
3. View the resulting sub-DataFrame to determine corresponding column names from OHE
4. For each row, replace all values with NaN if the missing column contains 1
5. Delete the missing column and merge the sub-DataFrame back into the main DataFrame

In [67]:
# Count the distinct categories in the `geometry` column

data['geometry'].value_counts()

tube       21145
annulus     4381
plate        618
Name: geometry, dtype: int64

In [68]:
# Create dummy variables, and add a new column for NaN values

geometry = pd.get_dummies(data['geometry'], dummy_na=True)

In [69]:
geometry.head()

Unnamed: 0_level_0,annulus,plate,tube,NaN
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,0,1,0
1,0,0,1,0
2,0,0,0,1
3,1,0,0,0
4,0,0,1,0


In [70]:
# Separate the temporary DataFrame into two DataFrames
# One will have rows where the NaN column has a 0 (not missing this feature value)
# The other will have rows where the NaN column has a 1 (missing this feature value)

geometry_notna = geometry.loc[np.where(geometry[np.nan] == 0)]
geometry_na = geometry.loc[np.where(geometry[np.nan])]

In [71]:
# Replace all of the known category columns with NaN in the DataFrame with this missing feature value

geometry_na.replace(to_replace={0:np.nan}, inplace=True)

In [72]:
# Concatenate the two DataFrames back into one DataFrame, and drop the NaN column

geometry = pd.concat([geometry_notna, geometry_na]).sort_index()
geometry.drop(np.nan, axis=1, inplace=True)

In [73]:
geometry.head()

Unnamed: 0_level_0,annulus,plate,tube
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0,0.0,1.0
1,0.0,0.0,1.0
2,,,
3,1.0,0.0,0.0
4,0.0,0.0,1.0


In [74]:
# Concatenate this encoded DataFrame back with the original DataFrame
# Drop the categorical `author` and `geometry` columns

data = pd.concat([data, geometry], axis=1)
data.drop(['author', 'geometry'], axis=1, inplace=True)

In [75]:
# Fit the pre-processed dataset with the KNN imputer

imputer.fit(data)

KNNImputer()

In [76]:
# Transform the pre-processed dataset with the pre-fitted KNN imputer

data_array = imputer.transform(data)

In [77]:
# Convert the imputed array back into a DataFrame

data_imputed = pd.DataFrame(data_array, columns=imputer.feature_names_in_)

In [78]:
# Check for any remaining NaN values

data_imputed.isnull().sum()

pressure     0
mass_flux    0
x_e_out      0
D_e          0
D_h          0
length       0
chf_exp      0
annulus      0
plate        0
tube         0
dtype: int64

In [79]:
# Isolate the feature of interest as a Series

x_e_out = data_imputed['x_e_out']
x_e_out.name = 'x_e_out [-]'

In [80]:
# Select only the rows with the original missing values in the feature of interest, and convert to a CSV file

x_e_out.iloc[x_e_out_na.index].to_csv(path_or_buf='submission.csv', index_label='id')

And we're done! All missing feature values have been imputed by one of the two imputation methods.