# MICE Feature Imputation
---
## by Diego Garrocho

Feature Imputation with a Heat Flux Dataset
Kaggle competition Playground Series - Season 3, Episode 15

So there appears to be quite a number of missing values everywhere, the grading for this project is done solely on the prediction made on the feature 'x_e_out' with RMSE so that should be left to be the last one filled.
MICE is specifically designed to handle datasets with multiple variables containing missing values. MICE works by iteratively imputing missing values in each variable using regression models that include the other variables as predictors. The process is repeated multiple times to create multiple imputed datasets, and the results from these imputed datasets are combined to obtain final imputations.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from catboost import CatBoostRegressor

In [2]:
data = pd.read_csv(r'C:\Users\logan\Desktop\Feature Imputation\data.csv' )

In [34]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31644 entries, 0 to 31643
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   31644 non-null  int64  
 1   author               26620 non-null  object 
 2   geometry             26144 non-null  object 
 3   pressure [MPa]       27192 non-null  float64
 4   mass_flux [kg/m2-s]  26853 non-null  float64
 5   x_e_out [-]          21229 non-null  float64
 6   D_e [mm]             26156 non-null  float64
 7   D_h [mm]             27055 non-null  float64
 8   length [mm]          26885 non-null  float64
 9   chf_exp [MW/m2]      31644 non-null  float64
dtypes: float64(7), int64(1), object(2)
memory usage: 2.4+ MB


In [3]:
# dropping id to begin with
data_noid = data.drop('id',axis=1)

In [4]:
data_noid['author'].unique()

array(['Thompson', 'Beus', nan, 'Peskov', 'Janssen', 'Weatherhead',
       'Inasaka', 'Williams', 'Mortimore', 'Richenderfer', 'Kossolapov'],
      dtype=object)

In [5]:
# Encoding geometry and author features

label_encoder = LabelEncoder()

data_noid['author_encoded'] = label_encoder.fit_transform(data_noid['author'])
data_noid['geometry_encoded'] = label_encoder.fit_transform(data_noid['geometry'])

# Revert encoded missing values back to NaN

#data_noid['author_encoded'] = np.where(pd.isnull(data_noid['author_encoded']),
                                       #np.nan, data_noid['author_encoded'])
#data_noid['geometry_encoded'] = np.where(pd.isnull(data_noid['geometry_encoded']),
                                         #np.nan, data_noid['geometry_encoded'])


In [6]:
# Drop encoded nan wierdness author is 10 and geo is 3
data_noid['author_encoded'] = np.where(data_noid['author_encoded'] == 10, np.nan, data_noid['author_encoded'])
data_noid['geometry_encoded'] = np.where(data_noid['geometry_encoded'] == 3, np.nan, data_noid['geometry_encoded'])

# Dropping text features
encoded = data_noid.drop(['author','geometry'],axis=1)

In [7]:
encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31644 entries, 0 to 31643
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   pressure [MPa]       27192 non-null  float64
 1   mass_flux [kg/m2-s]  26853 non-null  float64
 2   x_e_out [-]          21229 non-null  float64
 3   D_e [mm]             26156 non-null  float64
 4   D_h [mm]             27055 non-null  float64
 5   length [mm]          26885 non-null  float64
 6   chf_exp [MW/m2]      31644 non-null  float64
 7   author_encoded       26620 non-null  float64
 8   geometry_encoded     26144 non-null  float64
dtypes: float64(9)
memory usage: 2.2 MB


In [8]:
data_noid.head(20)

Unnamed: 0,author,geometry,pressure [MPa],mass_flux [kg/m2-s],x_e_out [-],D_e [mm],D_h [mm],length [mm],chf_exp [MW/m2],author_encoded,geometry_encoded
0,Thompson,tube,7.0,3770.0,0.1754,,10.8,432.0,3.6,7.0,2.0
1,Thompson,tube,,6049.0,-0.0416,10.3,10.3,762.0,6.2,7.0,2.0
2,Thompson,,13.79,2034.0,0.0335,7.7,7.7,457.0,2.5,7.0,
3,Beus,annulus,13.79,3679.0,-0.0279,5.6,15.2,2134.0,3.0,0.0,0.0
4,,tube,13.79,686.0,,11.1,11.1,457.0,2.8,,2.0
5,,,17.24,3648.0,-0.0711,,1.9,696.0,3.6,,
6,Thompson,,6.89,549.0,0.1203,12.8,12.8,1930.0,2.6,7.0,
7,Peskov,tube,18.0,750.0,,10.0,10.0,1650.0,2.2,5.0,2.0
8,,tube,12.07,4042.0,-0.0536,,,152.0,5.6,,2.0
9,Peskov,tube,12.0,1617.0,0.1228,10.0,10.0,520.0,2.2,5.0,2.0


    Identify variables with missing values: Determine which variables in your dataset have missing values that need to be imputed. This step helps you focus on the specific variables that require imputation.

    Split the data: Divide your dataset into two parts: one with complete cases (observations without missing values) and one with incomplete cases (observations with missing values). This separation allows you to use the complete cases for imputation model training and the incomplete cases for imputation.

    Initialize imputations: Create multiple copies (usually 3-10) of the incomplete cases dataset, each representing a complete dataset with imputed values. Start by imputing the missing values in each dataset using simple techniques like mean imputation or mode imputation.

    Iterate over variables: For each variable with missing values, perform the following steps:

    a. Select the variable to be imputed: Focus on one variable at a time.

    b. Create a regression model: Build a regression model to predict the missing variable based on the observed values of the other variables. The choice of the model depends on the nature of the variable (continuous, categorical, etc.). For continuous variables, linear regression is often used. For categorical variables, logistic regression or classification models can be employed.

    c. Update the missing values: Use the fitted regression model to impute the missing values for the selected variable in each dataset copy. Replace the missing values with the predicted values.

    d. Repeat steps b and c: Iterate over the remaining variables with missing values and update their missing values accordingly in each dataset copy.

    Repeat the iteration: Iterate the process of steps 4b-4d for a predetermined number of iterations (e.g., 5-10). Each iteration further refines the imputations by incorporating the updated values from the previous iterations.

    Combine the imputed datasets: After completing the desired number of iterations, combine the multiple imputed datasets into a single dataset. You can either use the imputed datasets separately for analysis or combine them using specific rules, such as using the mean or median of the imputed values.

    Assess imputation quality: Evaluate the quality of the imputations by examining diagnostic measures, such as convergence of the imputation process, distributional properties, and imputation-specific diagnostics (e.g., fraction of missing information).

In [9]:
#Splitting into complete and incomplete data
complete_mask = encoded.notnull().all(axis=1)
incomplete_mask = ~complete_mask

complete_data = encoded[complete_mask]
incomplete_data = encoded[incomplete_mask]


In [10]:
# Making several imputed datasets, probably 5
imputed_datasets = [incomplete_data.copy() for _ in range(5)]

# Naming cat features to use mode instead of median
categorical_features = ['author_encoded', 'geometry_encoded']
non_categorical_features = [feature for feature in imputed_datasets[0].columns if feature not in categorical_features]

# Imputing with median and mode probs cause most have outliers and are skewed or are categorical
for dataset in imputed_datasets:
    dataset[categorical_features] = dataset[categorical_features].fillna(dataset[categorical_features].mode().iloc[0])
    dataset[non_categorical_features] = dataset[non_categorical_features].fillna(dataset[non_categorical_features].median())



The following part is probably murky due to lacking reference material found. 

In [11]:
# Set up for FCS models for each feature with missing values

incomplete_mask = encoded.isnull().any(axis=1)


# Probably a wasteful step
features_geo = ['pressure [MPa]',
 'mass_flux [kg/m2-s]',
 'x_e_out [-]',
 'D_e [mm]',
 'D_h [mm]',
 'length [mm]',
 'chf_exp [MW/m2]',
 'author_encoded']


# Split the data into observed and missing rows
observed_data = encoded[complete_mask].copy()
missing_data = encoded[incomplete_mask].copy()

# Define the input and target variables for observed data
X_train = observed_data[features_geo]
y_train = observed_data['geometry_encoded']

# Create an instance of CatBoostClassifier
classifier = CatBoostClassifier()

# Fit the model
classifier.fit(X_train, y_train)

# Make predictions for the missing values
missing_X = missing_data[features_geo]
missing_data['geometry_encoded'] = classifier.predict(missing_X)

# Update the missing values with the predicted values
encoded.loc[incomplete_mask, 'geometry_encoded'] = missing_data['geometry_encoded']

# Convert the filled missing values to the appropriate data type
encoded['geometry_encoded'] = encoded['geometry_encoded'].astype(np.int64)

Learning rate set to 0.087078
0:	learn: 0.9412862	total: 137ms	remaining: 2m 17s
1:	learn: 0.8157907	total: 141ms	remaining: 1m 10s
2:	learn: 0.7150209	total: 143ms	remaining: 47.5s
3:	learn: 0.6316497	total: 146ms	remaining: 36.3s
4:	learn: 0.5617238	total: 149ms	remaining: 29.6s
5:	learn: 0.5022370	total: 152ms	remaining: 25.2s
6:	learn: 0.4505921	total: 155ms	remaining: 22s
7:	learn: 0.4062978	total: 159ms	remaining: 19.8s
8:	learn: 0.3680008	total: 162ms	remaining: 17.9s
9:	learn: 0.3336478	total: 165ms	remaining: 16.3s
10:	learn: 0.3044394	total: 168ms	remaining: 15.1s
11:	learn: 0.2770613	total: 170ms	remaining: 14s
12:	learn: 0.2528019	total: 173ms	remaining: 13.1s
13:	learn: 0.2310237	total: 176ms	remaining: 12.4s
14:	learn: 0.2116406	total: 178ms	remaining: 11.7s
15:	learn: 0.1941714	total: 182ms	remaining: 11.2s
16:	learn: 0.1791328	total: 184ms	remaining: 10.7s
17:	learn: 0.1650648	total: 187ms	remaining: 10.2s
18:	learn: 0.1521623	total: 190ms	remaining: 9.79s
19:	learn: 0.

171:	learn: 0.0144295	total: 633ms	remaining: 3.05s
172:	learn: 0.0143513	total: 636ms	remaining: 3.04s
173:	learn: 0.0142938	total: 640ms	remaining: 3.04s
174:	learn: 0.0142528	total: 643ms	remaining: 3.03s
175:	learn: 0.0142139	total: 647ms	remaining: 3.03s
176:	learn: 0.0141856	total: 649ms	remaining: 3.02s
177:	learn: 0.0141372	total: 652ms	remaining: 3.01s
178:	learn: 0.0140638	total: 655ms	remaining: 3s
179:	learn: 0.0140231	total: 657ms	remaining: 2.99s
180:	learn: 0.0139711	total: 660ms	remaining: 2.99s
181:	learn: 0.0139188	total: 663ms	remaining: 2.98s
182:	learn: 0.0138520	total: 666ms	remaining: 2.98s
183:	learn: 0.0137860	total: 669ms	remaining: 2.97s
184:	learn: 0.0137218	total: 672ms	remaining: 2.96s
185:	learn: 0.0136570	total: 675ms	remaining: 2.95s
186:	learn: 0.0136272	total: 677ms	remaining: 2.94s
187:	learn: 0.0135845	total: 680ms	remaining: 2.94s
188:	learn: 0.0135263	total: 684ms	remaining: 2.93s
189:	learn: 0.0134838	total: 687ms	remaining: 2.93s
190:	learn: 0.0

341:	learn: 0.0088701	total: 1.13s	remaining: 2.18s
342:	learn: 0.0088600	total: 1.14s	remaining: 2.17s
343:	learn: 0.0088372	total: 1.14s	remaining: 2.17s
344:	learn: 0.0088148	total: 1.14s	remaining: 2.17s
345:	learn: 0.0087942	total: 1.14s	remaining: 2.16s
346:	learn: 0.0087837	total: 1.15s	remaining: 2.16s
347:	learn: 0.0087637	total: 1.15s	remaining: 2.16s
348:	learn: 0.0087530	total: 1.15s	remaining: 2.15s
349:	learn: 0.0087428	total: 1.16s	remaining: 2.15s
350:	learn: 0.0087102	total: 1.16s	remaining: 2.14s
351:	learn: 0.0086858	total: 1.16s	remaining: 2.14s
352:	learn: 0.0086647	total: 1.17s	remaining: 2.14s
353:	learn: 0.0086452	total: 1.17s	remaining: 2.14s
354:	learn: 0.0086138	total: 1.18s	remaining: 2.13s
355:	learn: 0.0085996	total: 1.18s	remaining: 2.13s
356:	learn: 0.0085839	total: 1.18s	remaining: 2.13s
357:	learn: 0.0085524	total: 1.18s	remaining: 2.12s
358:	learn: 0.0085239	total: 1.19s	remaining: 2.12s
359:	learn: 0.0085019	total: 1.19s	remaining: 2.12s
360:	learn: 

521:	learn: 0.0058038	total: 1.75s	remaining: 1.6s
522:	learn: 0.0057934	total: 1.75s	remaining: 1.6s
523:	learn: 0.0057876	total: 1.76s	remaining: 1.59s
524:	learn: 0.0057750	total: 1.76s	remaining: 1.59s
525:	learn: 0.0057608	total: 1.76s	remaining: 1.59s
526:	learn: 0.0057516	total: 1.77s	remaining: 1.58s
527:	learn: 0.0057352	total: 1.77s	remaining: 1.58s
528:	learn: 0.0057263	total: 1.77s	remaining: 1.58s
529:	learn: 0.0057113	total: 1.77s	remaining: 1.57s
530:	learn: 0.0056987	total: 1.78s	remaining: 1.57s
531:	learn: 0.0056914	total: 1.78s	remaining: 1.57s
532:	learn: 0.0056786	total: 1.78s	remaining: 1.56s
533:	learn: 0.0056685	total: 1.79s	remaining: 1.56s
534:	learn: 0.0056572	total: 1.79s	remaining: 1.56s
535:	learn: 0.0056354	total: 1.79s	remaining: 1.55s
536:	learn: 0.0056276	total: 1.8s	remaining: 1.55s
537:	learn: 0.0056242	total: 1.8s	remaining: 1.55s
538:	learn: 0.0056078	total: 1.81s	remaining: 1.54s
539:	learn: 0.0055946	total: 1.81s	remaining: 1.54s
540:	learn: 0.00

689:	learn: 0.0041856	total: 2.35s	remaining: 1.05s
690:	learn: 0.0041753	total: 2.35s	remaining: 1.05s
691:	learn: 0.0041686	total: 2.36s	remaining: 1.05s
692:	learn: 0.0041587	total: 2.36s	remaining: 1.04s
693:	learn: 0.0041539	total: 2.36s	remaining: 1.04s
694:	learn: 0.0041456	total: 2.37s	remaining: 1.04s
695:	learn: 0.0041388	total: 2.37s	remaining: 1.03s
696:	learn: 0.0041354	total: 2.38s	remaining: 1.03s
697:	learn: 0.0041310	total: 2.38s	remaining: 1.03s
698:	learn: 0.0041263	total: 2.38s	remaining: 1.03s
699:	learn: 0.0041214	total: 2.39s	remaining: 1.02s
700:	learn: 0.0041190	total: 2.39s	remaining: 1.02s
701:	learn: 0.0041118	total: 2.39s	remaining: 1.02s
702:	learn: 0.0041068	total: 2.4s	remaining: 1.01s
703:	learn: 0.0040953	total: 2.4s	remaining: 1.01s
704:	learn: 0.0040916	total: 2.4s	remaining: 1.01s
705:	learn: 0.0040819	total: 2.41s	remaining: 1s
706:	learn: 0.0040732	total: 2.41s	remaining: 1000ms
707:	learn: 0.0040696	total: 2.41s	remaining: 996ms
708:	learn: 0.004

857:	learn: 0.0032187	total: 2.94s	remaining: 486ms
858:	learn: 0.0032134	total: 2.94s	remaining: 483ms
859:	learn: 0.0032088	total: 2.94s	remaining: 479ms
860:	learn: 0.0032060	total: 2.95s	remaining: 476ms
861:	learn: 0.0032022	total: 2.95s	remaining: 472ms
862:	learn: 0.0031995	total: 2.95s	remaining: 469ms
863:	learn: 0.0031962	total: 2.96s	remaining: 465ms
864:	learn: 0.0031931	total: 2.96s	remaining: 462ms
865:	learn: 0.0031901	total: 2.96s	remaining: 458ms
866:	learn: 0.0031849	total: 2.96s	remaining: 455ms
867:	learn: 0.0031815	total: 2.97s	remaining: 451ms
868:	learn: 0.0031756	total: 2.97s	remaining: 448ms
869:	learn: 0.0031694	total: 2.98s	remaining: 445ms
870:	learn: 0.0031625	total: 2.98s	remaining: 441ms
871:	learn: 0.0031541	total: 2.98s	remaining: 438ms
872:	learn: 0.0031511	total: 2.98s	remaining: 434ms
873:	learn: 0.0031455	total: 2.99s	remaining: 431ms
874:	learn: 0.0031414	total: 2.99s	remaining: 427ms
875:	learn: 0.0031355	total: 2.99s	remaining: 424ms
876:	learn: 

In [12]:
list(encoded)


['pressure [MPa]',
 'mass_flux [kg/m2-s]',
 'x_e_out [-]',
 'D_e [mm]',
 'D_h [mm]',
 'length [mm]',
 'chf_exp [MW/m2]',
 'author_encoded',
 'geometry_encoded']

In [13]:
# Now for author 

# Probably a wasteful step
features_author = ['pressure [MPa]',
 'mass_flux [kg/m2-s]',
 'x_e_out [-]',
 'D_e [mm]',
 'D_h [mm]',
 'length [mm]',
 'chf_exp [MW/m2]',
 'geometry_encoded']


# Define the input and target variables for observed data
X_train = observed_data[features_author]
y_train = observed_data['author_encoded']

# Create an instance of CatBoostClassifier
classifier = CatBoostClassifier()

# Fit the model
classifier.fit(X_train, y_train)

# Make predictions for the missing values
missing_X = missing_data[features_author]
missing_data['author_encoded'] = classifier.predict(missing_X)

# Update the missing values with the predicted values
encoded.loc[incomplete_mask, 'author_encoded'] = missing_data['author_encoded']

# Convert the filled missing values to the appropriate data type
encoded['author_encoded'] = encoded['author_encoded'].astype(np.int64)

Learning rate set to 0.087078
0:	learn: 1.7345725	total: 9.41ms	remaining: 9.4s
1:	learn: 1.4471923	total: 19.6ms	remaining: 9.79s
2:	learn: 1.2497026	total: 28.3ms	remaining: 9.4s
3:	learn: 1.0948525	total: 38.8ms	remaining: 9.65s
4:	learn: 0.9808553	total: 47.7ms	remaining: 9.5s
5:	learn: 0.8811378	total: 55.9ms	remaining: 9.26s
6:	learn: 0.7921390	total: 64.2ms	remaining: 9.11s
7:	learn: 0.7197314	total: 73.2ms	remaining: 9.07s
8:	learn: 0.6575918	total: 82.7ms	remaining: 9.1s
9:	learn: 0.6054218	total: 93.5ms	remaining: 9.25s
10:	learn: 0.5585543	total: 105ms	remaining: 9.39s
11:	learn: 0.5160729	total: 113ms	remaining: 9.33s
12:	learn: 0.4813894	total: 123ms	remaining: 9.31s
13:	learn: 0.4496072	total: 131ms	remaining: 9.23s
14:	learn: 0.4214426	total: 143ms	remaining: 9.37s
15:	learn: 0.3985572	total: 155ms	remaining: 9.5s
16:	learn: 0.3748612	total: 164ms	remaining: 9.49s
17:	learn: 0.3536137	total: 175ms	remaining: 9.54s
18:	learn: 0.3352826	total: 184ms	remaining: 9.5s
19:	lea

165:	learn: 0.0974853	total: 1.57s	remaining: 7.89s
166:	learn: 0.0974053	total: 1.58s	remaining: 7.88s
167:	learn: 0.0972524	total: 1.59s	remaining: 7.86s
168:	learn: 0.0970622	total: 1.6s	remaining: 7.85s
169:	learn: 0.0969443	total: 1.6s	remaining: 7.84s
170:	learn: 0.0968089	total: 1.61s	remaining: 7.83s
171:	learn: 0.0966428	total: 1.62s	remaining: 7.82s
172:	learn: 0.0965659	total: 1.63s	remaining: 7.81s
173:	learn: 0.0964498	total: 1.64s	remaining: 7.8s
174:	learn: 0.0963429	total: 1.65s	remaining: 7.79s
175:	learn: 0.0962267	total: 1.66s	remaining: 7.78s
176:	learn: 0.0961204	total: 1.67s	remaining: 7.78s
177:	learn: 0.0959800	total: 1.68s	remaining: 7.77s
178:	learn: 0.0957797	total: 1.69s	remaining: 7.76s
179:	learn: 0.0956645	total: 1.7s	remaining: 7.75s
180:	learn: 0.0955245	total: 1.71s	remaining: 7.73s
181:	learn: 0.0953563	total: 1.72s	remaining: 7.72s
182:	learn: 0.0952018	total: 1.73s	remaining: 7.71s
183:	learn: 0.0951109	total: 1.74s	remaining: 7.7s
184:	learn: 0.094

334:	learn: 0.0786605	total: 3.08s	remaining: 6.11s
335:	learn: 0.0786061	total: 3.09s	remaining: 6.1s
336:	learn: 0.0785425	total: 3.09s	remaining: 6.09s
337:	learn: 0.0784766	total: 3.1s	remaining: 6.08s
338:	learn: 0.0784258	total: 3.11s	remaining: 6.07s
339:	learn: 0.0783323	total: 3.12s	remaining: 6.05s
340:	learn: 0.0782447	total: 3.13s	remaining: 6.04s
341:	learn: 0.0781495	total: 3.14s	remaining: 6.03s
342:	learn: 0.0780926	total: 3.15s	remaining: 6.03s
343:	learn: 0.0780082	total: 3.15s	remaining: 6.02s
344:	learn: 0.0779389	total: 3.17s	remaining: 6.01s
345:	learn: 0.0778543	total: 3.17s	remaining: 6s
346:	learn: 0.0778070	total: 3.18s	remaining: 5.99s
347:	learn: 0.0777578	total: 3.19s	remaining: 5.98s
348:	learn: 0.0777029	total: 3.2s	remaining: 5.97s
349:	learn: 0.0776101	total: 3.21s	remaining: 5.96s
350:	learn: 0.0775532	total: 3.22s	remaining: 5.95s
351:	learn: 0.0775150	total: 3.23s	remaining: 5.94s
352:	learn: 0.0774345	total: 3.24s	remaining: 5.93s
353:	learn: 0.0773

494:	learn: 0.0679683	total: 4.43s	remaining: 4.52s
495:	learn: 0.0679365	total: 4.44s	remaining: 4.51s
496:	learn: 0.0678294	total: 4.45s	remaining: 4.5s
497:	learn: 0.0677953	total: 4.45s	remaining: 4.49s
498:	learn: 0.0677492	total: 4.46s	remaining: 4.48s
499:	learn: 0.0676679	total: 4.47s	remaining: 4.47s
500:	learn: 0.0676109	total: 4.48s	remaining: 4.46s
501:	learn: 0.0675821	total: 4.49s	remaining: 4.45s
502:	learn: 0.0675179	total: 4.5s	remaining: 4.44s
503:	learn: 0.0674658	total: 4.5s	remaining: 4.43s
504:	learn: 0.0673883	total: 4.51s	remaining: 4.42s
505:	learn: 0.0673574	total: 4.52s	remaining: 4.41s
506:	learn: 0.0673067	total: 4.53s	remaining: 4.4s
507:	learn: 0.0672261	total: 4.53s	remaining: 4.39s
508:	learn: 0.0671654	total: 4.54s	remaining: 4.38s
509:	learn: 0.0670769	total: 4.55s	remaining: 4.37s
510:	learn: 0.0669942	total: 4.56s	remaining: 4.36s
511:	learn: 0.0669245	total: 4.57s	remaining: 4.35s
512:	learn: 0.0668839	total: 4.58s	remaining: 4.34s
513:	learn: 0.06

664:	learn: 0.0591776	total: 5.8s	remaining: 2.92s
665:	learn: 0.0591313	total: 5.81s	remaining: 2.92s
666:	learn: 0.0590908	total: 5.82s	remaining: 2.91s
667:	learn: 0.0590246	total: 5.83s	remaining: 2.9s
668:	learn: 0.0589860	total: 5.84s	remaining: 2.89s
669:	learn: 0.0589546	total: 5.85s	remaining: 2.88s
670:	learn: 0.0589150	total: 5.85s	remaining: 2.87s
671:	learn: 0.0588906	total: 5.86s	remaining: 2.86s
672:	learn: 0.0587772	total: 5.87s	remaining: 2.85s
673:	learn: 0.0587471	total: 5.88s	remaining: 2.84s
674:	learn: 0.0587141	total: 5.89s	remaining: 2.83s
675:	learn: 0.0586253	total: 5.89s	remaining: 2.83s
676:	learn: 0.0585599	total: 5.9s	remaining: 2.81s
677:	learn: 0.0585208	total: 5.91s	remaining: 2.81s
678:	learn: 0.0584924	total: 5.92s	remaining: 2.8s
679:	learn: 0.0584444	total: 5.93s	remaining: 2.79s
680:	learn: 0.0584048	total: 5.93s	remaining: 2.78s
681:	learn: 0.0583772	total: 5.94s	remaining: 2.77s
682:	learn: 0.0583358	total: 5.95s	remaining: 2.76s
683:	learn: 0.05

828:	learn: 0.0529048	total: 7.17s	remaining: 1.48s
829:	learn: 0.0528669	total: 7.18s	remaining: 1.47s
830:	learn: 0.0528295	total: 7.19s	remaining: 1.46s
831:	learn: 0.0528089	total: 7.2s	remaining: 1.45s
832:	learn: 0.0527718	total: 7.21s	remaining: 1.45s
833:	learn: 0.0527391	total: 7.22s	remaining: 1.44s
834:	learn: 0.0527027	total: 7.23s	remaining: 1.43s
835:	learn: 0.0526747	total: 7.24s	remaining: 1.42s
836:	learn: 0.0526611	total: 7.25s	remaining: 1.41s
837:	learn: 0.0526138	total: 7.25s	remaining: 1.4s
838:	learn: 0.0525861	total: 7.26s	remaining: 1.39s
839:	learn: 0.0525706	total: 7.27s	remaining: 1.39s
840:	learn: 0.0525304	total: 7.29s	remaining: 1.38s
841:	learn: 0.0524760	total: 7.29s	remaining: 1.37s
842:	learn: 0.0524362	total: 7.3s	remaining: 1.36s
843:	learn: 0.0524138	total: 7.31s	remaining: 1.35s
844:	learn: 0.0523841	total: 7.32s	remaining: 1.34s
845:	learn: 0.0523541	total: 7.33s	remaining: 1.33s
846:	learn: 0.0523356	total: 7.34s	remaining: 1.32s
847:	learn: 0.0

994:	learn: 0.0477630	total: 8.54s	remaining: 42.9ms
995:	learn: 0.0477233	total: 8.55s	remaining: 34.3ms
996:	learn: 0.0476967	total: 8.55s	remaining: 25.7ms
997:	learn: 0.0476763	total: 8.56s	remaining: 17.2ms
998:	learn: 0.0476534	total: 8.57s	remaining: 8.58ms
999:	learn: 0.0476252	total: 8.58s	remaining: 0us


In [14]:
encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31644 entries, 0 to 31643
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   pressure [MPa]       27192 non-null  float64
 1   mass_flux [kg/m2-s]  26853 non-null  float64
 2   x_e_out [-]          21229 non-null  float64
 3   D_e [mm]             26156 non-null  float64
 4   D_h [mm]             27055 non-null  float64
 5   length [mm]          26885 non-null  float64
 6   chf_exp [MW/m2]      31644 non-null  float64
 7   author_encoded       31644 non-null  int64  
 8   geometry_encoded     31644 non-null  int64  
dtypes: float64(7), int64(2)
memory usage: 2.2 MB


In [15]:
# Mask refresh 
complete_mask = encoded.notnull().all(axis=1)
incomplete_mask = ~complete_mask

complete_data = encoded[complete_mask]
incomplete_data = encoded[incomplete_mask]

# Now for length

# Probably a wasteful step
features_length = ['pressure [MPa]',
 'mass_flux [kg/m2-s]',
 'x_e_out [-]',
 'D_e [mm]',
 'D_h [mm]',
 'chf_exp [MW/m2]',
 'author_encoded',
 'geometry_encoded']

# Split the data into observed and missing rows
observed_data_le = encoded[complete_mask].copy()
missing_data_le = encoded[incomplete_mask].copy()

In [16]:
# Ensure the order of features is consistent
X_train_le = observed_data_le[features_length]
y_train_le = observed_data_le['length [mm]']

X_test_le = missing_data_le[features_length] 

# Create an instance of CatBoostRegressor
regressor = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6)

# Fit the model
regressor.fit(X_train_le, y_train_le, verbose=False)

# Make predictions for the missing values
missing_data_le['length [mm]'] = regressor.predict(X_test_le)  # Use regressor instead of classifier

# Update the missing values with the predicted values
encoded.loc[incomplete_mask, 'length [mm]'] = missing_data_le['length [mm]']


In [17]:
encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31644 entries, 0 to 31643
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   pressure [MPa]       27192 non-null  float64
 1   mass_flux [kg/m2-s]  26853 non-null  float64
 2   x_e_out [-]          21229 non-null  float64
 3   D_e [mm]             26156 non-null  float64
 4   D_h [mm]             27055 non-null  float64
 5   length [mm]          31644 non-null  float64
 6   chf_exp [MW/m2]      31644 non-null  float64
 7   author_encoded       31644 non-null  int64  
 8   geometry_encoded     31644 non-null  int64  
dtypes: float64(7), int64(2)
memory usage: 2.2 MB


In [18]:
# Mask refresh 
complete_mask = encoded.notnull().all(axis=1)
incomplete_mask = ~complete_mask

complete_data = encoded[complete_mask]
incomplete_data = encoded[incomplete_mask]

# Now for de

# Probably a wasteful step
features_de = ['pressure [MPa]',
 'mass_flux [kg/m2-s]',
 'x_e_out [-]',
 'D_h [mm]',
 'length [mm]',
 'chf_exp [MW/m2]',
 'author_encoded',
 'geometry_encoded']

# Split the data into observed and missing rows
observed_data_de = encoded[complete_mask].copy()
missing_data_de = encoded[incomplete_mask].copy()

In [19]:
# Ensure the order of features is consistent
X_train_de = observed_data_de[features_de]
y_train_de = observed_data_de['D_e [mm]']

X_test_de = missing_data_de[features_de]  

# Create an instance of CatBoostRegressor
regressor = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6)

# Fit the model
regressor.fit(X_train_de, y_train_de, verbose=False)

# Make predictions for the missing values
missing_data_de['D_e [mm]'] = regressor.predict(X_test_de)  

# Update the missing values with the predicted values
encoded.loc[incomplete_mask, 'D_e [mm]'] = missing_data_de['D_e [mm]']


In [20]:
encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31644 entries, 0 to 31643
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   pressure [MPa]       27192 non-null  float64
 1   mass_flux [kg/m2-s]  26853 non-null  float64
 2   x_e_out [-]          21229 non-null  float64
 3   D_e [mm]             31644 non-null  float64
 4   D_h [mm]             27055 non-null  float64
 5   length [mm]          31644 non-null  float64
 6   chf_exp [MW/m2]      31644 non-null  float64
 7   author_encoded       31644 non-null  int64  
 8   geometry_encoded     31644 non-null  int64  
dtypes: float64(7), int64(2)
memory usage: 2.2 MB


In [21]:
# Mask refresh 
complete_mask = encoded.notnull().all(axis=1)
incomplete_mask = ~complete_mask

complete_data = encoded[complete_mask]
incomplete_data = encoded[incomplete_mask]

# Now for dh

# Probably a wasteful step
features_dh = ['pressure [MPa]',
 'mass_flux [kg/m2-s]',
 'x_e_out [-]',
 'D_e [mm]',
 'length [mm]',
 'chf_exp [MW/m2]',
 'author_encoded',
 'geometry_encoded']

# Split the data into observed and missing rows
observed_data_dh = encoded[complete_mask].copy()
missing_data_dh = encoded[incomplete_mask].copy()

In [22]:
# Ensure the order of features is consistent
X_train_dh = observed_data_dh[features_dh]
y_train_dh = observed_data_dh['D_h [mm]']

X_test_dh = missing_data_dh[features_dh]  

# Create an instance of CatBoostRegressor
regressor = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6)

# Fit the model
regressor.fit(X_train_dh, y_train_dh, verbose=False)

# Make predictions for the missing values
missing_data_dh['D_h [mm]'] = regressor.predict(X_test_dh)  

# Update the missing values with the predicted values
encoded.loc[incomplete_mask, 'D_h [mm]'] = missing_data_dh['D_h [mm]']


In [23]:
encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31644 entries, 0 to 31643
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   pressure [MPa]       27192 non-null  float64
 1   mass_flux [kg/m2-s]  26853 non-null  float64
 2   x_e_out [-]          21229 non-null  float64
 3   D_e [mm]             31644 non-null  float64
 4   D_h [mm]             31644 non-null  float64
 5   length [mm]          31644 non-null  float64
 6   chf_exp [MW/m2]      31644 non-null  float64
 7   author_encoded       31644 non-null  int64  
 8   geometry_encoded     31644 non-null  int64  
dtypes: float64(7), int64(2)
memory usage: 2.2 MB


In [24]:
# Mask refresh 
complete_mask = encoded.notnull().all(axis=1)
incomplete_mask = ~complete_mask

complete_data = encoded[complete_mask]
incomplete_data = encoded[incomplete_mask]

# Now for pressure

# Probably a wasteful step
features_press = [
 'mass_flux [kg/m2-s]',
 'x_e_out [-]',
 'D_h [mm]',
 'D_e [mm]',
 'length [mm]',
 'chf_exp [MW/m2]',
 'author_encoded',
 'geometry_encoded']

# Split the data into observed and missing rows
observed_data_press = encoded[complete_mask].copy()
missing_data_press = encoded[incomplete_mask].copy()

In [25]:
# Ensure the order of features is consistent
X_train_press = observed_data_press[features_press]
y_train_press = observed_data_press['pressure [MPa]']

X_test_press = missing_data_press[features_press]  

# Create an instance of CatBoostRegressor
regressor = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6)

# Fit the model
regressor.fit(X_train_press, y_train_press, verbose=False)

# Make predictions for the missing values
missing_data_press['pressure [MPa]'] = regressor.predict(X_test_press)  

# Update the missing values with the predicted values
encoded.loc[incomplete_mask, 'pressure [MPa]'] = missing_data_press['pressure [MPa]']


In [26]:
encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31644 entries, 0 to 31643
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   pressure [MPa]       31644 non-null  float64
 1   mass_flux [kg/m2-s]  26853 non-null  float64
 2   x_e_out [-]          21229 non-null  float64
 3   D_e [mm]             31644 non-null  float64
 4   D_h [mm]             31644 non-null  float64
 5   length [mm]          31644 non-null  float64
 6   chf_exp [MW/m2]      31644 non-null  float64
 7   author_encoded       31644 non-null  int64  
 8   geometry_encoded     31644 non-null  int64  
dtypes: float64(7), int64(2)
memory usage: 2.2 MB


In [27]:
# Mask refresh 
complete_mask = encoded.notnull().all(axis=1)
incomplete_mask = ~complete_mask

complete_data = encoded[complete_mask]
incomplete_data = encoded[incomplete_mask]

# Now for mass

# Probably a wasteful step
features_mass = ['pressure [MPa]',
 'x_e_out [-]',
 'D_h [mm]',
 'D_e [mm]',
 'length [mm]',
 'chf_exp [MW/m2]',
 'author_encoded',
 'geometry_encoded']

# Split the data into observed and missing rows
observed_data_mass = encoded[complete_mask].copy()
missing_data_mass = encoded[incomplete_mask].copy()

In [28]:
# Ensure the order of features is consistent
X_train_mass = observed_data_mass[features_mass]
y_train_mass = observed_data_mass['mass_flux [kg/m2-s]']

X_test_mass = missing_data_mass[features_mass]  

# Create an instance of CatBoostRegressor
regressor = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6)

# Fit the model
regressor.fit(X_train_mass, y_train_mass, verbose=False)

# Make predictions for the missing values
missing_data_mass['mass_flux [kg/m2-s]'] = regressor.predict(X_test_mass)  

# Update the missing values with the predicted values
encoded.loc[incomplete_mask, 'mass_flux [kg/m2-s]'] = missing_data_mass['mass_flux [kg/m2-s]']


In [29]:
encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31644 entries, 0 to 31643
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   pressure [MPa]       31644 non-null  float64
 1   mass_flux [kg/m2-s]  31644 non-null  float64
 2   x_e_out [-]          21229 non-null  float64
 3   D_e [mm]             31644 non-null  float64
 4   D_h [mm]             31644 non-null  float64
 5   length [mm]          31644 non-null  float64
 6   chf_exp [MW/m2]      31644 non-null  float64
 7   author_encoded       31644 non-null  int64  
 8   geometry_encoded     31644 non-null  int64  
dtypes: float64(7), int64(2)
memory usage: 2.2 MB


In [30]:
# Mask refresh 
complete_mask = encoded.notnull().all(axis=1)
incomplete_mask = ~complete_mask

complete_data = encoded[complete_mask]
incomplete_data = encoded[incomplete_mask]

# Now for xe

# Probably a wasteful step
features_xe = ['pressure [MPa]',
 'mass_flux [kg/m2-s]',
 'D_h [mm]',
 'D_e [mm]',
 'length [mm]',
 'chf_exp [MW/m2]',
 'author_encoded',
 'geometry_encoded']

# Split the data into observed and missing rows
observed_data_xe = encoded[complete_mask].copy()
missing_data_xe = encoded[incomplete_mask].copy()

In [31]:
# Ensure the order of features is consistent
X_train_xe = observed_data_xe[features_xe]
y_train_xe = observed_data_xe['x_e_out [-]']

X_test_xe = missing_data_xe[features_xe]  

# Create an instance of CatBoostRegressor
regressor = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=6)

# Fit the model
regressor.fit(X_train_xe, y_train_xe, verbose=False)

# Make predictions for the missing values
missing_data_xe['x_e_out [-]'] = regressor.predict(X_test_xe)  

# Update the missing values with the predicted values
encoded.loc[incomplete_mask, 'x_e_out [-]'] = missing_data_xe['x_e_out [-]']


In [32]:
encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31644 entries, 0 to 31643
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   pressure [MPa]       31644 non-null  float64
 1   mass_flux [kg/m2-s]  31644 non-null  float64
 2   x_e_out [-]          31644 non-null  float64
 3   D_e [mm]             31644 non-null  float64
 4   D_h [mm]             31644 non-null  float64
 5   length [mm]          31644 non-null  float64
 6   chf_exp [MW/m2]      31644 non-null  float64
 7   author_encoded       31644 non-null  int64  
 8   geometry_encoded     31644 non-null  int64  
dtypes: float64(7), int64(2)
memory usage: 2.2 MB


In [35]:
# Filter rows with missing values in 'x_e_out [-]'
fix_data = data[data['x_e_out [-]'].isnull()]

# Create 'fix' DataFrame with the 'id' column
fix = fix_data[['id']].copy()
fix['x_e_out [-]'] = missing_data_xe['x_e_out [-]']

# Create a new DataFrame with 'Id' and 'SalePrice' columns
#submission_df = pd.DataFrame({'id': data['id'], 'x_e_out [-]': encoded['x_e_out [-]']})

# Save the new DataFrame to a CSV file
fix.to_csv('submission_fix.csv', index=False)

In [36]:
fix.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10415 entries, 4 to 31642
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           10415 non-null  int64  
 1   x_e_out [-]  10415 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 244.1 KB
