# Introduction
I use H2O AutoML with simple preprocessing.

By H2O AutoML, we can simply try and compare many models. we can also try ensamble model.

# Import libraries

We need to install h2o at first. Instllation message is too long so I clear the messages after finishing the installation.

In [None]:
!pip install h2o
from IPython.display import clear_output
clear_output()

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import matplotlib.pyplot as plt
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import h2o
from h2o.automl import H2OAutoML
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Read data

In [None]:
train_df = pd.read_csv('/kaggle/input/playground-series-s3e9/train.csv')
test_df = pd.read_csv('/kaggle/input/playground-series-s3e9/test.csv')

According to [Concrete Strength Prediction](https://www.kaggle.com/datasets/mchilamwar/predict-concrete-strength), the mean of data is below.

- CementComponent:- Amount of cement is mixed
- BlastFurnaceSlag:- Amount of Blast Furnace Slag is mixed
- FlyAshComponent:- Amount of FlyAsh is mixed
- WaterComponent:- Amount of water is mixed
- SuperplasticizerComponent:- Amount of Super plasticizer is mixed
- CoarseAggregateComponent:- Amount of Coarse Aggregate is mixed
- FineAggregateComponent:- Amount of Coarse Aggregate is mixed
- AgeInDays:- How many days it was left dry
- Strength:- What was the final strength of concrete- (Target)

In [None]:
train_df.info()

In [None]:
test_df.info()

As you see above, there is no null record.

Next, I check the distribution of both train and test data.

In [None]:
train_df_tmp = train_df.copy() 
test_df_tmp = test_df.copy() 
train_df_tmp['data_label'] = 'train'
test_df_tmp['data_label'] = 'test'
all_df = pd.concat([train_df_tmp, test_df_tmp], axis=0)
all_df.drop(columns='id', inplace=True)
all_df.head()

In [None]:
sns.pairplot(all_df, hue='data_label')

In [None]:
cols = all_df.columns.tolist()
cols.remove('data_label')
cols.remove('Strength')
fig, axes = plt.subplots(2, 4, figsize=(60, 30))
axes = axes.ravel()
for col, ax in zip(cols, axes):
    sns.boxplot(data=all_df, y=col, x='data_label', ax=ax)

plt.show()

Train data distribution is much similar to test data, including outlier.

Next, check the correlation of each columns.

In [None]:
plot_col = train_df.columns.tolist()
plot_col.remove('id')
plot_col

In [None]:
corr = train_df[plot_col].corr()
mask = np.zeros_like(corr)
mask[np.tril_indices_from(mask)] = True
sns.heatmap(corr, cmap='Blues', annot=True, mask=mask.T)

`SuperplasticizerComponent` is correlated with `WaterComponent` inversely. Superplasticizer is known as [high range water reducers](https://en.wikipedia.org/wiki/Superplasticizer) so water may be reduced by superplasticizer.

`Superplasticizer` is also correlated with `FlyAshComponent`. According to [this link](https://www.nbmcw.com/product-technology/construction-chemicals-waterproofing/concrete-admixtures/effect-of-blended-fly-ash-and-superplasticizer-on-strength-of-cement.html), FlyAsh with superplasticizer effects the strength.

And `AgeInDays` is the most important for `Strength`.

# Create and fit model

In [None]:
# preprocessing
def preprocessing(df):
    df['amount'] = df[['CementComponent', 'BlastFurnaceSlag', 'FlyAshComponent', 'WaterComponent', 'SuperplasticizerComponent', 'CoarseAggregateComponent', 'FineAggregateComponent']].sum(axis=1)
    df['CementComponent_ratio'] = df['CementComponent'] / (df['amount'] + 1e-6)
    df['BlastFurnaceSlag_ratio'] = df['BlastFurnaceSlag'] / (df['amount'] + 1e-6)
    df['FlyAshComponent_ratio'] = df['FlyAshComponent'] / (df['amount'] + 1e-6)
    df['WaterComponent_ratio'] = df['WaterComponent'] / (df['amount'] + 1e-6)
    df['SuperplasticizerComponent_ratio'] = df['SuperplasticizerComponent'] / (df['amount'] + 1e-6)
    df['SuperplasticizerComponent_ratio'] = df['SuperplasticizerComponent'] / (df['amount'] + 1e-6)
    df['FineAggregateComponent_ratio'] = df['FineAggregateComponent'] / (df['amount'] + 1e-6)
    # ratio to cement
    df['BlastFurnaceSlag_to_Cement_ratio'] = df['BlastFurnaceSlag'] / (df['CementComponent'] + 1e-6)
    df['FlyAshComponent_to_Cement_ratio'] = df['FlyAshComponent'] / (df['CementComponent'] + 1e-6)
    df['WaterComponent_to_Cement_ratio'] = df['WaterComponent'] / (df['CementComponent'] + 1e-6)
    df['SuperplasticizerComponent_to_Cement_ratio'] = df['SuperplasticizerComponent'] / (df['CementComponent'] + 1e-6)
    df['CoarseAggregateComponent_to_Cement_ratio'] = df['CoarseAggregateComponent'] / (df['CementComponent'] + 1e-6)
    df['FineAggregateComponent_to_Cement_ratio'] = df['FineAggregateComponent'] / (df['CementComponent'] + 1e-6)
    # other ratio
    df['SuperplasticizerComponent_to_FlyAshComponent_ratio'] = df['SuperplasticizerComponent'] / (df['FlyAshComponent'] + 1e-6)
    df['SuperplasticizerComponent_to_WaterComponent_ratio'] = df['SuperplasticizerComponent'] / (df['WaterComponent'] + 1e-6)
    df['CoarseAggregateComponent_to_FineAggregateComponent_ratio'] = df['CoarseAggregateComponent'] / (df['FineAggregateComponent'] + 1e-6)
    return df

In [None]:
train_df = preprocessing(train_df)

define feature and target columns.

In [None]:
# feature_col = ['CementComponent', 'BlastFurnaceSlag', 'FlyAshComponent', 'WaterComponent', 'SuperplasticizerComponent', 'CoarseAggregateComponent', 'FineAggregateComponent', 'AgeInDays']
feature_col = train_df.columns.tolist()
feature_col.remove('id')
feature_col.remove('Strength')
target_col = 'Strength'

Initialize h2o config.

In [None]:
h2o.init() 

To use H2O AutoML, we need to change pandas dataframe to h2o frame.

In [None]:
train_h2o = h2o.H2OFrame(train_df)

In [None]:
aml = H2OAutoML(max_models=30, seed=1)
aml.train(x=feature_col, y=target_col, training_frame = train_h2o)

check the leaderboard.

In [None]:
aml.leaderboard

get the best model.

In [None]:
# best model
#model = aml.leader
# 
model = aml

# Predict

In [None]:
test_df = preprocessing(test_df)
test_h2o = h2o.H2OFrame(test_df[feature_col])

In [None]:
y_pred = h2o.as_list(model.predict(test_h2o), use_pandas=True)['predict'].tolist()

<a id="submit"></a>
# Submit your result

submit results by using sample_sabmission.csv.

It is important to avoid index when you save submission.
Use "index_col='id'" to read_csv or "index=False" to to_csv.

In [None]:
submission_df = pd.read_csv('/kaggle/input/playground-series-s3e9/sample_submission.csv', index_col='id')
submission_df[target_col] = y_pred
submission_df.to_csv('submission.csv')