## Competition [INGV - Volcanic Eruption Prediction](https://www.kaggle.com/competitions/predict-volcanic-eruptions-ingv-oe)

**Thanks to [😸 Basic preprocessing with CatBoost 😸](https://www.kaggle.com/code/miracl16/basic-preprocessing-with-catboost)**

### My upgrade:
* added desition and methods descriptions;
* updated and improved layout;

<a class="anchor" id="0.1"></a>
# Table of Contents

1. [Import libraries](#1)
1. [Download data](#2)
1. [EDA & FE](#3)
1. [Modeling & Baseline](#4)
1. [Prediction & Submission](#5)

## 1. Import libraries<a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Work with Data
import numpy as np 
import pandas as pd 
import os

# Visualization
import matplotlib.pyplot as plt
from tqdm import tqdm

# Modeling and Prediction
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error
from catboost import CatBoostRegressor, Pool

## 2. Download data<a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Download training and test data
DIR_PATH = '../input/predict-volcanic-eruptions-ingv-oe/'

train_list = os.listdir(DIR_PATH + 'train')
test_list = os.listdir(DIR_PATH + 'test')

print('Number of train files: {}'.format(len(train_list)))
print('Number of test files: {}'.format(len(test_list )))

In [None]:
# Display the first 5 rows of the example training data set
example_train_df = pd.read_csv(DIR_PATH + 'train/' + train_list[0])
example_train_df.head()

In [None]:
# Display the first 5 rows of the example test data set
example_test_df = pd.read_csv(DIR_PATH + 'test/' + test_list[0])
example_test_df.head()

In [None]:
# Display the train data set
train_time_to_eruption = pd.read_csv(DIR_PATH + 'train.csv')
train_time_to_eruption

## 3. EDA & FE <a class="anchor" id="3"></a>
[Back to Table of Contents](#0.1)

### Task description

We are faced with the task of regression. What does this imply?<br />
Regression is the process of finding the correlations between dependent and independent variables.<br />
The task of the Regression algorithm is to find the mapping function to map the input variable(x) to the continuous output variable(y).

In [None]:
# Visualization sensor signals from the example train data set
example_train_df.plot(figsize=(15,15), subplots=True);

**What to do with the matrix data of our sensors?**<br/>
What to pass to the algorithms as params?<br/>
<br/>
Suppose we have a time series that has its own characteristics: lows, highs, etc.<br/>
In essence, we compare the values of the graphs: the number of peaks, frequency, variance, etc.<br/>
In the set of parameters for each of the sensors, we can process the values for each of the indicators and pass them as parameters to the algorithm.<br/>
<br/>
Let's see how to do it on an example data set.

In [None]:
# Prepare and display basic statistics for the example train data set
# replace the NULL values with 0, rename a column
process = pd.DataFrame(example_train_df.fillna(0).describe().iloc[1:, :].unstack()).reset_index()
process = process.rename(columns={0: 'value'})
process

In [None]:
# Remove unnecessary columns and transpose data set
process['feature'] = process['level_0'] + '_' + process['level_1']
process = process.drop(['level_0', 'level_1'], axis=1).set_index('feature').T

# Set a column with time to eruption by the specific segment sensors data set
process['time'] = train_time_to_eruption[train_time_to_eruption.segment_id == int(train_list[0].split('.')[0])].time_to_eruption.values[0]
process

#### What Is Skewness?
Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data.<br />
If the curve is shifted to the left or to the right, it is said to be skewed.<br />
<br />
For normally distributed data, the skewness should be about zero.<br />
For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution.

In [None]:
# Computing the sample skewness for each sensor of the example train data set
pd.DataFrame(example_train_df.fillna(0).skew()).T

In [None]:
# Combine FE into a function so that it is convenient to apply to different data sets

def create_frame(data, data_time=None, type_data='train'):
    data = data.fillna(0)
    
    # get basic statistics
    data_transform = data.describe().iloc[1:, :]
    
    # Additional params
    
    # Asymmetry coefficient
    data_transform.loc['skew'] = data.skew().tolist()
    
    # Mean absolute deviation
    data_transform.loc['mad'] = data.mad().tolist()
    
    # The kurtosis coefficient
    # is a measure of the sharpness of the peak of the distribution of a random variable
    data_transform.loc['kurtosis'] = data.kurtosis().tolist()
    
    # Adding quantiles
    # necessary for a better understanding of what is happening with the dataset
    for i in range(0, 100, 5):
        if ((i!=25) & (i!=50)):
                str_col = f"{i}%"
                int_col = float(i)/100
                data_transform.loc[str_col] = data_transform.quantile(int_col).tolist()
        else:
            continue
    
    # Combining all features into the one dataset
    data_transform = pd.DataFrame(data_transform.unstack()).reset_index()
    data_transform = data_transform.rename(columns={0: 'value'})
    data_transform['feature'] = data_transform['level_0'] + '_' + data_transform['level_1']
    data_transform = data_transform.drop(['level_0', 'level_1'], axis=1).set_index('feature').T
    
    # For a train data set adding the time to eruption for the specific segment sensors 
    if type_data=='train':
        data_transform['time'] = data_time

    return data_transform

In [None]:
# Apply the function for each train data set
all_train = pd.DataFrame()

for file in tqdm(train_list):
    df = pd.read_csv(DIR_PATH + 'train/' + file)
    data_time = train_time_to_eruption[train_time_to_eruption.segment_id == int(file.split('.')[0])].time_to_eruption.values[0]
    df = create_frame(df, data_time, type_data='train')
    all_train = all_train.append(df)

all_train = all_train.reset_index(drop=True)

In [None]:
# Apply the function for each test data set
all_test = pd.DataFrame()

for file in tqdm(test_list):
    df = pd.read_csv(DIR_PATH + 'test/' + file)
    df = create_frame(df, data_time=None, type_data='test')
    all_test = all_test.append(df)

all_test = all_test.reset_index(drop=True)

In [None]:
all_train.head()

In [None]:
all_test.head()

## 4. Modeling & Baseline<a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)

#### What Is Baseline?
Insofar as we must know whether the predictions for a given algorithm are good or not we need to use a baseline prediction algorithm.<br />
A baseline prediction algorithm provides a set of predictions that you can evaluate as you would any predictions for your problem, such as classification accuracy or RMSE.<br />
<br />
The scores from these algorithms provide the required point of comparison when evaluating all other machine learning algorithms on your problem.

In [None]:
# Copy and split our data

test = all_test.copy()

X = all_train.drop('time',axis=1)
y = all_train['time']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True, random_state=10)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, shuffle=True, random_state=10)

**Mean Absolute Percentage Error** (MAPE) is one of the most commonly used KPIs to measure forecast accuracy.<br/>
MAPE is the sum of the individual absolute errors divided by the demand (each period separately). It is the average of the percentage errors.<br/>

In [None]:
# Define a function for computing a MAPE
def mape(y_true, y_pred):
    return np.mean(np.abs((y_pred-y_true)/y_true))

In [None]:
# Train a model
clf = CatBoostRegressor(loss_function='MAPE')

train_dataset = Pool(data=X_train, label=y_train)
eval_dataset = Pool(data=X_val, label=y_val)

clf.fit(train_dataset,
            use_best_model=True,
            verbose = 100,
            eval_set=eval_dataset)

In [None]:
# Apply the model to the test dataset
y_pred = clf.predict(Pool(data=X_test))
    
print(f"MAPE: {mape(y_test, y_pred)}")
print(f"MAE: {mean_absolute_error(y_test, y_pred)}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred))}")

## 5. Prediction & Submission<a class="anchor" id="5"></a>

[Back to Table of Contents](#0.1)

**We are going to use KFold with some parametrs with CatBoostRegressor.**<br/>
We didn't use GridSearch because save time.

[CatBoost](https://catboost.ai/docs) is a machine learning method based on [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) over decision trees.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, shuffle=True, random_state=10)

In [None]:
n_fold = 5
cv = KFold(n_splits=n_fold, shuffle=True, random_state=10)
prediction = np.zeros(len(test))
mape_, mae, rmse = [], [], []

params = {
            'iterations':1000,
            'learning_rate':0.1,
            'depth':6,
            'eval_metric':'MAPE'
}

# Split our data for KFolds blocs
for fold, (train_index, val_index) in enumerate(cv.split(X)):
    # Move to the next fold block
    X_train = X.iloc[train_index,:]
    X_val = X.iloc[val_index,:]

    y_train = y.iloc[train_index]
    y_val = y.iloc[val_index]
    
    # Traind the model with new train data form the current fold 
    clf = CatBoostRegressor(**params)  
    
    train_dataset = Pool(data=X_train, label=y_train)
    
    eval_dataset = Pool(data=X_val, label=y_val)
    
    clf.fit(train_dataset,
              use_best_model=True,
              verbose = 0,
              eval_set=eval_dataset)
   
    y_pred = clf.predict(Pool(data=X_test))
    
    mape_.append(mape(y_test, y_pred))
    mae.append(mean_absolute_error(y_test, y_pred))
    rmse.append(np.sqrt(mean_squared_error(y_test, y_pred)))

    print(f"fold: {fold}, MAPE: {mape(y_test, y_pred)}")
    print(f"fold: {fold}, MAE: {mean_absolute_error(y_test, y_pred)}")
    print(f"fold: {fold}, RMSE: {np.sqrt(mean_squared_error(y_test, y_pred))}")

    # test array predictions
    prediction += clf.predict(Pool(data=test))
        
prediction /= n_fold

print('CV mean MAPE:  {0:.4f}, std: {1:.4f}.'.format(np.mean(mape_), np.std(mape_)))
print('CV mean MAE: {0:.4f}, std: {1:.4f}.'.format(np.mean(mae), np.std(mae)))
print('CV mean RMSE: {0:.4f}, std: {1:.4f}.'.format(np.mean(rmse), np.std(rmse)))

In [None]:
# Display the submission dataframe
sub_example = pd.read_csv(DIR_PATH + 'sample_submission.csv')
sub_example.head()

In [None]:
# Display the 5 first segment ids
test_index = [int(i.split('.')[0]) for i in test_list]
test_index[:5]

In [None]:
# Saving the result into submission file
submission = pd.DataFrame()
submission['segment_id'] = test_index
submission['time_to_eruption'] = prediction
submission.to_csv('submission.csv', header=True, index=False) # Competition rules require that no index number be saved

submission

I hope you find this notebook useful and enjoyable.

Your comments and feedback are most welcome.

[Go to Top](#0)