# Description of the project

It is necessary to prepare a prototype of a machine learning model for the Zifry company.
<br> The company develops solutions for the efficient operation of industrial enterprises.

The model should predict the recovery rate of gold from gold ore.
<br> Data with extraction and treatment parameters are available.

The model will help optimize production so as not to launch an enterprise with unprofitable characteristics.

Necessary:

- Prepare data;
- Conduct exploratory data analysis;
- Build and train the model.

Gold from ore is obtained in the following way:

When the mined ore undergoes primary processing, a crushed mixture is obtained.
<br> It is sent for `flotation (enrichment)` and `two-stage cleaning`:

1. Flotation
A mixture of gold-bearing ore is fed into the flotation plant. After enrichment, a rough concentrate and “dump tails” are obtained, that is, product residues with a low concentration of valuable metals.
The stability of this process is affected by the unstable and non-optimal physical and chemical state of the flotation pulp (a mixture of solid particles and liquid).
2. Cleaning
The crude concentrate goes through two purifications. The output is the final concentrate and new final tailings.

You need to predict two quantities at once:
- rough concentrate enrichment efficiency `rougher.output.recovery`;
- efficiency of enrichment of the final concentrate `final.output.recovery`.

# Description of data

The data is in three files:

- gold_recovery_train.csv - train sample;
- gold_recovery_test.csv - test sample;
- gold_recovery_full.csv - initial data.

The data is indexed by the date and time the information was received (the date attribute). Parameters adjacent in time are often similar.

Some parameters are not available because they are measured and/or calculated much later. Because of this, the test set lacks some features that may be in the train set. Also, there are no target features in the test set.

The initial dataset contains the train and test sets with all features.

This is raw data: it has just been unloaded from storage. Before proceeding with the construction of the model, it is necessary to check them for correctness.


Technological process:
- Rougher feed - feedstock
- Rougher additions (or reagent additions) - flotation reagents: Xanthate, Sulphate, Depressant
- Xanthate ** - xanthate (promoter, or flotation activator);
- Sulphate - sulfate (in this production, sodium sulfide);
- Depressant - depressant (sodium silicate).
- Rougher process (English "rough process") - flotation
- Rougher tails
- Float banks - flotation unit
- Cleaner process - cleaning
- Rougher Au - rough gold concentrate
- Final Au - final gold concentrate

Stage parameters
- air amount — air volume
- fluid levels - fluid level
- feed size - feed granule size
- feed rate - feed rate


The name of the features is: `[stage].[parameter_type].[parameter_name]`
Example: `rougher.input.feed_ag`

Possible values ​​for the `[stage]` block:
- rougher - flotation
- primary_cleaner - primary cleaning
- secondary_cleaner - secondary cleaning
- final - final characteristics

Possible values ​​for the `[parameter_type]` block:
- input — raw material parameters
- output — product parameters
- state — parameters characterizing the current state of the stage
- calculation - calculated characteristics

# Action plan

1. Prepare data
- 1.1. Open files and examine them.
- 1.2. Verify that the enrichment efficiency is calculated correctly. Calculate it on the train sample for the feature `rougher.output.recovery`. Find `MAE` between calculation and feature value. Describe findings.
- 1.3. Analyze features that are not available in the test sample. What are these parameters? What type are they?
- 1.4. Perform data preprocessing.
2. Analyze the data
- 2.1. See how the concentration of metals (Au, Ag, Pb) changes at different stages of purification. Describe findings.
- 2.2. Compare the size distributions of raw material granules on the train and test samples. If the distributions are very different from each other, the estimation of the model will be wrong.
- 2.3. Investigate the total concentration of all substances at different stages: in raw materials, in roughing and final concentrates. Are there any anomalous values ​​in the total distribution or not? If they are, should they be removed from both samples? Describe findings and remove anomalies.
3. Build the model
- 3.1. Write a function to calculate the final `sMAPE`.
- 3.2. Train different models and evaluate their quality by cross-validation. Choose the best model and test it on a test set. Describe findings.

# Data preparation

## Import data files, study general information

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
from IPython.display import display
from IPython.display import display_html


from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

from scipy import stats as st

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [None]:
df_train = pd.read_csv('datasets/gold_recovery_train.csv', sep=',')
df_test = pd.read_csv('datasets/gold_recovery_test.csv', sep=',')
df_full = pd.read_csv('datasets/gold_recovery_full.csv', sep=',')

# df_train = pd.read_csv('/datasets/gold_recovery_train.csv', sep=',')
# df_test = pd.read_csv('/datasets/gold_recovery_test.csv', sep=',')
# df_full = pd.read_csv('/datasets/gold_recovery_full.csv', sep=',')

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
df_full.info()

In [None]:
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

#display_side_by_side(df_train.head(),df_test.head(),df_full.head())

## Verification of enrichment efficiency calculation

Function for calculating enrichment efficiency

In [None]:
def recovery(df):
    
    """
    C — share of gold in concentrate after flotation/refining;
    F — share of gold in raw material/concentrate before flotation/refining;
    T — share of gold in final tailings after flotation/cleaning.
    """
    C = df['rougher.output.concentrate_au']
    F = df['rougher.input.feed_au']
    T = df['rougher.output.tail_au']
    
    return 100 * C * (F - T) / (F * (C - T))

Let's calculate the enrichment efficiency

In [None]:
not_my = df_train['rougher.output.recovery']
my = recovery(df_train)

Let's check if there are missing or infinite values in the received samples

In [None]:
print(np.any(np.isnan(not_my)))
print(np.all(np.isfinite(not_my)))
print()
print(np.any(np.isnan(my)))
print(np.all(np.isfinite(my)))

Let's create a new dataframe, in which we will enter two arrays so that we can correctly remove the gap

In [None]:
table = pd.DataFrame()
table['not_my'] = not_my
table['my'] = my
table = table.dropna()

In [None]:
print(np.any(np.isnan(table)))
print(np.all(np.isfinite(table)))

Calculate MAE

In [None]:
MAE = mean_absolute_error(table['not_my'], table['my'])
print('MAE:', MAE)

### Conclusion

The MAE value is very close to zero, which means that the data and the calculation formula are interpreted and understood by me correctly.

## Analysis of features of the train set that are not available in the test set

Find features that are missing in the test set

In [None]:
col_train = pd.DataFrame(df_train.columns.to_list())
col_test = pd.DataFrame(df_test.columns.to_list())

In [None]:
col_concat = pd.concat([col_train,col_test]).drop_duplicates(keep=False)
col_concat

Let's see what date the data in the test set is dated and compare it with the train set

In [None]:
display(df_train['date'].sort_values())
display(df_test['date'].sort_values())

In [None]:
date_train = pd.DataFrame(df_train['date'].to_list())
date_test = pd.DataFrame(df_test['date'].to_list())

In [None]:
date_concat = pd.concat([date_train,date_test]).drop_duplicates(keep=False)
date_concat

Most likely, all features that are not available in the test set are computable.
<br>Since all signs refer either to product parameters or design characteristics.
<br>The dates in the test sample completely coincide with the dates in the train sample, which once again shows that the unavailable features are computable.

# Data analysis

## Change in the concentration of metals (Au, Ag, Pb) at different stages of purification

Let's plot the concentrations of Au, Ag, Pb at various stages in the context of two days

In [None]:
hours = 50
figsize = (10,10)

plt.figure(figsize=figsize)

plt.plot(df_train['rougher.output.concentrate_au'][:hours], label='au rougher')
plt.plot(df_train['primary_cleaner.output.concentrate_au'][:hours], label='au primary_cleaner')
plt.plot(df_train['final.output.concentrate_au'][:hours], label='au final')

legend = plt.legend(loc='lower left', shadow=False, fontsize='medium')
plt.title('Au')

plt.show()


plt.figure(figsize=figsize)

plt.plot(df_train['rougher.output.concentrate_ag'][:hours], label='ag rougher')
plt.plot(df_train['primary_cleaner.output.concentrate_ag'][:hours], label='ag primary_cleaner')
plt.plot(df_train['final.output.concentrate_ag'][:hours], label='ag final')

legend = plt.legend(loc='lower left', shadow=False, fontsize='medium')
plt.title('Ag')

plt.show()


plt.figure(figsize=figsize)

plt.plot(df_train['rougher.output.concentrate_pb'][:hours], label='pb rougher')
plt.plot(df_train['primary_cleaner.output.concentrate_pb'][:hours], label='pb primary_cleaner')
plt.plot(df_train['final.output.concentrate_pb'][:hours], label='pb final')

legend = plt.legend(loc='lower left', shadow=False, fontsize='medium')
plt.title('Pb')

plt.show()

### Conclusion

The concentration of gold increases with each stage of purification, which is logical.
<br> The concentration of silver decreases with each stage, which is also logical.
<br> But the concentration of lead is growing. This is probably due to the technological process of processing and lead is not yet removed at these stages.
<br> Abnormal values are also visible, which may be associated with the technological process.

## Comparison of the size distribution of raw material granules on the train and test samples

### Rougher

Let's check the equality of the mean of two general populations for their samples

In [None]:
results = st.ttest_ind(df_train['rougher.input.feed_size'].dropna(), df_test['rougher.input.feed_size'].dropna(), equal_var=False)
print('p-value:',results.pvalue)

The distributions in the primary_cleaner stage do not differ much from each other.

In [None]:
# hist_data = [df_train['rougher.input.feed_size'].dropna(), df_test['rougher.input.feed_size'].dropna()]
# group_labels = ['train', 'test']

# fig = ff.create_distplot(hist_data, group_labels, bin_size=0.2)
# fig.show()

sns.distplot(df_train['rougher.input.feed_size'].dropna(), label='train')
sns.distplot(df_test['rougher.input.feed_size'].dropna(), label='test')
plt.legend()

The distributions at the rougher stage do not differ much from each other.

### Primary_cleaner

Let's check the equality of the mean of two general populations for their samples

In [None]:
results = st.ttest_ind(df_train['primary_cleaner.input.feed_size'], df_test['primary_cleaner.input.feed_size'], equal_var=False)
print('p-value:',results.pvalue)

In [None]:
# hist_data = [df_train['primary_cleaner.input.feed_size'], df_test['primary_cleaner.input.feed_size']]
# group_labels = ['train', 'test']

# fig = ff.create_distplot(hist_data, group_labels, bin_size=0.2)
# fig.show()

sns.distplot(df_train['primary_cleaner.input.feed_size'].dropna(), label='train')
sns.distplot(df_test['primary_cleaner.input.feed_size'].dropna(), label='test')
plt.legend()

The distributions in the primary_cleaner stage do not differ much from each other.

## Study of the total concentration of all substances at different stages: in raw materials, in roughing and final concentrates

### Raw

In [None]:
df_train['sum_rougher'] = df_train['rougher.output.concentrate_ag'] + df_train['rougher.output.concentrate_pb'] + \
                          df_train['rougher.output.concentrate_sol'] + df_train['rougher.output.concentrate_au']

df_train['sum_primary'] = df_train['primary_cleaner.output.concentrate_ag'] + df_train['primary_cleaner.output.concentrate_pb'] + \
                          df_train['primary_cleaner.output.concentrate_sol'] + df_train['primary_cleaner.output.concentrate_au']

df_train['sum_final'] = df_train['final.output.concentrate_ag'] + df_train['final.output.concentrate_pb'] + \
                        df_train['final.output.concentrate_sol'] + df_train['final.output.concentrate_au']

In [None]:
hours = 200
figsize = (5,5)

plt.figure(figsize=figsize)

plt.plot(df_train['sum_rougher'][:hours], label='sum rougher')
plt.plot(df_train['sum_primary'][:hours], label='sum primary_cleaner')
plt.plot(df_train['sum_final'][:hours], label='sum final')

legend = plt.legend(loc='lower left', shadow=False, fontsize='small')
plt.title('Sum concentrations')

plt.show()

sns.displot(df_train['sum_rougher'], kde=True, height=5, aspect=1)
sns.displot(df_train['sum_primary'], kde=True, height=5, aspect=1)
sns.displot(df_train['sum_final'], kde=True, height=5, aspect=1)

The graphs show anomalous distributions in the total distribution.
<br> Probably, this is due to the technological process of processing.
<br> For example, anomalous values may appear due to maintenance.
<br> It is necessary to remove these anomalies, as they may affect the prediction of the model in the future.

Remove these anomalies

In [None]:
df_train.dropna(inplace=True)
df_train.info()

We define functions for finding the upper and lower boundaries of the distribution

In [None]:
def bot_line(name):
    
    Q1 = df_train[name].quantile(0.25)
    Q3 = df_train[name].quantile(0.75)
    IQR = Q3 - Q1
    return Q1 - 3*IQR
    
    
def top_line(name):
    
    Q1 = df_train[name].quantile(0.25)
    Q3 = df_train[name].quantile(0.75)
    IQR = Q3 - Q1
    return Q3 + 3*IQR  

At the flotation stage

In [None]:
print(df_train['sum_rougher'].describe())
print()

index = df_train[(df_train['sum_rougher'] <= bot_line('sum_rougher')) | (df_train['sum_rougher'] >= top_line('sum_rougher'))].index
df_train = df_train.drop(index)
sns.displot(df_train['sum_rougher'], kde=True, height=5, aspect=1)
plt.show()

print(df_train['sum_rougher'].describe())
print()

During the first stage of cleaning

In [None]:
print(df_train['sum_primary'].describe())
print()

index = df_train[(df_train['sum_primary'] <= bot_line('sum_primary')) | (df_train['sum_primary'] >= top_line('sum_primary'))].index
df_train = df_train.drop(index)
sns.displot(df_train['sum_primary'], kde=True, height=5, aspect=1)
plt.show()

print(df_train['sum_primary'].describe())
print()

At the final stage

In [None]:
print(df_train['sum_final'].describe())
print()

index = df_train[(df_train['sum_final'] <= bot_line('sum_final')) | (df_train['sum_final'] >= top_line('sum_final'))].index
df_train = df_train.drop(index)
sns.displot(df_train['sum_final'], kde=True, height=5, aspect=1)
plt.show()

print(df_train['sum_final'].describe())
print()

As can be seen from the description of the graphs, the average has not changed much, but the distribution has become much better.

# Model building

## Preparing data for the model

It is necessary to prepare two datasets.
<br> The first to predict the share of gold after flotation.
<br> Second for predicting the share of gold after cleaning.

<br> Obviously, all the necessary NOT target features are contained in the train dataset.
<br> For the first dataset, you need to select features with the stage `rougher` and add the target feature `rougher.output.recovery`.
<br> For the second dataset, select all features and add the target feature `final.output.recovery`.

Variables with the index `_r` in the name will refer to the black concentrate, those having `_f`, respectively, to the final concentrate.

In [None]:
col_r = df_test.columns[13:23].to_list()
col_r.append('rougher.output.recovery')
col_r

In [None]:
df_r = df_train[col_r]
df_r.head()

In [None]:
col_f = df_test.columns[1:].to_list()
col_f.append('final.output.recovery')
col_f

In [None]:
df_f = df_train[col_f]
df_f.head()

Let's divide the samples into sets with features and a target feature.

In [None]:
features_r = df_r.drop(['rougher.output.recovery'], axis=1)
target_r = df_r['rougher.output.recovery']

features_f = df_f.drop(['final.output.recovery'], axis=1)
target_f = df_f['final.output.recovery']

Let's divide each sample into two: train, validation in the ratio `3 : 1`.

In [None]:
features_train_r, features_valid_r = train_test_split(features_r, test_size=0.25, random_state=12345)
target_train_r, target_valid_r = train_test_split(target_r, test_size=0.25, random_state=12345)

print(features_r.shape)
print(features_train_r.shape)
print(features_valid_r.shape)
print(target_train_r.shape)
print(target_valid_r.shape)
print()


features_train_f, features_valid_f = train_test_split(features_f, test_size=0.25, random_state=12345)
target_train_f, target_valid_f = train_test_split(target_f, test_size=0.25, random_state=12345)

print(features_f.shape)
print(features_train_f.shape)
print(features_valid_f.shape)
print(target_train_f.shape)
print(target_valid_f.shape)
print()

Scale features

In [None]:
# Remove the target feature from the array with all features
num_r = col_r[:len(col_r)-1]
num_f = col_f[1:len(col_f)-1]


scaler_r = StandardScaler()
scaler_r.fit(features_train_r.loc[:, num_r])
features_train_r.loc[:, num_r] = scaler_r.transform(features_train_r.loc[:, num_r])
features_valid_r.loc[:, num_r] = scaler_r.transform(features_valid_r.loc[:, num_r])
features_r.loc[:, num_r] = scaler_r.transform(features_r.loc[:, num_r])

scaler_f = StandardScaler()
scaler_f.fit(features_train_f.loc[:, num_f])
features_train_f.loc[:, num_f] = scaler_f.transform(features_train_f.loc[:, num_f])
features_valid_f.loc[:, num_f] = scaler_f.transform(features_valid_f.loc[:, num_f])
features_f.loc[:, num_f] = scaler_f.transform(features_f.loc[:, num_f])

In [None]:
features_train_r.head()

## Model training

Define a function for calculating the sMAPE metric

In [None]:
def sMAPE(target, predicted):
    
    part = 100 * abs(target - predicted) / ((abs(target) + abs(predicted)) / 2)
    full = part.sum() / len(target)
    return full

Define a function for calculating model metrics

In [None]:
def model_results(model, features, target):

    predicted = model.predict(features)
    predicted = pd.Series(predicted, index=target.index) 

    MAE = mean_absolute_error(target, predicted) ** 0.5
    sMAPE_val = sMAPE(target, predicted)
    
    return (MAE, sMAPE_val)

### Linear regression

#### Efficiency of crude concentrate enrichment

In [None]:
model_r = LinearRegression()
model_r.fit(features_train_r, target_train_r)

results = model_results(model_r, features_valid_r, target_valid_r)

print('Mean:', '{:,.2f}'.format(target_valid_r.mean()))
print("MAE_valid_r:", results[0])
print("sMAPE_valid_r:", results[1])
print()

results = model_results(model_r, features_r, target_r)

print('Mean:', '{:,.2f}'.format(target_r.mean()))
print("MAE_r:", results[0])
print("sMAPE_r:", results[1])
print()

#### Final concentrate enrichment efficiency

In [None]:
model_f = LinearRegression()
model_f.fit(features_train_f, target_train_f)

results = model_results(model_f, features_valid_f, target_valid_f)

print('Mean:', '{:,.2f}'.format(target_valid_f.mean()))
print("MAE_valid_f:", results[0])
print("sMAPE_valid_f:", results[1])
print()

results = model_results(model_f, features_f, target_f)

print('Mean:', '{:,.2f}'.format(target_f.mean()))
print("MAE_f:", results[0])
print("sMAPE_r:", results[1])
print()

### Decision tree

#### Efficiency of crude concentrate enrichment

In [None]:
param_grid = {'max_depth': range(1,10)}

dtr_r = GridSearchCV(estimator=DecisionTreeRegressor(random_state=12345), param_grid=param_grid, cv=5)
dtr_r.fit(features_train_r, target_train_r)
dtr_r.best_params_

In [None]:
results = model_results(dtr_r, features_valid_r, target_valid_r)

print('Mean:', '{:,.2f}'.format(target_valid_r.mean()))
print("MAE_valid_r:", results[0])
print("sMAPE_valid_r:", results[1])
print()

results = model_results(dtr_r, features_r, target_r)

print('Mean:', '{:,.2f}'.format(target_r.mean()))
print("MAE_r:", results[0])
print("sMAPE_r:", results[1])
print()

#### Final concentrate enrichment efficiency

In [None]:
param_grid = {'max_depth': range(1,10)}

dtr_f = GridSearchCV(estimator=DecisionTreeRegressor(random_state=12345), param_grid=param_grid, cv=5)
dtr_f.fit(features_train_f, target_train_f)
dtr_f.best_params_

In [None]:
results = model_results(dtr_f, features_valid_f, target_valid_f)

print('Mean:', '{:,.2f}'.format(target_valid_f.mean()))
print("MAE_valid_f:", results[0])
print("sMAPE_valid_f:", results[1])
print()

results = model_results(dtr_f, features_f, target_f)

print('Mean:', '{:,.2f}'.format(target_f.mean()))
print("MAE_f:", results[0])
print("sMAPE_r:", results[1])
print()

### Random Forest

#### Efficiency of crude concentrate enrichment

In [None]:
param_grid = {'n_estimators': range(1,10), 'max_depth': range(1,10)}

rfr_r = GridSearchCV(estimator=RandomForestRegressor(random_state=12345), param_grid=param_grid, cv=5)
rfr_r.fit(features_train_r, target_train_r)
rfr_r.best_params_

In [None]:
results = model_results(rfr_r, features_valid_r, target_valid_r)

print('Mean:', '{:,.2f}'.format(target_valid_r.mean()))
print("MAE_valid_r:", results[0])
print("sMAPE_valid_r:", results[1])
print()

results = model_results(rfr_r, features_r, target_r)

print('Среднее значение:', '{:,.2f}'.format(target_r.mean()))
print("MAE_r:", results[0])
print("sMAPE_r:", results[1])
print()

#### Final concentrate enrichment efficiency

In [None]:
param_grid = {'n_estimators': range(1,10), 'max_depth': range(1,10)}

rfr_f = GridSearchCV(estimator=RandomForestRegressor(random_state=12345), param_grid=param_grid, cv=5)
rfr_f.fit(features_train_f, target_train_f)
rfr_f.best_params_

In [None]:
results = model_results(rfr_f, features_valid_f, target_valid_f)

print('Mean:', '{:,.2f}'.format(target_valid_f.mean()))
print("MAE_valid_r:", results[0])
print("sMAPE_valid_r:", results[1])
print()

results = model_results(rfr_f, features_f, target_f)

print('Mean:', '{:,.2f}'.format(target_f.mean()))
print("MAE_r:", results[0])
print("sMAPE_r:", results[1])
print()

## Checking models on a test set

### Preparing test data

Fill in the test sample with target features taken from the full sample.

In [None]:
df_test = df_test.merge(df_full[['date','rougher.output.recovery']], on='date', how='left')
df_test.head()

In [None]:
df_test = df_test.merge(df_full[['date','final.output.recovery']], on='date', how='left')
df_test.head()

In [None]:
def recovery_r(df):
    
    """
    C — share of gold in concentrate after flotation/refining;
    F — share of gold in raw material/concentrate before flotation/refining;
    T — share of gold in final tailings after flotation/cleaning.
    """
    C = df['rougher.output.concentrate_au']
    F = df['rougher.input.feed_au']
    T = df['rougher.output.tail_au']
    
    return 100 * C * (F - T) / (F * (C - T))

In [None]:
def recovery_f(df):
    
    """
    C — share of gold in concentrate after flotation/refining;
    F — share of gold in raw material/concentrate before flotation/refining;
    T — share of gold in final tailings after flotation/cleaning.
    """
    C = df['final.output.concentrate_au']
    F = df['rougher.output.concentrate_au']
    T = df['secondary_cleaner.output.tail_au']
    
    return 100 * C * (F - T) / (F * (C - T))

Let's divide the test sample into one that will relate to the draft concentrate and one that will relate to the final concentrate.

In [None]:
df_test.dropna(inplace=True)

df_test_r = df_test[col_r]
df_test_f = df_test[col_f]

Let's divide the sample into sets with features and a target feature.

In [None]:
features_test_r = df_test_r.drop(['rougher.output.recovery'], axis=1)
target_test_r = df_test_r['rougher.output.recovery']

features_test_f = df_test_f.drop(['final.output.recovery'], axis=1)
target_test_f = df_test_f['final.output.recovery']

Scale features

In [None]:
features_test_r.loc[:, num_r] = scaler_r.transform(features_test_r.loc[:, num_r])

features_test_f.loc[:, num_f] = scaler_f.transform(features_test_f.loc[:, num_f])

### Checking models on a test dataset

#### Linear Regression

In [None]:
Total = 0

results = model_results(model_r, features_test_r, target_test_r)
Total += 0.25*results[1]

print('Mean:', '{:,.2f}'.format(target_test_r.mean()))
print("MAE_test_r:", results[0])
print("sMAPE_test_r:", results[1])
print()

results = model_results(model_f, features_test_f, target_test_f)
Total += 0.75*results[1]

print('Mean:', '{:,.2f}'.format(target_test_f.mean()))
print("MAE_test_f:", results[0])
print("sMAPE_test_f:", results[1])
print()

print('Total sMAPE:', Total)

#### Decision tree

In [None]:
Total = 0

results = model_results(dtr_r, features_test_r, target_test_r)
Total += 0.25*results[1]

print('Mean:', '{:,.2f}'.format(target_test_r.mean()))
print("MAE_test_r:", results[0])
print("sMAPE_test_r:", results[1])
print()

results = model_results(dtr_f, features_test_f, target_test_f)
Total += 0.75*results[1]

print('Mean:', '{:,.2f}'.format(target_test_f.mean()))
print("MAE_test_f:", results[0])
print("sMAPE_test_f:", results[1])
print()

print('Total sMAPE:', Total)

#### Random Forest

In [None]:
Total = 0

results = model_results(rfr_r, features_test_r, target_test_r)
Total += 0.25*results[1]

print('Mean:', '{:,.2f}'.format(target_test_r.mean()))
print("MAE_test_r:", results[0])
print("sMAPE_test_r:", results[1])
print()

results = model_results(rfr_f, features_test_f, target_test_f)
Total += 0.75*results[1]

print('Mean:', '{:,.2f}'.format(target_test_f.mean()))
print("MAE_test_f:", results[0])
print("sMAPE_test_f:", results[1])
print()

print('Total sMAPE:', Total)

# Conclusion

The best result on the test dataset was shown by the linear regression model.
<br> Forecast accuracy of 8.64% is quite good.