# Action plan

1. Import files
2. Data processing:
- processing of missing values,
- processing of emissions,
- bringing columns over time to a temporary format
3. Based on the temperature measurement file, create a dataframe that will also contain the columns `active power`, `reactive power`, `gas`, `Bulk`, `Wire`, and for this you first need to match `Bulk` with ` Bulk_time` and `Wire` with `Wire_time`.
<br/>Next, it is necessary to go through each line `key` and `time` of the new dataframe, depending on the column, look in the matched tables for a match on `key` with a time less than or equal to `time` and sum up all found values.
<br/>
<br/>Thus, you will get a dataframe with a target feature in the form of temperature and the factors that influenced it.

Main goal: prediction of liquid metal temperature based on the input data from the sensors

# Description of data files

- `data_arc.csv` — electrode data;
- `data_bulk.csv` - data on the supply of bulk materials (volume);
- `data_bulk_time.csv` *—* data on the supply of bulk materials (time);
- `data_gas.csv` — data on alloy gas purge;
- `data_temp.csv` - temperature measurement results;
- `data_wire.csv` - data on wire materials (volume);
- `data_wire_time.csv` - data on wire materials (time).

# Import files

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import copy
from IPython.display import display
import os

from tqdm import tqdm_notebook
from tqdm._tqdm_notebook import tqdm_notebook
from tqdm import notebook

import gc

from itertools import chain

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)

In [None]:
data_arc = pd.read_csv('datasets/data_arc.csv')
data_bulk = pd.read_csv('datasets/data_bulk.csv')
data_bulk_time = pd.read_csv('datasets/data_bulk_time.csv')
data_gas = pd.read_csv('datasets/data_gas.csv')
data_temp = pd.read_csv('datasets/data_temp.csv')
data_wire = pd.read_csv('datasets/data_wire.csv')
data_wire_time = pd.read_csv('datasets/data_wire_time.csv')

In [None]:
def print_df(df):
    display(df)
    df.info()
    print('_'*120)
    print('_'*120)

print_df(data_arc)
print_df(data_bulk)
print_df(data_bulk_time)
print_df(data_gas)
print_df(data_temp)
print_df(data_wire)
print_df(data_wire_time)

# Data processing

**In the `data_temp` dataset, there are about 20% gaps in the target feature - temperature. There is nothing left but to remove these gaps**

In [None]:
data_temp.dropna(subset=['Temperature'], inplace=True)
data_temp.info()

## Check for outliers of all signs, except for time and temperature

**`Bulk`**

In [None]:
sns.set(font_scale=2)

for i in range(1,16):
    column = 'Bulk ' + str(i)
    print(column)
    plt.figure(figsize=(20,20))
    sns.boxplot(data=data_bulk[column])
    plt.show()

**`Wire`**

In [None]:
sns.set(font_scale=2)

for i in range(1,10):
    column = 'Wire ' + str(i)
    print(column)
    plt.figure(figsize=(20,20))
    sns.boxplot(data=data_wire[column])
    plt.show()

**`Gas 1`**

In [None]:
sns.set(font_scale=2)
plt.figure(figsize=(20,20))
sns.boxplot(data=data_gas['Gas 1'])
plt.show()

**Let's not remove `Bulk`, `Wire`, `Gas 1` emissions yet. Since the entire technological process is unknown, as well as the initial composition. If the accuracy of the model is low, then you need to go back and try to remove outliers**

**`Power`**

In [None]:
sns.set(font_scale=2)
plt.figure(figsize=(20,20))
sns.boxplot(data=data_arc['Active power'])
plt.show()

plt.figure(figsize=(20,20))
sns.boxplot(data=data_arc['Reactive power'])
plt.show()

In [None]:
data_arc.loc[data_arc['Reactive power'] < -1]

In [None]:
data_arc.loc[data_arc['key'] == 2116]

**Active power in the normal range, so as not to lose data, just zero the outlier**

In [None]:
data_arc.loc[(data_arc['key'] == 2116) & (data_arc['Reactive power'] < -1), ['Reactive power']] = 0

In [None]:
data_arc.loc[data_arc['key'] == 2116]

<div style="background: #ADD8E6">
Delete the entire batch

In [None]:
index = data_arc.loc[data_arc['Reactive power'] < -1].index
data_arc = data_arc.drop(index).reset_index(drop=True)

<div style="background: #ADD8E6">
Removal of emissions `Gas 1`, `Power`. The `Bulk` and `Wire` outliers are best removed after they have matched with time signatures.

In [None]:
data_gas.info()

In [None]:
q1 = data_gas['Gas 1'].quantile(0.25)
q3 = data_gas['Gas 1'].quantile(0.75)
irq = q3 - q1
keys = list(data_gas.loc[(data_gas['Gas 1'] < (q1 - 1.5*irq)) | (data_gas['Gas 1'] > (q3 + 1.5*irq))]['key'].values)
data_gas = data_gas.query('key not in @keys')
data_gas.info()

In [None]:
data_arc.info()

In [None]:
q1 = data_arc['Active power'].quantile(0.25)
q3 = data_arc['Active power'].quantile(0.75)
irq = q3 - q1
keys = list(data_arc.loc[(data_arc['Active power'] < (q1 - 1.5*irq)) | (data_arc['Active power'] > (q3 + 1.5*irq))]['key'].values)
data_arc = data_arc.query('key not in @keys')
data_arc.info()

## Fill in missing values

**Replace missing `Bulk` and `Wire` values with zeros**

In [None]:
data_bulk.fillna(0, inplace=True)
data_wire.fillna(0, inplace=True)

## Reduction of features to the format

**Let's bring all the signs to the appropriate format**

In [None]:
data_arc['Start of arc heating'] = pd.to_datetime(data_arc['Start of arc heating'], format='%Y-%m-%dT%H:%M:%S')
data_arc['End of arc heating'] = pd.to_datetime(data_arc['End of arc heating'], format='%Y-%m-%dT%H:%M:%S')
data_arc.info()

In [None]:
for i in range(1,16):
    column = 'Bulk ' + str(i)
    data_bulk_time[column] = pd.to_datetime(data_bulk_time[column], format='%Y-%m-%dT%H:%M:%S')
data_bulk_time.info()

In [None]:
data_temp['Measurement time'] = pd.to_datetime(data_temp['Measurement time'], format='%Y-%m-%dT%H:%M:%S')
data_temp.info()

In [None]:
for i in range(1,10):
    column = 'Wire ' + str(i)
    data_wire_time[column] = pd.to_datetime(data_wire_time[column], format='%Y-%m-%dT%H:%M:%S')
data_wire_time.info()

## Checking temporal features for outliers

**Check the time series of datasets for outliers**

**`Bulk`**

In [None]:
sns.set(font_scale=2)

for i in range(1,16):
    column = 'Bulk ' + str(i)
    print(column)
    plt.figure(figsize=(20,20))
    sns.scatterplot(x = data_bulk_time[column], y=data_bulk_time['key'])
    plt.show()

**`Wire`**

In [None]:
sns.set(font_scale=2)

for i in range(1,10):
    column = 'Wire ' + str(i)
    print(column)
    plt.figure(figsize=(20,20))
    sns.scatterplot(x = data_wire_time[column], y=data_wire_time['key'])
    plt.show()

**`Heating`**

In [None]:
df1 = copy.deepcopy(data_arc[['key','Start of arc heating']])
df1['Type'] = 'Start of heating'
df1.rename(columns={'Start of arc heating':'Heating'},inplace=True)
df2 = copy.deepcopy(data_arc[['key','End of arc heating']])
df2['Type'] = 'End of heating'
df2.rename(columns={'End of arc heating':'Heating'},inplace=True)
df = pd.concat([df1,df2], ignore_index=True,join='outer')

In [None]:
px.scatter(df, x = 'key', y = 'Heating', color='Type')

In [None]:
del df1,df2,df

**No outliers detected over time**

## Check temperature signs for outliers

**Check for temperature spikes in each dataset**

In [None]:
sns.set(font_scale=2)
plt.figure(figsize=(20,20))
sns.boxplot(data=data_temp['Temperature'])
plt.show()

**Let's not remove outliers yet. Since the entire technological process is unknown, as well as the initial composition. If the accuracy of the model is low, then you need to go back and try to remove outliers**

<div style="background: #ADD8E6">
Removing temperature outliers

It is best to remove temperature outliers in the final dataset

# Create a final dataset for training the model

## Match datasets with `Bulk` and `Wire`

In [None]:
df_bulk = data_bulk.join(data_bulk_time,rsuffix='_time').reset_index(drop=True)

In [None]:
col = ['key']
for i in range(1,16):
    name = 'Bulk ' + str(i)
    col.append(name)
    name = 'Bulk ' + str(i) + '_time'
    col.append(name)

In [None]:
df_bulk = df_bulk[col]
df_bulk

In [None]:
df_wire = data_wire.join(data_wire_time,rsuffix='_time').reset_index(drop=True)

In [None]:
col = ['key']
for i in range(1,10):
    name = 'Wire ' + str(i)
    col.append(name)
    name = 'Wire ' + str(i) + '_time'
    col.append(name)

In [None]:
df_wire = df_wire[col]
df_wire

<div style="background: #ADD8E6">
Removing outliers `Bulk`, `Wire`

In [None]:
display(df_bulk.info())
display(df_wire.info())

**Bulk**

Removing a line with outliers

In [None]:
# for i in range(1,16):
#     col = 'Bulk ' + str(i)
#     q1 = df_bulk[col].quantile(0.25)
#     q3 = df_bulk[col].quantile(0.75)
#     irq = q3 - q1
#     index = df_bulk.loc[(df_bulk[col] < (q1 - 1.5*irq)) | (df_bulk[col] > (q3 + 1.5*irq))].index
#     df_bulk = df_bulk.drop(index).reset_index(drop=True)

Replacing outliers with 0

In [None]:
# for i in range(1,16):
#     col = 'Bulk ' + str(i)
#     q1 = df_bulk[col].quantile(0.25)
#     q3 = df_bulk[col].quantile(0.75)
#     irq = q3 - q1
#     df_bulk.loc[(df_bulk[col] < (q1 - 1.5*irq)) | (df_bulk[col] > (q3 + 1.5*irq)), col] = 0

Deleting all key where there is at least one outlier

In [None]:
keys = []
for i in range(1,16):
    col = 'Bulk ' + str(i)
    q1 = df_bulk[col].quantile(0.25)
    q3 = df_bulk[col].quantile(0.75)
    irq = q3 - q1
    keys.append(list(df_bulk.loc[(df_bulk[col] < (q1 - 1.5*irq)) | (df_bulk[col] > (q3 + 1.5*irq))]['key'].values))

new_keys = list(chain.from_iterable(keys))


df_bulk = df_bulk.query('key not in @new_keys')

**Wire**

Removing a line with outliers

In [None]:
# for i in range(1,10):
#     col = 'Wire ' + str(i)
#     q1 = df_wire[col].quantile(0.25)
#     q3 = df_wire[col].quantile(0.75)
#     irq = q3 - q1
#     index = df_wire.loc[(df_wire[col] < (q1 - 1.5*irq)) | (df_wire[col] > (q3 + 1.5*irq))].index
#     df_wire = df_wire.drop(index).reset_index(drop=True)

Replacing outliers with 0

In [None]:
# for i in range(1,10):
#     col = 'Wire ' + str(i)
#     q1 = df_wire[col].quantile(0.25)
#     q3 = df_wire[col].quantile(0.75)
#     irq = q3 - q1
#     df_wire.loc[(df_wire[col] < (q1 - 1.5*irq)) | (df_wire[col] > (q3 + 1.5*irq)), col] = 0

Deleting all key where there is at least one outlier

In [None]:
keys = []
for i in range(1,10):
    col = 'Wire ' + str(i)
    q1 = df_wire[col].quantile(0.25)
    q3 = df_wire[col].quantile(0.75)
    irq = q3 - q1
    keys.append(list(df_wire.loc[(df_wire[col] < (q1 - 1.5*irq)) | (df_wire[col] > (q3 + 1.5*irq))]['key'].values))

new_keys = list(chain.from_iterable(keys))

df_wire = df_wire.query('key not in @new_keys')

In [None]:
display(df_bulk.info())
display(df_wire.info())

## Populate the dataset with all the information

In [None]:
full = copy.deepcopy(data_temp)
full.info()

In [None]:
full['gas'] = 0
full['AP'] = 0
full['RP'] = 0

In [None]:
for i in range(1,16):
    column = 'Bulk ' + str(i)
    full[column] = 0

In [None]:
for i in range(1,10):
    column = 'Wire ' + str(i)
    full[column] = 0

In [None]:
full

### Power

In [None]:
def func_ap(row):
    v = data_arc['Active power'].loc[(data_arc['key'] == row['key'])].sum()
    return v

full['AP'] = full.apply(func_ap, axis=1)

def func_rp(row):
    v = data_arc['Reactive power'].loc[(data_arc['key'] == row['key'])].sum()
    return v

full['RP'] = full.apply(func_rp, axis=1)

### Gas

In [None]:
def func_gas(row):
    v = data_gas['Gas 1'].loc[(data_gas['key'] == row['key'])].sum()
    return v

full['gas'] = full.apply(func_gas, axis=1)

### Bulk

In [None]:
def func_bulk(row,col):
    time_col = col + '_time'
    v = df_bulk[col].loc[(df_bulk['key'] == row['key']) & (df_bulk[time_col] <= row['Measurement time'])].sum()
    return v

for i in notebook.tqdm(range(1,16)):
    col = 'Bulk ' + str(i) 
    full[col] = full.apply(lambda x: func_bulk(x, col), axis=1)

### Wire

In [None]:
def func_wire(row,col):
    time_col = col + '_time'
    v = df_wire[col].loc[(df_wire['key'] == row['key']) & (df_wire[time_col] <= row['Measurement time'])].sum()
    return v

for i in notebook.tqdm(range(1,10)):
    col = 'Wire ' + str(i) 
    full[col] = full.apply(lambda x: func_wire(x, col), axis=1)

## Processing the final dataset

In [None]:
full.describe()

**In each batch, it is necessary to leave only the last measurement, and we will train the model on it**

In [None]:
n = copy.deepcopy(full)
n.sort_values(by=['key','Measurement time'], ascending=True,inplace=True)
n.drop_duplicates(subset=['key'], keep='first', inplace=True)
n.reset_index(drop=True,inplace=True)
n = n[['key','Temperature']]
n.rename(columns={'Temperature':'Initial temperature'},inplace=True)
n

In [None]:
f = copy.deepcopy(full)

In [None]:
f.sort_values(by=['key','Measurement time'], ascending=True,inplace=True)

In [None]:
f.drop_duplicates(subset=['key'], keep='last', inplace=True)
f.reset_index(drop=True,inplace=True)

In [None]:
f = f.merge(n,how='left',on='key')

In [None]:
f.head()

 <div style="background: #ADD8E6">
Removing rows where the end temperature is equal to the start

In [None]:
f.info()

In [None]:
index = f.loc[f['Initial temperature'] == f['Temperature']].index
f = f.drop(index).reset_index(drop=True)
f.info()

<div style="background: #ADD8E6">
Removing rows where there is no final temperature measurement

In [None]:
semi_ds = data_arc[['key','End of arc heating']].sort_values(by=['End of arc heating'])
semi_ds.drop_duplicates(subset=['key'], keep='last', inplace=True)
semi_ds.reset_index(drop=True,inplace=True)
semi_ds

In [None]:
f = f.merge(semi_ds,how='left',on='key')
f.info()

In [None]:
index = f.loc[f['Measurement time'] < f['End of arc heating']].index
f = f.drop(index).reset_index(drop=True)
f.info()

<div style="background: #ADD8E6">
Removing temperature outliers

In [None]:
f.info()

In [None]:
q1 = f['Temperature'].quantile(0.25)
q3 = f['Temperature'].quantile(0.75)
irq = q3 - q1

In [None]:
index = f.loc[(f['Temperature'] < (q1 - 1.5*irq)) | (f['Temperature'] > (q3 + 1.5*irq))].index
f = f.drop(index).reset_index(drop=True)
f.info()

In [None]:
q1 = f['Initial temperature'].quantile(0.25)
q3 = f['Initial temperature'].quantile(0.75)
irq = q3 - q1

In [None]:
index = f.loc[(f['Initial temperature'] < (q1 - 1.5*irq)) | (f['Initial temperature'] > (q3 + 1.5*irq))].index
f = f.drop(index).reset_index(drop=True)
f.info()

Removing unnecessary features

In [None]:
f.drop(['Measurement time','key','End of arc heating'],inplace=True,axis=1)

## Correlation matrix

In [None]:
#del data_arc, data_bulk, data_bulk_time, data_gas, data_temp, data_wire, data_wire_time, df_bulk, df_wire, n

In [None]:
corr = f.corr()
fig = px.imshow(corr)
fig.show()

In [None]:
for col in list(corr.columns.unique()):
    print(col)
    display(corr.loc[(corr[col] < 1) & (corr[col] >= 0.6),col])
    print()

Remove the following features due to the strong correlation: `RP`,`Bulk 7`,`Wire 4`,`Wire 8`,`Bulk 15`

In [None]:
f.drop(['RP', 'Bulk 14', 'Bulk 15'], inplace=True,axis=1)

In [None]:
corr = f.corr()
for col in list(corr.columns.unique()):
    print(col)
    display(corr.loc[(corr[col] < 1) & (corr[col] >= 0.6),col])
    print()

In [None]:
f

# Building temperature prediction models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_absolute_error

import lightgbm as lgb

**Let's split the sample into sets with features and a target feature**

In [None]:
features = f.drop(['Temperature'], axis=1)
target = f['Temperature']

**Scaling features**

In [None]:
features_train, features_valid = train_test_split(features, test_size=0.20, random_state=12345)
features_train, features_test = train_test_split(features_train, test_size=0.25, random_state=12345)

target_train, target_valid = train_test_split(target, test_size=0.20, random_state=12345)
target_train, target_test = train_test_split(target_train, test_size=0.25, random_state=12345)

print(features.shape)
print(target.shape)
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

**Scaling features**

In [None]:
scaler = StandardScaler()
scaler.fit(features_train)
features_train = scaler.transform(features_train)
features_valid = scaler.transform(features_valid)
features_test = scaler.transform(features_test)

## Linear regression

In [None]:
%%time
model_lr = LinearRegression()
model_lr.fit(features_train, target_train)
predictions = model_lr.predict(features_valid)
mae = mean_absolute_error(target_valid, predictions)
print('MAE Linear Regression:', mae)

## Decision tree

In [None]:
%%time
param_grid = {'max_depth': range(1,100,2)}

dtr = GridSearchCV(estimator=DecisionTreeRegressor(random_state=12345), param_grid=param_grid, cv=5,scoring='neg_mean_absolute_error')
dtr.fit(features_train, target_train)
dtr.best_params_

In [None]:
predictions = dtr.predict(features_valid)
mae = mean_absolute_error(target_valid, predictions)
print('MAE decision tree:', mae)

## Random Forest

In [None]:
%%time
param_grid = {'n_estimators': range(1,150,3), 'max_depth': range(1,150,3)}

rfr = GridSearchCV(estimator=RandomForestRegressor(random_state=12345), param_grid=param_grid, cv=5,verbose=2,scoring='neg_mean_absolute_error')
rfr.fit(features_train, target_train)
rfr.best_params_

In [None]:
predictions = rfr.predict(features_valid)
mae = mean_absolute_error(target_valid, predictions)
print('MAE Random Forest:', mae)

## LightGBM

In [None]:
hyper_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression_l1',
    'metric': 'mae',
    'learning_rate': 0.005,
    'verbose': 0,
    "max_depth": 100,
    "num_iterations": 20000,
    "n_estimators": 5000
}

In [None]:
%%time
gbm = lgb.LGBMRegressor(**hyper_params)
gbm.fit(features_train, target_train, 
        eval_set=[(features_valid, target_valid)],
        eval_metric='mae', verbose=0)

In [None]:
print('MAE LightGBM:', gbm.best_score_['valid_0']['l1'])

## Feature importances

### Decision tree

In [None]:
feature_importances = pd.DataFrame(dtr.best_estimator_.feature_importances_,
                                   index = features.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)
feature_importances

### Random Forest

In [None]:
feature_importances = pd.DataFrame(rfr.best_estimator_.feature_importances_,
                                   index = features.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)
feature_importances

### LightGBM

In [None]:
feature_importances = pd.DataFrame(gbm.feature_importances_,
                                   index = features.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)
feature_importances

# Checking models on a test dataset

In [None]:
predictions = model_lr.predict(features_test)
mae = mean_absolute_error(target_test, predictions)
print('MAE Linear Regression:', mae)

predictions = dtr.predict(features_test)
mae = mean_absolute_error(target_test, predictions)
print('MAE decision tree:', mae)

predictions = rfr.predict(features_test)
mae = mean_absolute_error(target_test, predictions)
print('MAE Random Forest:', mae)

predictions = gbm.predict(features_test)
mae = mean_absolute_error(target_test, predictions)
print('MAE LightGBM:', mae)

Best score - Random Forest : 5.84