<h1>Report on the analysis and modeling of the "steel-plates-fault" dataset</h1>

<h4>Introduction</h4>
The goal of the assignment was to analyze and model a selected dataset in order to expand one's knowledge related to data mining.<br> The repository includes a Jupyter nootebook containing the code and report, an analogous report in .pdf format, as well as .txt files and .png charts obtained during program execution.

<h4>Data description</h4>
A dataset named "steel-plates-faults" was used for the task. This dataset contains data on steel plate defects and consists of 1941 rows and 34 columns. Of these columns, 27 contain variables describing the physical parameters of the steel plate, 6 contain information on the occurrence of common defects (binary value 1-defect occurred, 0-defect did not occur, with only one of these columns being non-zero in a row). The last column, on the other hand, is the class of defect that occurred, which can take the value b'1' or b'2'. A value of b'1' means that one of the six most common defects occurred, while a value of b'2' means that another type of defect occurred. There are 1268 records of the first type and 673 of the second type in the collection.</br>
In the .arff file provided for download, the column names are abbreviated and therefore unclear. For processing purposes, the column names have been changed in accordance with information shown on the source page.<br>
Source: <br>
Dataset provided by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. <br>
https://www.openml.org/search?type=data&sort=runs&status=active&id=1504 <br>
http://archive.ics.uci.edu/ml/datasets/steel+plates+faults <br>

In [1]:
import pandas as pd
from scipy.io.arff import loadarff 

In [2]:
raw_data = loadarff('data\php9xWOpn.arff')
data = pd.DataFrame(raw_data[0])

In [3]:
data

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V25,V26,V27,V28,V29,V30,V31,V32,V33,Class
0,42.0,50.0,270900.0,270944.0,267.0,17.0,44.0,24220.0,76.0,108.0,...,0.8182,-0.2913,0.5822,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
1,645.0,651.0,2538079.0,2538108.0,108.0,10.0,30.0,11397.0,84.0,123.0,...,0.7931,-0.1756,0.2984,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
2,829.0,835.0,1553913.0,1553931.0,71.0,8.0,19.0,7972.0,99.0,125.0,...,0.6667,-0.1228,0.2150,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
3,853.0,860.0,369370.0,369415.0,176.0,13.0,45.0,18996.0,99.0,126.0,...,0.8444,-0.1568,0.5212,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
4,1289.0,1306.0,498078.0,498335.0,2409.0,60.0,260.0,246930.0,37.0,126.0,...,0.9338,-0.1992,1.0000,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1936,249.0,277.0,325780.0,325796.0,273.0,54.0,22.0,35033.0,119.0,141.0,...,-0.4286,0.0026,0.7254,0.0,0.0,0.0,0.0,0.0,0.0,b'2'
1937,144.0,175.0,340581.0,340598.0,287.0,44.0,24.0,34599.0,112.0,133.0,...,-0.4516,-0.0582,0.8173,0.0,0.0,0.0,0.0,0.0,0.0,b'2'
1938,145.0,174.0,386779.0,386794.0,292.0,40.0,22.0,37572.0,120.0,140.0,...,-0.4828,0.0052,0.7079,0.0,0.0,0.0,0.0,0.0,0.0,b'2'
1939,137.0,170.0,422497.0,422528.0,419.0,97.0,47.0,52715.0,117.0,140.0,...,-0.0606,-0.0171,0.9919,0.0,0.0,0.0,0.0,0.0,0.0,b'2'


In [4]:
data.shape

(1941, 34)

In [4]:
# data frame did not display all columns due to size hence this helper function
def print_data(data):
    with pd.option_context('display.max_columns', None):
        print(data)

In [5]:
print_data(data)

          V1      V2         V3         V4      V5    V6     V7        V8  \
0       42.0    50.0   270900.0   270944.0   267.0  17.0   44.0   24220.0   
1      645.0   651.0  2538079.0  2538108.0   108.0  10.0   30.0   11397.0   
2      829.0   835.0  1553913.0  1553931.0    71.0   8.0   19.0    7972.0   
3      853.0   860.0   369370.0   369415.0   176.0  13.0   45.0   18996.0   
4     1289.0  1306.0   498078.0   498335.0  2409.0  60.0  260.0  246930.0   
...      ...     ...        ...        ...     ...   ...    ...       ...   
1936   249.0   277.0   325780.0   325796.0   273.0  54.0   22.0   35033.0   
1937   144.0   175.0   340581.0   340598.0   287.0  44.0   24.0   34599.0   
1938   145.0   174.0   386779.0   386794.0   292.0  40.0   22.0   37572.0   
1939   137.0   170.0   422497.0   422528.0   419.0  97.0   47.0   52715.0   
1940  1261.0  1281.0    87951.0    87967.0   103.0  26.0   22.0   11682.0   

         V9    V10     V11  V12  V13    V14     V15     V16     V17     V18

In [7]:
data = data.rename(columns={
    "V1": "X_Minimum",
    "V2": "X_Maximum",
    "V3": "Y_Minimum",
    "V4": "Y_Maximum",
    "V5": "Pixels_Areas",
    "V6": "X_Perimeter",
    "V7": "Y_Perimeter",
    "V8": "Sum_of_Luminosity",
    "V9": "Minimum_of_Luminosity",
    "V10": "Maximum_of_Luminosity",
    "V11": "Length_of_Conveyer",
    "V12": "TypeOfSteel_A300",
    "V13": "TypeOfSteel_A400",
    "V14": "Steel_Plate_Thickness",
    "V15": "Edges_Index",
    "V16": "Empty_Index",
    "V17": "Square_Index",
    "V18": "Outside_X_Index",
    "V19": "Edges_X_Index",
    "V20": "Edges_Y_Index",
    "V21": "Outside_Global_Index",
    "V22": "LogOfAreas",
    "V23": "Log_X_Index",
    "V24": "Log_Y_Index",
    "V25": "Orientation_Index",
    "V26": "Luminosity_Index",
    "V27": "SigmoidOfAreas",
    "V28": "Pastry",
    "V29": "Z_Scratch",
    "V30": "K_Scatch",
    "V31": "Stains",
    "V32": "Dirtiness",
    "V33": "Bumps",
    "Class": "Other_Faults"
})
print_data(data)

      X_Minimum  X_Maximum  Y_Minimum  Y_Maximum  Pixels_Areas  X_Perimeter  \
0          42.0       50.0   270900.0   270944.0         267.0         17.0   
1         645.0      651.0  2538079.0  2538108.0         108.0         10.0   
2         829.0      835.0  1553913.0  1553931.0          71.0          8.0   
3         853.0      860.0   369370.0   369415.0         176.0         13.0   
4        1289.0     1306.0   498078.0   498335.0        2409.0         60.0   
...         ...        ...        ...        ...           ...          ...   
1936      249.0      277.0   325780.0   325796.0         273.0         54.0   
1937      144.0      175.0   340581.0   340598.0         287.0         44.0   
1938      145.0      174.0   386779.0   386794.0         292.0         40.0   
1939      137.0      170.0   422497.0   422528.0         419.0         97.0   
1940     1261.0     1281.0    87951.0    87967.0         103.0         26.0   

      Y_Perimeter  Sum_of_Luminosity  Minimum_of_Lu

In [8]:
data.Other_Faults.value_counts()

b'1'    1268
b'2'     673
Name: Other_Faults, dtype: int64

In [9]:
data[(data['Pastry']== 1.0) | (data['Z_Scratch']== 1.0) | (data['K_Scatch']== 1.0) | (data['Stains']== 1.0) | (data['Dirtiness']== 1.0) | (data['Bumps']== 1.0)].count()

X_Minimum                1268
X_Maximum                1268
Y_Minimum                1268
Y_Maximum                1268
Pixels_Areas             1268
X_Perimeter              1268
Y_Perimeter              1268
Sum_of_Luminosity        1268
Minimum_of_Luminosity    1268
Maximum_of_Luminosity    1268
Length_of_Conveyer       1268
TypeOfSteel_A300         1268
TypeOfSteel_A400         1268
Steel_Plate_Thickness    1268
Edges_Index              1268
Empty_Index              1268
Square_Index             1268
Outside_X_Index          1268
Edges_X_Index            1268
Edges_Y_Index            1268
Outside_Global_Index     1268
LogOfAreas               1268
Log_X_Index              1268
Log_Y_Index              1268
Orientation_Index        1268
Luminosity_Index         1268
SigmoidOfAreas           1268
Pastry                   1268
Z_Scratch                1268
K_Scatch                 1268
Stains                   1268
Dirtiness                1268
Bumps                    1268
Other_Faul

In [10]:
data[((data['Pastry']== 1.0) | (data['Z_Scratch']== 1.0) | (data['K_Scatch']== 1.0) | (data['Stains']== 1.0) | (data['Dirtiness']== 1.0) | (data['Bumps']== 1.0)) & (data['Other_Faults']== 1.0)].count()

X_Minimum                0
X_Maximum                0
Y_Minimum                0
Y_Maximum                0
Pixels_Areas             0
X_Perimeter              0
Y_Perimeter              0
Sum_of_Luminosity        0
Minimum_of_Luminosity    0
Maximum_of_Luminosity    0
Length_of_Conveyer       0
TypeOfSteel_A300         0
TypeOfSteel_A400         0
Steel_Plate_Thickness    0
Edges_Index              0
Empty_Index              0
Square_Index             0
Outside_X_Index          0
Edges_X_Index            0
Edges_Y_Index            0
Outside_Global_Index     0
LogOfAreas               0
Log_X_Index              0
Log_Y_Index              0
Orientation_Index        0
Luminosity_Index         0
SigmoidOfAreas           0
Pastry                   0
Z_Scratch                0
K_Scatch                 0
Stains                   0
Dirtiness                0
Bumps                    0
Other_Faults             0
dtype: int64

<h4>Description of the process of preparing data for analysis and modeling</h4>
The dataset used is a dedicated dataset for use in machine learning. Therefore, it is complete and correct i.e. there are no missing values and no invalid values.<br>
The data was used for the task proposed for this set, that is, classification into common defects and other defects.

<h4>Data analysis</h4>
First, the entire dataset was divided by strain class. Then Pearson's and Spearman's correlation coefficients were counted for each of them. Based on the results, it was decided to reject columns for which the correlation coefficient in both sets was greater than 0.85. 10 columns were rejected:
<ul>
    <li>TypeOfSteel_A400</li>
    <li>Y_Maximum</li>
    <li>X_Maximum</li>
    <li>Sum_of_Luminosity</li>
    <li>SigmoidOfAreas</li>
    <li>Orientation_Index</li>
    <li>Luminosity_Index</li>
    <li>LogOfAreas</li>
    <li>Log_X_Index</li>
    <li>Y_Perimeter</li>
</ul>
As a result, the dimension of the input vector used for modeling was reduced from 27 to 17 (28 to 18 including the expected value).<br>
Then, based on the values obtained from the z-score function from the SciPy library, outlier records were rejected. Records for which the obtained value was greater than 3 were considered as such. There were 182 of them, which is about 9% of the entire dataset. After discarding the data, 1181 records of class b'1' and 578 of class b'2' remained.<br>
An attempt was also made to use the Isolation Forest, but in that case about 14% of the dataset was rejected. This value was considered too high, so the previously mentioned method was used.

In [11]:
import numpy as np

In [12]:
data_common = data[data.Other_Faults == b'1']
steel_plate_common = data_common.iloc[:, :27]
corr_P_common = steel_plate_common.corr("pearson")
corr_P_tri_common = corr_P_common.where(np.triu(np.ones(corr_P_common.shape, dtype=bool), k=1)).stack().sort_values()
corr_P_tri_common[abs(corr_P_tri_common)>0.85]

TypeOfSteel_A300       TypeOfSteel_A400    -1.000000
Edges_Y_Index          Log_X_Index         -0.903438
Outside_X_Index        Log_X_Index          0.854468
Outside_Global_Index   Orientation_Index    0.859350
Log_Y_Index            SigmoidOfAreas       0.871895
Maximum_of_Luminosity  Luminosity_Index     0.879781
LogOfAreas             SigmoidOfAreas       0.896383
                       Log_Y_Index          0.913523
X_Perimeter            Sum_of_Luminosity    0.916876
                       Y_Perimeter          0.921065
LogOfAreas             Log_X_Index          0.928373
Pixels_Areas           X_Perimeter          0.973272
                       Sum_of_Luminosity    0.977294
X_Minimum              X_Maximum            0.987266
Y_Minimum              Y_Maximum            1.000000
dtype: float64

In [13]:
data_other = data[data.Other_Faults == b'2']
steel_plate_other = data_other.iloc[:, :27]
corr_P_other = steel_plate_other.corr("pearson")
corr_P_tri_other = corr_P_other.where(np.triu(np.ones(corr_P_other.shape, dtype=bool), k=1)).stack().sort_values()
corr_P_tri_other[abs(corr_P_tri_other)>0.85]

TypeOfSteel_A300       TypeOfSteel_A400    -1.000000
X_Perimeter            Sum_of_Luminosity    0.854267
Maximum_of_Luminosity  Luminosity_Index     0.862805
Outside_Global_Index   Orientation_Index    0.870849
Minimum_of_Luminosity  Luminosity_Index     0.876836
LogOfAreas             SigmoidOfAreas       0.879316
X_Minimum              X_Maximum            0.991174
Pixels_Areas           Sum_of_Luminosity    0.996881
Y_Minimum              Y_Maximum            1.000000
dtype: float64

In [15]:
data_common = data[data.Other_Faults == b'1']
steel_plate_common = data_common.iloc[:, :27]
corr_P_common = steel_plate_common.corr("spearman")
corr_P_tri_common = corr_P_common.where(np.triu(np.ones(corr_P_common.shape, dtype=bool), k=1)).stack().sort_values()
corr_P_tri_common[abs(corr_P_tri_common)>0.85]

TypeOfSteel_A300      TypeOfSteel_A400    -1.000000
Sum_of_Luminosity     Outside_X_Index      0.856395
Outside_X_Index       LogOfAreas           0.858522
Pixels_Areas          Outside_X_Index      0.858522
Sum_of_Luminosity     Log_X_Index          0.862438
LogOfAreas            Log_X_Index          0.866258
Pixels_Areas          Log_X_Index          0.866259
X_Perimeter           Log_Y_Index          0.872025
Outside_X_Index       SigmoidOfAreas       0.874942
Log_X_Index           SigmoidOfAreas       0.877920
Outside_Global_Index  Orientation_Index    0.879053
X_Minimum             X_Maximum            0.898179
X_Perimeter           Outside_X_Index      0.903883
                      Log_X_Index          0.904241
Sum_of_Luminosity     Log_Y_Index          0.921797
Log_Y_Index           SigmoidOfAreas       0.924209
X_Perimeter           Sum_of_Luminosity    0.928236
                      Y_Perimeter          0.930760
Pixels_Areas          Log_Y_Index          0.934385
LogOfAreas  

In [16]:
data_other = data[data.Other_Faults == b'2']
steel_plate_other = data_other.iloc[:, :27]
corr_P_other = steel_plate_other.corr("spearman")
corr_P_tri_other = corr_P_other.where(np.triu(np.ones(corr_P_other.shape, dtype=bool), k=1)).stack().sort_values()
corr_P_tri_other[abs(corr_P_tri_other)>0.85]

TypeOfSteel_A300       TypeOfSteel_A400    -1.000000
X_Perimeter            SigmoidOfAreas       0.853838
Outside_Global_Index   Orientation_Index    0.868194
Maximum_of_Luminosity  Luminosity_Index     0.890720
Minimum_of_Luminosity  Luminosity_Index     0.896616
Y_Perimeter            Log_Y_Index          0.902006
                       Sum_of_Luminosity    0.904426
                       LogOfAreas           0.917096
Pixels_Areas           Y_Perimeter          0.917096
Y_Perimeter            SigmoidOfAreas       0.935939
Sum_of_Luminosity      SigmoidOfAreas       0.956281
Pixels_Areas           SigmoidOfAreas       0.969082
LogOfAreas             SigmoidOfAreas       0.969082
Pixels_Areas           Sum_of_Luminosity    0.973690
Sum_of_Luminosity      LogOfAreas           0.973690
Outside_X_Index        Log_X_Index          0.983819
X_Minimum              X_Maximum            0.991711
Y_Minimum              Y_Maximum            1.000000
Pixels_Areas           LogOfAreas           1.

In [19]:
steel_plate = data.loc[:, ~data.columns.isin(["TypeOfSteel_A400", "Y_Maximum", "X_Maximum", "Sum_of_Luminosity", "SigmoidOfAreas", "Orientation_Index", "Luminosity_Index",
                                              "LogOfAreas", "Log_X_Index", "Y_Perimeter", "Pastry", "Z_Scratch", "K_Scatch", "Stains", "Dirtiness", "Bumps"])]
steel_plate

Unnamed: 0,X_Minimum,Y_Minimum,Pixels_Areas,X_Perimeter,Minimum_of_Luminosity,Maximum_of_Luminosity,Length_of_Conveyer,TypeOfSteel_A300,Steel_Plate_Thickness,Edges_Index,Empty_Index,Square_Index,Outside_X_Index,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,Log_Y_Index,Other_Faults
0,42.0,270900.0,267.0,17.0,76.0,108.0,1687.0,1.0,80.0,0.0498,0.2415,0.1818,0.0047,0.4706,1.0000,1.0,1.6435,b'1'
1,645.0,2538079.0,108.0,10.0,84.0,123.0,1687.0,1.0,80.0,0.7647,0.3793,0.2069,0.0036,0.6000,0.9667,1.0,1.4624,b'1'
2,829.0,1553913.0,71.0,8.0,99.0,125.0,1623.0,1.0,100.0,0.9710,0.3426,0.3333,0.0037,0.7500,0.9474,1.0,1.2553,b'1'
3,853.0,369370.0,176.0,13.0,99.0,126.0,1353.0,0.0,290.0,0.7287,0.4413,0.1556,0.0052,0.5385,1.0000,1.0,1.6532,b'1'
4,1289.0,498078.0,2409.0,60.0,37.0,126.0,1353.0,0.0,185.0,0.0695,0.4486,0.0662,0.0126,0.2833,0.9885,1.0,2.4099,b'1'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1936,249.0,325780.0,273.0,54.0,119.0,141.0,1360.0,0.0,40.0,0.3662,0.3906,0.5714,0.0206,0.5185,0.7273,0.0,1.2041,b'2'
1937,144.0,340581.0,287.0,44.0,112.0,133.0,1360.0,0.0,40.0,0.2118,0.4554,0.5484,0.0228,0.7046,0.7083,0.0,1.2305,b'2'
1938,145.0,386779.0,292.0,40.0,120.0,140.0,1360.0,0.0,40.0,0.2132,0.3287,0.5172,0.0213,0.7250,0.6818,0.0,1.1761,b'2'
1939,137.0,422497.0,419.0,97.0,117.0,140.0,1360.0,0.0,40.0,0.2015,0.5904,0.9394,0.0243,0.3402,0.6596,0.0,1.4914,b'2'


In [21]:
steel_plate_temp = steel_plate.copy()
print(f"data: {steel_plate_temp.shape}")

z_scores = zscore(steel_plate_temp.iloc[:, :17])
abs_z_scores = np.abs(z_scores)
filtered_z_scores = (abs_z_scores < 3).all(axis=1)

steel_plate_wout_outl = steel_plate_temp[filtered_z_scores]
print(f"data without outliers: {steel_plate_wout_outl.shape}")
print(f"lost data (percent):  {((steel_plate.iloc[:, :17].shape[0] - steel_plate_wout_outl.shape[0]) / steel_plate.iloc[:, :17].shape[0])*100}")
print(steel_plate_wout_outl.Other_Faults.value_counts())
steel_plate_wout_outl

data: (1941, 18)
data without outliers: (1759, 18)
lost data (percent):  9.376609994848016
b'1'    1181
b'2'     578
Name: Other_Faults, dtype: int64


Unnamed: 0,X_Minimum,Y_Minimum,Pixels_Areas,X_Perimeter,Minimum_of_Luminosity,Maximum_of_Luminosity,Length_of_Conveyer,TypeOfSteel_A300,Steel_Plate_Thickness,Edges_Index,Empty_Index,Square_Index,Outside_X_Index,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,Log_Y_Index,Other_Faults
0,42.0,270900.0,267.0,17.0,76.0,108.0,1687.0,1.0,80.0,0.0498,0.2415,0.1818,0.0047,0.4706,1.0000,1.0,1.6435,b'1'
1,645.0,2538079.0,108.0,10.0,84.0,123.0,1687.0,1.0,80.0,0.7647,0.3793,0.2069,0.0036,0.6000,0.9667,1.0,1.4624,b'1'
2,829.0,1553913.0,71.0,8.0,99.0,125.0,1623.0,1.0,100.0,0.9710,0.3426,0.3333,0.0037,0.7500,0.9474,1.0,1.2553,b'1'
4,1289.0,498078.0,2409.0,60.0,37.0,126.0,1353.0,0.0,185.0,0.0695,0.4486,0.0662,0.0126,0.2833,0.9885,1.0,2.4099,b'1'
5,430.0,100250.0,630.0,20.0,64.0,127.0,1387.0,0.0,40.0,0.6200,0.3417,0.1264,0.0079,0.5500,1.0000,1.0,1.9395,b'1'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1936,249.0,325780.0,273.0,54.0,119.0,141.0,1360.0,0.0,40.0,0.3662,0.3906,0.5714,0.0206,0.5185,0.7273,0.0,1.2041,b'2'
1937,144.0,340581.0,287.0,44.0,112.0,133.0,1360.0,0.0,40.0,0.2118,0.4554,0.5484,0.0228,0.7046,0.7083,0.0,1.2305,b'2'
1938,145.0,386779.0,292.0,40.0,120.0,140.0,1360.0,0.0,40.0,0.2132,0.3287,0.5172,0.0213,0.7250,0.6818,0.0,1.1761,b'2'
1939,137.0,422497.0,419.0,97.0,117.0,140.0,1360.0,0.0,40.0,0.2015,0.5904,0.9394,0.0243,0.3402,0.6596,0.0,1.4914,b'2'


<h4>Data modeling</h4>
The columns that remained after the previous steps were used as the input vector X for the model. The values were previously normalized. The vector of expected values was then converted to a vector of binary values, 0 and 1.<br>
A support vector machine, specifically SVC from the scikit-learn library, was chosen to model the dataset. The optuna package was used to optimize the hyperparameters of the model. The value of the regularization parameter - C, the type of kernel of the support vector machine and the degree of the polynomial were optimized. At the same time, the degree of polynomial only affects the kernel of the "poly" or polynomial type, and is ignored in other cases. The maximized value is the sum of the accuracy on the training set, the precision on the training set, the f1 measure on the training set, the accuracy on the test set, the precision on the test set and the f1 measure on the test set. These values are summed with weights of 0.025, 0.025, 0.2, 0.075, 0.075 and 0.6, respectively. The average values of each metric for each trial were saved to a text file. At the same time, these values are actually averages obtained from 5 fold Cross-Validation.<br>
Optuna was used a total of three times, with 1,000 trials each time.<br>
First, values were checked for C in the range 0.1 to 100.0, kernels of type 'linear', 'rbf' or 'poly', and the degree of the polynomial in the range 2 to 5. The best value was obtained for C = 94.54320781753651, kernel 'rbf' and degree = 5, where the degree does not matter because the kernel of type 'poly' is not used. The obtained values of the metrics for this example are shown in the table below.
<table>
    <tr>
        <td>accuracy train</td>
        <td>accuracy test</td>
        <td>precision train</td>
        <td>precision test</td>
        <td>f1 score train</td>
        <td>f1 score test</td>
    </tr>
        <td>0.974986</td>
        <td>0.778288</td>
        <td>0.938107</td>
        <td>0.613579</td>
        <td>0.960972</td>
        <td>0.644733</td>
</table>
On the second attempt, based on previous results (Slice Plot - see Raport.pdf), the search range of C was changed to 80.0 to 120.0, and the rest remained the same. The best value was obtained for C = 117.95849387058576, kernel 'rbf' and degree = 4, where degree does not matter because no poly kernel is used. The obtained values of the metrics for this example are shown in the table below.
<table>
    <tr>
        <td>accuracy train</td>
        <td>accuracy test</td>
        <td>precision train</td>
        <td>precision test</td>
        <td>f1 score train</td>
        <td>f1 score test</td>
    </tr>
        <td>0.982377</td>
        <td>0.793637</td>
        <td>0.955512</td>
        <td>0.67482</td>
        <td>0.972732</td>
        <td>0.68126</td>
</table>
For the third attempt, the C-search range was changed to 110.0 to 130.0, and the linear kernel was no longer considered, due to the fact that it gave significantly weaker results than the other two. The rest of the settings remained as before. The best value was obtained for C = 128.0280732226804, kernel 'rbf' and degree = 3, where degree does not matter because the "poly" type kernel is not used. The obtained values of the metrics for this example are shown in the table below.
<table>
    <tr>
        <td>accuracy train</td>
        <td>accuracy test</td>
        <td>precision train</td>
        <td>precision test</td>
        <td>f1 score train</td>
        <td>f1 score test</td>
    </tr>
        <td>0.98266</td>
        <td>0.79363</td>
        <td>0.95844</td>
        <td>0.683672</td>
        <td>0.973181</td>
        <td>0.68407</td>
</table>
Due to the small improvement in parameters, the result obtained was considered the final result.

In [None]:
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
import plotly

import optuna
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_parallel_coordinate
from optuna.visualization import plot_param_importances
from optuna.visualization import plot_slice

In [23]:
X = steel_plate_wout_outl.iloc[:, :17]
y = steel_plate_wout_outl.iloc[:, -1]

oe = OrdinalEncoder(categories = [[b'1', b'2']],
                   handle_unknown = 'use_encoded_value',
                   unknown_value = np.NaN)

oe.fit(np.asanyarray(y).reshape(-1, 1))
yk = oe.transform(np.asanyarray(y).reshape(-1, 1)).flatten()

scaler = StandardScaler()
X_std = scaler.fit_transform(X)

In [32]:
def fit_classifier(alg, X_train, X_test, y_train, y_test):
    alg.fit(X_train, y_train)
    y_pred_train = alg.predict(X_train)
    y_pred_test = alg.predict(X_test)
    
    return {
        "ACC_train":  sklearn.metrics.accuracy_score(y_pred_train, y_train),
        "ACC_test": sklearn.metrics.accuracy_score(y_pred_test, y_test),
        "P_train":    sklearn.metrics.precision_score(y_pred_train, y_train, zero_division=0),
        "P_test":   sklearn.metrics.precision_score(y_pred_test, y_test, zero_division=0),
        "F1_train":   sklearn.metrics.f1_score(y_pred_train, y_train),
        "F1_test":  sklearn.metrics.f1_score(y_pred_test, y_test)
    }

In [89]:
def eval_function(alg, X_train, X_test, y_train, y_test):
    series = pd.concat([
        pd.Series(fit_classifier(alg, X_train, X_test, y_train, y_test))], axis=1).T
    return series


def objective(trial):
    
    svc_model = svm.SVC(C = trial.suggest_float("C", 0.1, 100.0),
                        kernel = trial.suggest_categorical("kernel", ["linear", "rbf", "poly"]),
                        degree = trial.suggest_int("degree", 2, 5)
                        #max_iter = trial.suggest_int("max_iter", 100, 10000, step=100)
                       )
    
    kf = KFold(n_splits=5, shuffle=True)

    n_folds = 5
    
    results_cv = [eval_function(svc_model,
                            X_std[train,:],
                            X_std[test,:],
                            yk[train],
                            yk[test]) for train, test in kf.split(X)]

    results = sum(results_cv)/n_folds
    with open("log.txt", 'a') as f:
        f.write(f"{trial.number},")
        f.write(f"{results.to_string()}\n")
    
    
    return(0.025*results.ACC_train + 0.025*results.P_train + 0.2*results.F1_train + 0.075*results.ACC_test + 0.075*results.P_test + 0.6*results.F1_test)


In [None]:
study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler()
)

study.optimize(objective, n_trials=1000)

In [92]:
best_params = study.best_params
best_params

{'C': 94.54320781753651, 'kernel': 'rbf', 'degree': 5}

In [93]:
study.best_value

0.7612299220006424

In [None]:
plot_optimization_history(study)

In [None]:
plot_parallel_coordinate(study,  params=["C", "kernel"])

In [None]:
plot_slice(study,  params=["C", "kernel"])

In [None]:
plot_param_importances(study)

In [99]:
def objective2(trial):
    
    svc_model = svm.SVC(C = trial.suggest_float("C", 80.0, 120.0),
                        kernel = trial.suggest_categorical("kernel", ["linear", "rbf", "poly"]),
                        degree = trial.suggest_int("degree", 2, 5)
                        #max_iter = trial.suggest_int("max_iter", 100, 10000, step=100)
                       )
    
    kf = KFold(n_splits=5, shuffle=True)

    n_folds = 5
    
    results_cv = [eval_function(svc_model,
                            X_std[train,:],
                            X_std[test,:],
                            yk[train],
                            yk[test]) for train, test in kf.split(X)]

    results = sum(results_cv)/n_folds
    with open("log2.txt", 'a') as f:
        f.write(f"{trial.number},")
        f.write(f"{results.to_string()}\n")
    
    
    return(0.025*results.ACC_train + 0.025*results.P_train + 0.2*results.F1_train + 0.075*results.ACC_test + 0.075*results.P_test + 0.6*results.F1_test)


In [None]:
study2 = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler()
)

study2.optimize(objective2, n_trials=1000)

In [101]:
best_params2 = study2.best_params
best_params2

{'C': 117.95849387058576, 'kernel': 'rbf', 'degree': 4}

In [102]:
study2.best_value

0.7618838522299926

In [None]:
plot_optimization_history(study2)

In [None]:
plot_parallel_coordinate(study2,  params=["C", "kernel"])

In [None]:
plot_slice(study2,  params=["C", "kernel"])

In [None]:
plot_param_importances(study2)

In [107]:
def objective3(trial):
    
    svc_model = svm.SVC(C = trial.suggest_float("C", 110.0, 130.0),
                        kernel = trial.suggest_categorical("kernel", ["rbf", "poly"]),
                        degree = trial.suggest_int("degree", 2, 5)
                        #max_iter = trial.suggest_int("max_iter", 100, 10000, step=100)
                       )
    
    kf = KFold(n_splits=5, shuffle=True)

    n_folds = 5
    
    results_cv = [eval_function(svc_model,
                            X_std[train,:],
                            X_std[test,:],
                            yk[train],
                            yk[test]) for train, test in kf.split(X)]

    results = sum(results_cv)/n_folds
    with open("log3.txt", 'a') as f:
        f.write(f"{trial.number},")
        f.write(f"{results.to_string()}\n")
    
    
    return(0.025*results.ACC_train + 0.025*results.P_train + 0.2*results.F1_train + 0.075*results.ACC_test + 0.075*results.P_test + 0.6*results.F1_test)


In [None]:
study3 = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler()
)

study3.optimize(objective3, n_trials=1000)

In [116]:
best_params3 = study3.best_params
best_params3

{'C': 128.0280732226804, 'kernel': 'rbf', 'degree': 3}

In [117]:
study3.best_value

0.764403150396015

In [None]:
plot_optimization_history(study3)

In [None]:
plot_parallel_coordinate(study3,  params=["C", "kernel"])

In [None]:
plot_slice(study3,  params=["C", "kernel"])

In [None]:
plot_param_importances(study3)

<h4>Conclusions</h4>
In the end, the best model was obtained for C = 128.0280732226804, kernel 'rbf' and degree = 3, where degree does not matter because the poly kernel is not used. The obtained values of the metrics are as follows:
<table>
    <tr>
        <td>accuracy train</td>
        <td>accuracy test</td>
        <td>precision train</td>
        <td>precision test</td>
        <td>f1 score train</td>
        <td>f1 score test</td>
    </tr>
        <td>0.98266</td>
        <td>0.79363</td>
        <td>0.95844</td>
        <td>0.683672</td>
        <td>0.973181</td>
        <td>0.68407</td>
</table>
Because the project is essentially self-developing, it is hard to determine whether the result obtained is satisfactory and whether discarding columns with correlated values was a good move. In addition to the result obtained, the project demonstrates how much the hyperparameters of the model can impact the final outcome. An unusual error occurred during the project's implementation. Specifically, when running the project with JupyterLab in the Firefox browser, calling "optimize" on the "study" object from the optuna package caused the browser to progressively consume more RAM. Before completing 1,000 trials, the browser exceeded the available RAM on the machine, leading the operating system to shut it down. Eventually, this issue was resolved by using the Microsoft Edge browser.