## Traditional machine learning
Feature(s): 'total_string' column

- this is a concatenated string from various features, both categorcial and numerical, used for LLM experiments.

Target: grain_size_bin_25.7,second_phase_bin_1.14

- binary classes (the numbers of in the column name are the medians of the target)
- so two experiments, one predicting the grain size, one predicting the second phase

Datasets: 
- the HEREON_final.csv file has all the entries

- the HEREON_extruded_final.csv is a subset of the total dataset, i.e. for all entries here the 'Prozessbedingung' is 'Extruded'


In [1]:
import sys
import os

# Append the parent directory of your package to sys.path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', '..', '..', '..')))

In [2]:
import pandas as pd
import numpy as np
import zipfile

path_to_dataset = 'HEREON_final.csv'
csv_filename = 'HEREON_final.csv'

# Open the file, Correct the encoding and sep if necessary
if path_to_dataset.endswith('.zip'):
    with zipfile.ZipFile(path_to_dataset, 'r') as z:
        # Open the CSV file within the ZIP file
        with z.open(csv_filename) as f:
            # Read the CSV file into a DataFrame
            df = pd.read_csv(f, sep=',', on_bad_lines='warn')
else:
    # Read the CSV file into a DataFrame
    df = pd.read_csv(path_to_dataset, sep=',', on_bad_lines='warn')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Materials,Prozessbedingung,homogenized_temperature,homogenized_time,solutionized_temperature,solutionized_time,extrution_temperature,extrution_speed,extrusion_ratio,...,grain_size,grain_size_error,second_phase,second_phase_error,vpd,grain_size_bin,grain_size_bin_25.7,second_phase_bin_1.14,concentration_string,total_string
0,0,Mg,extruded,440.0,0.5,,,450.0,0.6,63.0,...,131.4,76.0,0.0,0.0,0,1,1,0,"0.0015 Fe, 0.0001 Cu, 0.0002 Ni, 0.0 Nd, 0.0 Z...",A alloy witht the following paramters; extrude...
1,1,Mg-0.5Nd,extruded,440.0,0.5,,,450.0,0.6,63.0,...,24.7,9.8,0.5,1.0,60,0,0,0,"0.0082 Fe, 0.0019 Cu, 0.0003 Ni, 0.68 Nd, 0.0 ...",A alloy witht the following paramters; extrude...
2,2,Mg-2Nd,extruded,440.0,0.5,,,450.0,0.6,63.0,...,19.8,8.0,1.4,0.2,60,0,0,1,"0.0026 Fe, 0.0021 Cu, 0.0011 Ni, 2.39 Nd, 0.0 ...",A alloy witht the following paramters; extrude...
3,3,Mg-5Nd,extruded,440.0,0.5,,,450.0,0.6,63.0,...,9.5,3.6,10.3,1.0,60,0,0,1,"0.016 Fe, 0.0024 Cu, 0.0038 Ni, 4.2 Nd, 0.0 Zn...",A alloy witht the following paramters; extrude...
4,4,Mg-2Zn,heat-treated,315.0,48.0,315.0,5.0,,,,...,914.27,191.11,0.06,0.0,550,1,1,0,"0.0 Fe, 0.0 Cu, 0.0 Ni, 0.0 Nd, 2.0 Zn, 0.0 Ca...",A alloy witht the following paramters; heat-tr...


In [4]:
df_encoded = pd.get_dummies(df, columns=['Prozessbedingung'])

In [5]:
exclude_columns = [ 'ECAE_temperature', 'ECAE_pass', 'Unnamed: 0', 'Materials',
                     'grain_size','concentration_string', 'total_string', 
                     'grain_size_error', 'second_phase', 'second_phase_error', 'vpd',
                     'grain_size_bin', 'grain_size_bin_25.7',  'second_phase_bin_1.14', 'Prozessbedingung']

feature_columns = list(set(df_encoded.columns.tolist()) - set(exclude_columns))

In [6]:
from MLPipeline import MLmodel, BinTheTarget

Target = ['grain_size_bin_25.7', 'second_phase_bin_1.14']
Features = feature_columns
Feature_types = ['numerical']*len(feature_columns)
input = df_encoded


  from .autonotebook import tqdm as notebook_tqdm


In [7]:
len(input)

81

In [8]:
model = MLmodel(modelType='RandomForestClassifier',
                    df=input,
                    target=Target,
                    features=Features,
                    feature_types=Feature_types,
                    train_count=10,
                    test_count=10)

# get the values (input and output) of the model
X_train, X_test, y_train, y_test = model.getValues()

[32m2024-09-12 17:31:52.549[0m | [1mINFO    [0m | [36mMLPipeline[0m:[36m__post_init__[0m:[36m134[0m - [1mndim y_train: 2[0m
[32m2024-09-12 17:31:52.550[0m | [1mINFO    [0m | [36mMLPipeline[0m:[36m__post_init__[0m:[36m135[0m - [1mndim x_train: 2[0m
[32m2024-09-12 17:31:52.550[0m | [1mINFO    [0m | [36mMLPipeline[0m:[36m__post_init__[0m:[36m136[0m - [1mshape y_train: (10, 2)[0m
[32m2024-09-12 17:31:52.550[0m | [1mINFO    [0m | [36mMLPipeline[0m:[36m__post_init__[0m:[36m137[0m - [1mshape x_train: (10, 24)[0m


In [9]:
model.train()
model.evaluate()

RandomForestClassifier model trained successfully.
Accuracies for each target in RandomForestClassifier: [0.7, 0.9]


[0.7, 0.9]

In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.base import clone

def objective(trial, model_instance):
    """
    Objective function for Optuna to minimize.
    """
    # Define hyperparameters to tune
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'max_depth': trial.suggest_categorical('max_depth', [None, 10, 20, 30, 40]),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 15),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 6),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2']),
        'bootstrap': trial.suggest_categorical('bootstrap', [True, False])
    }


    # Clone the model to ensure a fresh instance each trial
    model_to_clone = model_instance.model.estimator
    model_clone = clone(model_to_clone)
    model_clone.set_params(**params)
    
    # Define the score metric
    scoring = 'accuracy'

    # Perform cross-validation
    scores = cross_val_score(model_clone, model_instance.X_train, model_instance.y_train, cv=model_instance.cv, scoring=scoring)

    # Return the average score across all folds
    return scores.mean()

In [11]:
model = MLmodel(modelType='RandomForestClassifier', df=input, target=Target, 
                features=Features, hyperparameter_tuning=True,
                feature_types=Feature_types,
                train_count=10,
                test_count=10,
                optimization_method='optuna', objective=lambda trial: objective(trial, model))

model.train()
predictions = model.predict()
model.evaluate()

[32m2024-09-12 17:31:52.784[0m | [1mINFO    [0m | [36mMLPipeline[0m:[36m__post_init__[0m:[36m134[0m - [1mndim y_train: 2[0m
[32m2024-09-12 17:31:52.785[0m | [1mINFO    [0m | [36mMLPipeline[0m:[36m__post_init__[0m:[36m135[0m - [1mndim x_train: 2[0m
[32m2024-09-12 17:31:52.785[0m | [1mINFO    [0m | [36mMLPipeline[0m:[36m__post_init__[0m:[36m136[0m - [1mshape y_train: (10, 2)[0m
[32m2024-09-12 17:31:52.786[0m | [1mINFO    [0m | [36mMLPipeline[0m:[36m__post_init__[0m:[36m137[0m - [1mshape x_train: (10, 24)[0m
[I 2024-09-12 17:31:52,787] A new study created in memory with name: no-name-2f1ceb83-c72e-47a9-87f0-7546a6535700
[I 2024-09-12 17:31:53,513] Trial 0 finished with value: 0.4 and parameters: {'n_estimators': 156, 'max_depth': 30, 'min_samples_split': 11, 'min_samples_leaf': 3, 'max_features': 'sqrt', 'bootstrap': True}. Best is trial 0 with value: 0.4.
[I 2024-09-12 17:31:53,729] Trial 1 finished with value: 0.4 and parameters: {'n_estim

Best RandomForestClassifier model trained successfully with hyperparameter tuning using Optuna.
Best hyperparameters: {'n_estimators': 115, 'max_depth': 20, 'min_samples_split': 7, 'min_samples_leaf': 4, 'max_features': 'log2', 'bootstrap': False}
RandomForestClassifier model trained successfully.
Accuracies for each target in RandomForestClassifier: [0.7, 0.7]


[0.7, 0.7]