<a href="https://colab.research.google.com/github/PreyPython123/Master-V24-Semiveiledet-Regresjon/blob/Collagen-Pradeep/Bioco_Collagen_Klassiske_Superveiledet_Regresjonsmetoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lasting av nødvendige bibliotek og pakker

In [1]:
!pip install optuna



Importering av nødvendig bibliotek og pakker

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, r2_score

import optuna
from optuna.visualization import plot_optimization_history

Importering av relevant data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

# Velger første kolonne med dato og tid som index
collagen_data = pd.read_csv('/content/drive/MyDrive/MasterV24/Bioco_data/collagen_data.csv',
                            header=0,
                            sep=',',
                            index_col=0)

# Formatterer index til riktig format og datatype
collagen_data.index = pd.to_datetime(collagen_data.index,
                                     format='%Y-%m-%d %H:%M:%S')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Beskrivelse av datasett

In [4]:
collagen_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 29136 entries, 2022-10-31 17:37:00 to 2023-06-14 01:06:00
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   EnzymeType_A1  29136 non-null  int64  
 1   EnzymeType_A2  29136 non-null  int64  
 2   EnzymeType_B   29136 non-null  int64  
 3   EnzymeType_C   29136 non-null  int64  
 4   EnzymeType_D   29136 non-null  int64  
 5   EnzymeType_E   29136 non-null  int64  
 6   RawMatFlow     29136 non-null  float64
 7   NIRfat         29136 non-null  float64
 8   NIRash         29136 non-null  float64
 9   NIRwater       29136 non-null  float64
 10  TT07           29136 non-null  float64
 11  TT08           29136 non-null  float64
 12  PT03           29136 non-null  float64
 13  TT20           29136 non-null  float64
 14  TT09           29136 non-null  float64
 15  TT12           29136 non-null  float64
 16  Collagen       89 non-null     float64
dtypes: float64(11),

# Oppdeling av datasett for trening og testing

In [5]:
#Tilfeldighetsfrø
random_seed = 123

# Deler opp markert del av datasettet
collagen_markert = collagen_data.dropna(subset='Collagen')

# Legger til kategorisk variabel for enzymtype til fordeling av datsettet
collagen_markert['EnzymType'] = collagen_markert.filter(like='EnzymeType_').idxmax(axis=1).str.split('_').str[1].astype('category')

# Splitter datasett i collagendataasett og kategorisk enzymtype
collagen_enzymetypes = collagen_markert['EnzymType']
collagen_markert.drop(columns=['EnzymType'], inplace=True)

# Splitter trening og testdata etter enzymtype
collagen_trening, collagen_test, _, _ = train_test_split(collagen_markert,
                                                         collagen_enzymetypes,
                                                         test_size = 0.20,
                                                         stratify = collagen_enzymetypes,
                                                         random_state = random_seed)

# Deler opp datasettene til prediktorer og respons, for trening og testsett
X_trening = collagen_trening.iloc[:, :-1]
X_test = collagen_test.iloc[:, :-1]
y_trening = collagen_trening.iloc[:, -1]
y_test = collagen_test.iloc[:, -1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collagen_markert['EnzymType'] = collagen_markert.filter(like='EnzymeType_').idxmax(axis=1).str.split('_').str[1].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collagen_markert.drop(columns=['EnzymType'], inplace=True)


# RandomForestRegressor

Hyperparamter optimalisering med Optuna

In [9]:
def objective(trial):
  parametere = {
      'n_estimators': trial.suggest_int('n_estimators', 100, 200),
      'max_depth': trial.suggest_int('max_depth', 1, 10)
  }
  rf_pipeline = Pipeline([
      ('skalerer', StandardScaler()),
      ('modell', RandomForestRegressor(**parametere, random_state=random_seed))
  ])

  rf_pipeline.fit(X_trening, y_trening)
  y_test_prediksjon = rf_pipeline.predict(X_test)
  mse = mean_squared_error(y_test, y_test_prediksjon)
  return mse

if __name__ == "__main__":
  study = optuna.create_study(direction='minimize')
  study.optimize(objective, n_trials=100)

plot_optimization_history(study)

[I 2024-02-07 19:15:52,057] A new study created in memory with name: no-name-a7f6a084-cb77-41b5-a28d-966b93eca227
[I 2024-02-07 19:15:52,318] Trial 0 finished with value: 17.18293406377 and parameters: {'n_estimators': 101, 'max_depth': 6}. Best is trial 0 with value: 17.18293406377.
[I 2024-02-07 19:15:52,633] Trial 1 finished with value: 16.814654769059587 and parameters: {'n_estimators': 120, 'max_depth': 4}. Best is trial 1 with value: 16.814654769059587.
[I 2024-02-07 19:15:52,926] Trial 2 finished with value: 16.63570809931393 and parameters: {'n_estimators': 109, 'max_depth': 8}. Best is trial 2 with value: 16.63570809931393.
[I 2024-02-07 19:15:53,398] Trial 3 finished with value: 16.521038034199357 and parameters: {'n_estimators': 182, 'max_depth': 8}. Best is trial 3 with value: 16.521038034199357.
[I 2024-02-07 19:15:53,719] Trial 4 finished with value: 16.379254558961524 and parameters: {'n_estimators': 120, 'max_depth': 9}. Best is trial 4 with value: 16.379254558961524.
[

Evauleringer av beste modell

In [14]:
def detailed_objective(trial):
  parametere = {
      'n_estimators': trial.suggest_int('n_estimators', 100, 200),
      'max_depth': trial.suggest_int('max_depth', 1, 10)
  }
  rf_pipeline = Pipeline([
      ('skalerer', StandardScaler()),
      ('modell', RandomForestRegressor(**parametere, random_state=random_seed))
  ])

  rf_pipeline.fit(X_trening, y_trening)
  y_test_prediksjon = rf_pipeline.predict(X_test)

  mse = mean_squared_error(y_test, y_test_prediksjon)
  r2 = r2_score(y_test, y_test_prediksjon)

  return mse, r2

mse_resultat = detailed_objective(study.best_trial)[0]
r2_resultat = detailed_objective(study.best_trial)[1]

print("Beste hyperparametere for RandomForestRegressor: {}".format(study.best_params))
print("MSE: {}, og R^2: {}".format(mse_resultat, r2_resultat))

Beste hyperparametere for RandomForestRegressor: {'n_estimators': 125, 'max_depth': 8}
MSE: 16.1279219616204, og R^2: 0.5359083947487182
