# Estimering av Sykehusopphold

Maskinlæringsmodellen har som mål å predikere den forventede lengden på sykehusoppholdet per pasient. I tillegg skal lengden på sykehusopphold være basert på passende variabler fra data på pasientopplysninger, inkludert fysiologiske, demografiske og sykdomsalvorlighetsdata på tvers av ni sykdomskategorier.

### Importere nødvendinge datapakker

In [41]:
import numpy as np 
import pandas as pd
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer, MissingIndicator
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.dummy import DummyRegressor
from sklearn.pipeline import make_pipeline, FeatureUnion, Pipeline
from sklearn.metrics import root_mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

### Les inn data

In [42]:
demographic_df = pd.read_csv("./raw_data/demographic.csv")
hospital_df = pd.read_csv("./raw_data/hospital.csv")
phychological_df = pd.read_csv("./raw_data/physiological.txt", sep="\t")
severity_df = pd.read_json("./raw_data/severity.json")

## Preprocessing

### Rydde data

Lager en hjelpefunksjoner som skal rydde skal rydde de ulike datasettene

In [43]:
def clean_demographic_data(demographic_df):
    non_negative_cat = ["alder", "utdanning"]
    
    for col in non_negative_cat:
        df_demographic.loc[df_demographic[col] < 0, col] = np.nan # Bytter alle negative verdier (som ikke skal være negative) til NaN
    
    df_demographic = df_demographic.drop_duplicates() # Ser i datasettet at de to siste radene er duplikater, så fjerner alle duplikater
    
    return demographic_df

In [44]:
def clean_hospital_data(hospital_df):
    non_negative_cat = ["sykehusdød", "oppholdslengde"]
    
    for col in non_negative_cat:
        hospital_df.loc[hospital_df[col] < 0, col] = np.nan # Bytter alle negative verdier (som ikke skal være negative) til NaN
    
    return hospital_df

In [45]:
def clean_severity_data(df):
    severity_var_list = df.columns.tolist()
    df = df.explode(severity_var_list[2:]) # Fra og med "pasient_id" til og med siste kolonne
    
    # Sorterer dataframen slik at pasient_id er først
    new_cols = ["pasient_id"]
    for new_col in df.columns:
        if new_col != "pasient_id":
            new_cols.append(new_col)
    df = df[new_cols]
    
    # Fjerner sykdomskategori_id og sykdomskategori, da sykdom_underkategori forteller oss det samme bare mer detaljert
    df = df.drop(["sykdomskategori_id", "sykdomskategori"], axis=1) 
    
    df = df.sort_values(by="pasient_id")
    df = df.reset_index(drop=True)
    return df

Deretter settes alle dataframene til et enkelt dataframe

In [46]:
def merge_dataframes(hospital_df, demographic_df, phychological_df, severity_df):
    df = pd.merge(hospital_df, demographic_df, "outer", "pasient_id")
    df = pd.merge(df, severity_df, "outer", "pasient_id")
    df = pd.merge(df, severity_df, "outer", "pasient_id")
    assert df["pasient_id"].duplicated().any() == False, "Det er duplikater av pasient_id"
    return df

In [47]:
df = pd.concat([demographic_df, hospital_df, phychological_df, severity_df], axis=1)

### Fjerning av rader / kolonner med unødvendig / manglende data

In [48]:
df = df.dropna(thresh=(df.shape[0] * 0.5), axis=1) # Fjerner kolonner hvor 50% eller mer av verdiene er NaN
# df = df.drop("pasient_id", axis=1) # Fjerner pasient_id, da den ikke har noe direkete innvirkning på oppholdslengde
df = df.dropna(subset=["oppholdslengde"], axis=0) # Fjerner alle rader hvor oppholdslengde mangler, da det er variabelen modellen skal predikere

Fjerne alle kolonner som er målt på dag 7
Fjerne dødsfall ???

Lager en hjelpefunksjon slik at man slepper å gjøre alt på ny når man skal senere importere sample_data

In [49]:
def prepare_data(df_hospital, df_demographic, df_physiological, df_severity):
    
    df_hospital = clean_hospital_data(df_hospital)
    df_demographic = clean_demographic_data(df_demographic)
    df_severity = clean_severity_data(df_severity)
    
    df = merge_dataframes(df_hospital, df_demographic, df_physiological, df_severity)
    
    df = df.dropna(thresh=(df.shape[0] * 0.5), axis=1) # Fjerner kolonner hvor 50% eller mer av verdiene er NaN
    df = df.dropna(subset=["oppholdslengde"], axis=0) # Fjerner alle rader hvor oppholdslengde mangler, da det er variabelen modellen skal predikere
    
    return df 

### Train_test_split
Før man begynner med datamodellering må man splitte data i trenings-, valederings- og testdata

In [50]:
X = df.drop("oppholdslengde", axis=1)
y = df["oppholdslengde"]

X_train, X_temp, y_train, y_temp = train_test_split(X, y , train_size=0.7, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp , test_size=0.5, random_state=42)

In [51]:
print(f'Length of X_train: {len(X_train)}')
print(f'Length of y_train: {len(y_train)}')
print(f'Length of X_val: {len(X_val)}')
print(f'Length of y_val: {len(y_val)}')
print(f'Length of X_test: {len(X_test)}')
print(f'Length of y_test: {len(y_test)}')


Length of X_train: 5418
Length of y_train: 5418
Length of X_val: 1161
Length of y_val: 1161
Length of X_test: 1161
Length of y_test: 1161


### Feature engineering

#### NEWS (National Early Warning Score)

NEWS er skåringssystem for målinger av vitale funksjoner hos syke personer. Høy NEWS score tilsier at alvorligheten av sykdommen også er høy.

Kilde: https://sml.snl.no/NEWS_-_National_Early_Warning_Score

In [52]:
def calculate_news_score(row):
    score = 0
    
    # Respirasjonsfrekvens 
    if row['respirasjonsfrekvens'] <= 8:
        score += 3
    elif row['respirasjonsfrekvens'] <= 11:
        score += 1
    elif row['respirasjonsfrekvens'] <= 20:
        score += 0
    elif row['respirasjonsfrekvens'] <= 24:
        score += 2
    else:
        score += 3
    
    # Blodtrykk
    if row['blodtrykk'] <= 90:
        score += 3
    elif row['blodtrykk'] <= 100:
        score += 2
    elif row['blodtrykk'] <= 110:
        score += 1
    elif row['blodtrykk'] <= 219:
        score += 0
    else:
        score += 3
    
    # Hjertefrekvens
    if row['hjertefrekvens'] <= 40:
        score += 3
    elif row['hjertefrekvens'] <= 50:
        score += 1
    elif row['hjertefrekvens'] <= 90:
        score += 0
    elif row['hjertefrekvens'] <= 110:
        score += 1
    elif row['hjertefrekvens'] <= 130:
        score += 2
    else:
        score += 3
    
    # Kroppstemperatur
    if row['kroppstemperatur'] <= 35.0:
        score += 3
    elif row['kroppstemperatur'] <= 36.0:
        score += 1
    elif row['kroppstemperatur'] <= 38.0:
        score += 0
    elif row['kroppstemperatur'] <= 39.0:
        score += 1
    else:
        score += 2
    
    return score

In [53]:
def apply_news_score(X_org):
    X = X_org.copy()
    X["NEWS_score"] = X.apply(calculate_news_score, axis=1)
    return X

#### Dele inn i "normale" verdier
Vil sjekke om modellen får bedre effekt av å se på "normale" verdier for f.eks. respirasjonsfrekvens og blodtrykk. I praksis blir det å gjøre numeriske variabler om til kategoriske.

In [54]:
def categorize_values(X_org):
    X = X_org.copy()
    X.loc[:, "respirasjonsfrekvens_range"] = pd.cut(X["respirasjonsfrekvens"], [0, 9, 12, 21, 100], right=False)
    X.loc[:, "blodtrykk_range"] = pd.cut(X["blodtrykk"], [0, 91, 101, 111, 140, 160, 180, 250], right=False)
    # X = X.drop(["blodtrykk", "respirasjonsfrekvens"], axis=1)
    return X

In [55]:
def transform_X(X_org):
    X = X_org.copy()
    X = apply_news_score(X)
    X = categorize_values(X)
    return X

def transform_Xs(X_train, X_val, X_test):
    return (transform_X(X_train), transform_X(X_val), transform_X(X_test))

In [56]:
X_train, X_val, X_test = transform_Xs(X_train, X_val, X_test )

#### Kategoriske variabler --> numeriske variabler 

Når man skal trene modeller bør man gjøre kategoriske variabler om til numeriske variabler for best mulig resultat.

In [57]:
categorical_cols = ["sykdom_underkategori", "kreft", "kjønn", "inntekt", "etnisitet", "diabetes", "demens", "sykehusdød", 
                    "dødsfall", "respirasjonsfrekvens_range", "blodtrykk_range"]
numeric_cols = [col for col in X_train.columns if col not in categorical_cols] # Alle kolonner i X_train som ikke er kategoriske

encoder = ColumnTransformer([("cat", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), categorical_cols), # Alle kategoriske variabler vil blir omgjort til numeriske
                                 ("num", "passthrough", numeric_cols)] # Alle numeriske kategorier vil stå uendret
                                )

X_train = pd.DataFrame(encoder.fit_transform(X_train), columns=encoder.get_feature_names_out())
X_val = pd.DataFrame(encoder.transform(X_val), columns=encoder.get_feature_names_out())
X_test = pd.DataFrame(encoder.transform(X_test), columns=encoder.get_feature_names_out())

assert X_train.columns.equals(X_val.columns) and X_val.columns.equals(X_test.columns), "Det er ikke like kolonner i trenings-, valderings og testdata"

ValueError: A given column is not a column of the dataframe

La oss se på de 10 første radene i X_train.

In [18]:
X_train.head(10)

Unnamed: 0,cat__sykdom_underkategori_ARF/MOSF w/Sepsis,cat__sykdom_underkategori_CHF,cat__sykdom_underkategori_COPD,cat__sykdom_underkategori_Cirrhosis,cat__sykdom_underkategori_Colon Cancer,cat__sykdom_underkategori_Coma,cat__sykdom_underkategori_Lung Cancer,cat__sykdom_underkategori_MOSF w/Malig,cat__kreft_metastatic,cat__kreft_no,...,num__antall_komorbiditeter,num__koma_score,num__adl_stedfortreder,num__fysiologisk_score,num__apache_fysiologisk_score,num__overlevelsesestimat_2mnd,num__overlevelsesestimat_6mnd,num__lege_overlevelsesestimat_2mnd,num__lege_overlevelsesestimat_6mnd,num__NEWS_score
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,3,0.0,0.0,21.097656,17.0,0.73291,0.580933,0.9,0.9,5
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,5,44.0,0.0,46.898438,71.0,0.046997,0.004999,0.1,0.001,10
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3,0.0,0.0,20.398438,26.0,0.741943,0.661987,0.9,0.9,3
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1,0.0,0.0,37.195312,22.0,0.404968,0.106995,,,4
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1,0.0,0.0,5.599609,9.0,0.865967,0.700928,,,10
5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2,9.0,0.0,33.09375,53.0,0.54895,0.434998,0.4,0.1,8
6,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1,0.0,0.0,35.796875,45.0,0.653931,0.555908,0.3,0.2,7
7,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2,0.0,,37.898438,60.0,0.534912,0.419983,0.2,0.2,8
8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0,0.0,,43.898438,96.0,0.233978,0.132996,0.1,0.001,9
9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1,37.0,0.0,31.699219,37.0,0.42395,0.304993,0.4,0.3,7


### Korrelasjon

In [19]:
corr_oppholdslengde = X_train.corrwith(y_train)
corr_oppholdslengde = corr_oppholdslengde.sort_values()

corr_oppholdslengde_df = corr_oppholdslengde.reset_index()
corr_oppholdslengde_df.columns = ["Variabel", "Korrelasjon med oppholdslengde"]

fig = px.bar(corr_oppholdslengde_df, 
             x="Korrelasjon med oppholdslengde", 
             y="Variabel",
             title="Korrelasjon mellom ulike variabler og oppholdslengde",
             color="Korrelasjon med oppholdslengde"
             )

fig.show()

Ut fra figuren, ser man korrelasjonen er relativ lav.

La oss se på gj.snitt av oppholdslengden mot numeriske variabler.

In [20]:
for col in X_train.columns:
    if col.startswith("num") and col != "num__pasient_id":
        fig = px.histogram(x=X_train[col], histfunc="avg", y=y_train,labels={"x": f"{col}", "y": "oppholdslengde"})
        fig.show()

## Datamodellering

Målet er å trene ulike modeller på treningsdata og teste dem på valederingsdata. Den med minst feil på valederingsdata bestemmes som "beste" model. Til slutt ser man på generaliseringsevnen til den beste modellen ved å teste den på testdata. 

### Grunnlinjemodell

Aller først, lager man en grunnlinje modell for se hvor dårlig en modell vil fungere. I likhet som andre modeller må den først trenes vha. fit-metoden og deretter bruker man predict-metoden for å predikere y på nye data av X. I dette tilfelle med DummyRegressor(), predikerer vi gj.snittsverdien av y_val, uavhengig av hva X_val er.

In [21]:
baseline = DummyRegressor(strategy="mean") # Lage modell
baseline.fit(X_train, y_train) # Tilpasse modell
prediction = baseline.predict(X_val) # Prediksjon på valederingsdata
rmse_baseline = root_mean_squared_error(y_val, prediction) # Sjekker RMSE mellom de faktiske dataene og prediksjonen
rmse_baseline

20.603790732142762

Målet nå er å trene ulike modeller med testing av flere, ulike hyperparametre. For å teste hyperparametre kan man bruke GridSearchCV ved hjelp av en "parameter grid" som har en innebygd fit- og score-metode. Parameter grid er en grid med ulike parametre som skal testes. I dette tilfelle bruker jeg RandomizedSearchCV for å minimere kjøretid. Prinsippet er ganske likt, bortsett fra at RandomizedSearchCV ikke sjekker alle mulige kombinasjoner, men heller n kombinasjoner.

Da man skal teste flere ulike modeller samt imputeringstrategier, trenger man aller først en placeholder. I tilegg vil jeg sjekke om modellen predikerer bedre hvis man for hver manglende data legger til en kolonne for hver variabel som tilsier om variablen manglet før imputasjonen. For å gjøre dette brukes MissingIndicator.

Kilde: https://youtu.be/0B5eIE_1vpU?si=EwWVMFx0aYt_b4e6

In [22]:
# Featureunion kombinere både imputerinsstrategi og indikatorer som lar dem kjøre parallelt
transformer = FeatureUnion(
    transformer_list=[
        ('imputer', None),
        ('indicators', MissingIndicator())])

# Placeholder
pipeline = Pipeline([
    ('strat', transformer),
    ('scaler', None),
    ('model', None)  
])

### LinearRegression

In [23]:
lr_params = [
    {
        # SimpleImputer + StandardScaler + LinearRegression
        'strat__imputer': [SimpleImputer()],
        'strat__imputer__strategy': ['mean', 'median'],
        'scaler': [StandardScaler()],
        'model': [LinearRegression()],
        'model__fit_intercept': [True, False],  
        'model__positive': [True, False] 
    },
    {
        # KNNImputer + StandardScaler + LinearRegression
        'strat__imputer': [KNNImputer()],
        'strat__imputer__n_neighbors': [3, 5, 7],
        'scaler': [StandardScaler()],
        'model': [LinearRegression()],
        'model__fit_intercept': [True, False],  
        'model__positive': [True, False] 
    }
    # ,
    # {
    #     # SimpleImputer + PolynomialFeatures + LinearRegression
    #     'strat__imputer': [SimpleImputer()],
    #     'strat__imputer__strategy': ['mean', 'median'], 
    #     'scaler': [PolynomialFeatures()],
    #     'scaler__degree': [1, 2, 5],
    #     'model': [LinearRegression()],
    #     'model__fit_intercept': [True, False],  
    #     'model__positive': [True, False] 
    # },
    # {
    #     # KNNImputer + PolynomialFeatures + LinearRegression
    #     'strat__imputer': [KNNImputer()],
    #     'strat__imputer__n_neighbors': [3, 5, 7],
    #     'scaler': [PolynomialFeatures()],
    #     'scaler__degree': [1, 2, 5],
    #     'model': [LinearRegression()],
    #     'model__fit_intercept': [True, False],  
    #     'model__positive': [True, False] 
    # }
    ]

print()
lr_search = RandomizedSearchCV(pipeline, lr_params, cv=5, scoring="neg_root_mean_squared_error", n_jobs=-1,random_state=42)
lr_search.fit(X_train, y_train)
best_lr = lr_search.best_estimator_
print(f"Beste hyperparametere for LinearRegression: {lr_search.best_params_}")
lr_rmse = root_mean_squared_error(y_val, best_lr.predict(X_val))
print(f"Beste rmse for LinearRegression: {lr_rmse}")






30 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/sheldondyrdal/opt/miniconda3/envs/INF161/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/sheldondyrdal/opt/miniconda3/envs/INF161/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sheldondyrdal/opt/miniconda3/envs/INF161/lib/python3.12/site-packages/sklearn/pipeline.py", line 473, in fit
    self._final_estimator.fit(X

Beste hyperparametere for LinearRegression: {'strat__imputer__strategy': 'median', 'strat__imputer': SimpleImputer(), 'scaler': StandardScaler(), 'model__positive': False, 'model__fit_intercept': True, 'model': LinearRegression()}
Beste rmse for LinearRegression: 19.316927797863045


### ElasticNet

In [25]:
elastic_params = [
    {
        # SimpleImputer + StandardScaler + ElasticNet
        'strat__imputer': [SimpleImputer()],
        'strat__imputer__strategy': ['mean', 'median'], 
        'scaler': [StandardScaler()],
        'model': [ElasticNet()],
        'model__alpha': [0.001, 0.01, 0.1, 1, 10, 100],
        'model__max_iter': np.arange(1000, 5000, 1000),
        'model__l1_ratio': np.arange(0.0, 1.0, 0.1), 
        'model__tol': [0.0001, 0.00001, 0.001]
    },
    {
        # KNNImputer + StandardScaler + ElasticNet
        'strat__imputer': [KNNImputer()],
        'strat__imputer__n_neighbors': [3, 5, 7],
        'scaler': [StandardScaler()],
        'model': [ElasticNet()],
        'model__alpha': [0.001, 0.01, 0.1, 1, 10, 100],
        'model__max_iter': np.arange(1000, 5000, 1000),
        'model__l1_ratio': np.arange(0.0, 1.0, 0.1), 
        'model__tol': [0.0001, 0.00001, 0.001]
    }
    # ,
    # {
    #     # SimpleImputer + PolynomialFeatures + ElasticNet
    #     'strat__imputer': [SimpleImputer()],
    #     'strat__imputer__strategy': ['mean', 'median'], 
    #     'scaler': [PolynomialFeatures()],
    #     'scaler__degree': [1, 2, 5],
    #     'model': [ElasticNet()],
    #     'model__alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    #     'model__max_iter': np.arange(1000, 5000, 1000),
    #     'model__l1_ratio': np.arange(0.0, 1.0, 0.1), 
    #     'model__tol': [0.0001, 0.00001, 0.001]
    # },
    # {
    #     # KNNImputer + PolynomialFeatures + ElasticNet
    #     'strat__imputer': [KNNImputer()],
    #     'strat__imputer__n_neighbors': [3, 5, 7],
    #     'scaler': [PolynomialFeatures()],
    #     'scaler__degree': [1, 2, 5],
    #     'model': [ElasticNet()],
    #     'model__alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    #     'model__max_iter': np.arange(1000, 5000, 1000),
    #     'model__l1_ratio': np.arange(0.0, 1.0, 0.1), 
    #     'model__tol': [0.0001, 0.00001, 0.001]
    # }
    ]

print()
elastic_search = RandomizedSearchCV(pipeline, elastic_params, cv=5, scoring="neg_root_mean_squared_error", n_jobs=-1,random_state=42)
elastic_search.fit(X_train, y_train)
best_elastic = elastic_search.best_estimator_
print(f"Beste hyperparametere for ElasticNet: {elastic_search.best_params_}")
elastic_rmse = root_mean_squared_error(y_val, best_elastic.predict(X_val))
print(f"Beste rmse for ElasticNet: {elastic_rmse}") #19.20785694749137




  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Beste hyperparametere for ElasticNet: {'strat__imputer__strategy': 'mean', 'strat__imputer': SimpleImputer(), 'scaler': StandardScaler(), 'model__tol': 0.001, 'model__max_iter': 2000, 'model__l1_ratio': 0.9, 'model__alpha': 0.01, 'model': ElasticNet()}
Beste rmse for ElasticNet: 19.27323071354347


### RandomForestRegressor

In [26]:
rf_params = [
    {
        # SimpleImputer + StandardScaler + RandomForestRegressor
        'strat__imputer': [SimpleImputer()],
        'strat__imputer__strategy': ['mean', 'median'],
        'scaler': [StandardScaler()],
        'model': [RandomForestRegressor()],
        'model__max_depth': [10, 20, 30, 50, None],
        'model__min_samples_split': [2, 5, 10],
        'model__min_samples_leaf': [1, 2, 4, 10],
        'model__max_features': ['auto', 'sqrt', 'log2'],
        'model__bootstrap': [True, False]
    },
    {
        # KNNImputer + StandardScaler + RandomForestRegressor
        'strat__imputer': [KNNImputer()],
        'strat__imputer__n_neighbors': [3, 5, 7],
        'scaler': [StandardScaler()],
        'model': [RandomForestRegressor()],
        'model__n_estimators': [100, 200, 500, 1000],
        'model__max_depth': [10, 20, 30, 50, None],
        'model__min_samples_split': [2, 5, 10],
        'model__min_samples_leaf': [1, 2, 4, 10],
        'model__max_features': ['auto', 'sqrt', 'log2'],
        'model__bootstrap': [True, False]
    }
    # ,
    # {
    #     # SimpleImputer + PolynomialFeatures + RandomForestRegressor
    #     'strat__imputer': [SimpleImputer()],
    #     'strat__imputer__strategy': ['mean', 'median'],
    #     'scaler': [PolynomialFeatures()],
    #     'scaler__degree': [1, 2, 5],
    #     'model': [RandomForestRegressor()],
    #     'model__max_depth': [10, 20, 30, 50, None],
    #     'model__min_samples_split': [2, 5, 10],
    #     'model__min_samples_leaf': [1, 2, 4, 10],
    #     'model__max_features': ['auto', 'sqrt', 'log2'],
    #     'model__bootstrap': [True, False]
    # },
    # {
    #     # KNNImputer + PolynomialFeatures + RandomForestRegressor
    #     'strat__imputer': [KNNImputer()],
    #     'strat__imputer__n_neighbors': [3, 5, 7],
    #     'scaler': [PolynomialFeatures()],
    #     'scaler__degree': [1, 2, 5],
    #     'model': [RandomForestRegressor()],
    #     'model__n_estimators': [100, 200, 500, 1000],
    #     'model__max_depth': [10, 20, 30, 50, None],
    #     'model__min_samples_split': [2, 5, 10],
    #     'model__min_samples_leaf': [1, 2, 4, 10],
    #     'model__max_features': ['auto', 'sqrt', 'log2'],
    #     'model__bootstrap': [True, False]
    # }
    ]

print()
rf_search = RandomizedSearchCV(pipeline, rf_params, cv=5, scoring="neg_root_mean_squared_error", n_jobs=-1, random_state=42)
rf_search.fit(X_train, y_train)
best_rf = rf_search.best_estimator_
print(f"Beste hyperparametere for RandomForestRegressor: {rf_search.best_params_}")
rf_rmse = root_mean_squared_error(y_val, best_rf.predict(X_val))
print(f"Beste rmse for RandomForestRegressor: {rf_rmse}")






25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/sheldondyrdal/opt/miniconda3/envs/INF161/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/sheldondyrdal/opt/miniconda3/envs/INF161/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sheldondyrdal/opt/miniconda3/envs/INF161/lib/python3.12/site-packages/sklearn/pipeline.py", line 473, in fit
    self._final_estimator.fit(Xt

Beste hyperparametere for RandomForestRegressor: {'strat__imputer__n_neighbors': 5, 'strat__imputer': KNNImputer(), 'scaler': StandardScaler(), 'model__n_estimators': 1000, 'model__min_samples_split': 10, 'model__min_samples_leaf': 4, 'model__max_features': 'sqrt', 'model__max_depth': 50, 'model__bootstrap': False, 'model': RandomForestRegressor()}
Beste rmse for RandomForestRegressor: 18.603982938672846


### Beste model

In [27]:
models_rmse = {
    baseline: rmse_baseline,
    best_lr: lr_rmse,
    best_elastic: elastic_rmse,
    best_rf: rf_rmse
    }

best_model = None
best_rmse = None

for model, rmse in models_rmse.items():
    if best_rmse is None or rmse < best_rmse:
        best_rmse = rmse
        best_model = model

best_model

#### Generaliseringsevne

In [28]:
predictions = best_model.predict(X_test)
rmse = root_mean_squared_error(y_test, predictions)
rmse

20.995546823142323

#### Visualisering av generaliseringsevnen

In [34]:
df_results = pd.DataFrame({
    'pasient_id': X_test["num__pasient_id"].tolist(),
    'Faktiske verdier': y_test,
    'Predikerte verdier': predictions
})

# Plot scatter plot
fig = px.scatter(df_results, x='pasient_id', y=['Faktiske verdier', 'Predikerte verdier'],
                 labels={'x': 'pasient_id', 'value': 'Oppholdslengde'}, title=f'Faktiske verdier vs Predikerte verdier, rmse: {round(rmse, 2)}')

fig.show()