# Machine Learning

Now it's time to practice what you have seen in the previous notebooks. Your task for today is to download the data from the database and train a model in order to predict if a patient has a heart disease or not. 

![](https://www.nicepng.com/png/detail/397-3975460_disease-high-quality-png-heart-disease-cartoon-png.png)

## Task:

1. Import the data from the database. The schema is called `heart`. You can use DBeaver to get an overview over the different tables and think about a good way to join them. 
2. Conduct a brief EDA to become familiar with the data. 
3. Preprocess the data as far as you need it and...
4. ...train a logistic regression model.

## What you should use/keep in mind:
 
* **Scale your data:** Which scaler works best in your case?
* **Tune your model:** Tune the hyperparameter of your model. You can start with a larger parameter grid and a `RandomizedSearchCV` and continue with a narrower parameter grid for your `GridSearchCV`.
* **Choose the right evaluation metric!**


## Data Overview

| column | additional information |
|--------|------------------------|
| age | age of patient |
| sex | gender of patient |
| chest_pain_type  | 1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic | 
| resting_blood_pressure |  | 
| fasting_blood_sugar | > 120 mg/dl, 1 = true, 0 = false | 
| thal | 0 = normal, 1 = fixed defect, 2 = reversable defect
| serum_cholestoral | in mg/dl | 
| resting_electrocardiographic_results | 0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria | 
| maximum_heartrate_achieved | | 
| exercise_induced_angina | 1 = yes, 0 = no | 
| oldpeak | ST depression induced by exercise relative to rest | 
| slope_of_the_peak_exercise_st_segment | 1= upsloping, 2 = flat, 3 = downsloping | 
| number_of_major_vessels_colored_by_flourosopy | |
| real_data | tag to distinguish between real and made up data | 
| heart_attack | 0 = little risk of heart attack, 1 = high risk of heart attack | 

## Import

In [1]:
import os
from sqlalchemy import create_engine

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from timeit import default_timer as timer

# eye candy plots
plt.style.use('https://github.com/dhaitz/matplotlib-stylesheets/raw/master/pitayasmoothie-light.mplstyle')
# source https://github.com/dhaitz/matplotlib-stylesheets

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix

RSEED = 10

# Feel free to add all the libraries you need

## Getting the Data

The data for this exercise is stored in our postgres database in the schema `heart`. The different features are split thematically into five different tables. Your first task will be to have a look at the tables (e.g. in DBeaver) and figure out a way to join the information you need. As soon as you're happy with your query, you can use the following code cells to import the data into this notebook. 

In previous notebooks you've seen two different approaches to import data from a database into a notebook. The following code will use `sqlalchemy`in combination with pandas `pd.read_sql()` function. For the code to work, you need to copy the `.env` file from the previous repositories into this repository and change the query_string to your own query.

In [2]:
# Save dataframe as .csv file
df = pd.read_csv(r'heart_data.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 353 entries, 0 to 352
Data columns (total 17 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   id                                             353 non-null    int64  
 1   age                                            343 non-null    float64
 2   sex                                            334 non-null    float64
 3   chest_pain_type                                303 non-null    float64
 4   patient_id                                     353 non-null    int64  
 5   serum_cholestoral                              303 non-null    float64
 6   fasting_blood_sugar                            303 non-null    float64
 7   thal                                           303 non-null    float64
 8   resting_blood_pressure                         303 non-null    float64
 9   resting_electrocardiographic_results           303 non

In [4]:
df[df.isna().any(axis = 1)]

Unnamed: 0,id,age,sex,chest_pain_type,patient_id,serum_cholestoral,fasting_blood_sugar,thal,resting_blood_pressure,resting_electrocardiographic_results,maximum_heartrate_achieved,exercise_induced_angina,oldpeak,slope_of_the_peak_exercise_st_segment,number_of_major_vessels_colored_by_flourosopy,real_data,heart_attack
1,2,,1.0,2.0,1,250.0,0.0,2.0,130.0,1.0,187.0,0.0,3.5,0.0,0.0,real data,1
4,5,57.0,0.0,0.0,4,354.0,0.0,2.0,120.0,1.0,163.0,1.0,0.6,2.0,0.0,,1
6,7,,0.0,1.0,6,294.0,0.0,2.0,140.0,0.0,153.0,0.0,1.3,1.0,0.0,real data,1
9,10,57.0,1.0,2.0,9,168.0,0.0,2.0,150.0,1.0,174.0,0.0,1.6,2.0,0.0,,1
14,15,58.0,0.0,3.0,14,283.0,1.0,2.0,150.0,0.0,162.0,0.0,1.0,2.0,0.0,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
348,394,97.0,1.0,,393,,,,,,,,,,,,1
349,396,79.0,0.0,,395,,,,,,,,,,,,1
350,398,20.0,1.0,,397,,,,,,,,,,,,0
351,400,90.0,1.0,,399,,,,,,,,,,,,0


In [5]:
df = df.dropna(axis = 0, thresh = 6)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 303 entries, 0 to 302
Data columns (total 17 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   id                                             303 non-null    int64  
 1   age                                            293 non-null    float64
 2   sex                                            284 non-null    float64
 3   chest_pain_type                                303 non-null    float64
 4   patient_id                                     303 non-null    int64  
 5   serum_cholestoral                              303 non-null    float64
 6   fasting_blood_sugar                            303 non-null    float64
 7   thal                                           303 non-null    float64
 8   resting_blood_pressure                         303 non-null    float64
 9   resting_electrocardiographic_results           303 non-null

In [7]:
df.groupby(df.chest_pain_type)[['age', 'sex']].agg(['median'])

Unnamed: 0_level_0,age,sex
Unnamed: 0_level_1,median,median
chest_pain_type,Unnamed: 1_level_2,Unnamed: 2_level_2
0.0,57.0,1.0
1.0,52.0,1.0
2.0,53.0,1.0
3.0,59.0,1.0


In [8]:
df_sex = df

In [9]:
df_nosex = df

In [10]:
df_nosex.info()

<class 'pandas.core.frame.DataFrame'>
Index: 303 entries, 0 to 302
Data columns (total 17 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   id                                             303 non-null    int64  
 1   age                                            293 non-null    float64
 2   sex                                            284 non-null    float64
 3   chest_pain_type                                303 non-null    float64
 4   patient_id                                     303 non-null    int64  
 5   serum_cholestoral                              303 non-null    float64
 6   fasting_blood_sugar                            303 non-null    float64
 7   thal                                           303 non-null    float64
 8   resting_blood_pressure                         303 non-null    float64
 9   resting_electrocardiographic_results           303 non-null

In [11]:
df_nosex = df_nosex.drop(columns = ['real_data', 'sex'])

In [14]:
df_nosex.loc[df_nosex.chest_pain_type == 2.0, 'age'] = df_nosex.loc[df_nosex.chest_pain_type == 2.0, 'age'].fillna(53.0)

In [15]:
df_m = df.groupby(df.age)['maximum_heartrate_achieved'].agg([ 'median']).reset_index()
df_m

Unnamed: 0,age,median
0,29.0,202.0
1,34.0,183.0
2,35.0,165.0
3,37.0,170.0
4,38.0,173.0
5,39.0,165.5
6,40.0,178.0
7,41.0,168.0
8,42.0,167.5
9,43.0,161.5


In [16]:
for _, row_m in df_m.iterrows(): 
    df_nosex.loc[df_nosex.age == row_m['age'], 'maximum_heartrate_achieved'] = \
        df_nosex.loc[df_nosex.age == row_m['age'], 'maximum_heartrate_achieved'].fillna(row_m['median'])

In [17]:
df_nosex.info()

<class 'pandas.core.frame.DataFrame'>
Index: 303 entries, 0 to 302
Data columns (total 15 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   id                                             303 non-null    int64  
 1   age                                            303 non-null    float64
 2   chest_pain_type                                303 non-null    float64
 3   patient_id                                     303 non-null    int64  
 4   serum_cholestoral                              303 non-null    float64
 5   fasting_blood_sugar                            303 non-null    float64
 6   thal                                           303 non-null    float64
 7   resting_blood_pressure                         303 non-null    float64
 8   resting_electrocardiographic_results           303 non-null    float64
 9   maximum_heartrate_achieved                     303 non-null

In [18]:
df_sex = df_sex.dropna()

In [19]:
df_sex = df_sex.drop(columns = ['real_data'])

In [22]:
df_sex.loc[df_sex.chest_pain_type == 2.0, 'age'] = df_sex.loc[df_sex.chest_pain_type == 2.0, 'age'].fillna(53.0)

In [23]:
df_sex.info()

<class 'pandas.core.frame.DataFrame'>
Index: 212 entries, 0 to 302
Data columns (total 16 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   id                                             212 non-null    int64  
 1   age                                            212 non-null    float64
 2   sex                                            212 non-null    float64
 3   chest_pain_type                                212 non-null    float64
 4   patient_id                                     212 non-null    int64  
 5   serum_cholestoral                              212 non-null    float64
 6   fasting_blood_sugar                            212 non-null    float64
 7   thal                                           212 non-null    float64
 8   resting_blood_pressure                         212 non-null    float64
 9   resting_electrocardiographic_results           212 non-null

In [24]:
df_sex.dropna(inplace = True)

In [25]:
df_sex.shape

(212, 16)

In [26]:
df_sex = df_sex.drop(columns = ['id', 'patient_id'])

In [27]:
df_nosex = df_nosex.drop(columns = ['id', 'patient_id'])

In [29]:
for _, row_m in df_m.iterrows(): 
    df_nosex.loc[df_nosex.age == row_m['age'], 'maximum_heartrate_achieved'] = \
        df_nosex.loc[df_nosex.age == row_m['age'], 'maximum_heartrate_achieved'].fillna(row_m['median'])

In [30]:
df_nosex.info()

<class 'pandas.core.frame.DataFrame'>
Index: 303 entries, 0 to 302
Data columns (total 13 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   age                                            303 non-null    float64
 1   chest_pain_type                                303 non-null    float64
 2   serum_cholestoral                              303 non-null    float64
 3   fasting_blood_sugar                            303 non-null    float64
 4   thal                                           303 non-null    float64
 5   resting_blood_pressure                         303 non-null    float64
 6   resting_electrocardiographic_results           303 non-null    float64
 7   maximum_heartrate_achieved                     303 non-null    float64
 8   exercise_induced_angina                        303 non-null    float64
 9   oldpeak                                        303 non-null

In [31]:
df_sex.info()

<class 'pandas.core.frame.DataFrame'>
Index: 212 entries, 0 to 302
Data columns (total 14 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   age                                            212 non-null    float64
 1   sex                                            212 non-null    float64
 2   chest_pain_type                                212 non-null    float64
 3   serum_cholestoral                              212 non-null    float64
 4   fasting_blood_sugar                            212 non-null    float64
 5   thal                                           212 non-null    float64
 6   resting_blood_pressure                         212 non-null    float64
 7   resting_electrocardiographic_results           212 non-null    float64
 8   maximum_heartrate_achieved                     212 non-null    float64
 9   exercise_induced_angina                        212 non-null

In [32]:
df_nosex.thal.value_counts()

thal
2.0    166
3.0    117
1.0     18
0.0      2
Name: count, dtype: int64

## DATA - PREPROCESSING

In [33]:
df_nosex_d = pd.get_dummies(df_nosex, columns=['chest_pain_type', 'slope_of_the_peak_exercise_st_segment', 'thal','number_of_major_vessels_colored_by_flourosopy','resting_electrocardiographic_results'],
                prefix=['cp', 'st_seg', 'th', 'flo', 're']).astype(float)

In [34]:
df_nosex_d.columns

Index(['age', 'serum_cholestoral', 'fasting_blood_sugar',
       'resting_blood_pressure', 'maximum_heartrate_achieved',
       'exercise_induced_angina', 'oldpeak', 'heart_attack', 'cp_0.0',
       'cp_1.0', 'cp_2.0', 'cp_3.0', 'st_seg_0.0', 'st_seg_1.0', 'st_seg_2.0',
       'th_0.0', 'th_1.0', 'th_2.0', 'th_3.0', 'flo_0.0', 'flo_1.0', 'flo_2.0',
       'flo_3.0', 'flo_4.0', 're_0.0', 're_1.0', 're_2.0'],
      dtype='object')

In [35]:
df_sex_d = pd.get_dummies(df_sex, columns=['chest_pain_type', 'slope_of_the_peak_exercise_st_segment', 'thal','number_of_major_vessels_colored_by_flourosopy','resting_electrocardiographic_results'],
                prefix=['cp', 'st_seg', 'th', 'flo', 're']).astype(float)

In [36]:
df_sex_d.columns

Index(['age', 'sex', 'serum_cholestoral', 'fasting_blood_sugar',
       'resting_blood_pressure', 'maximum_heartrate_achieved',
       'exercise_induced_angina', 'oldpeak', 'heart_attack', 'cp_0.0',
       'cp_1.0', 'cp_2.0', 'cp_3.0', 'st_seg_0.0', 'st_seg_1.0', 'st_seg_2.0',
       'th_0.0', 'th_1.0', 'th_2.0', 'th_3.0', 'flo_0.0', 'flo_1.0', 'flo_2.0',
       'flo_3.0', 'flo_4.0', 're_0.0', 're_1.0', 're_2.0'],
      dtype='object')

## ML ALGORITHM (LOGISTIC REGRESSION)

Creating X and y for no analysis with sex and without sex data

In [37]:
X1 = df_nosex_d.drop(columns = 'heart_attack')

In [38]:
y1 = df_nosex_d.heart_attack

In [39]:
X2 = df_sex_d.drop(columns = 'heart_attack')

In [40]:
y2 = df_sex_d.heart_attack

### SPLITTING THE DATA

In [41]:
# Train-test-split for no sex data
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.4, random_state=42, stratify=y1)

# Train-test-split for sex data
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.4, random_state=42, stratify=y2)


### STANDARDIZING THE DATA

In [42]:
col_scale = ['age', 'serum_cholestoral',
       'resting_blood_pressure', 'maximum_heartrate_achieved', 'oldpeak']


# Scaling with standard scaler for no sex data
scaler1 = StandardScaler()
X1_train_scaled = scaler1.fit_transform(X1_train[col_scale])
X1_test_scaled = scaler1.transform(X1_test[col_scale])


# Scaling with standard scaler for sex data
scaler2 = StandardScaler()
X2_train_scaled = scaler2.fit_transform(X2_train[col_scale])
X2_test_scaled = scaler2.transform(X2_test[col_scale])

In [43]:
# Concatenating scaled and dummy columns for no sex data
X1_train_preprocessed = np.concatenate([X1_train_scaled, X1_train.drop(col_scale, axis=1)], axis=1)
X1_test_preprocessed = np.concatenate([X1_test_scaled, X1_test.drop(col_scale, axis=1)], axis=1)

# Concatenating scaled and dummy columns for sex data
X2_train_preprocessed = np.concatenate([X2_train_scaled, X2_train.drop(col_scale, axis=1)], axis=1)
X2_test_preprocessed = np.concatenate([X2_test_scaled, X2_test.drop(col_scale, axis=1)], axis=1)


In [44]:
X1_train_preprocessed

array([[ 0.87922035,  0.45098034, -0.71545212, ...,  0.        ,
         1.        ,  0.        ],
       [ 2.44159368, -1.02892665,  0.41948945, ...,  0.        ,
         0.        ,  1.        ],
       [ 1.32561273,  0.68353715,  0.75997193, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 0.20963178,  0.21842353, -0.14798133, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.09803369,  0.3452727 , -0.14798133, ...,  0.        ,
         1.        ,  0.        ],
       [-0.1251625 ,  0.00700824, -0.14798133, ...,  1.        ,
         0.        ,  0.        ]])

## RIDGE and LASSO Regularization for analysis without sex column data 

In [47]:
lr_r1 = LogisticRegression(penalty='l1', solver='liblinear', C=1)
lr_r1.fit(X1_train_preprocessed, y1_train)

lr_r2 = LogisticRegression(penalty='l1', solver='liblinear', C=1)
lr_r2.fit(X2_train_preprocessed, y2_train)

### This is the analysis without sex column data and LASSO REGRESSION

In [49]:
# this is for No sex column data
# Set up the hyperparameter grid to search over
param_grid = {
    'penalty': ['l1'],  # Regularization type
    'solver': ['liblinear'],  # Solvers
    'C': [0.1, 1.0, 5.0, 10.0, 12.0],  # Regularization strength
    'max_iter': [50, 100, 200, 500, 1000]  # Maximum iterations for convergence
}

# Create a logistic regression model
model = LogisticRegression()

# Set up the GridSearchCV with 5-fold cross-validation
grid_search1_lasso = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1, scoring='f1_macro')

# Fit the GridSearchCV to the training data
grid_search1_lasso.fit(X1_train_preprocessed, y1_train)

# Get the best model from GridSearchCV
best_model1_lasso = grid_search1_lasso.best_estimator_

# Print the best hyperparameters found
print(f"Best Hyperparameters: {grid_search1_lasso.best_params_}")

# Make predictions with the best model
y1_pred_lasso = best_model1_lasso.predict(X1_test)

# Evaluate the model
accuracy1_ns_lasso = accuracy_score(y1_test, y1_pred_lasso)
f1_ns_lasso = f1_score(y1_test, y1_pred_lasso)

print(f"Best Model Accuracy: {accuracy1_ns_lasso:.4f}")
print(f"Best Model F1 Score: {f1_ns_lasso:.4f}")

# Print all the results (i.e., scores for all parameter combinations)
results_df1_lasso = pd.DataFrame(grid_search1_lasso.cv_results_)

# Print the grid search results: the hyperparameters and the corresponding scores
print("\nAll Results from Grid Search:")
print(results_df1_lasso[['param_penalty', 'param_solver', 'param_C', 'mean_test_score', 'std_test_score']])

# Optionally, if you want to sort by the mean test score to see the best models at the top:
sorted_results1_lasso = results_df1_lasso[['param_penalty', 'param_solver', 'param_C', 'mean_test_score', 'std_test_score']].sort_values(by='mean_test_score', ascending=False)

print("\nSorted Results by Mean Test Score:")
print(sorted_results1_lasso)


Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best Hyperparameters: {'C': 1.0, 'max_iter': 50, 'penalty': 'l1', 'solver': 'liblinear'}
Best Model Accuracy: 0.4590
Best Model F1 Score: 0.0000

All Results from Grid Search:
   param_penalty param_solver  param_C  mean_test_score  std_test_score
0             l1    liblinear      0.1         0.830969        0.047898
1             l1    liblinear      0.1         0.830969        0.047898
2             l1    liblinear      0.1         0.830969        0.047898
3             l1    liblinear      0.1         0.830969        0.047898
4             l1    liblinear      0.1         0.830969        0.047898
5             l1    liblinear      1.0         0.875865        0.058926
6             l1    liblinear      1.0         0.875865        0.058926
7             l1    liblinear      1.0         0.875865        0.058926
8             l1    liblinear      1.0         0.875865        0.058926
9             l1    liblinear      1.0    



### This is the analysis using sex column data and LASSO REGRESSION

In [50]:
# this is for sex column data

# Set up the hyperparameter grid to search over
param_grid = {
    'penalty': ['l1'],  # Regularization type
    'solver': ['liblinear'],  # Solvers
    'C': [0.1, 1.0, 5.0, 10.0, 12.0],  # Regularization strength
    'max_iter': [50, 100, 200, 500, 1000]  # Maximum iterations for convergence
}

# Create a logistic regression model
model = LogisticRegression()

# Set up the GridSearchCV with 5-fold cross-validation
grid_search2_lasso = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1, scoring='f1_macro')

# Fit the GridSearchCV to the training data
grid_search2_lasso.fit(X2_train_preprocessed, y2_train)

# Get the best model from GridSearchCV
best_model2_lasso = grid_search2_lasso.best_estimator_

# Print the best hyperparameters found
print(f"Best Hyperparameters: {grid_search2_lasso.best_params_}")

# Make predictions with the best model
y2_pred_lasso = best_model2_lasso.predict(X2_test)

# Evaluate the model
accuracy_ns2_lasso = accuracy_score(y2_test, y2_pred_lasso)
f1_ns2_lasso = f1_score(y2_test, y2_pred_lasso)

print(f"Best Model Accuracy: {accuracy_ns2_lasso:.4f}")
print(f"Best Model F1 Score: {f1_ns2_lasso:.4f}")

# Print all the results (i.e., scores for all parameter combinations)
results_df2_lasso = pd.DataFrame(grid_search2_lasso.cv_results_)

# Print the grid search results: the hyperparameters and the corresponding scores
print("\nAll Results from Grid Search:")
print(results_df2_lasso[['param_penalty', 'param_solver', 'param_C', 'mean_test_score', 'std_test_score']])

# Optionally, if you want to sort by the mean test score to see the best models at the top:
sorted_results2_lasso = results_df2_lasso[['param_penalty', 'param_solver', 'param_C', 'mean_test_score', 'std_test_score']].sort_values(by='mean_test_score', ascending=False)

print("\nSorted Results by Mean Test Score:")
print(sorted_results2_lasso)


Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best Hyperparameters: {'C': 5.0, 'max_iter': 50, 'penalty': 'l1', 'solver': 'liblinear'}
Best Model Accuracy: 0.4471
Best Model F1 Score: 0.0000

All Results from Grid Search:
   param_penalty param_solver  param_C  mean_test_score  std_test_score
0             l1    liblinear      0.1         0.726020        0.063980
1             l1    liblinear      0.1         0.726020        0.063980
2             l1    liblinear      0.1         0.726020        0.063980
3             l1    liblinear      0.1         0.726020        0.063980
4             l1    liblinear      0.1         0.726020        0.063980
5             l1    liblinear      1.0         0.820268        0.075713
6             l1    liblinear      1.0         0.820268        0.075713
7             l1    liblinear      1.0         0.820268        0.075713
8             l1    liblinear      1.0         0.820268        0.075713
9             l1    liblinear      1.0    



### This is the analysis without sex column data and RIDGE REGRESSION

In [51]:
# this is for No sex column data
# Set up the hyperparameter grid to search over
param_grid = {
    'penalty': ['l2'],  # Regularization type
    'solver': ['liblinear', 'lbfgs', 'newton-cg', 'saga'],  # Solvers
    'C': [0.1, 1.0, 5.0, 10.0, 12.0],  # Regularization strength
    'max_iter': [50, 100, 200, 500, 1000]  # Maximum iterations for convergence
}

# Create a logistic regression model
model = LogisticRegression()

# Set up the GridSearchCV with 5-fold cross-validation
grid_search1_ridge = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1, scoring='f1_macro')

# Fit the GridSearchCV to the training data
grid_search1_ridge.fit(X1_train_preprocessed, y1_train)

# Get the best model from GridSearchCV
best_model1_ridge = grid_search1_ridge.best_estimator_

# Print the best hyperparameters found
print(f"Best Hyperparameters: {grid_search1_ridge.best_params_}")

# Make predictions with the best model
y1_pred_ridge = best_model1_ridge.predict(X1_test)

# Evaluate the model
accuracy1_ns_ridge = accuracy_score(y1_test, y1_pred_ridge)
f1_ns_ridge = f1_score(y1_test, y1_pred_ridge)

print(f"Best Model Accuracy: {accuracy1_ns_ridge:.4f}")
print(f"Best Model F1 Score: {f1_ns_ridge:.4f}")

# Print all the results (i.e., scores for all parameter combinations)
results_df1_ridge = pd.DataFrame(grid_search1_ridge.cv_results_)

# Print the grid search results: the hyperparameters and the corresponding scores
print("\nAll Results from Grid Search:")
print(results_df1_ridge[['param_penalty', 'param_solver', 'param_C', 'mean_test_score', 'std_test_score']])

# Optionally, if you want to sort by the mean test score to see the best models at the top:
sorted_results1_ridge = results_df1_ridge[['param_penalty', 'param_solver', 'param_C', 'mean_test_score', 'std_test_score']].sort_values(by='mean_test_score', ascending=False)

print("\nSorted Results by Mean Test Score:")
print(sorted_results1_ridge)


Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best Hyperparameters: {'C': 1.0, 'max_iter': 50, 'penalty': 'l2', 'solver': 'liblinear'}
Best Model Accuracy: 0.4590
Best Model F1 Score: 0.0000

All Results from Grid Search:
   param_penalty param_solver  param_C  mean_test_score  std_test_score
0             l2    liblinear      0.1         0.870297        0.075323
1             l2        lbfgs      0.1         0.870297        0.075323
2             l2    newton-cg      0.1         0.870297        0.075323
3             l2         saga      0.1         0.870297        0.075323
4             l2    liblinear      0.1         0.870297        0.075323
..           ...          ...      ...              ...             ...
95            l2         saga     12.0         0.858598        0.071080
96            l2    liblinear     12.0         0.858598        0.071080
97            l2        lbfgs     12.0         0.858598        0.071080
98            l2    newton-cg     12.0   



### This is the analysis using sex column data and RIDGE REGRESSION

In [52]:
# this is for sex column data

# Set up the hyperparameter grid to search over
param_grid = {
    'penalty': ['l2'],  # Regularization type
    'solver': ['liblinear', 'lbfgs', 'newton-cg', 'saga'],  # Solvers
    'C': [0.1, 1.0, 5.0, 10.0, 12.0],  # Regularization strength
    'max_iter': [50, 100, 200, 500, 1000]  # Maximum iterations for convergence
}

# Create a logistic regression model
model = LogisticRegression()

# Set up the GridSearchCV with 5-fold cross-validation
grid_search2_ridge = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1, scoring='f1_macro')

# Fit the GridSearchCV to the training data
grid_search2_ridge.fit(X2_train_preprocessed, y2_train)

# Get the best model from GridSearchCV
best_model2_ridge = grid_search2_ridge.best_estimator_

# Print the best hyperparameters found
print(f"Best Hyperparameters: {grid_search2_ridge.best_params_}")

# Make predictions with the best model
y2_pred_ridge = best_model2_ridge.predict(X2_test)

# Evaluate the model
accuracy_ns2_ridge = accuracy_score(y2_test, y2_pred_ridge)
f1_ns2_ridge = f1_score(y2_test, y2_pred_ridge)

print(f"Best Model Accuracy: {accuracy_ns2_ridge:.4f}")
print(f"Best Model F1 Score: {f1_ns2_ridge:.4f}")

# Print all the results (i.e., scores for all parameter combinations)
results_df2_ridge = pd.DataFrame(grid_search2_ridge.cv_results_)

# Print the grid search results: the hyperparameters and the corresponding scores
print("\nAll Results from Grid Search:")
print(results_df2_ridge[['param_penalty', 'param_solver', 'param_C', 'mean_test_score', 'std_test_score']])

# Optionally, if you want to sort by the mean test score to see the best models at the top:
sorted_results2_ridge = results_df2_ridge[['param_penalty', 'param_solver', 'param_C', 'mean_test_score', 'std_test_score']].sort_values(by='mean_test_score', ascending=False)

print("\nSorted Results by Mean Test Score:")
print(sorted_results2_ridge)


Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best Hyperparameters: {'C': 5.0, 'max_iter': 50, 'penalty': 'l2', 'solver': 'liblinear'}
Best Model Accuracy: 0.4471
Best Model F1 Score: 0.0000

All Results from Grid Search:
   param_penalty param_solver  param_C  mean_test_score  std_test_score
0             l2    liblinear      0.1         0.800701        0.090674
1             l2        lbfgs      0.1         0.808165        0.081075
2             l2    newton-cg      0.1         0.808165        0.081075
3             l2         saga      0.1         0.808165        0.081075
4             l2    liblinear      0.1         0.800701        0.090674
..           ...          ...      ...              ...             ...
95            l2         saga     12.0         0.814919        0.076338
96            l2    liblinear     12.0         0.814919        0.076338
97            l2        lbfgs     12.0         0.814919        0.076338
98            l2    newton-cg     12.0   

