## Necessary Imports

In [120]:
#Remove if you need to install 

#%pip install numpy
#%pip install matplotlib
#%pip install pandas
#%pip install seaborn
#%pip install scikit-learn
#%pip install imblearn

In [121]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np

## Importing data

In [122]:
uncleaned_stroke_data = pd.read_csv('voorspellen-van-hartinfarct/train.csv')
test_data = pd.read_csv('voorspellen-van-hartinfarct/test.csv')

## Overview of training data

In [None]:
uncleaned_stroke_data.isnull().sum()

In [None]:
uncleaned_stroke_data.duplicated().sum()

There is no missing or duplicated data in the training set. 

In [None]:
uncleaned_stroke_data.dtypes

Basic statistics of the numerical columns: 

In [None]:
print(uncleaned_stroke_data.describe())

Basic statistics of the boolean columns: 

In [None]:
print(uncleaned_stroke_data.describe(include='bool'))

## Visual overview of individual columns

The following histograms provide an short overview of the columns and their distrubtion as well as their frequency in the training dataset. 

In [None]:
for s in uncleaned_stroke_data.columns:
    sns.histplot(uncleaned_stroke_data[s])
    plt.show()


In [None]:
print(uncleaned_stroke_data['stroke'].value_counts())

The column 'stroke' only contains "0" and "1" (with 0 being the predominant variable) which is essential to correctly train the model to predict strokes in this classification problem - so no adjustment is needed. 

## Irrelevant Information

Not all information is necessarily relevant for the model in order to correctly predict strokes. In the following we will have a look at irrelevant information within the training dataset.

In [None]:
corr_matrix = uncleaned_stroke_data.corr()

plt.figure(figsize=(22,22))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

In [None]:
def sta_sca(sc, df, cols):
    for i in cols:
        df[i] = sc.fit_transform(df[[i]])

sc = StandardScaler()

X = uncleaned_stroke_data.drop('stroke', axis=1)
y = uncleaned_stroke_data['stroke']

sta_sca(sc, X, ['age', 'avg_glucose_level'])


ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X, y)

ridge_importance = ridge_model.coef_
plt.figure(figsize=(10, 6))
plt.barh(X.columns, ridge_importance)
plt.xlabel('Ridge Coefficient')
plt.ylabel('Features')
plt.title('Feature Importance with Ridge Regression')
plt.show()


In [132]:
#ridge_importance.plot(kind='bar', figsize=(15, 8), color='skyblue')
#plt.title('Feature Importance (Logistic Regression Coefficients)')
#plt.ylabel('Coefficient Value')
#plt.xlabel('Feature')
#p#lt.xticks(rotation=45)
#plt.show()

The prior analysis allows us to drop the following columns: 

## TO-DO: We still have to agree on which to remove according to our analysis: 

"Some rows and/or columns may not be relevant to machine learning. Clean up the data so
that only relevant rows remain."

idk if my feature selection with Ridge was correct? but based on that we could drop some columns

In [133]:
col_to_drop = ['id']

cleaned_stroke_data=uncleaned_stroke_data.drop(col_to_drop, axis=1)
test_data_cleaned=test_data.drop(col_to_drop, axis=1)

Overview of the remaining columns and their respective types: 

In [None]:
cleaned_stroke_data.dtypes

## Defining X_train and y_train based on the cleaned data

In [135]:
X_train = cleaned_stroke_data.drop('stroke', axis=1)
y_train = cleaned_stroke_data['stroke']

Due to imbalance of the non-stroke vs. stroke cases, we need to resample our dataset:

In [136]:
smote = SMOTE(sampling_strategy='minority', random_state=0)

X_train, y_train = smote.fit_resample(X_train, y_train)

# Models

Using our cleaned data we can now start to train the models to predict strokes.

## Logistic Regression

### Selecting best hyperparameters

In [137]:
steps = [('lr', LogisticRegression())]
pipeline = Pipeline(steps)

param_grid = {'lr__class_weight': ['balanced', None], 'lr__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga', 'newton-cholesky']}
lr_cv = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1', n_jobs=-1)

In [None]:
lr_cv.fit(X_train, y_train)
print("Best parameters: ", lr_cv.best_params_)
print("Best cross-validation score: ", lr_cv.best_score_)

### Training Model

In [None]:
lr = LogisticRegression(class_weight='balanced', solver='newton-cholesky')
lr.fit(X_train, y_train)

### Predicting 

In [None]:
y_pred = lr.predict(test_data_cleaned)

test_data_cleaned['stroke'] = y_pred

### Editing test_csv

In [97]:
test_data_cleaned.to_csv('voorspellen-van-hartinfarct/test.csv', index=False)