## Necessary Imports

In [3]:
#Remove if you need to install 

#%pip install numpy
#%pip install matplotlib
#%pip install pandas
#%pip install seaborn
#%pip install scikit-learn
#%pip install imblearn

In [4]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

## Importing data

At first, we will import our data using pd.read_csv and store them into variables.

In [5]:
uncleaned_stroke_data = pd.read_csv('Datasets/train.csv')
test_data = pd.read_csv('Datasets/test.csv')

## Overview of training data

In [None]:
uncleaned_stroke_data.isnull().sum()

In [None]:
uncleaned_stroke_data.duplicated().sum()

There is no missing or duplicated data in the training set. 

In [None]:
uncleaned_stroke_data.dtypes

Basic statistics of the numerical columns: 

In [None]:
print(uncleaned_stroke_data.describe())

Basic statistics of the boolean columns: 

In [None]:
print(uncleaned_stroke_data.describe(include='bool'))

## Overview of test data

In [None]:
test_data.isnull().sum()

In [None]:
test_data.duplicated().sum()

Theres no missing or duplicated value in the test set either.

In [None]:
test_data.dtypes

Basic statistics of the numerical columns: 

In [None]:
print(test_data.describe())

Basic statistics of the boolean columns: 

In [None]:
print(test_data.describe(include='bool'))

## Conclusion

Both sets already satisfy the minimum requirements for machine learning:

No missing data as well as only numeric or boolean data.

## Visual overview of individual columns

The following histograms provide an short overview of the columns and their distrubtion as well as their frequency in the training dataset. 

In [None]:
for s in uncleaned_stroke_data.columns:
    sns.histplot(uncleaned_stroke_data[s])
    plt.show()


In [None]:
print(uncleaned_stroke_data['stroke'].value_counts())

The column 'stroke' only contains "0" and "1" (with 0 being the predominant variable) which is essential to correctly train the model to predict strokes. However, the dataset is imbalanced (far more people who havent had strokes) which wil lead to wrong predictions. Later on, the data will be manipulated in order to counter this issue.

## Irrelevant Information

Not all information is necessarily relevant for the model in order to correctly predict strokes. In the following we will have a look at irrelevant information within the training dataset.

In [None]:
correlation_matrix = uncleaned_stroke_data.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title("Correlation Matrix of Features")
plt.show()

The prior analysis allows us to drop the following columns: 

## TO-DO: We still have to agree on which to remove according to our analysis: 

"Some rows and/or columns may not be relevant to machine learning. Clean up the data so
that only relevant rows remain."

 consider values like married_yes or married_no...there can always be one value dropped bcs if married_yes=1 the other one is 0 automatically, right? so no additional information

 also i think if i remember correctly from last term, you always want to get rid of at least one categorial variable in such a case. 

 personally i also dont think that relying solely on correlation isnt enough bcs there could be a non linear relationship right?-> but i always removed the one with the lowest correlation in the subset i.e. work_type_private for working features


In [24]:
col_to_drop = ['id', 'ever_married_No', 'Residence_type_Rural', 'gender_Other', 'work_type_Private', 'smoking_status_smokes']

cleaned_stroke_data=uncleaned_stroke_data.drop(col_to_drop, axis=1)
test_data_cleaned=test_data.drop(col_to_drop, axis=1)

Overview of the remaining columns and their respective types: 

In [None]:
cleaned_stroke_data.dtypes

## Defining X_train and y_train based on the cleaned data

In [40]:
X_train = cleaned_stroke_data.drop('stroke', axis=1)
y_train = cleaned_stroke_data['stroke']

Due to imbalance of the non-stroke vs. stroke cases, we need to resample our dataset:

In [41]:
smote = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = smote.fit_resample(X_train, y_train)

### Standardizing Data

At last, we need to standardize our data. This is because of the three float variables we have in our dataset which would have an higher impact without actually being more important

In [42]:
def sta_sca(sc, df, cols):
    for i in cols:
        df[i] = sc.fit_transform(df[[i]])

sc = StandardScaler()

sta_sca(sc, X_train, ['age', 'avg_glucose_level', 'bmi'])
sta_sca(sc, test_data_cleaned, ['age', 'avg_glucose_level', 'bmi'])


# Models

Using our cleaned data we can now start to train the models to predict strokes.

## K-Nearest-Neighbor

### Selecting best hyperparameters

In [43]:
steps = [('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)

param_grid = {'knn__n_neighbors': np.arange(1, 21), 'knn__weights': ['uniform', 'distance'], 'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}
knn_cv = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1', n_jobs=-1)

In [None]:
knn_cv.fit(X_train, y_train)
print("Best parameters: ", knn_cv.best_params_)
print("Best cross-validation score: ", knn_cv.best_score_)

### Training model

In [None]:
knn = KNeighborsClassifier(n_neighbors=1, weights='uniform', algorithm='brute')
knn.fit(X_train, y_train)

### Predicting

In [None]:
knn_pred = knn.predict(test_data_cleaned)

test_data_cleaned['stroke'] = knn_pred

### Editing the csv

In [None]:
test_data_cleaned.to_csv('Datasets/Predictions/test_lr_pred.csv', index=False)

## Logistic Regression

### Selecting best hyperparameters

In [24]:
steps = [('lr', LogisticRegression())]
pipeline = Pipeline(steps)

param_grid = {'lr__penalty': ['l1', 'l2', 'elasticnet', None], 'lr__C': np.logspace(-4, 4, 20), 'lr__class_weight': ['balanced', None], 'lr__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga', 'newton-cholesky']}
lr_cv = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1', n_jobs=-1)

In [None]:
lr_cv.fit(X_train, y_train)
print("Best parameters: ", lr_cv.best_params_)
print("Best cross-validation score: ", lr_cv.best_score_)

### Training Model

In [None]:
lr = LogisticRegression(class_weight='balanced', solver='newton-cholesky')
lr.fit(X_train, y_train)

### Predicting 

In [27]:
y_pred = lr.predict(test_data_cleaned)

test_data_cleaned['stroke'] = y_pred

### Editing test_csv

In [28]:
test_data_cleaned.to_csv('Datasets/Predictions/test_lr_pred.csv', index=False)

## Decision Tree

### Selecting the best hyperparameters

In [31]:
steps = [('dt', DecisionTreeClassifier())]
pipeline = Pipeline(steps)

param_grid = {
    'dt__class_weight': ['balanced', None],
    'dt__criterion': ['gini', 'entropy'],
    'dt__max_depth': [None, 10, 20, 30],
    'dt__min_samples_split': [2, 10, 20],
    'dt__min_samples_leaf': [1, 5, 10]
}

dt_cv = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1', n_jobs=-1)

In [None]:
dt_cv.fit(X_train, y_train)
print("Best parameters: ", dt_cv.best_params_)
print("Best cross-validation score: ", dt_cv.best_score_)

### Training Model

In [None]:
dt = DecisionTreeClassifier(class_weight='None', solver='entropy', max_depth=30, min_samples_leaf=1,min_samples_split=10)
dt.fit(X_train, y_train)

### Predicting

In [None]:
dt_pred = dt.predict(test_data_cleaned)

test_data_cleaned['stroke'] = dt_pred

### Editing test_csv

In [None]:
test_data_cleaned.to_csv('Datasets/Predictions/test_dt_pred.csv', index=False)