# Introduction

A stroke is a medical condition in which poor blood flow to the brain causes cell death. The main risk factor for stroke is high blood pressure. Other risk factors include high blood cholesterol, tobacco smoking, obesity, diabetes mellitus, a previous TIA, end-stage kidney disease, and atrial fibrillation. 
This dataset containing data about the poaple if there have stroke or not and some information about their health and social status.

### Understanding the Variables

1. gender: "Male", "Female" or "Other"

2. age: age of the patient

3. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

4. heart disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

5. Ever-married: "No" or "Yes"

6. work type: "children", "Govtjov", "Never worked", "Private" or "Self-employed" 

7. Residencetype: "Rural" or "Urban"

8. avg glucose level: average glucose level in blood

9. BMI: body mass index

10. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

11. stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient.

# Preperation the data

In [None]:
# Import libraries.
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

In [None]:
# knowing the name of the dataset.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Load tha data.
df = pd.read_csv("/kaggle/input/brain-stroke-dataset/brain_stroke.csv")
df.head()

In [None]:
# seeing the shape of the data.
df.shape

In [None]:
# seeing if there duplicated values.
df.duplicated().sum()

In [None]:
# seeing if there null values.
df.isna().sum()

# Exploration and visualization

In [None]:
# Categorical columns.
categorical_features = df[["gender", "hypertension", "heart_disease", "ever_married", 
                       "work_type", "Residence_type", "smoking_status"]]

# Numerical columns.
numerical_features = df[["age", "avg_glucose_level", "bmi"]]

### Insights in categorical columns

In [None]:
# calculate descriptive statistics for categorical values.
categorical_features.astype('object').describe()

In [None]:
# The percentage of each element of the data
for feature in categorical_features:
    categorical_features[feature].value_counts().plot(kind = 'pie', autopct = '%1.1f%%')
    plt.show()

* Most of the data are women, 
* most of those in the data do not have hypertension or heart disease, 
* most of them are married, most of the work is private, 
* the residence type is equal between urban and rural,
* and only about 16% smoke.

In [None]:
# Plot graphs that show the number who had stroke for categorical features.
for feature in categorical_features:
    title = "Stroke count by " + feature
    sns.countplot(data=df, x=feature, hue="stroke")
    plt.title(title)
    plt.show()

* Women have a slightly higher percentage of stroke, 
* the percentage of infection is higher in those who do not have hypertension or heart disease, but this may be due to the fact that most of the observations do not suffer this does not mean that hypertension and heart diseases are not related to stroke.
* For stroke, children almost do not get it. 
* The nature of the residence type does not have a significant effect on the incidence of stroke, the percentage of people affected by it is almost equal in rural and urban areas. 
* Those who do not smoke have the highest percentage of infection, and this is due to the fact that most of the patients do not smoke, but they can have other reasons for the injury.

### Insights in numerical columns

In [None]:
# calculate descriptive statistics for numerical values.
numerical_features.describe()

* The average age of people is 43,
* the average blood glucose level is 106, which is a good rate Where is The normal rate of glucose in the blood is less than 140, 
* and the average BMI is 28, which is also good because the normal range ranges between 23 and 28.

In [None]:
# Plot graphs that show the number who had stroke for numerical features.
for feature in numerical_features:
    title = "Stroke count by " + feature
    sns.histplot(data=df, x=feature, hue="stroke", kde = True)
    plt.title(title)
    plt.show()

* Most stroke sufferers are over the age of 40,
* It is clear that The level of glucose in the blood is not significantly affected, 
* and most stroke sufferers have a BMI between 20 and 40.

In [None]:
# calculate descriptive statistics for 20<bmi<40
df[(df['bmi'] > 20) & (df['bmi'] < 40)].astype('object').describe() 

It is known that the average BMI rate is from 23 to 28, but in the data it appeared that there are stroke observations with a moderate BMI rate, and this is due to the fact that there may be other factors that are more influential and lead to a stroke even if the BMI rate is moderate, such as being married and a smoker, for example.

# Data processing for the model

We will drop the two columns "hypertension,	heart_disease" beacuse there are unbalanced.



In [None]:
# Drop hypertension and heart_disease
df.drop(['hypertension', 'heart_disease'], axis=1, inplace=True)

### Encoding and scalling the data

In [None]:
# One hot Endocing for "work type" and "smoking".
df = pd.get_dummies(df, columns=['work_type', 'smoking_status'])
df.head()

In [None]:
# Label Encoding for rest of categorical columns.

# Create label object.
label_encoder = LabelEncoder()

for i in ["gender", "ever_married", "Residence_type"]:
    df[i] = label_encoder.fit_transform(df[i])

df.head()

In [None]:
# Create scaler object.
scaler = StandardScaler()

# Fit scaler on selected columns.
scaler.fit(numerical_features)

# Transform selected columns with scaler.
numerical_features = scaler.transform(numerical_features)

### spilt the data

In [None]:
# Split data into x and y.
X = df.drop("stroke", axis=1)
y = df["stroke"]

In [None]:
y.value_counts().plot(kind = 'pie', autopct = '%1.1f%%')
plt.show()

The data is unbalanced so we will Resampling tha data, trying two approach the oversampling and the under sampling.


In [None]:
# Split the data into train and test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Split the train data into two subsets, train1 and test1
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_train, y_train, test_size=0.25, random_state=10)

# Split the train data into two subsets, train2 and test2
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_train, y_train, test_size=0.25, random_state=20)

# Resampling tha data

### Oversampling (SMOTE)

In [None]:
# Instantiate the SMOTE class
smote = SMOTE(sampling_strategy='auto', random_state=42)

# Perform SMOTE oversampling on the dataset
X_overesampled, y_overesampled = smote.fit_resample(X_train1, y_train1)

In [None]:
y_overesampled.value_counts().plot(kind = 'pie', autopct = '%1.1f%%')

### Trying the models with oversampling

In [None]:
# Initialize the models.
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
}

In [None]:
# Iterate over each model and evaluate its accuracy using cross-validation.
for model_name, model in models.items():
    scores = cross_val_score(model, X_overesampled, y_overesampled)
    accuracy = scores.mean()
    print(f'{model_name} Accuracy: {accuracy}')
    
    # Fit the model to the full training set and make predictions on the test set
    model.fit(X_overesampled, y_overesampled)
    y_pred1 = model.predict(X_test1)
    
    # Evaluate the model on the test set
    acc = accuracy_score(y_test1, y_pred1)
    prec = precision_score(y_test1, y_pred1)
    
    print(f"Accuracy: {acc:.3f}")
    print(f"Precision: {prec:.3f}")

When dealing with unbalanced data, accuracy is not the most appropriate metric to choose the model, but the best is precision or recall or other metrics according to your goal.

In our data, we are interested in precision because there is a high cost of false positive.

### Undersampling (TomekLinks)

In [None]:
# Instantiate the TomekLinks class
tomek_links = TomekLinks(sampling_strategy='auto', n_jobs=-1)

# Perform Tomek Links undersampling on the dataset
X_underesampled, y_underesampled = tomek_links.fit_resample(X_train2, y_train2)

In [None]:
y_underesampled.value_counts().plot(kind = 'pie', autopct = '%1.1f%%')

### Trying the models with undersampling

In [None]:
# Iterate over each model and evaluate its accuracy using cross-validation.
for model_name, model in models.items():
    scores = cross_val_score(model, X_underesampled, y_underesampled)
    accuracy = scores.mean()
    print(f'{model_name} Accuracy: {accuracy}')
    
    # Fit the model to the full training set and make predictions on the test set
    model.fit(X_underesampled, y_underesampled)
    y_pred2 = model.predict(X_test2)
    
    # Evaluate the model on the test set
    acc = accuracy_score(y_test2, y_pred2)
    prec = precision_score(y_test2, y_pred2)
    
    print(f"Accuracy: {acc:.3f}")
    print(f"Precision: {prec:.3f}")

As we have seen, undersampling is not appropriate and does not resample the data. 

It has also been shown through the models’ experience that accuracy is not a measure of the efficiency of the model, with unbalanced data, as precision is the best in our case.

# Modeling

Logistic Regression has the higher presision so will we use it for the model and we resamplind the data using the SMOTE approach.


In [None]:
# Perform SMOTE oversampling on the dataset
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

In [None]:
# Make tha Logistic Regression model.
logreg = LogisticRegression(C=10, penalty='l1')

# Fit the model.
model.fit(X_resampled, y_resampled)

# Predict y-predict.
y_pred = model.predict(X_test)

# Evaluate the accuracy and precision of y-predict.
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
    
print(f"Accuracy: {acc:.3f}")
print(f"Precision: {prec:.3f}")

In [None]:
# Seeing the confision matrix.
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

In [None]:
# Seeing the counts of the unique in predition.
unique, counts = np.unique(y_pred, return_counts=True)
print(unique)
print(counts)

# Conclusion

In conclusion, we started by exploring the dataset and understanding the variables involved in it. Then We noticed that the data does not contain dublicated or null values so this dataset little preprocessed. Then we explore the data and visulize it.

* We notice that Most of the data are women,do not have hypertension or heart disease, married, work as private, the residence type is equal between urban and rural, and only about 16% smoke.

* We concluded that the percentage of infection is higher in those who do not have hypertension or heart disease, but this may be due to the fact that most of the observations do not suffer this does not mean that hypertension and heart diseases are not related to stroke.

* We noticed that this data is unbalanced, so we tried to handle it with the resampling techniques (oversampling and undersampling), we used different machine learning algorithms such as Decision Tree, Random Forest, and Logistic Regression to predict the patients. We compared the accuracy and precision of the models. We found that Logistic Regression gave the best results when we use it with SMOTE oversampling.