### Dataset Overview

The dataset used in this project aims to predict the occurrence of a stroke based on various health-related factors. Each feature represents an aspect of the patient's health, such as age, hypertension status, smoking habits, etc. The target variable is `stroke`, which is a binary classification indicating whether a stroke occurred.

Key features:
- `age`: The age of the patient.
- `hypertension`: Indicates whether the patient has hypertension (1) or not (0).
- `heart_disease`: Indicates the presence (1) or absence (0) of heart disease.
- `avg_glucose_level`: The average glucose level in the blood.
- `bmi`: Body Mass Index.


# importing the libraries

In [None]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data Loading

In [None]:
df = pd.read_csv("DATA/healthcare-dataset-stroke-data.csv")

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.columns

# Data Exploration (Visualizations)


In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()



In [None]:
df.hist(bins=30, figsize=(15, 10))
plt.suptitle('Feature Distributions')
plt.show()


# Data Preprocessing


In [None]:
def show_missing_data(df):
    total=df.isnull().sum().sort_values(ascending=False)
    percent=(df.isnull().sum()/df.isnull().count() * 100).sort_values(ascending=False)
    missing_data=pd.concat([total,percent],axis=1,keys=["total","Percent"])
    return missing_data

In [None]:
show_missing_data(df)

In [None]:
df["bmi"].fillna(df["bmi"].mean(), inplace=True)

In [None]:
show_missing_data(df)

In [None]:
df['stroke'].unique()

In [None]:
df.drop('id',axis=1)

In [None]:
X=df.drop('stroke',axis=1)
y=df["stroke"]

In [None]:
X

In [None]:
y

In [None]:
#checking the number of unique values in each column
print("Unique values in each column are:")
for col in df.columns:
    print(col,df[col].nunique())

In [None]:
colname_num=[]
for x in df.columns:
    if df[x].dtype=='object':
        colname_num.append(x)
colname_num

### Encoding categorical variables.

-Since the dataset contains categorical variables, we need to convert these into a numerical format to be used by the machine learning model. We use `LabelEncoder` for this purpose, which transforms categorical data into a format that can be provided to the model.


In [None]:


le = preprocessing.LabelEncoder()

for x in colname_num:
    df[x]=le.fit_transform(df[x])
    
    print()
    le_name_mapping = dict(zip(le.classes_,le.transform(le.classes_)))
    print('Feature',x)
    print('mapping',le_name_mapping)

In [None]:
df

In [None]:
X=df.drop('stroke',axis=1)
y=df["stroke"]

In [None]:
y

In [None]:
X

## Model : Decision tree

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('stroke',axis=1), df['stroke'], test_size=0.3, random_state=30)

In [None]:
dtree = DecisionTreeClassifier(criterion='entropy',max_depth=10,min_samples_leaf=20,min_samples_split=15,splitter='best')
dtree


In [None]:
dtree.fit(X_train, y_train)

In [None]:
print("training Accuracy:",dtree.score(X_train,y_train))

In [None]:
#training the model

dtree.score(X_test, y_test)

In [None]:
dt_pred = dtree.predict(X_test)
accuracy_score(y_test, dt_pred)

In [None]:
cfm =confusion_matrix(y_test,dt_pred)

print(cfm)

print("Classification report: ")

print(classification_report(y_test,dt_pred))

acc=accuracy_score(y_test, dt_pred)
print("Accuracy of the model: ",acc)

In [None]:
for i in df.columns:
    print("-------------{}-------------".format(i))
    print(df[i].value_counts())

# Conclusion

The Decision Tree model was effectively used to predict stroke occurrences based on various health indicators. The model achieved a reasonable accuracy, but there is potential for further improvement through hyperparameter tuning, feature engineering, and handling class imbalance. Overall, the analysis provided insights into the most significant factors contributing to stroke risk, offering a foundation for more advanced models and applications in predictive healthcare.
