<h1><u><span style="color:#FF6666;">Stroke Patient Healthcare Using Deep Learning</span></u></h1>

The main objective of this project is to examine a healthcare dataset regarding the prediction of **strokes**. The dataset consists of patient details including demographic information, medical background, and lifestyle elements. Its goal is to forecast the probability of experiencing a stroke using different characteristics.

<h4><u>Dataset:</u></h4>

- **Gender and Age**: Basic demographic information.
- **Hypertension and Heart Disease**: Medical history related to stroke risk.
- **Ever Married**: Marital status, which might influence health outcomes.
- **Work Type and Residence Type**: Social factors that could affect health.
- **Avg Glucose Level and BMI**: Indicators of health related to metabolic conditions.
- **Smoking Status**: Lifestyle factor contributing to stroke risk.
- **Stroke**: The target variable indicating whether a stroke has occurred (1) or not (0).

**Dataset Link**: [Click here to access the dataset](https://drive.google.com/file/d/1XyhVIZaKYZczlM2alun_fofilqTBq_9c/view?usp=drive_link)


># **(1) Defining Problem Statement and Analyzing Basic Metrics**
***

The objective of this project is to analyze a healthcare dataset related to **stroke** patients, focusing on identifying key risk factors that contribute to stroke occurrences. This includes analyzing metrics such as age, hypertension, heart disease, BMI, glucose levels, and lifestyle factors like smoking habits. By examining these variables, we aim to uncover patterns that can aid in early diagnosis and stroke prevention. The analysis will provide insights into the prevalence of strokes across different demographics and conditions, helping to better understand the contributing factors.

# **(2) Import libraries and Load the dataset**
***

In [14]:
import numpy as np  # Importing NumPy 
import pandas as pd  # Importing Pandas 


In [16]:
# Load .csv dataset into a pandas dataFrame
df=pd.read_csv("https://drive.google.com/uc?export=download&id=1XyhVIZaKYZczlM2alun_fofilqTBq_9c")

In [18]:
# shows the top 5 records of the dataset
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


# **(3) Data Exploration and Pre-processing**
***

### **Check basic metrics and data types**

Comprehending the format of the dataset, such as the quantity of rows and columns, and the data types of every attribute. Exploring data is an essential step.

In [24]:
df.shape  # Returns a tuple representing the number of rows and columns in the DataFrame

(5110, 12)

In [26]:
df.info()  # Provides a summary of the DataFrame, including data types and non-null counts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


<h5 style="color: blue;">Observations:</h5>


- **Dataset Size**: 5,110 entries and 12 columns.
- **Data Types**:
  - 4 integer columns
  - 3 float columns
  - 5 object columns
- **Missing Values**: The **BMI** column has 101 missing values; all other columns are complete with no missing data.
- **Attributes**:
  - Demographic information: **gender**, **age**, **ever_married**, **work_type**, **Residence_type**
  - Health-related factors: **hypertension**, **heart_disease**, **avg_glucose_level**, **bmi**, **smoking_status**, **stroke**
- **Analysis Potential**: The dataset is suitable for exploring relationships between demographic factors and stroke incidence.


In [31]:
 #Describing the statistical summary of numerical type data
df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


<h5 style="color: blue;">Observations:</h5>

-  **Age Distribution**: The average age of participants is approximately **43.23 years**, with a wide range from **0.08** to **82 years**, indicating a diverse age group.
  
- **Hypertension and Heart Disease**: About **9.75%** of participants have hypertension, and approximately **5.4%** have a history of heart disease, reflecting relatively low prevalence rates.

- **Average Glucose Level**: The mean average glucose level is **106.15 mg/dL**, with a maximum value of **271.74 mg/dL**, suggesting variability in glucose levels among participants.

- **Body Mass Index (BMI)**: The mean BMI is **28.89 kg/m²**, indicating that many participants may fall into the overweight category; the BMI column has **201 missing values**.

- **Stroke Incidence**: Only about **4.87%** of participants have experienced a stroke, suggesting that stroke occurrences are relatively infrequent in this dataset.



In [41]:
df.smoking_status.unique()  # Retrieve the unique values in the 'smoking_status' column to analyze different smoking behaviors in the dataset.

array(['formerly smoked', 'never smoked', 'smokes', 'Unknown'],
      dtype=object)

### **Statistical Summary of categorical type data**

In [44]:
# Statistical summary of categorical type data
df.describe(include = object)

Unnamed: 0,gender,ever_married,work_type,Residence_type,smoking_status
count,5110,5110,5110,5110,5110
unique,3,2,5,2,4
top,Female,Yes,Private,Urban,never smoked
freq,2994,3353,2925,2596,1892


<h5 style="color: blue;">Observations:</h5>

- **Gender**: There are **3 unique categories** with **Female** being the most frequent (2,994 occurrences), indicating a higher representation of females in the dataset.

- **Marital Status**: The majority of participants are **ever married** (3,353 occurrences), suggesting that most individuals in the dataset have been married at some point.

- **Work Type**: There are **5 unique work types**, with **Private** employment being the most common (2,925 occurrences), indicating that many participants work in the private sector.

- **Residence Type**: The dataset includes **2 types** of residence, with **Urban** being more prevalent (2,596 occurrences), highlighting a tendency for participants to live in urban areas.

- **Smoking Status**: There are **4 unique smoking status categories**, with **never smoked** being the most common (1,892 occurrences), indicating a relatively low prevalence of smoking among participants.

<h3 style="text-decoration: underline;">Checking null values</h3>


This step serves as both **data cleaning** and **data preprocessing**. Identifying and managing missing values is an essential part of data cleaning, as it addresses the issue of incomplete data. Depending on the severity of the missing data, you may need to choose a method for handling it, such as imputing values or removing the affected rows or columns. 

Additionally, this step is crucial for data preprocessing because the presence of missing values can affect the reliability of subsequent analyses. Addressing these gaps ensures that the data is in an appropriate format for effective analysis.

In [48]:
# Display the count of missing values for each column
df.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [50]:
# Calculate the missing values percentage for each column and round to two decimal places
missing_values_percentage = (df.isnull().mean() * 100).round(2)

# Display the missing values percentage for each column
print("Missing Values Percentage:\n")
print(missing_values_percentage)

Missing Values Percentage:

id                   0.00
gender               0.00
age                  0.00
hypertension         0.00
heart_disease        0.00
ever_married         0.00
work_type            0.00
Residence_type       0.00
avg_glucose_level    0.00
bmi                  3.93
smoking_status       0.00
stroke               0.00
dtype: float64


<h5 style="color: blue;">Observations:</h5>

- The dataset is largely complete, with most columns showing **0% missing data**.
- The **'bmi'** column is the only one with missing values, accounting for **3.93%** of the entries.
- Key columns like **'id'**, **'gender'**, **'age'**, **'hypertension'**, **'heart_disease'**, **'ever_married'**,
  **'work_type'**, **'Residence_type'**, **'avg_glucose_level'**, **'smoking_status'**, and **'stroke'** have **no missing data**.


<h3 style="text-decoration: underline;">Handling null values</h3>



In [58]:
# Handling missing values for the 'bmi' column by assigning the result back
df['bmi'] = df['bmi'].fillna(df['bmi'].median())

In [60]:
# Display the count of missing values for each column
df.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

<h5 style="color: blue;">Observations:</h5>

- The dataset is complete with no missing values across all columns.