<h1>STROKE PATIENT HEALTHCARE</h1> 

**Dataset Description** :

This dataset contains health-related information about individuals, with a focus on factors that may influence the occurrence of strokes. Each entry includes various demographic, medical, and lifestyle attributes.

| **Field**            | **Description**                                                                                                                                               |
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **age**              | Age of the individual in years.                                                                                                                               |
| **hypertension**     | Indicates if the individual has hypertension (0 = No, 1 = Yes).                                                                                               |
| **heart_disease**    | Indicates if the individual has heart disease (0 = No, 1 = Yes).                                                                                              |
| **ever_married**     | Indicates marital status (Yes/No).                                                                                                                            |
| **work_type**        | Type of work the individual is engaged in (e.g., Private, Self-employed, Govt_job).                                                                            |
| **Residence_type**   | Type of residence (Urban/Rural).                                                                                                                              |
| **avg_glucose_level**| Average glucose level of the individual (measured in mg/dL).                                                                                                  |
| **bmi**              | Body Mass Index of the individual (N/A indicates missing data).                                                                                               |
| **smoking_status**   | Smoking status of the individual (e.g., never smoked, smokes, formerly smoked, Unknown).                                                                      |
| **stroke**           | Indicates if the individual has had a stroke (1 = Yes, 0 = No).                                                                                                |


<h1>Import libraries</h1>

In [14]:
import pandas as pd
import numpy as np


<h1>Read the csv file</h1>

In [15]:
df= pd.read_csv("healthcare-dataset-stroke-data.csv")

<h1>Data Exploration and Pre Processing</h1>
<h2>Check basic metrics and data types</h2>

Understanding the structure of the dataset, including the number of rows and columns, and the data types of each attribute. It is a crucial step in data exploration.

In [43]:
df.head() #show top 5 record of the dataset

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [44]:
df.shape

(5110, 12)

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                5110 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


<h2>OBSERVATIONS</h2>

The dataset consists of 5110 rows with 12 columns.
We can see that columns like "id","gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"


<h3>Describing the statistical summary of numerical type data</h3>

In [58]:
df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,5110.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.698018,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.8,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.4,0.0
75%,54682.0,61.0,0.0,0.0,114.09,32.8,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [71]:
# Statistical summary of categorical type data

df.describe(include = object)

Unnamed: 0,gender,ever_married,work_type,Residence_type,smoking_status
count,5110,5110,5110,5110,5110
unique,3,2,5,2,4
top,Female,Yes,Private,Urban,never smoked
freq,2994,3353,2925,2596,1892


### **Check for missing values**

This is both a **data cleaning** and **data preprocessing** step. Identifying and handling missing values is considered **data cleaning** since it involves addressing the issue of incomplete data. Depending on the extent of missing data, you may need to decide how to handle it, either by imputing values or removing the affected rows/columns. Additionally, it is also a **data preprocessing** step since having missing values can impact the effectiveness of subsequent analyses, and addressing them helps ensure the data is in a suitable form for analysis.

In [72]:
df['bmi'].isnull().any()

np.False_

In [84]:
# Display the count of missing values for each column
df.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [86]:
set(df.work_type)

{'Govt_job', 'Never_worked', 'Private', 'Self-employed', 'children'}

In [87]:
df.smoking_status.unique()

array(['formerly smoked', 'never smoked', 'smokes', 'Unknown'],
      dtype=object)

In [88]:
#methods to deal with NaN values

In [89]:
df['bmi'].fillna(0) #replacing null values with 0

0       36.600000
1       28.893237
2       32.500000
3       34.400000
4       24.000000
          ...    
5105    28.893237
5106    40.000000
5107    30.600000
5108    25.600000
5109    26.200000
Name: bmi, Length: 5110, dtype: float64

In [90]:
df['bmi'].value_counts() # you'll not get null value's count in this

bmi
28.893237    201
28.700000     41
28.400000     38
27.600000     37
26.100000     37
            ... 
47.900000      1
13.000000      1
46.300000      1
54.100000      1
14.900000      1
Name: count, Length: 419, dtype: int64

In [91]:
mode_bmi = df['bmi'].mode()

In [92]:
df['bmi'].fillna(mode_bmi)

0       36.600000
1       28.893237
2       32.500000
3       34.400000
4       24.000000
          ...    
5105    28.893237
5106    40.000000
5107    30.600000
5108    25.600000
5109    26.200000
Name: bmi, Length: 5110, dtype: float64

<h3>replacing null values with mean value</h3>

In [93]:
df.fillna(np.mean(df.bmi),inplace = True)

In [94]:
np.mean(df.bmi)

np.float64(28.893236911794663)