<a href="https://colab.research.google.com/github/MasterSlyer10/CSMODEL/blob/main/MCO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# Dataset Description



The dataset we are using is called "Stroke Prediction" and it is used to predict a patients likelihood of experiencing a stroke based on parameters such as their gender, age, diseases they might have, as well as if they smoke.

## Data Collection
The dataset used in this research was obtained from a confidential source. Due to the confidential nature of the data, details about the specific source are not disclosed to maintain privacy and adhere to the terms of use. The data is intended for educational purposes only, and any utilization for research purposes requires proper crediting to the author, as specified by the source.

The dataset is from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data   

By the author: Federico Soriano Palacios  
LinkedIn: https://www.linkedin.com/in/federico-soriano-palacios/  
Kaggle: https://www.kaggle.com/fedesoriano  
Github: https://github.com/fedesoriano

In [10]:
stroke_df = pd.read_csv('datasets/healthcare-dataset-stroke-data.csv')

In [11]:
stroke_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


Each row in the dataset is a patient and each column is a characteristic of each patient with the last column "stroke" being whether or not they had a stroke.

There are a total of 5110 observations and 12 variables in the dataset.

## Variable Description

The following descriptions of the variables are directly gotten from the source webpage of the dataset.

1.) id: Unique identifier  
2.) gender: "Male", "Female", or "Other" to specify the individuals gender  
3.) age: Age of the patient  
4.) hypertension:  1 If the patient has hypertension, 0 If the patient has no hypertension  
5.) heart_disease: 1 If the patient has heart disease, 0 If the patient has no heart disease  
6.) ever_married: "No" or "Yes" if the person has married  
7.) work_type: "children", "Govt_jov", "Never_worked", "Private", or "Self-employed"  
8.) Residence_type: "Rural" or "Urban"  
9.) avg_glucose_level: average glucose level in the blood of the patient  
10.) bmi: body mass index of the patient  
11.) smoking_status: "formerly smoked", "never smoked", "smokes", or "Unknown" meaning information of the patient was not available  
12.) stroke: 1 If the patient had a stroke, 0 If the patiend didn't have a stroke


# Data Cleaning

In [12]:
stroke_df.isnull().any()

id                   False
gender               False
age                  False
hypertension         False
heart_disease        False
ever_married         False
work_type            False
Residence_type       False
avg_glucose_level    False
bmi                   True
smoking_status       False
stroke               False
dtype: bool

In [13]:
stroke_df['bmi'].isnull().sum()

201

In [14]:
stroke_df.shape

(5110, 12)

In [15]:
stroke_df = stroke_df.dropna(subset='bmi')

In [16]:
stroke_df.shape

(4909, 12)

In [17]:
stroke_df.tail(20)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
5087,26214,Female,63.0,0,0,Yes,Self-employed,Rural,75.93,34.7,formerly smoked,0
5088,22190,Female,64.0,1,0,Yes,Self-employed,Urban,76.89,30.2,Unknown,0
5089,56714,Female,0.72,0,0,No,children,Rural,62.13,16.8,Unknown,0
5090,4211,Male,26.0,0,0,No,Govt_job,Rural,100.85,21.0,smokes,0
5091,6369,Male,59.0,1,0,Yes,Private,Rural,95.05,30.9,never smoked,0
5092,56799,Male,76.0,0,0,Yes,Govt_job,Urban,82.35,38.9,never smoked,0
5094,28048,Male,13.0,0,0,No,children,Urban,82.38,24.3,Unknown,0
5095,68598,Male,1.08,0,0,No,children,Rural,79.15,17.4,Unknown,0
5096,41512,Male,57.0,0,0,Yes,Govt_job,Rural,76.62,28.2,never smoked,0
5097,64520,Male,68.0,0,0,Yes,Self-employed,Urban,91.68,40.8,Unknown,0


In [18]:
stroke_df['age'] = stroke_df['age'].round()

In [19]:
stroke_df['age'] = stroke_df['age'].astype(int)

In [20]:
display(stroke_df.dtypes)

id                     int64
gender                object
age                    int32
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object

In [21]:
stroke_df.tail(20)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
5087,26214,Female,63,0,0,Yes,Self-employed,Rural,75.93,34.7,formerly smoked,0
5088,22190,Female,64,1,0,Yes,Self-employed,Urban,76.89,30.2,Unknown,0
5089,56714,Female,1,0,0,No,children,Rural,62.13,16.8,Unknown,0
5090,4211,Male,26,0,0,No,Govt_job,Rural,100.85,21.0,smokes,0
5091,6369,Male,59,1,0,Yes,Private,Rural,95.05,30.9,never smoked,0
5092,56799,Male,76,0,0,Yes,Govt_job,Urban,82.35,38.9,never smoked,0
5094,28048,Male,13,0,0,No,children,Urban,82.38,24.3,Unknown,0
5095,68598,Male,1,0,0,No,children,Rural,79.15,17.4,Unknown,0
5096,41512,Male,57,0,0,Yes,Govt_job,Rural,76.62,28.2,never smoked,0
5097,64520,Male,68,0,0,Yes,Self-employed,Urban,91.68,40.8,Unknown,0


# Exploratory Data Analysis

# Research Question