In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# FIRST DATASET

We will use the pre-approved dataset called 'Stroke Prediction Dataset'.

1. Source of data :
Here is the link for where the data is found from kaggle: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

2. Brief description of data :
This is a healthcare dataset used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

3. What is the target :
There are 2 possible classes : predict stroke (target): 1 if the patient had a stroke or 0 if not.

4. What does one row represent :
Each row represents a specific patient.

5. Is this a classification or regression problem :
This is a binary classification problem.

6. How many features does the data have :
The data contains 12 attributes (columns).

7. How many rows are in the dataset :
The data contains 5110 observations (rows).

8. What challenges do you foresee in cleaning, exploring, or modeling this dataset :
- In cleaning :
Clean inaccurate or incorrect values, missing or incomplete data, outlier or anomalous data, and duplicate or redundant data.
I have to deal with inconsistent or conflicting features that creates confusion, incorrect features that could lead to bad decision-making and also affect the patient's life.
- In Exploring :
Identify relationships between different data variables and the distribution of data values in order to reveal patterns and points of interest, enabling to gain greater insight into the raw data.
- In Modeling :
Choose the right data model especially when data is unbalanced, regularize the model by tunning hyperparameters and dealing with class imbalance and compare the model's performance.

In [2]:
# Load & read data from folder structure
fpath = '/content/healthcare-dataset-stroke-data.csv'
df = pd.read_csv(fpath)
# Explore data : missing values, columns, types
df.info()
# Top rows of the dataframe
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [3]:
# Check target balance
df['stroke'].value_counts()

0    4861
1     249
Name: stroke, dtype: int64

In [4]:
# Check target balance in percent : a high unbalanced dataset
df['stroke'].value_counts(normalize = True)

0    0.951272
1    0.048728
Name: stroke, dtype: float64

# SECOND DATASET

We will use the pre-approved dataset called 'Metabolic Syndrome Prediction'.

1. Source of data :
Here is the link for where the data is found from kaggle: https://data.world/informatics-edu/metabolic-syndrome-prediction

2. Brief description of data :
This is a healthcare dataset used to predict whether a patient has a metabolic syndrome based on the input parameters like age, sex, race, molecules levels in the blood like triglycerides & bloodglucose. Each row in the data provides relavant information about the patient.

3. What is the target :
There are 2 possible classes : Metabolic Syndrome (target): Yes if the patient had a Metabolic Syndrome or No if not.

4. What does one row represent :
Each row represents a specific patient.

5. Is this a classification or regression problem :
This is a binary classification problem.

6. How many features does the data have :
The data contains 15 attributes (columns).

7. How many rows are in the dataset :
The data contains 2401 observations (rows).

8. What challenges do you foresee in cleaning, exploring, or modeling this dataset :
- In cleaning :
Clean inaccurate or incorrect values, missing or incomplete data, outlier or anomalous data, and duplicate or redundant data.
I have to deal with inconsistent or conflicting features that creates confusion, incorrect features that could lead to bad decision-making and also affect the patient's life.
- In Exploring :
Identify relationships between different data variables and the distribution of data values in order to reveal patterns and points of interest, enabling to gain greater insight into the raw data.
- In Modeling :
Choose the right data model especially when data is unbalanced, regularize the model by tunning hyperparameters and dealing with class imbalance and compare the model's performance.

In [5]:
# Load & read data from folder structure
fpath = '/content/Metabolic Syndrome.csv'
df = pd.read_csv(fpath)
# Explore data : missing values, columns, types
df.info()
# Top rows of the dataframe
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2401 entries, 0 to 2400
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   seqn               2401 non-null   int64  
 1   Age                2401 non-null   int64  
 2   Sex                2401 non-null   object 
 3   Marital            2193 non-null   object 
 4   Income             2284 non-null   float64
 5   Race               2401 non-null   object 
 6   WaistCirc          2316 non-null   float64
 7   BMI                2375 non-null   float64
 8   Albuminuria        2401 non-null   int64  
 9   UrAlbCr            2401 non-null   float64
 10  UricAcid           2401 non-null   float64
 11  BloodGlucose       2401 non-null   int64  
 12  HDL                2401 non-null   int64  
 13  Triglycerides      2401 non-null   int64  
 14  MetabolicSyndrome  2401 non-null   object 
dtypes: float64(5), int64(6), object(4)
memory usage: 281.5+ KB


Unnamed: 0,seqn,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
0,62161,22,Male,Single,8200.0,White,81.0,23.3,0,3.88,4.9,92,41,84,No MetSyn
1,62164,44,Female,Married,4500.0,White,80.1,23.2,0,8.55,4.5,82,28,56,No MetSyn
2,62169,21,Male,Single,800.0,Asian,69.6,20.1,0,5.07,5.4,107,43,78,No MetSyn
3,62172,43,Female,Single,2000.0,Black,120.4,33.3,0,5.22,5.0,104,73,141,No MetSyn
4,62177,51,Male,Married,,Asian,81.1,20.1,0,8.13,5.0,95,43,126,No MetSyn


In [7]:
# Check target balance
df['MetabolicSyndrome'].value_counts()

No MetSyn    1579
MetSyn        822
Name: MetabolicSyndrome, dtype: int64

In [8]:
# Check target balance in percent : unbalanced dataset
df['MetabolicSyndrome'].value_counts(normalize = True)

No MetSyn    0.657643
MetSyn       0.342357
Name: MetabolicSyndrome, dtype: float64