# Metabolic Syndrome Prediction | 1. Dataset Exploration

-> Load the dataset

-> Explore and confirm features and label(s) of this dataset

-> Explore size/shape of dataset

-> Investigate data type of features and labels and chose any better option for a 
particular column for data type if possible

-> Calculate the memory usage differences

-> Explore the statistical facts like mean, median, x percentiles of the columns

## 1. Load the dataset

In [1]:
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

df = pd.read_csv("Metabolic Syndrome.csv") # reading in the csv file using pandas, similar implementation cn be done for polars (lazy reading)

---

## 2. Explore and confirm features and label(s) of this dataset

In [2]:
df # displaying the dataset

Unnamed: 0,seqn,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
0,62161,22,Male,Single,8200.0,White,81.0,23.3,0,3.88,4.9,92,41,84,0
1,62164,44,Female,Married,4500.0,White,80.1,23.2,0,8.55,4.5,82,28,56,0
2,62169,21,Male,Single,800.0,Asian,69.6,20.1,0,5.07,5.4,107,43,78,0
3,62172,43,Female,Single,2000.0,Black,120.4,33.3,0,5.22,5.0,104,73,141,0
4,62177,51,Male,Married,,Asian,81.1,20.1,0,8.13,5.0,95,43,126,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2396,71901,48,Female,Married,1000.0,Other,,59.7,0,22.11,5.8,152,57,107,0
2397,71904,30,Female,Single,2000.0,Asian,,18.0,0,2.90,7.9,91,90,91,0
2398,71909,28,Male,Single,800.0,MexAmerican,100.8,29.4,0,2.78,6.2,99,47,84,0
2399,71911,27,Male,Married,8200.0,MexAmerican,106.6,31.3,0,4.15,6.2,100,41,124,1


In [3]:
df.columns # displaying column names/features

Index(['seqn', 'Age', 'Sex', 'Marital', 'Income', 'Race', 'WaistCirc', 'BMI',
       'Albuminuria', 'UrAlbCr', 'UricAcid', 'BloodGlucose', 'HDL',
       'Triglycerides', 'MetabolicSyndrome'],
      dtype='object')

In [4]:
df.head() # displaying first 5 rows

Unnamed: 0,seqn,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
0,62161,22,Male,Single,8200.0,White,81.0,23.3,0,3.88,4.9,92,41,84,0
1,62164,44,Female,Married,4500.0,White,80.1,23.2,0,8.55,4.5,82,28,56,0
2,62169,21,Male,Single,800.0,Asian,69.6,20.1,0,5.07,5.4,107,43,78,0
3,62172,43,Female,Single,2000.0,Black,120.4,33.3,0,5.22,5.0,104,73,141,0
4,62177,51,Male,Married,,Asian,81.1,20.1,0,8.13,5.0,95,43,126,0


In [5]:
df.tail() # displaying last 5 rows

Unnamed: 0,seqn,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
2396,71901,48,Female,Married,1000.0,Other,,59.7,0,22.11,5.8,152,57,107,0
2397,71904,30,Female,Single,2000.0,Asian,,18.0,0,2.9,7.9,91,90,91,0
2398,71909,28,Male,Single,800.0,MexAmerican,100.8,29.4,0,2.78,6.2,99,47,84,0
2399,71911,27,Male,Married,8200.0,MexAmerican,106.6,31.3,0,4.15,6.2,100,41,124,1
2400,71915,60,Male,Single,6200.0,White,106.6,27.5,0,12.82,5.2,91,36,226,1


---

## 3. Explore size/shape of dataset

In [6]:
df.shape # displaying shape of the dataset

(2401, 15)

---

## 4. Investigate data type of features and labels and chose any better option for a particular column for data type if possible

In [7]:
df.info() # displays information about the dataset, with respective datatypes for each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2401 entries, 0 to 2400
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   seqn               2401 non-null   int64  
 1   Age                2401 non-null   int64  
 2   Sex                2401 non-null   object 
 3   Marital            2193 non-null   object 
 4   Income             2284 non-null   float64
 5   Race               2401 non-null   object 
 6   WaistCirc          2316 non-null   float64
 7   BMI                2375 non-null   float64
 8   Albuminuria        2401 non-null   int64  
 9   UrAlbCr            2401 non-null   float64
 10  UricAcid           2401 non-null   float64
 11  BloodGlucose       2401 non-null   int64  
 12  HDL                2401 non-null   int64  
 13  Triglycerides      2401 non-null   int64  
 14  MetabolicSyndrome  2401 non-null   int64  
dtypes: float64(5), int64(7), object(3)
memory usage: 281.5+ KB


From above list we see datatypes include float, int and object. ML models require data to be numerical, so we would have to convert datatype object into float/int to proceed further in the ML pipeline.

Identified columns that need above change: with corresponding encoding techniques

-> Sex : binary encoding : one-hot or label encoder

-> Marital : multi-value encoding : label or ordinal encoder

-> Race : multi-value encoding : label encoder

In [8]:
set(df['Sex']) # unique values to be encoded in Sex

{'Female', 'Male'}

In [9]:
set(df['Marital']) # unique values to be encoded in Marital status

{'Divorced', 'Married', 'Separated', 'Single', 'Widowed', nan}

In [10]:
set(df['Race']) # unique values to be encoded in Race

{'Asian', 'Black', 'Hispanic', 'MexAmerican', 'Other', 'White'}

---

# 5. Calculate the memory usage differences

### pandas

In [11]:
print("Pandas Memory Usage:", df.memory_usage().sum(), 'Bytes')

Pandas Memory Usage: 288248 Bytes


In [12]:
print("Pandas Deep Memory Usage:", df.memory_usage(deep=True).sum(), 'Bytes')

Pandas Deep Memory Usage: 677600 Bytes


### polars

In [13]:
import polars as pl

df1 = pl.read_csv('Metabolic Syndrome.csv')

print("Polars Memory Usage:", df1.estimated_size('kb') , 'KB')

del df1 # deleting polars implementation since pandas implementation is not memory-intensive

Polars Memory Usage: 266.537109375 KB


## memory usage for dataset

### pandas

memory usage: 661.7 KB

lazy read memory usage: 282 KB

### polars

lazy read memory usage: 267 KB

---

## 6. Explore the statistical facts like mean, median, x percentiles of the columns

In [14]:
df.describe(percentiles=[0.1,0.25,0.5,0.75,0.99]) # describes numerical interpretations of the dataset in terms of mean, max, min, quartiles, etc.

# percentiles considered: 10, 25, 50, 75, 99

Unnamed: 0,seqn,Age,Income,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
count,2401.0,2401.0,2284.0,2316.0,2375.0,2401.0,2401.0,2401.0,2401.0,2401.0,2401.0,2401.0
mean,67030.674302,48.691795,4005.25394,98.307254,28.702189,0.154102,43.626131,5.489046,108.247813,53.369429,128.125364,0.342357
std,2823.565114,17.632852,2954.032186,16.252634,6.662242,0.42278,258.272829,1.439358,34.820657,15.185537,95.322477,0.474597
min,62161.0,20.0,300.0,56.2,13.4,0.0,1.4,1.8,39.0,14.0,26.0,0.0
10%,63133.0,25.0,1000.0,78.6,21.4,0.0,3.22,3.7,86.0,36.0,56.0,0.0
25%,64591.0,34.0,1600.0,86.675,24.0,0.0,4.45,4.5,92.0,43.0,75.0,0.0
50%,67059.0,48.0,2500.0,97.0,27.7,0.0,7.07,5.4,99.0,51.0,103.0,0.0
75%,69495.0,63.0,6200.0,107.625,32.1,0.0,13.69,6.4,110.0,62.0,150.0,1.0
99%,71830.0,80.0,9000.0,145.165,49.2,2.0,806.58,9.4,278.0,97.0,511.0,1.0
max,71915.0,80.0,9000.0,176.0,68.7,2.0,5928.0,11.3,382.0,156.0,1562.0,1.0


only for numerical data*

---

## 7. Finding NULL/NaN count for imputation

In [15]:
df.isnull().sum() # counting null values and summing them for each column

seqn                   0
Age                    0
Sex                    0
Marital              208
Income               117
Race                   0
WaistCirc             85
BMI                   26
Albuminuria            0
UrAlbCr                0
UricAcid               0
BloodGlucose           0
HDL                    0
Triglycerides          0
MetabolicSyndrome      0
dtype: int64

in addition to encoding we would also need imputation on the following columns since these hold null/NaN values which cannot be sent into a ML model:

-> Marital

-> Income

-> WaistCirc

-> BMI

of the above null values we would have to check if the columns are disjoint from others or dependent on another column(s) :

for example, income could be evaluated to depend on Sex and Marital status, or BMI being dependent on Sex, and hence keeping in mind these dependencies we would have to encode values.