# Random Forest-Metabolic syndrome

**context**

This dataset contains information on individuals with metabolic syndrome, a complex medical condition associated with a cluster of risk factors for cardiovascular diseases and type 2 diabetes.


**Data Dictionary**

seqn: Sequential identification number.

Age: Age of the individual.

Sex: Gender of the individual (e.g., Male, Female).

Marital: Marital status of the individual.

Income: Income level or income-related information.

Race: Ethnic or racial background of the individual.

WaistCirc: Waist circumference measurement.

BMI: Body Mass Index, a measure of body composition.

Albuminuria: Measurement related to albumin in urine.

UrAlbCr: Urinary albumin-to-creatinine ratio.

UricAcid: Uric acid levels in the blood.

BloodGlucose: Blood glucose levels, an indicator of diabetes risk.

HDL: High-Density Lipoprotein cholesterol levels (the "good" cholesterol).

Triglycerides: Triglyceride levels in the blood.

MetabolicSyndrome: Binary variable indicating the presence (1) or absence (0) of metabolic syndrome.

In [19]:
import pandas as pd
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import KFold,cross_val_score

# for preprocessing

from sklearn import preprocessing

In [2]:
df=pd.read_csv('/content/Metabolic Syndrome.csv')

In [3]:
df.head()

Unnamed: 0,seqn,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
0,62161,22,Male,Single,8200.0,White,81.0,23.3,0,3.88,4.9,92,41,84,0
1,62164,44,Female,Married,4500.0,White,80.1,23.2,0,8.55,4.5,82,28,56,0
2,62169,21,Male,Single,800.0,Asian,69.6,20.1,0,5.07,5.4,107,43,78,0
3,62172,43,Female,Single,2000.0,Black,120.4,33.3,0,5.22,5.0,104,73,141,0
4,62177,51,Male,Married,,Asian,81.1,20.1,0,8.13,5.0,95,43,126,0


In [4]:
df.shape

(2401, 15)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2401 entries, 0 to 2400
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   seqn               2401 non-null   int64  
 1   Age                2401 non-null   int64  
 2   Sex                2401 non-null   object 
 3   Marital            2193 non-null   object 
 4   Income             2284 non-null   float64
 5   Race               2401 non-null   object 
 6   WaistCirc          2316 non-null   float64
 7   BMI                2375 non-null   float64
 8   Albuminuria        2401 non-null   int64  
 9   UrAlbCr            2401 non-null   float64
 10  UricAcid           2401 non-null   float64
 11  BloodGlucose       2401 non-null   int64  
 12  HDL                2401 non-null   int64  
 13  Triglycerides      2401 non-null   int64  
 14  MetabolicSyndrome  2401 non-null   int64  
dtypes: float64(5), int64(7), object(3)
memory usage: 281.5+ KB


In [6]:
df.dtypes

seqn                   int64
Age                    int64
Sex                   object
Marital               object
Income               float64
Race                  object
WaistCirc            float64
BMI                  float64
Albuminuria            int64
UrAlbCr              float64
UricAcid             float64
BloodGlucose           int64
HDL                    int64
Triglycerides          int64
MetabolicSyndrome      int64
dtype: object

In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
seqn,2401.0,67030.674302,2823.565114,62161.0,64591.0,67059.0,69495.0,71915.0
Age,2401.0,48.691795,17.632852,20.0,34.0,48.0,63.0,80.0
Income,2284.0,4005.25394,2954.032186,300.0,1600.0,2500.0,6200.0,9000.0
WaistCirc,2316.0,98.307254,16.252634,56.2,86.675,97.0,107.625,176.0
BMI,2375.0,28.702189,6.662242,13.4,24.0,27.7,32.1,68.7
Albuminuria,2401.0,0.154102,0.42278,0.0,0.0,0.0,0.0,2.0
UrAlbCr,2401.0,43.626131,258.272829,1.4,4.45,7.07,13.69,5928.0
UricAcid,2401.0,5.489046,1.439358,1.8,4.5,5.4,6.4,11.3
BloodGlucose,2401.0,108.247813,34.820657,39.0,92.0,99.0,110.0,382.0
HDL,2401.0,53.369429,15.185537,14.0,43.0,51.0,62.0,156.0


In [9]:
df.isnull().sum()

seqn                   0
Age                    0
Sex                    0
Marital              208
Income               117
Race                   0
WaistCirc             85
BMI                   26
Albuminuria            0
UrAlbCr                0
UricAcid               0
BloodGlucose           0
HDL                    0
Triglycerides          0
MetabolicSyndrome      0
dtype: int64

#Imputing null values

In [10]:
a=df['Marital'].mode()[0]
df['Marital']=df['Marital'].fillna(value=a)

In [11]:
a=df['Income'].median()
df['Income']=df['Income'].fillna(value=a)

In [12]:
a=df['WaistCirc'].median()
df['WaistCirc']=df['WaistCirc'].fillna(value=a)

In [13]:
a=df['BMI'].median()
df['BMI']=df['BMI'].fillna(value=a)

In [14]:
df.isnull().sum()

seqn                 0
Age                  0
Sex                  0
Marital              0
Income               0
Race                 0
WaistCirc            0
BMI                  0
Albuminuria          0
UrAlbCr              0
UricAcid             0
BloodGlucose         0
HDL                  0
Triglycerides        0
MetabolicSyndrome    0
dtype: int64

.There are no null values in the dataset.

In [16]:
df[df.duplicated()].sum()

seqn                 0.0
Age                  0.0
Sex                  0.0
Marital              0.0
Income               0.0
Race                 0.0
WaistCirc            0.0
BMI                  0.0
Albuminuria          0.0
UrAlbCr              0.0
UricAcid             0.0
BloodGlucose         0.0
HDL                  0.0
Triglycerides        0.0
MetabolicSyndrome    0.0
dtype: float64

.There are no duplicate values in the dataset.

In [17]:
df.drop('seqn',axis=1,inplace=True)

In [18]:
df.head()

Unnamed: 0,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
0,22,Male,Single,8200.0,White,81.0,23.3,0,3.88,4.9,92,41,84,0
1,44,Female,Married,4500.0,White,80.1,23.2,0,8.55,4.5,82,28,56,0
2,21,Male,Single,800.0,Asian,69.6,20.1,0,5.07,5.4,107,43,78,0
3,43,Female,Single,2000.0,Black,120.4,33.3,0,5.22,5.0,104,73,141,0
4,51,Male,Married,2500.0,Asian,81.1,20.1,0,8.13,5.0,95,43,126,0


In [20]:
label_encoder=preprocessing.LabelEncoder()
df['Sex']=label_encoder.fit_transform(df['Sex'])
df['Marital']=label_encoder.fit_transform(df['Marital'])
df['Race']=label_encoder.fit_transform(df['Race'])

In [21]:
df.head()

Unnamed: 0,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
0,22,1,3,8200.0,5,81.0,23.3,0,3.88,4.9,92,41,84,0
1,44,0,1,4500.0,5,80.1,23.2,0,8.55,4.5,82,28,56,0
2,21,1,3,800.0,0,69.6,20.1,0,5.07,5.4,107,43,78,0
3,43,0,3,2000.0,1,120.4,33.3,0,5.22,5.0,104,73,141,0
4,51,1,1,2500.0,0,81.1,20.1,0,8.13,5.0,95,43,126,0


In [27]:
df.shape

(2401, 14)

# Dividing the dataset into x and y

In [28]:
x=df.iloc[:,0:13]
y=df.iloc[:,13]

In [29]:
x.head()

Unnamed: 0,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides
0,22,1,3,8200.0,5,81.0,23.3,0,3.88,4.9,92,41,84
1,44,0,1,4500.0,5,80.1,23.2,0,8.55,4.5,82,28,56
2,21,1,3,800.0,0,69.6,20.1,0,5.07,5.4,107,43,78
3,43,0,3,2000.0,1,120.4,33.3,0,5.22,5.0,104,73,141
4,51,1,1,2500.0,0,81.1,20.1,0,8.13,5.0,95,43,126


Splitting the data using KFold cross validation

In [35]:
kfold = KFold(n_splits=10)
cart = DecisionTreeClassifier()
num_trees = 100

#build the model

model = BaggingClassifier(estimator=cart, n_estimators = num_trees)
results = cross_val_score(model,x,y,cv=kfold)

print(results.mean())

0.8829598893499309


In [36]:
results

array([0.89626556, 0.87916667, 0.8625    , 0.90416667, 0.89166667,
       0.8625    , 0.87083333, 0.90416667, 0.87083333, 0.8875    ])

#Random Forest Classifier

In [37]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split

In [38]:
df.head()

Unnamed: 0,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
0,22,1,3,8200.0,5,81.0,23.3,0,3.88,4.9,92,41,84,0
1,44,0,1,4500.0,5,80.1,23.2,0,8.55,4.5,82,28,56,0
2,21,1,3,800.0,0,69.6,20.1,0,5.07,5.4,107,43,78,0
3,43,0,3,2000.0,1,120.4,33.3,0,5.22,5.0,104,73,141,0
4,51,1,1,2500.0,0,81.1,20.1,0,8.13,5.0,95,43,126,0


In [39]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

In [40]:
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((1680, 13), (721, 13), (1680,), (721,))

In [51]:
sel = SelectFromModel(RandomForestClassifier(n_estimators = 200,max_features = 5))
sel.fit(x_train,y_train)

In [52]:
sel.get_support()

array([False, False, False, False, False,  True,  True, False, False,
       False,  True,  True,  True])

In [53]:
selected_feat= x_train.columns[(sel.get_support())]
len(selected_feat)

5

In [54]:
selected_feat

Index(['WaistCirc', 'BMI', 'BloodGlucose', 'HDL', 'Triglycerides'], dtype='object')