# Read the Data and Pick 10 Features for Research

### Our main goal: Among the selected features, identify which features are related with diabetes, and how they are correlated.

### Features we picked:

#### 1. Age (AGE_P)

Type: Continuous variable

Description: Represents the respondent's age in years.

Example values: 18, 35, 60, etc.

Why: Diabetes risk increases with age, especially for Type 2 diabetes.


#### 2. Sex (SEX)

Type: Categorical variable

Possible values:

1: Male

2: Female

7: Refused

9: Don't know

Why: There may be gender differences in diabetes prevalence and risk factors.


#### 3. Body Mass Index (BMI)

Type: Continuous variable

Description: The body mass index (BMI) is calculated from height and weight.

Example values:

<18.5: Underweight

18.5–24.9: Normal weight

25–29.9: Overweight

30 or greater: Obesity

Why: Obesity is one of the strongest risk factors for Type 2 diabetes.


#### 4. Hypertension (HYPEV)

Type: Categorical variable

Possible values:

1: Yes

2: No

7: Refused

9: Don’t know

Why: High blood pressure is often associated with diabetes.


#### 5. Cholesterol Levels (CHLEV)

Type: Categorical variable

Possible values:

1: Yes

2: No

7: Refused

9: Don’t know

Why: High cholesterol is a common comorbidity with diabetes and can lead to related complications.




#### 6. Smoking Status SMKNOW (Current Smoking Status)


Type: Categorical variable

Possible values:

1: Every day

2: Some days

3: Not at all

7: Refused

9: Don’t know

Why: Smoking has been linked to an increased risk of diabetes and other chronic conditions.


#### 7. Family History of Diabetes (DIBREL)

Type: Categorical variable

Possible values:

1: Yes

2: No

7: Refused

9: Don’t know

Why: A family history of diabetes significantly increases an individual's risk.


#### 8. Gestational Diabetes (DIBGDM)

Type: Categorical variable

Possible values:

1: Yes

2: No

7: Refused

9: Don’t know

Why: Women who develop gestational diabetes during pregnancy have a higher risk of developing Type 2 diabetes later in life.


#### 9. Alcohol Consumption (ALC1YR)

Type: Categorical variable

Possible values:

1: Yes (consumed alcohol in the past year)

2: No (did not consume alcohol in the past year)

7: Refused (refused to answer the question)

9: Don't know (unsure if consumed alcohol in the past year)

Why: Excessive alcohol consumption can increase the risk of diabetes through its impact on weight and liver function.


### Tag for Diabetes (DIBEV1)

Type: Categorical variable

Possible values:

1: Yes (the respondent has been diagnosed with diabetes)

2: No (the respondent has not been diagnosed with diabetes)

3: Borderline (the respondent has been told they have borderline diabetes)

7: Refused (the respondent refused to answer)

9: Don't know (the respondent is unsure if they have been diagnosed with diabetes)

In [24]:
import pandas as pd

In [25]:
pd = pd.read_csv("samadult.csv")
pd.head()

Unnamed: 0,FPX,FMX,HHX,INTV_QRT,WTIA_SA,WTFA_SA,SEX,HISPAN_I,R_MARITL,MRACRPI2,...,BFWH_05,BFWH_06,BFWH_07,BFWH_08,BFWH_09,BFWH_10,BNRFALL,BINTHI,BINTTR,BINTRS
0,1,1,1,1,11241.0,26100,1,12,8,2,...,2.0,2.0,2.0,2.0,1.0,2.0,5.0,2.0,2.0,2.0
1,1,2,1,1,5620.5,11294,2,12,5,2,...,,,,,,,,,,
2,1,1,2,1,2919.3,2506,1,12,5,1,...,,,,,,,,,,
3,2,1,3,1,8883.8,9267,2,3,2,1,...,,,,,,,,,,
4,1,1,5,1,3300.8,3443,2,3,7,1,...,,,,,,,,,,


In [31]:
# List of columns to keep (10 key features + diabetes indicator column)
columns_to_keep = [
    'AGE_P',     # Age
    'SEX',       # Sex
    'BMI',       # Body Mass Index
    'HYPEV',     # Hypertension diagnosis
    'CHLEV',     # Cholesterol diagnosis
    'SMKNOW',    # Current smoking status
    'ALC1YR',    # Alcohol Consumption
    'DIBREL',    # Family history of diabetes
    'DIBGDM',    # Gestational diabetes history
    'DIBEV1'     # Diabetes diagnosis (Yes/No/Borderline)
]

In [32]:
# Create a new DataFrame with only the selected columns
samadult = pd[columns_to_keep]
samadult.head()

Unnamed: 0,AGE_P,SEX,BMI,HYPEV,CHLEV,SMKNOW,ALC1YR,DIBREL,DIBGDM,DIBEV1
0,22,1,3336,2,2,,1,1,,2
1,24,2,2019,2,2,,1,2,,2
2,76,1,2727,1,1,,2,2,,2
3,36,2,3862,2,2,,1,1,2.0,2
4,35,2,3995,2,2,1.0,1,1,2.0,2


In [33]:
# Find BMI column doesn't have decimal point. Add it in
samadult['BMI']=samadult['BMI']/100
samadult.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  samadult['BMI']=samadult['BMI']/100


Unnamed: 0,AGE_P,SEX,BMI,HYPEV,CHLEV,SMKNOW,ALC1YR,DIBREL,DIBGDM,DIBEV1
0,22,1,33.36,2,2,,1,1,,2
1,24,2,20.19,2,2,,1,2,,2
2,76,1,27.27,1,1,,2,2,,2
3,36,2,38.62,2,2,,1,1,2.0,2
4,35,2,39.95,2,2,1.0,1,1,2.0,2


# Preprocessing the Dataset

## Reduce Outliers

In [34]:
#Suspect outlier in BMI column (such as BMI = 99.99)
#Convert the BMI into category to reduce the impact

def categorize_bmi(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif 18.5 <= bmi < 25:
        return 'Normal weight'
    elif 25 <= bmi < 30:
        return 'Overweight'
    else:
        return 'Obesity'

samadult['BMI'] = samadult['BMI'].apply(categorize_bmi)
samadult

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  samadult['BMI'] = samadult['BMI'].apply(categorize_bmi)


Unnamed: 0,AGE_P,SEX,BMI,HYPEV,CHLEV,SMKNOW,ALC1YR,DIBREL,DIBGDM,DIBEV1
0,22,1,Obesity,2,2,,1,1,,2
1,24,2,Normal weight,2,2,,1,2,,2
2,76,1,Overweight,1,1,,2,2,,2
3,36,2,Obesity,2,2,,1,1,2.0,2
4,35,2,Obesity,2,2,1.0,1,1,2.0,2
...,...,...,...,...,...,...,...,...,...,...
33023,56,1,Normal weight,2,2,3.0,1,1,,2
33024,58,1,Obesity,1,1,3.0,1,1,,2
33025,71,2,Overweight,1,1,3.0,2,1,2.0,1
33026,64,1,Overweight,1,1,3.0,2,2,,2


### Can converting BMI number into categories reduce the impact of outliers?

Yes. Here are reasons:

### Outlier Mitigation:

When converting to categories, you're mapping a potentially extreme or erroneous value (like 99.99, which is unreasonably high for BMI) into a broad category (e.g., "Obesity"). This reduces the influence that extreme values could have in the dataset, as you no longer work with exact values but with predefined ranges.

### Loss of Extremes:

For example, if you have a BMI value of 99.99, after conversion, it would fall into the "Obesity" category. This conversion disregards how extreme the value is, treating it the same as any other BMI over 30, thereby reducing its impact.

### Interpretability:

Categories are easier to interpret in studies where you're interested in general patterns (e.g., whether obesity correlates with diabetes). By categorizing BMI, you're focusing on broader trends rather than specific extreme values.

# Preprocessing the dataset

## Handle Missing Values

In [36]:
import numpy as np

In [39]:
#Besides the NaN in the dataset, we also have people who answer 9, which is don't know
#Convert option 9 as NaN for categorical data

# List of columns where the special missing values occur
categorical_columns = ['SEX', 'HYPEV', 'CHLEV', 'SMKNOW', 'DIBREL', 'DIBGDM', 'ALC1YR', 'DIBEV1']

samadult[categorical_columns] = samadult[categorical_columns].replace([9], np.nan)
samadult.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,AGE_P,SEX,BMI,HYPEV,CHLEV,SMKNOW,ALC1YR,DIBREL,DIBGDM,DIBEV1
0,22,1.0,Obesity,2.0,2.0,,1.0,1.0,,2.0
1,24,2.0,Normal weight,2.0,2.0,,1.0,2.0,,2.0
2,76,1.0,Overweight,1.0,1.0,,2.0,2.0,,2.0
3,36,2.0,Obesity,2.0,2.0,,1.0,1.0,2.0,2.0
4,35,2.0,Obesity,2.0,2.0,1.0,1.0,1.0,2.0,2.0
5,20,2.0,Normal weight,2.0,2.0,,2.0,1.0,,2.0
6,19,2.0,Normal weight,2.0,2.0,,2.0,,,2.0
7,45,2.0,Overweight,2.0,1.0,1.0,1.0,1.0,,2.0
8,18,2.0,Normal weight,2.0,2.0,,1.0,2.0,,2.0
9,20,2.0,Normal weight,2.0,2.0,,2.0,2.0,,2.0


In [48]:
#Then let's check how many people refuse to answer (choose 7)
for col in categorical_columns:
    count_refused = samadult[col].value_counts().get(7, 0)  # Get the count of 7 (or 0 if not present)
    print(f"Column {col} has {count_refused} instances of value 7.")

Column SEX has 0 instances of value 7.
Column HYPEV has 21 instances of value 7.
Column CHLEV has 24 instances of value 7.
Column SMKNOW has 3 instances of value 7.
Column DIBREL has 27 instances of value 7.
Column DIBGDM has 5 instances of value 7.
Column ALC1YR has 47 instances of value 7.
Column DIBEV1 has 17 instances of value 7.


In [58]:
#Since we have more than 33028 observations, and the number of observations is very small
#We can change them to the mode
import pandas as pd

# Assuming df is your DataFrame

# Loop through each column and replace "7" with the mode
for col in samadult.columns:
    mode_value = samadult[col].mode()[0]  # Get the most frequent value (mode)
    # Replace "7" with the mode, ensuring the mode itself is not 7
    if mode_value != 7:
        samadult[col].replace(7, mode_value, inplace=True)

samadult

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,AGE_P,SEX,BMI,HYPEV,CHLEV,SMKNOW,ALC1YR,DIBREL,DIBGDM,DIBEV1
0,22,1.0,1,2.0,2.0,,1.0,1.0,,2.0
1,24,2.0,0,2.0,2.0,,1.0,2.0,,2.0
2,76,1.0,2,1.0,1.0,,2.0,2.0,,2.0
3,36,2.0,1,2.0,2.0,,1.0,1.0,2.0,2.0
4,35,2.0,1,2.0,2.0,1.0,1.0,1.0,2.0,2.0
...,...,...,...,...,...,...,...,...,...,...
33023,56,1.0,0,2.0,2.0,3.0,1.0,1.0,,2.0
33024,58,1.0,1,1.0,1.0,3.0,1.0,1.0,,2.0
33025,71,2.0,2,1.0,1.0,3.0,2.0,1.0,2.0,1.0
33026,64,1.0,2,1.0,1.0,3.0,2.0,2.0,,2.0


In [59]:
#Then let's check if all of them are converted
for col in categorical_columns:
    count_refused = samadult[col].value_counts().get(7, 0)  # Get the count of 7 (or 0 if not present)
    print(f"Column {col} has {count_refused} instances of value 7.")

Column SEX has 0 instances of value 7.
Column HYPEV has 0 instances of value 7.
Column CHLEV has 0 instances of value 7.
Column SMKNOW has 0 instances of value 7.
Column DIBREL has 0 instances of value 7.
Column DIBGDM has 0 instances of value 7.
Column ALC1YR has 0 instances of value 7.
Column DIBEV1 has 0 instances of value 7.


In [61]:
#Finally, let's deal with the missing value

#First, check the percentage of missing
print(samadult.isnull().mean() * 100)

AGE_P      0.000000
SEX        0.000000
BMI        0.000000
HYPEV      0.084777
CHLEV      0.196803
SMKNOW    59.125590
ALC1YR     0.060555
DIBREL     1.341286
DIBGDM    58.771346
DIBEV1     0.051471
dtype: float64


 For columns like SMKNOW and DIBGDM, the missingness is high (59% and 58% respectively).
 
 We can use KNN imputation to fill in missing values based on similar observations.

In [62]:
#To apply KNN, let's conduct Label Encoding
from sklearn.preprocessing import LabelEncoder

# Example: Convert BMI category to numerical values using LabelEncoder
label_encoder = LabelEncoder()
samadult['BMI'] = label_encoder.fit_transform(samadult['BMI'])  # Assuming BMI is categorical
samadult

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  samadult['BMI'] = label_encoder.fit_transform(samadult['BMI'])  # Assuming BMI is categorical


Unnamed: 0,AGE_P,SEX,BMI,HYPEV,CHLEV,SMKNOW,ALC1YR,DIBREL,DIBGDM,DIBEV1
0,22,1.0,1,2.0,2.0,,1.0,1.0,,2.0
1,24,2.0,0,2.0,2.0,,1.0,2.0,,2.0
2,76,1.0,2,1.0,1.0,,2.0,2.0,,2.0
3,36,2.0,1,2.0,2.0,,1.0,1.0,2.0,2.0
4,35,2.0,1,2.0,2.0,1.0,1.0,1.0,2.0,2.0
...,...,...,...,...,...,...,...,...,...,...
33023,56,1.0,0,2.0,2.0,3.0,1.0,1.0,,2.0
33024,58,1.0,1,1.0,1.0,3.0,1.0,1.0,,2.0
33025,71,2.0,2,1.0,1.0,3.0,2.0,1.0,2.0,1.0
33026,64,1.0,2,1.0,1.0,3.0,2.0,2.0,,2.0


In [63]:
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder

# Use KNNImputer for missing value imputation
imputer = KNNImputer(n_neighbors=5)
samadult_imputed = pd.DataFrame(imputer.fit_transform(samadult), columns=samadult.columns)

In [None]:
s
