# Read the Data and Pick 10 Features for Research

### Our main goal: Among the selected features, identify which features are related with diabetes, and how they are correlated.

### 10 features we picked:

#### 1. Age (AGE_P)

Type: Continuous variable

Description: Represents the respondent's age in years.

Example values: 18, 35, 60, etc.

Why: Diabetes risk increases with age, especially for Type 2 diabetes.


#### 2. Sex (SEX)

Type: Categorical variable

Possible values:

1: Male

2: Female

7: Refused

9: Don't know

Why: There may be gender differences in diabetes prevalence and risk factors.


#### 3. Body Mass Index (BMI)

Type: Continuous variable

Description: The body mass index (BMI) is calculated from height and weight.

Example values:

<18.5: Underweight

18.5–24.9: Normal weight

25–29.9: Overweight

30 or greater: Obesity

Why: Obesity is one of the strongest risk factors for Type 2 diabetes.


#### 4. Hypertension (HYPEV)

Type: Categorical variable

Possible values:

1: Yes

2: No

7: Refused

9: Don’t know

Why: High blood pressure is often associated with diabetes.


#### 5. Cholesterol Levels (CHLEV)

Type: Categorical variable

Possible values:

1: Yes

2: No

7: Refused

9: Don’t know

Why: High cholesterol is a common comorbidity with diabetes and can lead to related complications.


#### 6. Physical Activity VIGFREQW (Frequency of Vigorous Physical Activity)

Type: Categorical variable

Possible values:

1: Every day

2: Most days

3: Some days

4: Rarely

5: Never

7: Refused

9: Don’t know

Why: Regular physical activity can help reduce the risk of diabetes.


#### 7. Smoking Status SMKNOW (Current Smoking Status)


Type: Categorical variable

Possible values:

1: Every day

2: Some days

3: Not at all

7: Refused

9: Don’t know

Why: Smoking has been linked to an increased risk of diabetes and other chronic conditions.


#### 8. Family History of Diabetes (DIBREL)

Type: Categorical variable

Possible values:

1: Yes

2: No

7: Refused

9: Don’t know

Why: A family history of diabetes significantly increases an individual's risk.


#### 9. Gestational Diabetes (DIBGDM)

Type: Categorical variable

Possible values:

1: Yes

2: No

7: Refused

9: Don’t know

Why: Women who develop gestational diabetes during pregnancy have a higher risk of developing Type 2 diabetes later in life.


#### 10. Alcohol Consumption (ALC1YR)

Type: Categorical variable

Possible values:

1: Yes (consumed alcohol in the past year)

2: No (did not consume alcohol in the past year)

7: Refused (refused to answer the question)

9: Don't know (unsure if consumed alcohol in the past year)

Why: Excessive alcohol consumption can increase the risk of diabetes through its impact on weight and liver function.

In [1]:
import pandas as pd

In [2]:
samadult = pd.read_csv("samadult.csv")
samadult.head()

Unnamed: 0,FPX,FMX,HHX,INTV_QRT,WTIA_SA,WTFA_SA,SEX,HISPAN_I,R_MARITL,MRACRPI2,...,BFWH_05,BFWH_06,BFWH_07,BFWH_08,BFWH_09,BFWH_10,BNRFALL,BINTHI,BINTTR,BINTRS
0,1,1,1,1,11241.0,26100,1,12,8,2,...,2.0,2.0,2.0,2.0,1.0,2.0,5.0,2.0,2.0,2.0
1,1,2,1,1,5620.5,11294,2,12,5,2,...,,,,,,,,,,
2,1,1,2,1,2919.3,2506,1,12,5,1,...,,,,,,,,,,
3,2,1,3,1,8883.8,9267,2,3,2,1,...,,,,,,,,,,
4,1,1,5,1,3300.8,3443,2,3,7,1,...,,,,,,,,,,


In [3]:
# List of columns to keep (10 key features + diabetes indicator column)
columns_to_keep = [
    'AGE_P',     # Age
    'SEX',       # Sex
    'BMI',       # Body Mass Index
    'HYPEV',     # Hypertension diagnosis
    'CHLEV',     # Cholesterol diagnosis
    'VIGFREQW',  # Frequency of vigorous physical activity
    'SMKNOW',    # Current smoking status
    'DIBREL',    # Family history of diabetes
    'DIBGDM',    # Gestational diabetes history
    'DIBEV1'     # Diabetes diagnosis (Yes/No/Borderline)
]

In [4]:
# Create a new DataFrame with only the selected columns
samadult = samadult[columns_to_keep]
samadult.head()

Unnamed: 0,AGE_P,SEX,BMI,HYPEV,CHLEV,SMKEV,SMKNOW,VIGFREQW,DIBREL,DIBGDM,DIBEV1
0,22,1,3336,2,2,2,,3,1,,2
1,24,2,2019,2,2,2,,95,2,,2
2,76,1,2727,1,1,2,,95,2,,2
3,36,2,3862,2,2,2,,21,1,2.0,2
4,35,2,3995,2,2,1,1.0,95,1,2.0,2
