# 🔧 Feature Selection

After preprocessing, feature selection is performed to retain the **most relevant features** for predicting diabetes class.  
This helps **reduce dimensionality, improve model performance**, and focus on clinically significant indicators.

---

## 1️⃣ Run Preprocessing Notebook

All preprocessed variables (`X_train_res`, `X_test_scaled`, `y_train_res`, `y_test`, `scaler`) are available by running the preprocessing notebook.

In [1]:
# Run the preprocessing notebook
%run ./Preprocessing.ipynb

Preprocessed data saved as joblib files!


## 2️⃣ Convert NumPy Arrays to DataFrames

Converting arrays to **pandas DataFrames** allows easier manipulation, feature selection, and visualization.


In [2]:
import pandas as pd

X_train_res_df = pd.DataFrame(X_train_res, columns=X.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X.columns)

## 3️⃣ Select Relevant Features

Top features are selected based on **domain knowledge, statistical tests, and correlation analysis**:

- `HbA1c` → Blood sugar level over 3 months  
- `BMI` → Body Mass Index  
- `AGE` → Patient age  
- `Urea, Cr` → Kidney function markers  
- `Chol, LDL, VLDL, TG` → Lipid profile  

The selected features are used for model training to improve **accuracy and interpretability**.

In [3]:
selected_features = ['HbA1c','BMI','AGE','Urea','Chol','VLDL','TG','Cr','LDL']

X_train_fs = X_train_res_df[selected_features]
X_test_fs = X_test_scaled_df[selected_features]

## 4️⃣ Inspect Selected Features

Checking the first few rows ensures the selection was successful.


In [4]:
X_train_fs.head()

Unnamed: 0,HbA1c,BMI,AGE,Urea,Chol,VLDL,TG,Cr,LDL
0,-0.432651,-0.30618,-1.624315,-0.21272,0.208963,-0.312617,-0.80631,-0.213544,0.09472
1,-1.165377,-0.905307,-0.429091,-0.422238,-0.192012,-0.341727,-0.965386,-0.281358,1.658419
2,-0.595479,-0.705598,0.965337,-0.107961,0.289158,-0.283508,-0.726772,-0.465426,0.87657
3,-1.043256,-0.505889,-0.229887,-0.2651,0.048573,0.735327,-1.044923,-0.300734,1.658419
4,-0.391944,-0.30618,0.068919,-0.21272,0.609938,-0.283508,-0.647234,-0.23292,1.169763


## ✅ Key Insights

- Selected features represent **key clinical indicators** for diabetes prediction.  
- Reduces noise from irrelevant features.  
- Prepares data for **efficient and accurate model training**.