## Group member 5: feature selection

### Subtask:
Based on the processed data, apply feature selection techniques to identify the most relevant features for predicting stroke. This could involve statistical methods, feature importance from models, or dimensionality reduction techniques (though PCA is assigned to Member 6). Explain the rationale behind the chosen feature selection method. Display the selected features.


**Reasoning**:
Define the feature matrix X and the target variable y from the df_cleaned DataFrame, excluding 'stroke' and 'id'.



In [None]:
X = df_cleaned.drop(['stroke', 'id'], axis=1)
y = df_cleaned['stroke']

print("Shape of feature matrix X:", X.shape)
print("Shape of target variable y:", y.shape)

Shape of feature matrix X: (5110, 16)
Shape of target variable y: (5110,)


**Reasoning**:
Train a RandomForestClassifier model on the data and get the feature importances.



In [None]:
from sklearn.ensemble import RandomForestClassifier

# Choose a feature selection method: Feature importance from RandomForestClassifier
# Rationale: Tree-based models like RandomForest inherently provide feature importances based on how much
# they reduce impurity or variance when splitting nodes on that feature. Features with higher
# importance scores are considered more relevant for the prediction task. This method is
# computationally efficient and provides a clear ranking of features.

# Instantiate and train a RandomForestClassifier model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Get feature importances
feature_importances = model.feature_importances_

# Create a pandas Series to store feature names and their importances
feature_importance_series = pd.Series(feature_importances, index=X.columns)

# Sort feature importances in descending order
sorted_feature_importances = feature_importance_series.sort_values(ascending=False)

print("Feature Importances (sorted):")
print(sorted_feature_importances)

Feature Importances (sorted):
age                               2.616317e-01
bmi                               2.594563e-01
avg_glucose_level                 2.173858e-01
gender_Male                       3.566488e-02
Residence_type_Urban              3.525226e-02
hypertension                      2.775140e-02
smoking_status_never smoked       2.690523e-02
work_type_Private                 2.563415e-02
heart_disease                     2.459820e-02
smoking_status_formerly smoked    2.228079e-02
work_type_Self-employed           2.162083e-02
smoking_status_smokes             2.114090e-02
ever_married_Yes                  1.933248e-02
work_type_children                1.294418e-03
work_type_Never_worked            5.047267e-05
gender_Other                      1.685171e-07
dtype: float64


**Reasoning**:
Select the most important features based on the sorted feature importances and print the selected features. A common approach is to select features that contribute significantly to the model's performance, often by setting a threshold on the importance score or selecting a fixed number of top features. For this task, I will select features with importance scores above a small threshold to exclude features with negligible importance.



In [None]:
# Select features based on a threshold (e.g., importance > 0.01)
# You can adjust this threshold based on the desired number of features or domain knowledge.
selected_features = sorted_feature_importances[sorted_feature_importances > 0.01].index.tolist()

print("\nSelected Features based on importance threshold:")
print(selected_features)

# Alternatively, you could select the top N features:
# N = 10  # Example: select the top 10 features
# top_n_features = sorted_feature_importances.head(N).index.tolist()
# print(f"\nTop {N} Selected Features:")
# print(top_n_features)


Selected Features based on importance threshold:
['age', 'bmi', 'avg_glucose_level', 'gender_Male', 'Residence_type_Urban', 'hypertension', 'smoking_status_never smoked', 'work_type_Private', 'heart_disease', 'smoking_status_formerly smoked', 'work_type_Self-employed', 'smoking_status_smokes', 'ever_married_Yes']
