<a href="https://colab.research.google.com/github/JunLiang778/Feature-Selection-Project-Understanding/blob/master/Feature_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is it?

To train a model, we collect huge quantities of data to help the machine learn better. But not all this data will be useful to us.

Feature selection is the `process of selecting the most relevant and important features` (variables) from a dataset for building a predictive model.

It `eliminates irrelevant, redundant, or noisy features` to `improve the model’s accuracy, reduce overfitting, and make the model easier to interpret`.

# The problem it solves

1. `Irrelevant Features`: Not all features in a dataset contribute to the prediction. For example, including irrelevant features may confuse the model and reduce accuracy.
2. `Overfitting`: Too many features can cause the model to perform well on training data but poorly on unseen data.
3. `High Computational Cost`: More features mean more calculations, which increases the time and resources required for model training and predictions.

# Methods of Feature Selection
1. `Filter Methods`: Use statistical tests to find relationships (e.g., correlation).
2. `Wrapper Methods`: Use the model’s performance to evaluate feature combinations.
3. `Embedded Methods`: Select features during model training (e.g., LASSO regression).


Feature selection ensures the model is efficient and effective by focusing on the right inputs.



---
# 1. Correlation-based Feature Selection in a Data Science Project

In [3]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

# Load Breast Cancer Dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)

df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [4]:
# Add target column (0: malignant, 1: benign)
df['target'] = pd.Series(data.target)
df['target']

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
...,...
564,0
565,0
566,0
567,0


In [6]:
# Compute correlation matrix
correlation_matrix = df.corr()

correlation_matrix["target"]

Unnamed: 0,target
mean radius,-0.730029
mean texture,-0.415185
mean perimeter,-0.742636
mean area,-0.708984
mean smoothness,-0.35856
mean compactness,-0.596534
mean concavity,-0.69636
mean concave points,-0.776614
mean symmetry,-0.330499
mean fractal dimension,0.012838


In [8]:
# Filter features with high correlation (above 0.7 or below -0.7) with the target
threshold = 0.7
relevant_features = correlation_matrix['target'].abs() > threshold

# Display selected features
selected_features = correlation_matrix.index[relevant_features]
print("Selected Features:", selected_features)

Selected Features: Index(['mean radius', 'mean perimeter', 'mean area', 'mean concave points',
       'worst radius', 'worst perimeter', 'worst area', 'worst concave points',
       'target'],
      dtype='object')


In [9]:
# Display correlation values of selected features with the target
print(correlation_matrix['target'][relevant_features])

mean radius            -0.730029
mean perimeter         -0.742636
mean area              -0.708984
mean concave points    -0.776614
worst radius           -0.776454
worst perimeter        -0.782914
worst area             -0.733825
worst concave points   -0.793566
target                  1.000000
Name: target, dtype: float64
