<a href="https://colab.research.google.com/github/Jerin2004/Lect-28-CIPHER-SCHOOL-/blob/main/Feature_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Variance Thresholding

### Definition
Variance Thresholding is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

### Example Table Data

| Feature1 | Feature2 | Feature3 | Constant |
|----------|----------|----------|----------|
| 1        | 2        | 3        | 1        |
| 1        | 3        | 4        | 1        |
| 1        | 4        | 5        | 1        |
| 1        | 5        | 6        | 1        |
| 1        | 6        | 7        | 1        |

In this table, 'Feature1' and 'Constant' have low or zero variance.

In [None]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

# Sample data
data = {
    'Feature1': [1, 1, 1, 1, 1],  # Low variance
    'Feature2': [2, 3, 4, 5, 6],
    'Feature3': [3, 4, 5, 6, 7],
    'Constant': [1, 1, 1, 1, 1]  # Zero variance
}
df = pd.DataFrame(data)

# Variance Thresholding
selector = VarianceThreshold(threshold=0.1)
df_variance_filtered = pd.DataFrame(selector.fit_transform(df), columns=df.columns[selector.get_support()])
print("After Variance Thresholding:\n", df_variance_filtered)

# 2. Correlation Matrix Filtering

### Definition
Correlation Matrix Filtering involves computing the correlation matrix for the features in the dataset and removing one of each pair of features with a high correlation. This helps to reduce redundancy in the data.

### Example Table Data

| Feature1 | Feature2 | Feature3 | Feature4 |
|----------|----------|----------|----------|
| 1        | 2        | 2        | 5        |
| 2        | 4        | 4        | 6        |
| 3        | 6        | 6        | 7        |
| 4        | 8        | 8        | 8        |
| 5        | 10       | 10       | 9        |

In this table, 'Feature2' and 'Feature3' are highly correlated with 'Feature1'.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = {
    'Feature1': [1, 2, 3, 4, 5],
    'Feature2': [2, 4, 6, 8, 10],  # Highly correlated with Feature1
    'Feature3': [2, 4, 6, 8, 10],  # Highly correlated with Feature2
    'Feature4': [5, 6, 7, 8, 9]
}
df = pd.DataFrame(data)

# Correlation matrix
corr_matrix = df.corr().abs()

# Plot correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than 0.9
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]

# Drop features
df_corr_filtered = df.drop(to_drop, axis=1)
print("After Correlation Matrix Filtering:\n", df_corr_filtered)

# 3. Domain Knowledge

### Definition
Domain knowledge involves using expertise from the specific field or industry to manually select the most relevant features. This method leverages human understanding of which features are likely to be important.

### Example Table Data

| Age | Salary | Height | Weight |
|-----|--------|--------|--------|
| 25  | 50000  | 5.5    | 150    |
| 30  | 60000  | 6.0    | 160    |
| 35  | 70000  | 5.8    | 170    |
| 40  | 80000  | 5.9    | 180    |
| 45  | 90000  | 6.1    | 190    |

In this table, 'Age' and 'Salary' might be selected based on domain knowledge indicating their importance.

In [None]:
import pandas as pd

# Sample data
data = {
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 70000, 80000, 90000],
    'Height': [5.5, 6.0, 5.8, 5.9, 6.1],
    'Weight': [150, 160, 170, 180, 190]
}
df = pd.DataFrame(data)

# Based on domain knowledge, we know Age and Salary are important
selected_features_domain = df[['Age', 'Salary']]
print("Selected Features based on Domain Knowledge:\n", selected_features_domain)

### Thank you