<a href="https://colab.research.google.com/github/RifatMuhtasim/Data_Science_Workflow/blob/main/3.3.Feature_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Identify Important Features

# Variance Threshold

In [None]:
from sklearn.feature_selection import VarianceThreshold

numerical_df = train_df.select_dtypes(include=['int', 'float'])
variance_threshold = VarianceThreshold(threshold=0.1)
variance_threshold.fit(numerical_df)
support_array = variance_threshold.get_support()
constant_columns = numerical_df.columns[~support_array]
print("Constant columns:", constant_columns)

# Select Features using Mutual Info Classifiction
- Target Variable: Is categorical, like in classification tasks (e.g., predicting if a customer will churn or not).

- Features: Can be numerical, categorical, or a mix of both.

In [None]:
from sklearn.feature_selection import mutual_info_classif

X = df.drop(['output'], axis="columns")
y = df['output']

# Calculate mutual information
mutual_info = mutual_info_classif(X, y)
mutual_info = pd.Series(mutual_info, index=X.columns)

# Sort mutual information values in descending order
mutual_info = mutual_info.sort_values(ascending=False)

# Display the sorted mutual information
mutual_info.reset_index()

# Select Features using Mutual Info Regression
- Target Variable: Is numerical, such as predicting house prices, sales figures, or any other continuous variable.

- Features: Can be a mix of numerical and categorical.


In [None]:
from sklearn.feature_selection import mutual_info_regression

X = df.drop(['output'], axis="columns")
y = df['output']

# Assuming X and y are defined and contain your features and target variable
mutual_info = mutual_info_regression(X, y)

# Create a pandas Series to hold the mutual information scores
mutual_info_series = pd.Series(mutual_info, index=X.columns)

# Sort the mutual information scores in descending order
mutual_info_sorted = mutual_info_series.sort_values(ascending=False)

# Display the sorted mutual information scores
mutual_info_sorted.reset_index()

# Select Feature using Chi2 Statistical Analysis
You would apply the chi-squared feature selection method when:

- Target Variable: Is categorical, such as class labels in a classification problem.

- Features: Are numerical, specifically discrete values or counts, rather than continuous values.

In [None]:
from sklearn.feature_selection import chi2

X = df.drop(['output'], axis="columns")
y = df['output']

f_p_values = chi2(X, y)
p_values = pd.Series(f_p_values[0])
p_values.index = X.columns
p_values.sort_values(ascending=False).reset_index()

This score should be used to evaluate categorical variables in classification task

# Ranked Best feature using Univariate Analysis
- Your features are numerical (discrete counts or binary).

- Your target variable is categorical.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

X = df.drop(['output'], axis="columns")
y = df['output']

order_ranked_columns = SelectKBest(score_func=chi2, k='all')
order_ranked_features = order_ranked_columns.fit(X, y)

df_scores = pd.DataFrame(order_ranked_features.scores_, columns=['score'])
df_columns = pd.DataFrame(X.columns)
feature_rank = pd.concat([df_columns, df_scores], axis="columns")
featrue_rank.sort_values("score", ascending=False)

# Ranked Best Features using ANOVA
- Feature Numerical but Target Categorical

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

X = df.drop(['output'], axis="columns")
y = df['output']

order_ranked_columns = SelectKBest(f_classif, k='all')
order_ranked_features = order_ranked_columns.fit(X, y)

df_scores = pd.DataFrame(order_ranked_features.scores_, columns=['score'])
df_columns = pd.DataFrame(X.columns)
feature_rank = pd.concat([df_columns, df_scores], axis="columns")
feature_rank.sort_values("score", ascending=False)

# Select Features using ExtraTreesClassifier
When using an ExtraTreesClassifier, examining feature importance can be incredibly insightful.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
X = df.drop(['output'], axis="columns")
y = df['output']

# Assuming you have your data loaded in X_train and y_train
model = ExtraTreesClassifier()
model.fit(X, y)

# Extract feature importances
importances = model.feature_importances_

# Create a DataFrame to display the feature importances
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
feature_importance_df

# Recursive Feature Elimination (RFE):
Recursive Feature Elimination (RFE) is a feature selection technique in machine learning. Here's how it works:

How RFE Works
Initial Model Fitting: It starts by training a model (e.g., logistic regression, decision tree) on the entire dataset.

Ranking Features: The model assigns an importance score to each feature.

Eliminating Least Important Features: The least important features are removed.

Repeating the Process: The process is repeated recursively with the remaining features until the desired number of features is selected.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
X_selected = rfe.fit_transform(X, y)
# Get the ranking of features
ranking = rfe.ranking_

# Create a DataFrame to view the selected features and their rankings
selected_features = pd.DataFrame({'Feature': X.columns, 'Ranking': ranking})
selected_features = selected_features.sort_values('Ranking', ascending = True)
display(selected_features)

In the context of Recursive Feature Elimination (RFE), a ranking of 1 indicates the most important feature. Essentially, the lower the ranking number, the more significant the feature. So, a ranking of 1 is excellent, while 30 would be much less influential. Got a lot riding on those top-ranked features!

# 2. Remove Features

In [None]:
# Remove columns

remove_columns_list = ['B',  'C']
df = my_df.drop(remove_columns_list, axis=1)