# **Phase:2 Feature Selection**

In [1]:
import pandas as pd

df_encoded = pd.read_csv("cleaned_dataset.csv")


**Information Gain Calculation**

In [2]:
X = df_encoded.drop(columns=["Target"])  # All features except Target
y = df_encoded["Target"]  # Target variable

In [3]:
from sklearn.feature_selection import mutual_info_classif

# Compute Information Gain (IG)
ig_scores = mutual_info_classif(X, y, random_state=42, n_jobs=-1)  # Parallel Processing

# Create a DataFrame to Store IG Values
ig_df = pd.DataFrame({"Feature": X.columns, "Information Gain": ig_scores})

#  Sort Features by IG Score (Descending Order)
ig_df = ig_df.sort_values(by="Information Gain", ascending=False).reset_index(drop=True)

#  Display Results
print("✅ IG Calculation Completed!")
print(ig_df)


✅ IG Calculation Completed!
                   Feature  Information Gain
0                       pH          0.090183
1     Color_Near Colorless          0.080836
2                Manganese          0.070980
3                Turbidity          0.056685
4                 Chloride          0.054655
5                   Copper          0.052396
6                     Odor          0.051209
7       Color_Faint Yellow          0.046724
8             Color_Yellow          0.043236
9                  Nitrate          0.042416
10                Chlorine          0.041500
11                Fluoride          0.041020
12                    Iron          0.040132
13  Total Dissolved Solids          0.035176
14      Color_Light Yellow          0.033709
15                 Sulfate          0.032870
16             Source_Well          0.030178
17             Time of Day          0.025906
18           Source_Stream          0.025121
19             Source_Lake          0.025049
20            Source_River 

In [4]:
# Save IG results to CSV
ig_df.to_csv("thesis_main_ig_calculation.csv", index=False)

print("✅ IG Calculation Results Saved Successfully as thesis_main_ig_calculation.csv!")


✅ IG Calculation Results Saved Successfully as thesis_main_ig_calculation.csv!


## **Paper Implementation (For Selecting IG Method)**

**Fixed Threshold Method**

In [3]:
import numpy as np

In [7]:
# Use the 50th percentile (median) as the fixed threshold
fixed_threshold = np.percentile(ig_df["Information Gain"], 50)  

# Select features with IG >= fixed threshold
selected_features_fixed = ig_df.loc[ig_df["Information Gain"] >= fixed_threshold, "Feature"].values

# Display Results
print("\n✅ Features Selected by Fixed Threshold Method (Median-Based):")
print(selected_features_fixed)
print(f"📌 Number of Features Selected: {len(selected_features_fixed)}")



✅ Features Selected by Fixed Threshold Method (Median-Based):
['pH' 'Color_Near Colorless' 'Manganese' 'Turbidity' 'Chloride' 'Copper'
 'Odor' 'Color_Faint Yellow' 'Color_Yellow' 'Nitrate' 'Chlorine'
 'Fluoride' 'Iron' 'Total Dissolved Solids' 'Color_Light Yellow' 'Sulfate'
 'Source_Well' 'Time of Day' 'Source_Stream' 'Source_Lake' 'Source_River']
📌 Number of Features Selected: 21


**Standard Deviation-Based Threshold Approach**

In [8]:
# Calculate the threshold as the standard deviation of IG values
std_threshold = ig_df["Information Gain"].std()

# Select features with IG >= standard deviation threshold
selected_features_std = ig_df.loc[ig_df["Information Gain"] >= std_threshold, "Feature"].values

# Display Results
print("\n✅ Features Selected by Standard Deviation Threshold Method:")
print(selected_features_std)
print(f"📌 Number of Features Selected: {len(selected_features_std)}")



✅ Features Selected by Standard Deviation Threshold Method:
['pH' 'Color_Near Colorless' 'Manganese' 'Turbidity' 'Chloride' 'Copper'
 'Odor' 'Color_Faint Yellow' 'Color_Yellow' 'Nitrate' 'Chlorine'
 'Fluoride' 'Iron' 'Total Dissolved Solids' 'Color_Light Yellow' 'Sulfate'
 'Source_Well' 'Time of Day' 'Source_Stream' 'Source_Lake' 'Source_River'
 'Source_Reservoir' 'Source_Ground' 'Source_Spring' 'Zinc']
📌 Number of Features Selected: 25


**CBFS Method**

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Train Random Forest on the dataset
rf = RandomForestClassifier(n_estimators=100, max_features="sqrt", random_state=42, n_jobs=-1)
rf.fit(X, y)

# Apply CBFS: Select features with importance >= mean importance
cbfs = SelectFromModel(rf, threshold="mean", prefit=False)
cbfs.fit(X, y)  # Fit before selecting features
selected_features_cbfs = X.columns[cbfs.get_support()]

# Display Results
print("\n✅ Features Selected by CBFS Method:")
print(selected_features_cbfs)
print(f"📌 Number of Features Selected: {len(selected_features_cbfs)}")



✅ Features Selected by CBFS Method:
Index(['pH', 'Iron', 'Nitrate', 'Chloride', 'Zinc', 'Turbidity', 'Fluoride',
       'Copper', 'Odor', 'Sulfate', 'Chlorine', 'Manganese',
       'Total Dissolved Solids'],
      dtype='object')
📌 Number of Features Selected: 13


**FFT Method**

In [4]:
from scipy.fft import fft, ifft
from sklearn.feature_selection import mutual_info_classif
import numpy as np

# Step 1: Apply FFT transformation to the dataset (keeping real part)
X_fft = fft(X, axis=0).real  

# Step 2: Apply IFFT to bring the data back to original form
X_ifft = ifft(X_fft, axis=0).real  

# Step 3: Compute Information Gain (IG) on transformed data
information_gain_fft = mutual_info_classif(X_ifft, y, random_state=42, n_jobs=-1)

# Step 4: Compute standard deviation threshold
fft_threshold = np.std(information_gain_fft)  

# Step 5: Select features with IG >= threshold
selected_features_fft = X.columns[information_gain_fft >= fft_threshold]

print("\nFeatures Selected by FFT with Proposed Threshold:")
print(selected_features_fft)
print(f"Number of Features: {len(selected_features_fft)}")



Features Selected by FFT with Proposed Threshold:
Index(['Color_Faint Yellow', 'Color_Light Yellow', 'Color_Near Colorless',
       'Color_Yellow', 'Source_Ground', 'Source_Lake', 'Source_Reservoir',
       'Source_River', 'Source_Spring', 'Source_Stream', 'Source_Well',
       'Month_August', 'Month_December', 'Month_February', 'Month_January',
       'Month_July', 'Month_June', 'Month_March', 'Month_May',
       'Month_November', 'Month_October', 'Month_September'],
      dtype='object')
Number of Features: 22



---

# **Feature Selection Analysis Summary**  

We applied four different **Information Gain (IG)-based** feature selection methods, each emphasizing different types of features. Below is a structured analysis of their outcomes and the next steps.  

---

## **🔍 Overview of Selected Features**  

| **Method**                         | **Selected Features** | **Key Characteristics** |
|------------------------------------|----------------------|-------------------------|
| **Fixed Threshold (Median-Based)** | 21 features         | Focuses on **chemical properties** (`pH`, `Iron`, `Nitrate`, etc.), with some categorical (`Color_*`, `Source_*`). |
| **Standard Deviation-Based Threshold** | 25 features     | Similar to Fixed Threshold but includes additional **source-based** features (`Source_*`). |
| **CBFS (Random Forest-Based)**      | 13 features         | Strictly **chemical-based**, ignoring categorical/time-related variables. |
| **FFT-Based Selection**            | 22 features         | Prioritizes **categorical (`Color`, `Source`) & temporal (`Month_*`) features**, ignoring key chemical indicators. |

---

## **📌 Key Observations & Trends**  

### ✅ **What’s Being Prioritized?**  

✔ **Fixed & Std-Dev Threshold-Based Selection:**  
   - Prioritize **core water quality indicators** (`pH`, `Iron`, `Nitrate`, `Turbidity`).  
   - Include some categorical variables (`Color_*`, `Source_*`).  

✔ **CBFS (Random Forest-Based):**  
   - Selects **only chemical properties**, emphasizing direct pollutant measurements.  

✔ **FFT-Based Selection:**  
   - Detects **seasonal patterns & categorical dependencies**, prioritizing `Month_*` and `Source_*` over chemical values.  

---

### ⚠️ **What’s Being Ignored?**  

❌ **CBFS & Threshold-Based Methods:**  
   - **Completely ignore** time-based (`Month_*`) and source-based (`Source_*`) variables, potentially **missing seasonal variations**.  

❌ **FFT-Based Selection:**  
   - **Fails to capture key chemical indicators** (e.g., `pH`, `Iron`, `Nitrate`), likely because **chemical values don’t follow strong periodic patterns**.  

---

## **🚀 Next Steps & Implementation Plan**  

🔹 **Step 1:** Utilize all four Feature Selection (FS) methods.  
🔹 **Step 2:** Convert selected features following the approach outlined in the reference paper.  
🔹 **Step 3:** Apply machine learning models to all feature sets and compare their initial performance.  
🔹 **Step 4:** Analyze performance trends and refine the best FS method for further improvements.  

---
