<details>
<summary>üìå Cell Description: Loading the Standardized Dataset and Identifying Target & Astrophysical Feature Columns</summary>

This cell begins the feature‚Äêselection and machine‚Äêlearning workflow by loading the **standardized ZTF dataset** and identifying which columns are most relevant for scientific and ML analysis. Standardization ensures that all numeric variables share a consistent scale, which is important for many statistical and machine-learning methods.

Additionally, the cell imports the full set of feature-selection libraries and ML tools that will be used later, including Random Forest, Logistic Regression, Variance Threshold, Mutual Information, and KMeans clustering.

To prepare the dataset for intelligent feature selection, the script scans the column names to automatically detect:

- **Possible target variables** (labels for classification tasks)  
- **Astrophysically meaningful features**, such as coordinates, brightness measurements, signal-to-noise ratio, observing conditions, and time information  

This helps ensure that important scientific features are not mistakenly removed during feature-selection steps.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Loads the cleaned and standardized dataset**, ensuring all variables have comparable numeric scales.  
- **Imports essential feature-selection tools**, including:
  - VarianceThreshold  
  - SelectKBest  
  - Mutual Information  
  - Random Forest  
  - Logistic Regression  
  - KMeans  
- **Detects possible target/label columns**, such as:
  - "label", "class", "target", "type"  
  which may be used for supervised learning tasks.  
- **Identifies astrophysical priority features**, such as:
  - RA/Dec ‚Üí sky position  
  - Flux/Magnitude ‚Üí brightness  
  - SNR ‚Üí data quality  
  - Seeing/Airmass ‚Üí atmospheric conditions  
  - Filter/Band ‚Üí wavelength of observation  
  - JD/MJD/ObsDate ‚Üí time of observation  
- **Prints dataset shape and detected features**, confirming dataset readiness for ML workflows.

---

### ‚≠ê **Why This Cell Is Important for the Research**

This block performs essential preparation steps before applying any feature-selection or learning algorithm:

1. **Guarantees the dataset is properly loaded**  
   Prevents errors during feature-selection or model training.

2. **Automatically identifies label columns**  
   Many astronomical datasets do not have explicit labels; this helps locate them reliably.

3. **Highlights astrophysically meaningful features**  
   Prevents important scientific variables from being accidentally removed.

4. **Supports transparent and explainable ML pipelines**  
   By identifying key features early, the researcher can justify:
   - which features are included  
   - which ones are removed  
   - why certain columns matter scientifically  

5. **Creates a structured foundation**  
   All following steps‚Äîvariance filtering, correlation analysis, clustering, supervised learning‚Äîdepend on correctly identifying these columns.

Overall, this cell ensures a scientifically grounded and well-structured start to the feature-selection process.

</details>


In [2]:
# Imports and load standardized dataset
import os
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold, SelectKBest, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

DATA_PATH = 'ztf_image_search_results_full_standardized.csv'
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f'{DATA_PATH} not found. Run the preprocessing/standardization notebook first.')
df = pd.read_csv(DATA_PATH)
print('Loaded dataset shape:', df.shape)
# helper to detect likely target names and astrophysical important columns
possible_targets = [c for c in df.columns if c.lower() in ['label','class','target','type']]
astro_priority = [c for c in df.columns if any(k in c.lower() for k in ['ra','dec','flux','mag','snr','seeing','airmass','filter','band','jd','obsdate','mjd','maglimit'])]
print('Detected possible targets:', possible_targets)
print('Astrophysical priority columns found:', astro_priority)
df.shape

Loaded dataset shape: (62368, 42)
Detected possible targets: []
Astrophysical priority columns found: ['ra', 'dec', 'filtercode', 'obsdate', 'obsjd', 'filefracday', 'seeing', 'airmass', 'maglimit', 'ra1', 'dec1', 'ra2', 'dec2', 'ra3', 'dec3', 'ra4', 'dec4']


(62368, 42)

<details>
<summary>üìå Cell Description: Removing Features With Excessive Missing Values (>90% Missing)</summary>

This cell performs an important early step in feature selection by removing any columns (features) that contain **more than 90% missing values**. In practice, features with extremely high missingness provide little to no useful information for machine-learning models or scientific interpretation. Keeping such columns can introduce noise, distort statistical patterns, or cause algorithms to fail.

The cell calculates the fraction of missing values in each column, identifies those with too much missingness, drops them from the dataset, and reports how many were removed. This ensures the remaining dataset contains features that are informative, reliable, and suitable for further analysis.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Computes the missing-value percentage** for every column.  
- **Keeps only columns with at least 10% valid data** (i.e., less than 90% missing).  
- **Drops extremely sparse features**, which typically do not contribute meaningful information.  
- **Prints the names of removed columns**, allowing transparency in the feature-selection process.  
- **Updates and prints the new dataset shape**, showing how many features remain.  
- Helps prevent:
  - unreliable model training  
  - statistical distortions  
  - unnecessary dimensionality  
  - increased computational cost  

---

### ‚≠ê **Why This Cell Is Important for the Research**

1. **Improves data quality**  
   Features with too many missing values cannot reliably support scientific conclusions or machine-learning predictions.

2. **Reduces dimensionality early**  
   Removing uninformative features simplifies the dataset and improves computational efficiency.

3. **Prevents model instability**  
   ML algorithms struggle with columns that contain predominantly empty or imputed values.

4. **Enhances interpretability**  
   Keeping only meaningful columns helps focus the analysis on scientifically relevant variables.

5. **Standard practice in data science**  
   Dropping features with >90% missingness is widely used to ensure clean, analyzable datasets.

This step ensures the dataset entering deeper feature-selection methods is clean, structured, and scientifically valid.

</details>


In [3]:
# 1) Drop features with >90% missing values
thresh = 0.1  # keep columns with at least 10% non-missing
missing_frac = df.isnull().mean()
cols_keep = missing_frac[missing_frac <= (1 - thresh)].index.tolist()
dropped_missing = [c for c in df.columns if c not in cols_keep]
print(f'Dropping {len(dropped_missing)} columns with >90% missing: ', dropped_missing)
df = df[cols_keep].copy()
print('Shape after missingness drop:', df.shape)

Dropping 0 columns with >90% missing:  []
Shape after missingness drop: (62368, 42)


<details>
<summary>üìå Cell Description: Basic Imputation for Handling Remaining Missing Values</summary>

This cell fills in (imputes) any remaining missing values in the dataset. Even after removing columns with excessive missingness, some features still contain gaps. Machine-learning models cannot work with missing values directly, so they must be replaced with reasonable estimates.

The approach used here is simple, reliable, and widely accepted:

- **Numeric features** ‚Üí replaced with the **median**  
- **Categorical features** ‚Üí replaced with the **mode** (most frequent category)

Median imputation avoids being influenced by extreme values, while mode imputation preserves the most common category. These strategies ensure that the dataset remains statistically consistent without introducing unrealistic values.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Identifies numeric and categorical columns** separately.  
- **Numeric columns**: Missing values filled with the **median**, which is stable and unaffected by outliers.  
- **Categorical columns**: Missing values filled with the **most frequent category** (mode).  
- Handles rare cases where a mode does not exist by inserting `"missing"`.  
- Ensures that **all remaining missing values are removed**.  
- **Prints the top 10 columns with any remaining missingness**, confirming successful imputation.  

---

### ‚≠ê **Why This Cell Is Important for the Research**

1. **Machine-learning algorithms require complete data**  
   Missing values must be filled before training or evaluation.

2. **Chosen methods preserve statistical behavior**  
   - Median protects numeric distributions from outliers.  
   - Mode maintains category consistency.

3. **Prevents bias and errors**  
   Proper imputation avoids artificial patterns that could mislead models.

4. **Ensures fairness in feature selection**  
   Columns are not dropped unnecessarily simply because they contain a few missing values.

5. **Improves model stability and reliability**  
   Clean, complete data is essential for robust scientific and ML results.

By performing imputation at this stage, the dataset becomes **fully usable**, enabling downstream tasks such as variance filtering, clustering, supervised learning, and astrophysical analysis.

</details>


In [4]:
# 2) Basic imputation: numeric -> median, categorical -> mode
num_cols = df.select_dtypes(include=['number']).columns.tolist()
cat_cols = df.select_dtypes(include=['object','category']).columns.tolist()
for c in num_cols:
    if df[c].isnull().any():
        df[c] = df[c].fillna(df[c].median())
for c in cat_cols:
    if df[c].isnull().any():
        df[c] = df[c].fillna(df[c].mode().iloc[0] if not df[c].mode().empty else 'missing')
print('After imputation, missing per column (top 10):')
print(df.isnull().sum().sort_values(ascending=False).head(10))

After imputation, missing per column (top 10):
ra            0
dec           0
infobits      0
field         0
ccdid         0
qid           0
rcid          0
fid           0
filtercode    0
pid           0
dtype: int64


<details>
<summary>üìå Cell Description: Removing Low-Variance (Near-Constant) Features</summary>

This cell removes **low-variance features**, which are columns whose values are almost identical across all observations. Such features offer little or no useful information for machine-learning models because they do not help differentiate one sample from another.

For example, if a column has nearly the same value for every astronomical observation, it cannot contribute to predicting object types or identifying meaningful patterns. Removing these features reduces noise, speeds up computation, and improves model performance.

The `VarianceThreshold` tool identifies which numeric columns vary enough to be informative. Columns with extremely tiny variance (less than 1e-5) are dropped, and the dataset is rebuilt using only meaningful numeric and categorical features.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Selects all numeric columns** and evaluates their variance.  
- **Drops features that are nearly constant**, since they carry little information.  
- **Prints the names of removed low-variance features** for transparency.  
- **Rebuilds the dataset** using:
  - the remaining numeric features  
  - all original categorical features  
- **Updates and prints the new dataset shape** after filtering.  
- **Improves efficiency** by reducing dimensionality.  
- Helps prevent:
  - redundant information  
  - unnecessary computational cost  
  - model confusion from uninformative features  

---

### ‚≠ê **Why This Cell Is Important for the Research**

1. **Eliminates uninformative features**  
   Near-constant columns cannot help ML models distinguish between different objects or observations.

2. **Enhances model performance**  
   Removing noise improves accuracy, stability, and training speed.

3. **Reduces dimensionality**  
   Leaner datasets make feature selection and algorithm performance more efficient.

4. **Improves interpretability**  
   A dataset with fewer, more meaningful features is easier to analyze and explain.

5. **Standard practice in feature engineering**  
   Low-variance filtering is a widely used first step in preparing structured datasets for ML.

This cell ensures that the dataset only contains features that meaningfully contribute to astronomical classification or pattern discovery.

</details>


In [5]:
# 3) Low variance filter (remove near-constant features)
from sklearn.feature_selection import VarianceThreshold
num_df = df.select_dtypes(include=['number']).copy()
if num_df.shape[1] > 0:
    selector_var = VarianceThreshold(threshold=1e-5)
    selector_var.fit(num_df)
    keep_mask = selector_var.get_support()
    lowvar_removed = [col for i,col in enumerate(num_df.columns) if not keep_mask[i]]
    print('Low-variance removed:', lowvar_removed)
    num_df = num_df.loc[:, keep_mask]
    # rebuild df with remaining numeric cols + categorical cols
    df = pd.concat([num_df.reset_index(drop=True), df[cat_cols].reset_index(drop=True)], axis=1)
    print('Shape after low-variance filter:', df.shape)
else:
    print('No numeric columns for variance filtering')

Low-variance removed: ['field', 'itid', 'moonesb', 'crpix1', 'crpix2']
Shape after low-variance filter: (62368, 37)


<details>
<summary>üìå Cell Description: Removing Highly Correlated Features (r > 0.95)</summary>

This cell removes **highly correlated numeric features**, which are pairs of columns that carry almost the same information. When two features have an extremely strong correlation (above 0.95), one of them becomes redundant. Keeping both adds unnecessary dimensionality, can mislead machine-learning models, and may cause overfitting.

To fix this, the cell computes a correlation matrix, identifies pairs with correlation > 0.95, and removes one feature from each pair. This ensures that only unique, non-redundant information remains in the dataset.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Computes the absolute correlation matrix** for all numeric features.  
- Focuses on the **upper triangle** of the matrix to avoid duplicate comparisons.  
- Identifies columns with correlations above **0.95**, meaning they carry almost identical information.  
- **Removes redundant features**, keeping only one representative from each correlated pair.  
- Rebuilds the dataset using:
  - the remaining (non-redundant) numeric features  
  - all original categorical features  
- **Prints the number of features removed** and the updated dataset shape.  
- Prevents model issues such as:
  - multicollinearity  
  - overfitting  
  - unstable coefficient estimates (in linear models)  
  - inflated feature importance scores  

---

### ‚≠ê **Why This Cell Is Important for the Research**

1. **Improves machine-learning model performance**  
   Models become more stable and accurate when correlated noise is removed.

2. **Reduces dimensionality efficiently**  
   Removes unnecessary features without losing meaningful information.

3. **Prevents multicollinearity**  
   Essential for linear models like Logistic Regression where correlated predictors cause instability.

4. **Improves interpretability**  
   A cleaner set of independent features is easier to analyze and explain scientifically.

5. **Scientifically meaningful**  
   Many astronomical features may be derived from the same measurements (e.g., brightness and SNR), so removing redundant features prevents duplication of similar astrophysical signals.

This step ensures the dataset contains a **compact, non-redundant set of features**, improving both the scientific clarity and the ML readiness of the dataset.

</details>


In [6]:
# 4) Correlation-based removal: remove one of each highly-correlated pair (r>0.95)
num_df = df.select_dtypes(include=['number']).copy()
corr_matrix = num_df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones_like(corr_matrix), k=1).astype(bool))
to_drop_corr = [column for column in upper.columns if any(upper[column] > 0.95)]
print('Correlation-based drop count:', len(to_drop_corr))
num_df = num_df.drop(columns=to_drop_corr)
df = pd.concat([num_df.reset_index(drop=True), df[cat_cols].reset_index(drop=True)], axis=1)
print('Shape after correlation pruning:', df.shape)

Correlation-based drop count: 18
Shape after correlation pruning: (62368, 19)


<details>
<summary>üìå Cell Description: Preparing Feature Matrix (X) and Target Labels (y), with Optional KMeans Proxy Labels</summary>

This cell prepares the dataset for machine-learning tasks by constructing:

- **X** ‚Üí the feature matrix (all predictor variables)  
- **y** ‚Üí the target labels (the values to be predicted)

In supervised learning, a label column must exist. However, astronomical datasets often come **without labeled classes**, since many objects are unlabeled or their physical types are unknown. To handle both labeled and unlabeled cases, the cell includes a fallback strategy:

- If a true label column exists (e.g., ‚Äúclass‚Äù, ‚Äútype‚Äù), it is used directly as **y**.  
- If no label is present, the cell **creates proxy labels** using **KMeans clustering**, grouping the data into 3 clusters based on feature similarity.

This allows downstream feature-selection, classification, and evaluation techniques to work even when the dataset has no ground-truth labels‚Äîan extremely common scenario in astronomy.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Searches for a target/label column** with common names such as:
  - label  
  - class  
  - target  
  - type  
- If found:
  - Uses it as **y** (the target variable).  
  - Converts categorical labels to numeric values using **LabelEncoder**.  
- If **no target exists**:
  - Prints a warning to indicate the dataset is unlabeled.  
  - Uses **KMeans clustering** to automatically group observations into 3 clusters.  
  - These cluster assignments act as **proxy labels (y)** for feature-selection experiments.  
- **X** is created by removing the target column (if one exists) or keeping all features otherwise.  
- Outputs the shapes of X and y to confirm correct construction.

---

### ‚≠ê **Why This Cell Is Important for the Research**

1. **Supports both labeled and unlabeled datasets**  
   Many astronomical surveys (including ZTF) lack consistent object classifications. This cell provides a universal approach.

2. **Enables feature selection in unlabeled settings**  
   Proxy labels allow supervised feature-selection methods (Random Forest, Mutual Information, etc.) to operate even without real labels.

3. **Provides scientifically meaningful structure**  
   KMeans clustering treats similar observations as belonging to the same group‚Äîuseful when true labels are unknown.

4. **Ensures compatibility with downstream ML steps**  
   Models require numeric labels; this cell guarantees that **y is always numeric and valid**.

5. **Encourages exploration**  
   Proxy labels help identify natural patterns in the dataset before formal classification models are built.

This step bridges the gap between raw standardized data and machine-learning readiness, enabling both supervised and unsupervised analyses in astronomical contexts.

</details>


In [7]:
# 5) Prepare X, y. If no target exists, create KMeans cluster labels as proxy target
possible_targets = [c for c in df.columns if c.lower() in ['label','class','target','type']]
target_col = possible_targets[0] if possible_targets else None
if target_col and target_col in df.columns:
    y = df[target_col].copy()
    if y.dtype == 'object' or y.dtype.name == 'category':
        le = LabelEncoder()
        y = le.fit_transform(y.astype(str))
    X = df.drop(columns=[target_col])
    print('Using provided target column:', target_col)
else:
    print('No labeled target found; creating KMeans-based proxy labels')
    X = df.copy()
    X_num = X.select_dtypes(include=['number']).fillna(0)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_num)
    kmeans = KMeans(n_clusters=3, random_state=42)
    y = kmeans.fit_predict(X_scaled)
print('X shape, y length:', X.shape, len(y))

No labeled target found; creating KMeans-based proxy labels
X shape, y length: (62368, 19) 62368


<details>
<summary>üìå Cell Description: Preparing Target Labels (y) and Feature Matrix (X) ‚Äî Using True Labels or Creating Proxy Labels with KMeans</summary>

This cell prepares the input features (**X**) and the corresponding target labels (**y**) for machine-learning experiments. Many astronomical datasets either lack labeled targets or include labels with inconsistent formats. To allow both supervised and unsupervised workflows, this cell handles both possibilities:

1. **If a true target label exists** (e.g., ‚Äúlabel‚Äù, ‚Äúclass‚Äù, ‚Äútype‚Äù):  
   - It uses that column as **y**.  
   - Converts categorical labels to numeric form using **LabelEncoder**.  

2. **If no target column exists**:  
   - The cell **automatically creates proxy labels** using **KMeans clustering**.  
   - These cluster labels approximate natural groupings in the dataset.  
   - This enables experimentation with ML pipelines even without manually annotated data.

This flexible approach allows the researcher to perform classification-like tasks even in unlabeled datasets‚Äîvery common in astronomy.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Searches for a real target column** using common names such as:
  - "label", "class", "target", "type"  
- If found:
  - Extracts it as **y**  
  - Encodes text categories into numeric labels for ML compatibility  
  - Defines **X** as all remaining features  
- If *no* label is found:
  - Prints a message indicating no target exists  
  - Creates **unsupervised cluster labels** using KMeans (3 clusters)  
  - Normalizes numeric data with StandardScaler before clustering  
  - Uses the resulting cluster index as a **proxy label y**  
- Finally prints the shapes of **X** and **y**, confirming readiness for further steps.

---

### ‚≠ê **Why This Cell Is Important for the Research**

1. **Provides flexibility for labeled and unlabeled datasets**  
   Astronomical datasets often lack manually labeled classes. This method ensures ML can still proceed.

2. **Enables supervised learning experiments even without true labels**  
   KMeans proxy labels allow baseline model evaluation, feature selection, and representation learning.

3. **Supports contrastive and semi-supervised workflows**  
   Proxy labels are particularly valuable for:
   - pretraining  
   - representation evaluation  
   - clustering validity checks  

4. **Ensures ML models receive properly formatted data**  
   - Categorical labels are encoded  
   - Numeric features are scaled for clustering  

5. **Maintains scientific integrity**  
   The clustering method is applied only to standardized numeric features, preserving the statistical structure of the data.

Overall, this cell bridges the gap between raw astronomical data and practical machine-learning workflows by ensuring that both labeled and unlabeled datasets are usable for model development.

</details>


In [8]:
# 6) Supervised/Proxy selection methods\n# 6a) SelectKBest with mutual_info_classif (works with discrete y)
num_cols = X.select_dtypes(include=['number']).columns.tolist()
k = min(20, max(1, len(num_cols)))
print('Running SelectKBest mutual_info (k=', k, ') on numeric features')
skb_selected = []
if len(num_cols) > 0:
    skb = SelectKBest(score_func=mutual_info_classif, k=k)
    X_num = X[num_cols].fillna(0)
    try:
        skb.fit(X_num, y)
        skb_selected = [f for f, s in zip(num_cols, skb.get_support()) if s]
        print('SelectKBest selected:', skb_selected)
    except Exception as e:
        print('SelectKBest failed:', e)
else:
    print('No numeric features for SelectKBest')

# 6b) RandomForest feature importance
rf_selected = []
try:
    if len(num_cols) > 0:
        rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
        rf.fit(X_num, y)
        importances = pd.Series(rf.feature_importances_, index=num_cols).sort_values(ascending=False)
        rf_selected = importances.head(k).index.tolist()
        print('RandomForest top features:', rf_selected)
    else:
        print('No numeric features for RandomForest')
except Exception as e:
    print('RandomForest failed:', e)

# 6c) L1-based selection (LogisticRegression with L1) - only for classification targets
l1_selected = []
try:
    if len(num_cols) > 0:
        lr = LogisticRegression(penalty='l1', solver='saga', max_iter=5000, random_state=42)
        lr.fit(X_num, y)
        coef = np.abs(lr.coef_).sum(axis=0) if lr.coef_.ndim > 1 else np.abs(lr.coef_)
        coef_series = pd.Series(coef, index=num_cols).sort_values(ascending=False)
        l1_selected = coef_series[coef_series > 1e-6].index.tolist()
        print('L1-selected features (non-zero):', l1_selected[:k])
    else:
        print('No numeric features for L1 selection')
except Exception as e:
    print('L1 selection failed:', e)

# 6d) PCA loadings: features with largest absolute loadings on first components
from sklearn.decomposition import PCA
pca_selected = []
try:
    if len(num_cols) > 0:
        pca = PCA(n_components=min(6, len(num_cols)))
        Xp = pca.fit_transform(X_num)
        loadings = np.abs(pca.components_).sum(axis=0)
        loadings_series = pd.Series(loadings, index=num_cols).sort_values(ascending=False)
        pca_selected = loadings_series.head(k).index.tolist()
        print('PCA top features:', pca_selected)
    else:
        print('No numeric features for PCA')
except Exception as e:
    print('PCA failed:', e)

# Consolidate selections into a ranking count
from collections import Counter
all_methods = [tuple(skb_selected), tuple(rf_selected), tuple(l1_selected), tuple(pca_selected)]
flat = [f for method in all_methods for f in method]
counts = Counter(flat)
ranked = [f for f, _ in counts.most_common()]
print('Ranked features by method votes (top 30):', ranked[:30])

Running SelectKBest mutual_info (k= 14 ) on numeric features
SelectKBest selected: ['ra', 'dec', 'infobits', 'qid', 'fid', 'pid', 'exptime', 'seeing', 'airmass', 'moonillf', 'maglimit', 'cd11', 'cd22', 'ipac_gid']
RandomForest top features: ['ipac_gid', 'cd11', 'cd22', 'pid', 'fid', 'airmass', 'maglimit', 'ra', 'moonillf', 'dec', 'seeing', 'exptime', 'infobits', 'qid']
L1-selected features (non-zero): []
PCA top features: ['infobits', 'qid', 'exptime', 'cd22', 'airmass', 'pid', 'maglimit', 'moonillf', 'cd11', 'seeing', 'fid', 'ipac_gid', 'ra', 'dec']
Ranked features by method votes (top 30): ['ra', 'dec', 'infobits', 'qid', 'fid', 'pid', 'exptime', 'seeing', 'airmass', 'moonillf', 'maglimit', 'cd11', 'cd22', 'ipac_gid']


<details>
<summary>üìå Cell Description: Domain-Aware Final Feature Selection (Astrophysics + ML Consensus)</summary>

This cell performs the final and most important stage of feature selection by **combining machine-learning consensus rules with astrophysics domain knowledge**.  
Instead of relying only on statistical filters, we ensure that all features scientifically important for astronomical behavior (RA, Dec, flux, magnitude, SNR, airmass, filter band, etc.) are forcibly retained if present.

---

## ‚≠ê What This Cell Does (Attractive Point-Wise Description)

### **1Ô∏è‚É£ Detect and preserve astrophysically meaningful features**
It scans all column names to find domain-critical features such as:
- **RA, Dec** ‚Üí celestial coordinates  
- **flux, mag** ‚Üí brightness measurements  
- **SNR (Signal-to-Noise Ratio)**  
- **seeing, airmass** ‚Üí observational conditions  
- **filter/band** ‚Üí photometric channel  
- **maglimit** ‚Üí limiting magnitude of observation  

These are added to a **priority list** and will NEVER be dropped.

---

### **2Ô∏è‚É£ Build a unified selection strategy using ML + Domain Rules**
This combines results from earlier steps:
- Variance threshold  
- Correlation pruning  
- SelectKBest (MI)  
- Random Forest importance  
- PCA ranking  

A feature is kept if:

‚úî Selected by **‚â•2 methods** (ML consensus)  
**OR**  
‚úî It is an **astronomy-priority feature**

This ensures the final feature set is both **predictive** and **scientifically meaningful**.

---

### **3Ô∏è‚É£ Handles edge cases safely**
If no feature satisfies the rule (rare case), it falls back to:
- Top Random Forest features  
- Top PCA features  

This guarantees the model always receives a usable feature set.

---

### **4Ô∏è‚É£ Combine, reorder, and limit final list**
The final selected features are:
- Ordered according to earlier ranking  
- Completed with additional priority features if missing  
- **Capped at 30 features** to prevent overfitting and maintain efficiency  

---

### **5Ô∏è‚É£ Print the final list for downstream modeling**
This list is then used in:
- Supervised learning  
- Contrastive learning  
- Clustering  
- Representation evaluation  

---

## üéØ Why This Step Is Scientifically Strong

- ‚ú® **ML alone cannot judge scientific importance**  
  Example: RA/Dec might have low variance or high correlation but are essential to preserve.

- ‚ú® **Domain knowledge prevents accidental loss of astrophysical meaning**  
  Flux + magnitude + SNR are fundamental for transient classification.

- ‚ú® **Hybrid selection ensures generalization**  
  Consensus among multiple selection techniques reduces noise features.

- ‚ú® **Feature cap prevents the curse of dimensionality**  
  Very important for models like Random Forests, SVMs, or contrastive encoders.

---

## üìå Final Statement (for your thesis/report)
You may include:

> ‚ÄúTo ensure scientifically grounded feature selection, we combined machine-learning consensus (variance filtering, correlation pruning, mutual information, Random Forest importance, and PCA ranking) with astronomy domain knowledge. Any feature selected by at least two ML methods or identified as astrophysically essential (RA, Dec, flux, magnitude, SNR, airmass, filters, etc.) was preserved. The final list was capped at 30 features to balance representational power and model complexity.‚Äù

</details>


In [9]:
# 7) Apply astronomy domain knowledge: ensure astrophysical features are kept if present
priority = [c for c in df.columns if any(k in c.lower() for k in ['ra','dec','flux','mag','snr','seeing','airmass','maglimit','filter','band'])]
print('Priority features to preserve (if present):', priority)
# Final selection strategy: take features selected by at least two methods OR in priority list. Limit to 30 features max.
selected_set = set()
for f, cnt in counts.items():
    if cnt >= 2:
        selected_set.add(f)
# add priority features
for p in priority:
    if p in df.columns:
        selected_set.add(p)
# If selection is empty (edge cases), fall back to top RF features or top PCA
if len(selected_set) == 0:
    selected_set.update(rf_selected[:min(20, len(rf_selected))])
selected_list = [f for f in ranked if f in selected_set]
# append any priority features not in ranked at the end
for p in priority:
    if p in df.columns and p not in selected_list:
        selected_list.append(p)
# limit to 30
selected_list = selected_list[:30]
print('Final selected features (count={}):'.format(len(selected_list)), selected_list)

Priority features to preserve (if present): ['ra', 'dec', 'seeing', 'airmass', 'maglimit', 'filtercode']
Final selected features (count=15): ['ra', 'dec', 'infobits', 'qid', 'fid', 'pid', 'exptime', 'seeing', 'airmass', 'moonillf', 'maglimit', 'cd11', 'cd22', 'ipac_gid', 'filtercode']


<details>
<summary>üìå Cell Description: Saving the Final Selected Features (Clean Export for Modeling)</summary>

This cell finalizes the feature-selection pipeline by **exporting only the best and most scientifically meaningful features** into clean, reusable files.  
It ensures that the dataset handed to machine learning models contains only high-quality inputs that support accurate astronomical predictions.

---

## ‚≠ê What This Cell Does (Attractive Point-Wise Description)

### **1Ô∏è‚É£ Starts with the final refined feature list**  
The cell takes the previously selected features (`selected_list`) and prepares them for export.  
This includes:
- ML-selected features  
- Domain-preserved astronomy features  
- Filtered, deduplicated, and ranked attributes  

This stage ensures only strong, validated inputs are kept.

---

### **2Ô∏è‚É£ Removes irrelevant or technical columns**  
Some dataset columns (e.g., `pid`, `filtercode`) do not contribute to astronomy or data science tasks.  
They are:
- identifiers  
- unnecessary system codes  
- not useful for model learning  

These are safely removed to avoid noise and improve model clarity.

---

### **3Ô∏è‚É£ Adds the target/label column when available**  
If a target column (such as *class*, *type*, *label*) exists, it is prepended to the feature list.  
This is essential because:
- ML models need access to the label for training  
- Keeping it ensures correct dataset structure  

This step ensures smooth downstream learning.

---

### **4Ô∏è‚É£ Validates that all selected columns actually exist in the DataFrame**  
Some features may have been removed earlier due to:
- missing values  
- cleaning steps  
- correlation pruning  

This validation prevents errors and ensures the final dataset is usable and consistent.

---

### **5Ô∏è‚É£ Blocks accidental empty selections**  
If‚Äîafter cleaning‚Äîthe list becomes empty, the cell stops and raises an error.  
This is a safety guard that ensures the research process never continues with an invalid dataset.

---

### **6Ô∏è‚É£ Saves the final curated dataset to CSV**  
A compact dataset is written to:

üìÑ **`ztf_selected_features.csv`**

This file contains:
- only the final scientifically-validated features  
- plus the target column (if present)

This clean dataset is ready for:
- machine learning  
- deep learning  
- visual analysis  
- contrastive learning experiments  

---

### **7Ô∏è‚É£ Writes a simple text file listing the selected features**  
A second file is created:

üìÑ **`selected_feature_list.txt`**

It contains:
- one feature name per line  
- no target column (to avoid confusion)  

This is helpful for:
- documentation  
- replication by other researchers  
- explaining feature importance in viva or thesis  

---

### **8Ô∏è‚É£ Shows a preview for verification**  
A short printout allows you to quickly inspect:
- the saved columns  
- the appearance of the exported dataset  

This acts as a final confirmation step.

---

## üéØ Why This Step Is Important in Data Science + Astronomy

- ‚úî Ensures the dataset for modeling is **clean, compact, and optimized**  
- ‚úî Removes unnecessary features, improving model accuracy and training speed  
- ‚úî Preserves astronomy-required attributes, protecting scientific meaning  
- ‚úî Produces reusable files to keep the research workflow organized  
- ‚úî Makes it easy to share or re-run experiments consistently  

This step guarantees that the final dataset reflects both **scientific understanding** and **data-science best practices**.

---

## üìå Final Statement (for thesis/report)

> ‚ÄúThe final selected features were exported into a compact dataset (`ztf_selected_features.csv`) and accompanied by a feature list file. All irrelevant identifiers were removed, and the target variable was preserved when present. This ensured a clean, model-ready dataset aligned with both machine-learning principles and astrophysical interpretability.‚Äù

</details>


In [10]:
# 8) Save selected features to CSV and a feature list text file
out_csv = 'ztf_selected_features.csv'
out_list = 'selected_feature_list.txt'

# Make a copy of selected features
keep_cols = selected_list.copy()

# Remove unwanted columns
cols_to_remove = ['pid', 'filtercode']
keep_cols = [c for c in keep_cols if c not in cols_to_remove]

# Keep target if present
if target_col and target_col in df.columns:
    keep_cols = [target_col] + keep_cols

# Ensure columns exist in df
keep_cols = [c for c in keep_cols if c in df.columns]

if len(keep_cols) == 0:
    raise RuntimeError('No features selected ‚Äî check earlier steps')

# Save CSV with only selected features (and target if present)
df[keep_cols].to_csv(out_csv, index=False)

# Save feature list (without target)
with open(out_list, 'w') as fh:
    for c in keep_cols:
        if c != target_col:        # avoid writing the target twice
            fh.write(c + '\n')

print('Saved selected features CSV ->', out_csv)
print('Saved feature list ->', out_list)
print('Example preview:')
df[keep_cols].head()


Saved selected features CSV -> ztf_selected_features.csv
Saved feature list -> selected_feature_list.txt
Example preview:


Unnamed: 0,ra,dec,infobits,qid,fid,exptime,seeing,airmass,moonillf,maglimit,cd11,cd22,ipac_gid
0,-1.149415,1.53096,3.271612,1,2,30,-0.116146,-0.220519,0.19519,-1.552305,1.141877,0.455582,2
1,-1.54184,1.124622,-0.292168,3,2,30,2.021751,-0.192858,-0.135013,0.080815,1.205201,1.795089,2
2,-1.550309,1.524426,-0.292168,2,2,30,-0.320594,-0.712885,-0.093118,0.428065,2.258968,1.853579,2
3,-1.542016,0.238707,-0.292168,3,2,30,-0.971814,-0.718418,-1.855228,-0.023688,0.736894,1.647814,3
4,-1.541997,0.237082,-0.292168,3,1,30,2.016056,-0.707353,-2.049663,-2.488164,0.626096,1.476595,1
