# Feature Importance Analysis

In this notebook, we analyze the feature importances derived from a Random Forest model trained on gene expression data. The dataset contains various features related to gene expression and chromosomal information.

In [None]:
#get the combined pckl for the dataframe
import pandas as pd

df = pd.read_pickle('../combined_data.pkl')

display(df.head())

#split X and y
X = df.drop('status', axis=1)
y = df['status']

from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
import numpy as np

kf = KFold(n_splits=5, shuffle=True, random_state=42)
rf = RandomForestClassifier(random_state=42)

feature_importances = []

for train_index, val_index in kf.split(X):
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    
    rf.fit(X_train, y_train)
    feature_importances.append(rf.feature_importances_)

# Calculate mean and standard deviation of feature importances
mean_importances = np.mean(feature_importances, axis=0)
std_importances = np.std(feature_importances, axis=0)

Unnamed: 0,status,sum_gene_expr_normalized,mean_gene_expr_normalized,variance_gene_expr_normalized,window100_sum_gene_expr_avg,window100_mean_gene_expr_avg,window100_variance_gene_expr_avg,window10_sum_gene_expr_avg,window10_mean_gene_expr_avg,window10_variance_gene_expr_avg,...,chromosome_chr22,chromosome_chr3,chromosome_chr4,chromosome_chr5,chromosome_chr6,chromosome_chr7,chromosome_chr8,chromosome_chr9,chromosome_chrX,chromosome_chrY
0,amplified,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
1,amplified,0.006066,4.167977e-07,1.770619e-10,0.003042,2.083989e-07,8.853096e-11,0.003042,2.083989e-07,8.853096e-11,...,False,False,False,False,False,False,False,False,False,False
2,amplified,0.0,0.0,0.0,0.002028,1.389326e-07,5.902064e-11,0.002028,1.389326e-07,5.902064e-11,...,False,False,False,False,False,False,False,False,False,False
3,amplified,0.002203,1.510508e-07,3.707889e-11,0.002072,1.419621e-07,5.35352e-11,0.002072,1.419621e-07,5.35352e-11,...,False,False,False,False,False,False,False,False,False,False
4,amplified,0.0,0.0,0.0,0.001658,1.135697e-07,4.282816e-11,0.001658,1.135697e-07,4.282816e-11,...,False,False,False,False,False,False,False,False,False,False


In [7]:
#show the importance of each feature
importances_df = pd.DataFrame({'feature': X.columns, 'importance': mean_importances, 'std': std_importances})
importances_df = importances_df.sort_values('importance', ascending=False)
display(importances_df)

Unnamed: 0,feature,importance,std
3,window100_sum_gene_expr_avg,0.186346,0.000606
4,window100_mean_gene_expr_avg,0.177921,0.000594
6,window10_sum_gene_expr_avg,0.129314,8e-05
7,window10_mean_gene_expr_avg,0.106499,0.0003
9,dev_gene_expr_normalized_mean,0.102354,0.000107
0,sum_gene_expr_normalized,0.075541,0.00018
10,dev_mean_gene_expr_normalized_mean,0.053614,0.000218
1,mean_gene_expr_normalized,0.04875,0.000188
31,chromosome_chr7,0.030213,4.5e-05
34,chromosome_chrX,0.01757,0.000107


# Analysis of Feature Importance in Random Forest

This table shows the feature importances derived from a Random Forest model. Below is an interpretation of the key columns:

## Feature Importance Scores
- The `importance` column quantifies the contribution of each feature to the model's predictions. 
  - **Higher values** indicate more significant features.
  - The feature `window100_sum_gene_expr_avg` has the highest importance (0.186346), meaning it plays the most critical role in the model's decision-making.

## Standard Deviation (std)
- The `std` column represents the variability of the importance scores across all trees in the Random Forest.
  - **Lower `std` values** indicate that the importance is consistent across trees.

## Key Insights
1. **Top Features**:
   - Features related to **gene expression averages** (e.g., `window100_sum_gene_expr_avg`, `window100_mean_gene_expr_avg`) are highly influential.
   
2. **Chromosomal Features**:
   - Chromosomal features (e.g., `chromosome_chr7`, `chromosome_chrX`) have relatively lower importance but still contribute slightly to the model.

3. **Low-Importance Features**:
   - Features like `chromosome_chr8` and `chromosome_chr15` have very low importance values, suggesting they contribute minimally to the predictions.
