# Aim

The aim of the following permutation analysis is to find which features are the most important for predicting the probability of someone being admitted.

The main assumption is that not all features are equally important to the final result, thus performing a permutation analysis (see [Permutation Analysis](perm_analysis.ipynb#permutation-analysis)) beforehand will shed light to the most important features, thus it will allow us to reduce the number of features and consequently the complexity of the neural network.

In [74]:
# import libraries
from sklearn.preprocessing import PowerTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [75]:
dtypes_dict = {
    'Serial No.' : int,
    'GRE Score' : int,
    'TOEFL Score' : int,
    'University Rating' : int,
    'SOP' : float,
    'LOR' : float,
    'CGPA' : float,
    'Research' : int,
    'Chance of Admit' : float
}
admissions_data = pd.read_csv("admissions_data.csv", encoding='utf-8', dtype=dtypes_dict)

# "Research" seems to be categorical
admissions_data["Research"] = admissions_data["Research"].astype('category')
admissions_data.dtypes

Serial No.              int64
GRE Score               int64
TOEFL Score             int64
University Rating       int64
SOP                   float64
LOR                   float64
CGPA                  float64
Research             category
Chance of Admit       float64
dtype: object

In [76]:
admissions_data.sample(5)

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
329,330,297,96,2,2.5,1.5,7.89,0,0.43
471,472,311,103,3,2.0,4.0,8.09,0,0.64
467,468,318,101,5,3.5,5.0,8.78,1,0.78
411,412,313,94,2,2.5,1.5,8.13,0,0.56
487,488,327,115,4,3.5,4.0,9.14,0,0.79


# Permutation Analysis
<a id='#permutation-analysis'></a>
It is a model inspection technique that measures the contribution of each feature to a fitted model's statistical performance on a given tabular dataset.
This technique is particularly useful for non-linear or opaque estimators (i.e., an object that manages estimation and decoding of a model) and involves random shuffling the values of a single feature and observing the resulting degradation of the model's score. The intuition behind this analysis is that by breaking the relationship between the predictor and the target variable, we can determine how much the model relies on such particular feature.

Permutation importance measures how much the model performance decreases when the values of a feature are randomly shuffled. A larger decrease in performance indicates that the feature is more important for the model.

Permutation importance is model-agnostic, meaning it can be applied to any type of predictive model. However, complex models might have more nuanced interpretations due to interactions among features.

In [81]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# transform the feature columns
scaler = PowerTransformer() # stabilises variance, normal-like distribution, heteroscedasticity
X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)

# fit Random Forest model on the training data
model = RandomForestRegressor(random_state=42)
model.fit(X_train_normalized, y_train)

# calculate permutation feature importance
# increase number of repeats for more robust results (1000 is more than enough)
result = permutation_importance(model, X_test_normalized, y_test, n_repeats=1000, random_state=42)

importance_df = pd.DataFrame({
    'Feature': features.columns,
    'Importance': result.importances_mean,
    'Importance_std' : result.importances_std
}).sort_values(by='Importance', ascending=False)

importance_df

Unnamed: 0,Feature,Importance,Importance_std
5,CGPA,0.806137,0.111436
0,GRE Score,0.03315,0.021972
3,SOP,0.02093,0.011529
6,Research,0.014922,0.011625
4,LOR,0.008711,0.008034
2,University Rating,0.007895,0.004898
1,TOEFL Score,0.005063,0.008558


In [78]:
import pprint
for i in importance_df.index:
    feature_name = importance_df.at[i, 'Feature']
    importance_mean = importance_df.at[i, 'Importance']
    importance_std = importance_df.at[i, 'Importance_std']
    pprint.pprint(f"{feature_name:<20} {importance_mean:.3f} +/- {importance_std:.3f}")

'CGPA                 0.806 +/- 0.111'
'GRE Score            0.033 +/- 0.022'
'SOP                  0.021 +/- 0.012'
'Research             0.015 +/- 0.012'
'LOR                  0.009 +/- 0.008'
'University Rating    0.008 +/- 0.005'
'TOEFL Score          0.005 +/- 0.009'


# Conclusions
The results from the permutation analysis provide insights into the relative importance of different features in predicting the outcome, likely related to admission success in this context. Based on the analysis, it appears that CGPA (Cumulative Grade Point Average) is the most influential predictor, with a mean importance score of 0.806 and a standard deviation of 0.111. This indicates that variations in CGPA have a substantial impact on the predicted outcome. Following CGPA, the GRE Score also shows significant importance, although to a much lesser extent with a mean score of 0.033 and a standard deviation of 0.022. SOP (Statement of Purpose), Research experience, and LOR (Letter of Recommendation) also contribute positively to the prediction, albeit with smaller mean importance scores ranging from 0.021 to 0.009 and associated standard deviations. University Rating and TOEFL Score, on the other hand, appear to have relatively minor impacts, each with mean scores of 0.008 and 0.005, respectively, and similarly low standard deviations. These findings suggest that while all features contribute to the predictive model, CGPA and GRE Score carry the most weight in determining the predicted outcome in this analysis.

The results of the permutation analysis above show that `CGPA` is a very important feature since the model relies heavily on it to make predictions. As mentioned earlier, the fact that the model relies heavily on that it means that during the permutation analysis when the relationship between the prediction and the output got broken, the model underperformed.
Therefore for the neural network we will take into account only the first 4 features. We could even reduce to `CGPA` since it is the most important features, but just for the sake of plurality we use `CGPA`, `GRE Score`, `SOP`, `Research`.

# Further Reading
> https://scikit-learn.org/stable/modules/permutation_importance.html

> https://scikit-learn.org/stable/modules/preprocessing.html#yeo-johnson-transform

> https://academic.oup.com/jrsssb/article-abstract/26/2/211/7028064?redirectedFrom=fulltext&login=false

> https://www.jwilber.me/permutationtest/ 