**In this example, we will use permutation importance to extract insights from the features.**

The permutation importance calculated as follows:

1.   Get a trained model
2.   Shuffle the values in a single column, make predictions using the resulting dataset. Use these predictions and the true target values to calculate how much the loss function suffered from shuffling. That performance deterioration measures the importance of the variable you just shuffled.
3.   Return the data to the original order (undoing the shuffle from step 2). Now repeat step 2 with the next column in the dataset, until you have calculated the importance of each column.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

We make use of the New York City Taxi Fare Prediction dataset from Kaggle. https://www.kaggle.com/datasets/dansbecker/new-york-city-taxi-fare-prediction.
 

In [2]:
data = pd.read_csv('/content/train.csv', nrows=50000)

In [3]:
# Remove data with extreme outlier coordinates or negative fares
data = data.query('pickup_latitude > 40.7 and pickup_latitude < 40.8 and ' +
                  'dropoff_latitude > 40.7 and dropoff_latitude < 40.8 and ' +
                  'pickup_longitude > -74 and pickup_longitude < -73.9 and ' +
                  'dropoff_longitude > -74 and dropoff_longitude < -73.9 and ' +
                  'fare_amount > 0'
                  )

In [21]:
### We select a few features for simplicity purpose. 
y = data.fare_amount

base_features = ['pickup_longitude',
                 'pickup_latitude',
                 'dropoff_longitude',
                 'dropoff_latitude',
                 'passenger_count']

X = data[base_features]

In [5]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
first_model = RandomForestRegressor(n_estimators=50, random_state=1).fit(train_X, train_y)

In [6]:
train_X.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,23466.0,23466.0,23466.0,23466.0,23466.0
mean,-73.976827,40.756931,-73.975359,40.757434,1.66232
std,0.014625,0.018206,0.01593,0.018659,1.290729
min,-73.999999,40.700013,-73.999999,40.70002,0.0
25%,-73.987964,40.744901,-73.987143,40.745756,1.0
50%,-73.979629,40.758076,-73.978588,40.758542,1.0
75%,-73.967797,40.769602,-73.966459,40.770406,2.0
max,-73.900062,40.799952,-73.900062,40.799999,6.0


In [7]:
train_y.describe()

count    23466.000000
mean         8.472539
std          4.609747
min          0.010000
25%          5.500000
50%          7.500000
75%         10.100000
max        165.000000
Name: fare_amount, dtype: float64

In [10]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(first_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = base_features)

Weight,Feature
0.8426  ± 0.0168,dropoff_latitude
0.8269  ± 0.0211,pickup_latitude
0.5943  ± 0.0436,pickup_longitude
0.5387  ± 0.0273,dropoff_longitude
-0.0020  ± 0.0013,passenger_count


#### **Interpreting Permutation Importances:**

The values towards the top are the most important features, and those towards the bottom matter least.

The first number in each row shows how much model performance decreased with a random shuffling (in this case, using "accuracy" as the performance metric).

Like most things in data science, there is some randomness to the exact performance change from a shuffling a column. We measure the amount of randomness in our permutation importance calculation by repeating the process with multiple shuffles. The number after the ± measures how performance varied from one-reshuffling to the next.





On average, the latitude features matter more than the longititude features.What might be the reasons behind it?

Some hypotheses are:
1. Travel might tend to have greater latitude distances than longitude distances. If the longitudes values were generally closer together, shuffling them wouldn't matter as much. 
2. Different parts of the city might have different pricing rules (e.g. price per mile), and pricing rules could vary more by latitude than longitude. 
3. Tolls might be greater on roads going North<->South (changing latitude) than on roads going East <-> West (changing longitude). Thus latitude would have a larger effect on the prediction because it captures the amount of the tolls.

Without detailed knowledge of New York City, it's difficult to rule out most hypotheses about why latitude features matter more than longitude.

A good next step is to disentangle the effect of being in certain parts of the city from the effect of total distance traveled.

We will creates absoulute changes in longitudinal and latitudinal distance. It then builds a model that adds these new features to those you already had.

In [11]:
data['abs_lon_change'] = abs(data.dropoff_longitude - data.pickup_longitude)
data['abs_lat_change'] = abs(data.dropoff_latitude - data.pickup_latitude)

features_2  = ['pickup_longitude',
               'pickup_latitude',
               'dropoff_longitude',
               'dropoff_latitude',
               'abs_lat_change',
               'abs_lon_change']

X = data[features_2]
new_train_X, new_val_X, new_train_y, new_val_y = train_test_split(X, y, random_state=1)
second_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(new_train_X, new_train_y)

# Create a PermutationImportance object on second_model and fit it to new_val_X and new_val_y
perm2 = PermutationImportance(second_model, random_state=1).fit(new_val_X, new_val_y)

# show the weights for the permutation importance you just calculated
eli5.show_weights(perm2, feature_names = features_2)

Weight,Feature
0.5979  ± 0.0625,abs_lat_change
0.4485  ± 0.0503,abs_lon_change
0.0810  ± 0.0240,pickup_latitude
0.0766  ± 0.0121,dropoff_latitude
0.0709  ± 0.0103,pickup_longitude
0.0596  ± 0.0135,dropoff_longitude


In [19]:
new_features = X[['abs_lat_change','abs_lon_change']]

In [20]:
new_features.describe()

Unnamed: 0,abs_lat_change,abs_lon_change
count,31289.0,31289.0
mean,0.014893,0.013039
std,0.012234,0.011644
min,0.0,0.0
25%,0.006038,0.004947
50%,0.01167,0.010049
75%,0.020497,0.017717
max,0.094655,0.094065


We can that the values for abs_lon_change and abs_lat_change are pretty small (all values are between -0.1 and 0.1), whereas other variables have larger values. Do you think this could explain why those coordinates had larger permutation importance values in this case?

Consider an alternative where you created and used a feature that was 100X as large for these features, and used that larger feature for training and importance calculations. Would this change the outputted permutaiton importance values?

In fact, the scale of features does not affect permutation importance. The only reason that rescaling a feature would affect PI is indirectly, if rescaling helped or hurt the ability of the particular learning method we're using to make use of that feature. That won't happen with tree based models, like the Random Forest used here. If you are familiar with Ridge Regression, you might be able to think of how that would be affected. That said, the absolute change features have high importance because they capture total distance traveled, which is the primary determinant of taxi fares.It is not an artifact of the feature magnitude.