# Permutation Importance
What features does your model think are important?

# 📊 Feature Importance & Permutation Importance

## 🔍 Introduction

A fundamental question in machine learning:  
**Which features have the biggest impact on model predictions?**

This is called **feature importance**.

---

## 💡 Why Permutation Importance?

Compared to other methods, permutation importance is:
- ✅ Fast to calculate
- ✅ Widely used and understood
- ✅ Consistent with desirable properties for measuring importance

---

## ⚙️ How Permutation Importance Works

1. **Train a model** on your dataset.
2. **Choose a validation set**.
3. For each feature:
   - **Randomly shuffle the values** in that feature column.
   - **Make predictions** on the now-shuffled data.
   - Measure how much **performance decreases** (e.g., increase in error).
   - **Greater the performance drop = more important the feature**.
4. **Restore the original order** of the column.
5. Repeat for each feature.

---

## 🧠 Intuition via Example

🎯 **Goal**: Predict someone's height at age 20  
📦 **Features available at age 10**:
- `Height at age 10` → useful
- `Socks owned` → likely irrelevant

When we:
- Shuffle `height at age 10` → Model performs poorly → 🔥 Important!
- Shuffle `socks owned` → Little to no change → ❄️ Not important

---

## 📌 Summary

Permutation importance asks:  
> _"How much worse does the model perform when I break the relationship between this feature and the target?"_

It provides a clear, interpretable, and robust way to **rank features** based on their predictive power.


In [5]:
# Loading data, dividing, modeling and EDA below
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

data = pd.read_csv('new-york-city-taxi-fare-prediction.csv', nrows=50000)

# Remove data with extreme outlier coordinates or negative fares
data = data.query('pickup_latitude > 40.7 and pickup_latitude < 40.8 and ' +
                  'dropoff_latitude > 40.7 and dropoff_latitude < 40.8 and ' +
                  'pickup_longitude > -74 and pickup_longitude < -73.9 and ' +
                  'dropoff_longitude > -74 and dropoff_longitude < -73.9 and ' +
                  'fare_amount > 0'
                  )

y = data.fare_amount

base_features = ['pickup_longitude',
                 'pickup_latitude',
                 'dropoff_longitude',
                 'dropoff_latitude',
                 'passenger_count']

X = data[base_features]


train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
first_model = RandomForestRegressor(n_estimators=50, random_state=1).fit(train_X, train_y)


# show data
print("Data sample:")
data.head()

Data sample:


Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
6,2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
7,2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1


In [6]:
train_X.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,23466.0,23466.0,23466.0,23466.0,23466.0
mean,-73.976827,40.756931,-73.975359,40.757434,1.66232
std,0.014625,0.018206,0.01593,0.018659,1.290729
min,-73.999999,40.700013,-73.999999,40.70002,0.0
25%,-73.987964,40.744901,-73.987143,40.745756,1.0
50%,-73.979629,40.758076,-73.978588,40.758542,1.0
75%,-73.967797,40.769602,-73.966459,40.770406,2.0
max,-73.900062,40.799952,-73.900062,40.799999,6.0


In [7]:
train_y.describe()

count    23466.000000
mean         8.472539
std          4.609747
min          0.010000
25%          5.500000
50%          7.500000
75%         10.100000
max        165.000000
Name: fare_amount, dtype: float64

## Question 1

The first model uses the following features
- pickup_longitude
- pickup_latitude
- dropoff_longitude
- dropoff_latitude
- passenger_count

Before running any code... which variables seem potentially useful for predicting taxi fares? Do you think permutation importance will necessarily identify these features as important?

**Solution**: It would be helpful to know whether New York City taxis vary prices based on how many passengers they have. Most places do not change fares based on numbers of passengers. If you assume New York City is the same, then only the top 4 features listed should matter. At first glance, it seems all of those should matter equally.

## Question 2

Let's find out with code!

In [9]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(first_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

Weight,Feature
0.8426  ± 0.0168,dropoff_latitude
0.8269  ± 0.0211,pickup_latitude
0.5943  ± 0.0436,pickup_longitude
0.5387  ± 0.0273,dropoff_longitude
-0.0020  ± 0.0013,passenger_count


## Question 3
Before seeing these results, we might have expected each of the 4 directional features to be equally important.

But, on average, the latitude features matter more than the longititude features. Can you come up with any hypotheses for this?

**Solution**: 
1. Travel might tend to have greater latitude distances than longitude distances. If the longitudes values were generally closer together, shuffling them wouldn't matter as much. 
2. Different parts of the city might have different pricing rules (e.g. price per mile), and pricing rules could vary more by latitude than longitude. 
3. Tolls might be greater on roads going North<->South (changing latitude) than on roads going East <-> West (changing longitude). Thus latitude would have a larger effect on the prediction because it captures the amount of the tolls.


## Question 4

Without detailed knowledge of New York City, it's difficult to confidently explain why latitude features seem to matter more than longitude features.

### 🧠 Objective
A good next step is to **disentangle the effect of being in specific areas of the city** from the effect of **total distance traveled**.

In [10]:
# create new features
data['abs_lon_change'] = abs(data.dropoff_longitude - data.pickup_longitude)
data['abs_lat_change'] = abs(data.dropoff_latitude - data.pickup_latitude)

features_2  = ['pickup_longitude',
               'pickup_latitude',
               'dropoff_longitude',
               'dropoff_latitude',
               'abs_lat_change',
               'abs_lon_change']

X = data[features_2]
new_train_X, new_val_X, new_train_y, new_val_y = train_test_split(X, y, random_state=1)
second_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(new_train_X, new_train_y)

# Create a PermutationImportance object on second_model and fit it to new_val_X and new_val_y
# Use a random_state of 1 for reproducible results that match the expected solution.
perm2 =  PermutationImportance(second_model, random_state=1).fit(new_val_X, new_val_y)

# show the weights for the permutation importance you just calculated
eli5.show_weights(perm2, feature_names = features_2)

Weight,Feature
0.5952  ± 0.0575,abs_lat_change
0.4485  ± 0.0493,abs_lon_change
0.0799  ± 0.0241,pickup_latitude
0.0770  ± 0.0121,dropoff_latitude
0.0694  ± 0.0115,pickup_longitude
0.0596  ± 0.0131,dropoff_longitude


--> Distance traveled seems far more important than any location effects.

## Question 5

A colleague observes that the values for `abs_lon_change` and `abs_lat_change` are pretty small (all values are between -0.1 and 0.1), whereas other variables have larger values.  Do you think this could explain why those coordinates had larger permutation importance values in this case?  

Consider an alternative where you created and used a feature that was 100X as large for these features, and used that larger feature for training and importance calculations. Would this change the outputted permutaiton importance values?

In [12]:
# create new features
data['abs_lon_change'] = 100 * abs(data.dropoff_longitude - data.pickup_longitude)
data['abs_lat_change'] = 100 * abs(data.dropoff_latitude - data.pickup_latitude)

features_3  = ['pickup_longitude',
               'pickup_latitude',
               'dropoff_longitude',
               'dropoff_latitude',
               'abs_lat_change',
               'abs_lon_change']

X = data[features_3]
new_train_X, new_val_X, new_train_y, new_val_y = train_test_split(X, y, random_state=1)
second_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(new_train_X, new_train_y)

# Create a PermutationImportance object on second_model and fit it to new_val_X and new_val_y
# Use a random_state of 1 for reproducible results that match the expected solution.
perm3 =  PermutationImportance(second_model, random_state=1).fit(new_val_X, new_val_y)

# show the weights for the permutation importance you just calculated
eli5.show_weights(perm3, feature_names = features_3)

Weight,Feature
0.5959  ± 0.0581,abs_lat_change
0.4512  ± 0.0497,abs_lon_change
0.0792  ± 0.0247,pickup_latitude
0.0750  ± 0.0122,dropoff_latitude
0.0644  ± 0.0106,pickup_longitude
0.0587  ± 0.0132,dropoff_longitude


-->The scale of features does not affect permutation importance per se. (to see Ridge Regression: you might be able to think of how that would be affected. )

## Question 6

You've seen that the feature importance for latitudinal distance is greater than the importance of longitudinal distance. From this, can we conclude whether travelling a fixed latitudinal distance tends to be more expensive than traveling the same longitudinal distance?

Why or why not?

**Solution**:We cannot tell from the permutation importance results whether traveling a fixed latitudinal distance is more or less expensive than traveling the same longitudinal distance. Possible reasons latitude feature are more important than longitude features. 
1. latitudinal distances in the dataset tend to be larger 
2. it is more expensive to travel a fixed latitudinal distance 
3. Both of the above If abs_lon_change values were very small, longitues could be less important to the model even if the cost per mile of travel in that direction were high.