# Exercises

## Set Up

You will calculate permutation importance with data from the [Taxi Fare Prediction](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) competition.

We won't focus on data exploration or model building for now. You can just run the cell below to 
- Load the data
- Divide the data into training and validation
- Build a model that predicts taxi fares
- Print a few rows for you to review

In [53]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


data = pd.read_csv('../input/new-york-city-taxi-fare-prediction/train.csv', nrows=50000)

# Remove data with extreme outlier coordinates or negative fares
data = data.query('pickup_latitude > 40 and pickup_latitude < 41 and ' +
                  'dropoff_latitude > 40 and dropoff_latitude < 41 and ' +
                  'pickup_longitude > -74.5 and pickup_longitude < -73 and ' +
                  'dropoff_longitude > -74.5 and dropoff_longitude < -73 and ' +
                  'fare_amount > 0'
                  )

y = data.fare_amount

base_features = ['pickup_longitude',
                 'pickup_latitude',
                 'dropoff_longitude',
                 'dropoff_latitude',
                 'passenger_count']

X = data[base_features]


train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
first_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(train_X, train_y)

print("Data sample:")
data.head()

Data sample:


Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
1,2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1


The following two cells may also be useful to understand the values in the training data:

In [54]:
train_X.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,36680.0,36680.0,36680.0,36680.0,36680.0
mean,-73.975446,40.750812,-73.974232,40.751432,1.667121
std,0.035188,0.027497,0.035402,0.031242,1.288414
min,-74.438233,40.121653,-74.429332,40.164927,0.0
25%,-73.992356,40.736385,-73.991327,40.735866,1.0
50%,-73.982135,40.753299,-73.980537,40.754104,1.0
75%,-73.968376,40.76776,-73.965384,40.768545,2.0
max,-73.350557,40.99326,-73.350557,40.97809,6.0


In [55]:
train_y.describe()

count    36680.000000
mean        11.315906
std          9.514749
min          0.010000
25%          6.000000
50%          8.500000
75%         12.500000
max        180.000000
Name: fare_amount, dtype: float64

## 1

The first model uses the following featuers
- pickup_longitude
- pickup_latitude
- dropooff_longitude
- dropoff_latitude
- passenger_count

Before running any code... what do you expect the importances from this model might look like?

There is no right answer at this point, but run `q1.hint()` when to see how you might think about it.

In [2]:
# q1.hint()

## 2

Create a `PermutationImportance` object to show the importances from`first_model`.  Run the `fit` method to specify that we will calculate permutation importance using `val_x` and `val_y` as data.

In [39]:
import eli5
from eli5.sklearn import PermutationImportance

perm = _ # Create and fit permutationImportance object
_ # show resulting weights

# q2.check()


# TODO: Move following to solution
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(first_model).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = base_features)


Weight,Feature
0.4898  ± 0.0114,pickup_longitude
0.4000  ± 0.0080,dropoff_longitude
0.1550  ± 0.0052,dropoff_latitude
0.0579  ± 0.0056,pickup_latitude
0  ± 0.0000,passenger_count


In [4]:
# q2.hint()
# q2.solution()

## 3
Why might the longitude features matter more than the latitude features? Reasonable hypotheses include
- Longitude differences are larger (e.g. most trips travel further East-West than they do North-South)
- More traffic when going East-West than North-South
- More tolls on the East-West roads than North-South roads

Without more information, this seem plausible.

Can you come up with any hypotheses for why `pickup_latitude` would matter less than `dropoff_latitude`? After you've thought about it, check here for some possible explanations:


In [None]:
q3.hint()

## 4

Without detailed knowledge of New York City, it's difficult to rule out hypotheses about the pickup vs dropoff asymmetry.

At this point, you should try some new features or models and calculate new importances to sleuth out what is happening.  

A good next step is to disentangle the effect of being in certain parts of the city from the effect of total distance traveled.  
The code below creates new featuers for longitudinal and latitudinal distance, as well as direction of travel. It then builds a model that adds these new features to those you already had.

Fill in the bottom two lines to calculate and show the importance weights with this new set of features.

In [59]:
# create new features
data['abs_lon_change'] = data.dropoff_longitude - data.pickup_longitude
data['abs_lat_change'] = data.dropoff_latitude - data.pickup_latitude
data['going_north'] = (data.dropoff_latitude > data.pickup_latitude).astype(int)
data['going_west'] = (data.dropoff_longitude > data.pickup_longitude).astype(int)

features_2  = ['pickup_longitude',
               'pickup_latitude',
               'dropoff_longitude',
               'dropoff_latitude',
               'passenger_count',
               'abs_lat_change',
               'abs_lon_change',
               'going_north',
               'going_west']

X = data[features_2]
new_train_X, new_val_X, new_train_y, new_val_y = train_test_split(X, y, random_state=1)
second_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(new_train_X, new_train_y)

# Create a PermutationImportance object on second_model and fit it to new_val_X and new_val_y
perm2 = _ #PermutationImportance(second_model).fit(new_val_X, new_val_y)

# show the weights for the permutation importance you just calculated
_ #eli5.show_weights(perm2, feature_names = features_2)

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,abs_lon_change,abs_lat_change,going_north,going_west
count,48907.0,48907.0,48907.0,48907.0,48907.0,48907.0,48907.0,48907.0,48907.0,48907.0
mean,11.34178,-73.975465,40.750851,-73.974216,40.751363,1.669618,0.001249,0.000512,0.499335,0.523095
std,9.528639,0.035193,0.027453,0.035303,0.031264,1.29058,0.040856,0.031188,0.500005,0.499471
min,0.01,-74.438233,40.121653,-74.429332,40.164927,0.0,-0.447625,-0.299386,0.0,0.0
25%,6.0,-73.992303,40.736507,-73.991333,40.735891,1.0,-0.011499,-0.013975,0.0,0.0
50%,8.5,-73.982128,40.753418,-73.980495,40.754102,1.0,0.001106,0.0,0.0,1.0
75%,12.5,-73.968428,40.76773,-73.965354,40.768456,2.0,0.014036,0.014514,1.0,1.0
max,180.0,-73.350557,40.99326,-73.350557,40.97809,6.0,0.361878,0.279212,1.0,1.0


How would you interpret these importance scores?  What have we learned from this new model?  After thinking about it, run the cell below for one point-of-view.

In [None]:
# q4.hint()

## 5

A colleague observes that the values for `abs_lon_change` and `abs_lat_change` are pretty small (all values are between -0.5 and 0.5), whereas other variables have larger values.  Do you think this could explain why those coordinates had larger permutation importance values in this case?  

Consider an alternative where you created and used a feature that was 100X as large for these features, and used that larger feature for training and importance calculations. Would this change the outputted permutaiton importance values?

Why or why not?

After you have thought about your answer, either try it or look up the answer in the cell below

In [None]:
# q5.solution()