# Exercises

## Set Up

You will calculate permutation importance with data from the [Taxi Fare Prediction](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) competition.

We won't focus on data exploration or model building for now. You can just run the cell below to 
- Load the data
- Divide the data into training and validation
- Build a model that predicts taxi fares
- Print a few rows for you to review

In [22]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


data = pd.read_csv('../input/new-york-city-taxi-fare-prediction/train.csv', nrows=50000)

# Remove data with extreme outlier coordinates or negative fares
data = data.query('pickup_latitude > 40.7 and pickup_latitude < 40.8 and ' +
                  'dropoff_latitude > 40.7 and dropoff_latitude < 40.8 and ' +
                  'pickup_longitude > -74 and pickup_longitude < -73.9 and ' +
                  'dropoff_longitude > -74 and dropoff_longitude < -73.9 and ' +
                  'fare_amount > 0'
                  )

y = data.fare_amount

base_features = ['pickup_longitude',
                 'pickup_latitude',
                 'dropoff_longitude',
                 'dropoff_latitude',
                 'passenger_count']

X = data[base_features]


train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
first_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(train_X, train_y)
# first_model = LinearRegression().fit(train_X, train_y)
print("Data sample:")
data.head()

Data sample:


Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2,2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
3,2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
4,2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
6,2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
7,2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1


The following two cells may also be useful to understand the values in the training data:

In [15]:
train_X.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,25775.0,25775.0,25775.0,25775.0,25775.0
mean,-73.973905,40.755495,-73.971926,40.754594,1.654045
std,0.02346,0.020579,0.024684,0.023701,1.28106
min,-73.999999,40.604037,-73.999999,40.602453,0.0
25%,-73.988016,40.743961,-73.986749,40.743833,1.0
50%,-73.97941,40.757694,-73.977976,40.75774,1.0
75%,-73.96673,40.769592,-73.964015,40.770022,2.0
max,-73.800652,40.799952,-73.800102,40.799999,6.0


In [23]:
train_y.describe()

count    23466.000000
mean         8.472539
std          4.609747
min          0.010000
25%          5.500000
50%          7.500000
75%         10.100000
max        165.000000
Name: fare_amount, dtype: float64

## 1

The first model uses the following featuers
- pickup_longitude
- pickup_latitude
- dropooff_longitude
- dropoff_latitude
- passenger_count

Before running any code... what do you expect the importances from this model might look like?

There is no right answer at this point, but run `q1.hint()` when to see how you might think about it.

In [4]:
# q1.hint()

## 2

Create a `PermutationImportance` object to show the importances from `first_model`.  Run the `fit` method to specify that we will calculate permutation importance using `val_x` and `val_y` as data.

In [24]:
import eli5
from eli5.sklearn import PermutationImportance

perm = _ # Create and fit permutationImportance object
_ # show resulting weights

# q2.check()


# TODO: Move following to solution
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(first_model).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = base_features)


Weight,Feature
0.8555  ± 0.0217,dropoff_latitude
0.8384  ± 0.0197,pickup_latitude
0.6191  ± 0.0528,pickup_longitude
0.5369  ± 0.0119,dropoff_longitude
-0.0024  ± 0.0028,passenger_count


In [6]:
# q2.hint()
# q2.solution()

## 3
Before seeing these results, we might have expected each of the 4 directional features to be equally important.

But, on average, the latitude features matter more than the longititude features. Can you come up with any hypotheses for this?

After you've thought about it, check here for some possible explanations:

In [7]:
# q3.hint()

NameError: name 'q3' is not defined

## 4

Without detailed knowledge of New York City, it's difficult to rule out most hypotheses about why latitude features matter more than longitude.

A good next step is to disentangle the effect of being in certain parts of the city from the effect of total distance traveled.  

The code below creates new featuers for longitudinal and latitudinal distance. It then builds a model that adds these new features to those you already had.

Fill in the bottom two lines to calculate and show the importance weights with this new set of features.

In [26]:
# create new features
data['abs_lon_change'] = data.dropoff_longitude - data.pickup_longitude
data['abs_lat_change'] = data.dropoff_latitude - data.pickup_latitude

features_2  = ['pickup_longitude',
               'pickup_latitude',
               'dropoff_longitude',
               'dropoff_latitude',
               'abs_lat_change',
               'abs_lon_change']

X = data[features_2]
new_train_X, new_val_X, new_train_y, new_val_y = train_test_split(X, y, random_state=1)
second_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(new_train_X, new_train_y)

# Create a PermutationImportance object on second_model and fit it to new_val_X and new_val_y
perm2 = _ #PermutationImportance(second_model).fit(new_val_X, new_val_y)

# show the weights for the permutation importance you just calculated
_ # eli5.show_weights(perm2, feature_names = features_2)

# uncomment and run the following cell to check your answer
# q4.check()

Weight,Feature
0.8095  ± 0.0710,abs_lat_change
0.6524  ± 0.0253,abs_lon_change
0.1504  ± 0.0259,dropoff_latitude
0.1348  ± 0.0240,dropoff_longitude
0.0694  ± 0.0065,pickup_latitude
0.0458  ± 0.0074,pickup_longitude


How would you interpret these importance scores? Distance traveled seems far more important than any location effects. 

But the location still affects model predictions, and dropoff location now matters slightly more than pickup location. Do you have any hypotheses for why this might be? The techniques used later in the course will help us dive into this more.

## 5

A colleague observes that the values for `abs_lon_change` and `abs_lat_change` are pretty small (all values are between -0.1 and 0.1), whereas other variables have larger values.  Do you think this could explain why those coordinates had larger permutation importance values in this case?  

Consider an alternative where you created and used a feature that was 100X as large for these features, and used that larger feature for training and importance calculations. Would this change the outputted permutaiton importance values?

Why or why not?

After you have thought about your answer, either try this experiment or look up the answer in the cell below

In [None]:
# q5.solution()

## 6

You've seen that the feature importance for latitudinal distance is greater than the importance of longitudinal distance. From this, can we conclude whether travelling a fixed latitudinal distance tends to be more expensive than traveling the same longitudinal distance?

Why or why not? Check your answer below.

In [None]:
# q6.solution()