# Prediction Project
In the previous workbook you were led through the steps to design and train a model. 

In this workbook you will train a model to predict the Fare for a New York City Taxi. 

This is based on a Kaggle Challenge - https://www.kaggle.com/c/new-york-city-taxi-fare-prediction

The main difference is that we are using a smaller subset of the data.

This workbook is more of a Project and contains less guidance and less included code so work in your teams and try and solve the challenge.

## Import some dependencies

In [None]:
import pandas as pd
import matplotlib.pyplot as plt 
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing

import tensorflow as tf
from tensorflow import keras


## Helper Functions 
The following cell contains a set of functions that you may (or may not) use during this project.

In [None]:
# range of longitude for NYC
nyc_min_longitude = -74.05
nyc_max_longitude = -73.75

# range of latitude for NYC
nyc_min_latitude = 40.63
nyc_max_latitude = 40.85

def get_NYC_records_only(dataset):
    return get_records_within_long_lat(dataset, nyc_min_longitude, nyc_max_longitude,
                                      nyc_min_latitude, nyc_max_latitude)
    
def get_records_within_long_lat(dataset, min_long, max_long, min_lat, max_lat):
    # Create a copy and leave the original alone
    df2 = dataset.copy(deep=True)

    for long in ['pickup_longitude', 'dropoff_longitude']:
        df2 = df2[(df2[long] > min_long) & (df2[long] < max_long)]

    for lat in ['pickup_latitude', 'dropoff_latitude']:
        df2 = df2[(df2[lat] > min_lat) & (df2[lat] < max_lat)]
    
    return df2


def plot_lat_long(df, points='Pickup'):
    plt.figure(figsize = (12,12)) # set figure size
    if points == 'pickup':
        plt.plot(list(df.pickup_longitude), list(df.pickup_latitude), '.', markersize=1)
    else:
        plt.plot(list(df.dropoff_longitude), list(df.dropoff_latitude), '.', markersize=1)
    
    plt.title("{} Locations in NYC Illustrated".format(points))

    plt.grid(None)
    plt.xlabel("Latitude")
    plt.ylabel("Longitude")
    plt.show()
    
def plotHistory(history):
  hist = pd.DataFrame(history.history)
  hist['epoch'] = history.epoch

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Abs Error')
  plt.plot(hist['epoch'], hist['mean_absolute_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
           label = 'Val Error')
  plt.legend()


## Obtain the Data 
We will load the data in a Pandas Dataframe

In [None]:
# Read in the data
data = pd.read_csv('./data/taxi.csv', parse_dates=['pickup_datetime'],)
data.head()

## Data Pre-Processing
In the workbook on Data Pre-Processing and Feature Engineering we used this data set and performed a set of pre-processing steps on the data.

Some of the steps are below, but following this next step you will need to make some descisions about what pre-processing and feature engineering actions you want to take.

In [None]:
# Drop any rows with null values
data = data.dropna()

# Update the passenger count to 1 for any set to 0
# Get the mode (the most frequent value)
mode = data['passenger_count'].mode().values[0]
# Set the passenger count to the mode where the passenger count is currently 0
data.loc[data['passenger_count'] == 0, 'passenger_count'] = mode

# Drop the key
data.drop(['key'], axis=1, inplace=True)

data.describe()

### Exercise: Dealing with invalid Fares
As indicated before the _fare_amount_ feature has some problems. Firstly there are some negative values but secondly has some very large fares whereas most are very small.

The following graph shows the distribution of fares. Examine the graph and decide which fares you want to ignore. You can then run the cell to drop any training samples where the fare is outside the bounds you specified. 

In [None]:
data['fare_amount'].plot.hist(bins=500)
plt.xlabel('Fare')
plt.title('Histogram of Fares')
plt.show()

In [None]:
# TODO: Change the following value to be the minimum fare you are interested in modelling
min_acceptable_fare = 0

# TODO: Change the following value to be the maximum fare you are interesting in modelling
# If you want to incldue all then leave this at $500
max_acceptable_fare = 50

# Delete fares that are outside your limits
data = data[(data['fare_amount'] >= min_acceptable_fare) &
            (data['fare_amount'] <= max_acceptable_fare)]

# Describe the data
data.describe()
# Show the distribution
data['fare_amount'].plot.hist(bins=500)
plt.xlabel('Fare')
plt.title('Histogram of Fares')
plt.show()

### Exercise: Dealing with invalid Longitude and Latitudes
Previously we noted that some of the Longitude and Latitudes are outside of the NYC area and so are probably invalid.

So as an initial pre-processing step we will get rid of any records that are outside NYC.

We will then plot our pick-up and drop-off points to see if we want to reduce the dataset further to focus on specific areas.

In [None]:
data = get_NYC_records_only(data)

In [None]:
plot_lat_long(data, points='Pickup')


In [None]:
plot_lat_long(data, points='Drop Off')

Interestingly, just by plotting the pick-up & drop-off points we can see parts of the NYC street layout. We can see that we have certain areas of concentration whereas others are quite sparse.

If you want you can focus your data on a specific area - we can achieve this by specifying the min and max longitude and latitudes we want to consider

In [None]:
# TODO: If you want to reduce the dataset further then specify latitude and longitude ranges
# and run this cell
min_long = -74.05
max_long = -73.75

# range of latitude for NYC
min_lat = 40.63
max_lat = 40.85

data = get_records_within_long_lat(data, min_long, max_long, min_lat, max_lat)

# Plot the pickups to show the difference
plot_lat_long(data, points='Pickup')


## Feature Engineering
We will create some new features:
- A distance feature
- Split out the date features into seperate features

In [None]:
# Create the Distance feature
def euclidean_distance(lat1, long1, lat2, long2):
    return (((lat1-lat2)**2 + (long1 - long2) ** 2) ** 0.5)

# Now we will create a new feature called distance that uses this method
data['distance'] = euclidean_distance(data['pickup_latitude'], 
                        data['pickup_longitude'],
                       data['dropoff_latitude'],
                       data['dropoff_longitude'])

In [None]:
# Split out the date into parts
data['year'] = data['pickup_datetime'].dt.year
data['month'] = data['pickup_datetime'].dt.month
data['day'] = data['pickup_datetime'].dt.day
data['day_of_week'] = data['pickup_datetime'].dt.dayofweek
data['hour'] = data['pickup_datetime'].dt.hour

# Remove the original _pickup_datetime_ feature
data.drop(['pickup_datetime'], axis=1, inplace=True)

### Exercise: Remove unwanted Features
We now want to remove any unwanted feature. 

In [None]:
data.head()

In [None]:
# TODO: Review the above columns and decide which columns you want to use for training
# If you don't want to remove any columns then don't run this cell
features_to_remove = ['col']
data.drop(features_to_remove, axis=1, inplace=True)

## Create Training and Testing Splits
We are now in a position to create our Training and Test splits

First we will create our Feature and Target sets (X and y). Then we will use a method from the Sklearn module to create a random 20% split for the Testing data. 

In [None]:
# Split data in Train and Test sets
X = data.loc[:, data.columns != 'fare_amount']
y = data.loc[:, 'fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Normalise our Features
We can now normalise our Training Features to ensure they are all in a similar range. We will use a Tensorflow method to achieve this.

In [None]:
# Normalise the data
X_train = tf.keras.utils.normalize(X_train, axis=1)
X_test = tf.keras.utils.normalize(X_test, axis=1)

In [None]:
X_train.head()

In [None]:
print("Training Data Shape: {}".format(X_train.shape))

## Build a model
We are now in a position to build a model. We will provide you with a basic structure, but the rest is up to you.

### Exercise: Design your model
Work in your groups to decide on a set of model designs to try out. The key choices you will need to make are:
 - How many units to include in your Input layer
 - How many Hidden layers to build and how many units in each
    - You can specify a new layer using `layers.Dense(units=32, activation='relu'))

Within your group try shallow networks (e.g. 1 hidden layer) and deeper networks (such as 5 layers). Also try to use different combinations of units in layers (typically we choose values such as 8, 16, 32, 64, 128, 256)

In [None]:
model = keras.Sequential()

# Input Layer
# TODO: Specify how many units you want in your input layer
model.add(layers.Dense(units=None, activation='relu', input_dim=X_train.shape[1]))

# Hidden layers
# TODO: Add one or more hidden layers
model.add(None)

# Output Layer
model.add(layers.Dense(1)) # No need to specify activation as we are predicting a vlaue


 # We now compile our model with Loss Function and an Optimizer
optimizer = tf.keras.optimizers.Adam()

model.compile(loss='mean_squared_error',
                optimizer=optimizer,
                metrics=['mean_absolute_error', 'mean_squared_error'])
model.summary()

## Train your model
We are now in a position to train and evaluate our model.

### Exercise: Epochs and Batch Size
In the cell below we have specified 200 epochs and a batch_size of 64 - you are free to change these if you want.
- epochs: this is the number of iterations to train for
- batch_size: this determines how many training samples we train with at one time

In [None]:
history = model.fit(X_train, y_train, epochs=200,
                    validation_split = 0.2, batch_size=64)

## Evaluate your Model
We now want to evaluate how good our model is.

First we will look at the training and validation error during the training

In [None]:
plotHistory(history)

Next we will evaluate our model against the test set and work out what the average error is.

In the case of this model, the Mean Absolute Error is the average amount our prediction is out by on a given fare. So the lower the better 

In [None]:
# Accuracy against test set
val_loss, mean_abs_error, mean_squared_error = model.evaluate(X_test, y_test)
print ("Validation Mean Absolute Error:", mean_abs_error)

# Testing your model
Testing a model is different to evaluating a model - with evaluation we are measuring how good the model is against the data. With testing we are interested in evaluating risks that might affect the model.

## Exercise
Consider the purpose of our model, which is to predict the fare price for a taxi journey in New York City between two points on a given date and time. We have an average error (from our evaluation of the model against the test data) is this enough to assess the quality of the model?

In this exercise, work in your teams to build an outline Test Strategy/Approach for the model. In particular consider:
- What additional testing would you recommend? 
    - See section below on Hypothosis Testing for one possible approach
- What Oracles could you use to check your model's predictions against?
- How should you evaluate your predictions?

### Hypothosis Testing
One approach to consider is to think in terms of what Hypotheses you have about the problem domain. These are statements you think should be true about the model's predictions such as:
- the fare for a journey during Peak Traffic times will be higher than the same journey during low Traffic times. 

Sometimes these are known as _Good-old Fashioned Common Sense_ tests - things that ought to be true and we want to test our model to ensure it is consistent with these views.

If you think about the problem domain you can probably come up with a number of such hypotheses.

For each Hypothosis you can then consider how you would test to disprove the Hypothosis. For example:
- if we pick random journeys and predict the fare during high and low traffic periods, if the low traffic fares are similar or higher than that the high traffic fares then the hypothosis must be false.

The trick here is to think about how to disprove your Hypothisis rather than try to prove it - the latter can lead to confirmation bais.

When we are evaluating our tests we need to consider how we detect if a problem has occurred. For example, if we predict a low traffic fare as $5.00 and the high traffic fare as $5.10 is that a sufficient difference to accept as evidence for our hypothosis?


## Exercise
Based on your test approach select some test ideas that you think the model is most likely to get wrong, construct and run some tests using the harness below. 

Evaluate the results and decide if there is a problem or not.

#### Prediction Test Harness
In the harness below you specify one or more sets of values to test; then run the cell to see the results.

In [None]:
test_values = [
        
]
predictions = model.predict(test_values)
print(predictions)

# Summary and Observations
You have now taken a real-world set of data, made decisions about pre-processing and feature engineering and then used that data to train a model.


## Further Questions
- Do you think further Feature Engineering could improve our model (i.e. generate a lower Mean Absolute Error)?
- How did shallow networks perform against deeper networks?
- Do you think the Model is Overfitting the training data?