# Tutorial: Build a regression model with automated machine learning and Open Datasets

In this tutorial, you leverage the convenience of Azure Open Datasets along with the power of Azure Machine Learning service to create a regression model to predict NYC taxi fare prices. Easily download publicly available taxi, holiday and weather data, and configure an automated machine learning experiment using Azure Machine Learning service. This process accepts training data and configuration settings, and automatically iterates through combinations of different feature normalization/standardization methods, models, and hyperparameter settings to arrive at the best model.

In this tutorial you learn the following tasks:

* Configure an Azure Machine Learning service workspace
* Set up a local Python environment
* Access, transform, and join data using Azure Open Datasets
* Train an automated machine learning regression model
* Calculate model accuracy

## Prerequisites

This tutorial requires the following prerequisites.

* An Azure Machine Learning service workspace
* A Python 3.6 environment 

### Create a workspace

Follow the [instructions](https://docs.microsoft.com/azure/machine-learning/service/setup-create-workspace#portal) to create a workspace through the Azure portal, if you don't already have one. After creation, make note of your workspace name, resource group name, and subscription id.

### Create a Python environment

This example uses an Anaconda environment with Jupyter notebooks, but you can run this code in any 3.6.x environment and with any text editor or IDE. Use the following steps to create a new development environment.

1. If you don't already have it, [download](https://www.anaconda.com/distribution/) and install Anaconda, and choose **Python 3.7 version**.
1. Open an Anaconda prompt and create a new environment. It will take several minutes to create the environment while components and packages are downloaded.
```
conda create -n tutorialenv python=3.6.5
```
1. Activate the environment.
```
conda activate tutorialenv
```
1. Enable environment-specific ipython kernels.
```
conda install notebook ipykernel
```
1. Create the kernel.
```
ipython kernel install --user
```
1. Install the packages you need for this tutorial. These packages are large and will take 5-10 minutes to install.
```
pip install azureml-sdk[automl] azureml-contrib-opendatasets
```
1. Start a notebook kernel from your environment.
```
jupyter notebook
```

After you complete these steps, clone the [repo](https://github.com/Azure/OpenDatasetsNotebooks) and open the **tutorials/taxi-automl/01-tutorial-opendatasets-automl.ipynb** notebook to run it.

## Download and prepare data

Import the necessary packages. The Open Datasets package contains a class representing each data source (`NycTlcGreen` for example) to easily filter date parameters before downloading.

In [None]:
from azureml.contrib.opendatasets import NycTlcGreen
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta

Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid `MemoryError` with large datasets. To download a year of taxi data, iteratively fetch one month at a time, and before appending it to `green_taxi_df` randomly sample 2000 records from each month to avoid bloating the dataframe. Then preview the data.

Note: Open Datasets has mirroring classes for working in Spark environments where data size and memory aren't a concern.

In [None]:
green_taxi_df = pd.DataFrame([])
start = datetime.strptime("1/1/2016","%m/%d/%Y")
end = datetime.strptime("1/31/2016","%m/%d/%Y")

for sample_month in range(12):
    temp_df_green = NycTlcGreen(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \
        .to_pandas_dataframe()
    green_taxi_df = green_taxi_df.append(temp_df_green.sample(2000))
    
green_taxi_df.head(10)

Now that the intial data is loaded, define a function to create various time-based features from the pickup datetime field. This will create new fields for the month number, day of month, day of week, and hour of day, and will allow the model to factor in time-based seasonality. The function also adds a static feature for the country code to join holiday data. Use the `apply()` function on the dataframe to iteratively apply the `build_time_features()` function to each row in the taxi data.

In [None]:
def build_time_features(vector):
    pickup_datetime = vector[0]
    month_num = pickup_datetime.month
    day_of_month = pickup_datetime.day
    day_of_week = pickup_datetime.weekday()
    hour_of_day = pickup_datetime.hour
    country_code = "US"
    
    return pd.Series((month_num, day_of_month, day_of_week, hour_of_day, country_code))

green_taxi_df[["month_num", "day_of_month","day_of_week", "hour_of_day", "country_code"]] = green_taxi_df[["lpepPickupDatetime"]].apply(build_time_features, axis=1)
green_taxi_df.head(10)

Remove some of the columns that you won't need for modeling or additional feature building. Rename the time field for pickup time, and additionally convert the time to midnight using `pandas.Series.dt.normalize`. You do this to all time features so that the datetime component can be later used as a key when joining datasets together at a daily level of granularity.

In [None]:
columns_to_remove = ["lpepDropoffDatetime", "puLocationId", "doLocationId", "extra", "mtaTax",
                     "improvementSurcharge", "tollsAmount", "ehailFee", "tripType", "rateCodeID", 
                     "storeAndFwdFlag", "paymentType", "fareAmount", "tipAmount"
                    ]
for col in columns_to_remove:
    green_taxi_df.pop(col)
    
green_taxi_df = green_taxi_df.rename(columns={"lpepPickupDatetime": "datetime"})
green_taxi_df["datetime"] = green_taxi_df["datetime"].dt.normalize()
green_taxi_df.head(5)

### Enrich with holiday data

Now that you have taxi data downloaded and roughly prepared, add in holiday data as additional features. Holiday-specific features will assist model accuracy, as major holidays are times where taxi demand increases dramatically and supply becomes limited. The holiday dataset is relatively small, so fetch the full set by using the `PublicHolidays` class constructor with no parameters for filtering. Preview the data to check the format.

In [None]:
from azureml.contrib.opendatasets import PublicHolidays
# call default constructor to download full dataset
holidays_df = PublicHolidays().to_pandas_dataframe()
holidays_df.head(5)

Rename the `countryRegionCode` and `date` columns to match the respective field names from the taxi data, and also normalize the time so it can be used as a key. Next, join the holiday data with the taxi data by performing a left-join using the Pandas `merge()` function. This will preserve all records from `green_taxi_df`, but add in holiday data where it exists for the corresponding `datetime` and `country_code`, which in this case is always `"US"`. Preview the data to verify that they were merged correctly.

In [None]:
holidays_df = holidays_df.rename(columns={"countryRegionCode": "country_code", "date": "datetime"})
holidays_df["datetime"] = holidays_df["datetime"].dt.normalize()
holidays_df.pop("countryOrRegion")
holidays_df.pop("holidayName")

taxi_holidays_df = pd.merge(green_taxi_df, holidays_df, how="left", on=["datetime", "country_code"])
taxi_holidays_df.head(5)

### Enrich with weather data

Now you append NOAA surface weather data to the taxi and holiday data. Use a similar approach to fetch the weather data by downloading one month at a time iteratively. Additionally, specify the `cols` parameter with an array of strings to filter the columns you want to download. This is a very large dataset containing weather surface data from all over the world, so before appending each month, filter the lat/long fields to near NYC using the `query()` function on the dataframe. This will ensure the `weather_df` doesn't get too large.

In [None]:
from azureml.contrib.opendatasets import NoaaIsdWeather

weather_df = pd.DataFrame([])
start = datetime.strptime("1/1/2016","%m/%d/%Y")
end = datetime.strptime("1/31/2016","%m/%d/%Y")

for sample_month in range(12):
    tmp_df = NoaaIsdWeather(cols=["temperature", "precipTime", "precipDepth", "snowDepth"], start_date=start + relativedelta(months=sample_month), end_date=end + relativedelta(months=sample_month))\
        .to_pandas_dataframe()
    print("--weather downloaded--")
    
    # filter out coordinates not in NYC to conserve memory
    tmp_df = tmp_df.query("latitude>=40.53 and latitude<=40.88")
    tmp_df = tmp_df.query("longitude>=-74.09 and longitude<=-73.72")
    print("--filtered coordinates--")
    weather_df = weather_df.append(tmp_df)
    
weather_df.head(10)

Again call `pandas.Series.dt.normalize` on the `datetime` field in the weather data so it matches the time key in `taxi_holidays_df`. Delete the unneeded columns, and filter out records where the temperature is `NaN`. 

Next group the weather data so that you have daily aggregated weather values. Define a dict `aggregations` to define how to aggregate each field at a daily level. For `snowDepth` and `temperature` take the mean and for `precipTime` and `precipDepth` take the daily maximum. Use the `groupby()` function along with the aggregations to group the data. Preview the data to ensure there is one record per day.

In [None]:
weather_df["datetime"] = weather_df["datetime"].dt.normalize()
weather_df.pop("usaf")
weather_df.pop("wban")
weather_df.pop("longitude")
weather_df.pop("latitude")

# filter out NaN
weather_df = weather_df.query("temperature==temperature")

# group by datetime
aggregations = {"snowDepth": "mean", "precipTime": "max", "temperature": "mean", "precipDepth": "max"}
weather_df_grouped = weather_df.groupby("datetime").agg(aggregations)
weather_df_grouped.head(10)

Note: The examples in this tutorial merge data using Pandas functions and custom aggregations, but the Open Datasets SDK has classes designed to easily merge and enrich data sets. See the [notebook](https://github.com/Azure/OpenDatasetsNotebooks/blob/master/tutorials/data-join/04-nyc-taxi-join-weather-in-pandas.ipynb) for code examples of these design patterns.

### Cleanse data 

Merge the taxi and holiday data you prepared with the new weather data. This time you only need the `datetime` key, and again perform a left-join of the data. Run the `describe()` function on the new dataframe to see summary statistics for each field.

In [None]:
taxi_holidays_weather_df = pd.merge(taxi_holidays_df, weather_df_grouped, how="left", on=["datetime"])
taxi_holidays_weather_df.describe()

From the summary statistics, you see that there are several fields that have outliers or values that will reduce model accuracy. First filter the lat/long fields to be within the same bounds you used for filtering weather data. The `tripDistance` field has some bad data, because the minimum value is negative. The `passengerCount` field has bad data as well, with the max value being 210 passengers. Lastly, the `totalAmount` field has negative values, which don't make sense in the context of our model.

Filter out these anomolies using query functions, and then remove the last few columns unnecesary for training.

In [None]:
final_df = taxi_holidays_weather_df.query("pickupLatitude>=40.53 and pickupLatitude<=40.88")
final_df = final_df.query("pickupLongitude>=-74.09 and pickupLongitude<=-73.72")
final_df = final_df.query("tripDistance>0 and tripDistance<75")
final_df = final_df.query("passengerCount>0 and passengerCount<100")
final_df = final_df.query("totalAmount>0")

columns_to_remove_for_training = ["datetime", "pickupLongitude", "pickupLatitude", "dropoffLongitude", "dropoffLatitude", "country_code"]
for col in columns_to_remove_for_training:
    final_df.pop(col)

Call `describe()` again on the data to ensure cleansing worked as expected. You now have a prepared and cleansed set of taxi, holiday, and weather data to use for machine learning model training.

In [None]:
final_df.describe()

## Train a model

Now you use the prepared data to train an automated machine learning model. Start by splitting `final_df` into features (X values) and labels (y value), which for this model is the taxi fare cost.

In [None]:
y_df = final_df.pop("totalAmount")
x_df = final_df

Now you split the data into training and test sets by using the `train_test_split()` function in the `scikit-learn` library. The `test_size` parameter determines the percentage of data to allocate to testing. The `random_state` parameter sets a seed to the random number generator, so that your train-test splits are deterministic.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=222)

### Load workspace and configure experiment

Load your Azure Machine Learning service workspace using the `get()` function with your subscription and workspace information. Create an experiment within your workspace to store and monitor your model runs.

In [None]:
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment

workspace = Workspace.get(subscription_id="65a1016d-0f67-45d2-b838-b8f373d6d52e", name="trbye-ml", resource_group="trbye-test")
experiment = Experiment(workspace, "opendatasets-ml")

Create a configuration object for the experiment using the `AutoMLConfig` class. You attach your training data, and additionally specify settings and parameters that control the training process. The parameters have the following purposes:

* `task`: the type of experiment to run.
* `X`: training features.
* `y`: training labels.
* `iterations`: number of iterations to run. Each iteration tries combinations of different feature normalization/standardization methods, and different models using multiple hyperparameter settings.
* `primary_metric`: primary metric to optimize during model training. Best fit model will be chosen based on this metric.
* `preprocess`: controls whether the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)
* `n_cross_validations`: Number of cross-validation splits to perform when validation data is not specified.

In [None]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task="regression", 
                             X=X_train.values, 
                             y=y_train.values.flatten(),
                             iterations=20,
                             primary_metric="spearman_correlation",
                             preprocess=True,
                             n_cross_validations=5
                            )

### Submit experiment

Submit the experiment for training. After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. It chooses the best-fit model by optimizing the defined accuracy metric. Pass the `automl_config` object to the experiment. Set the output to `True` to view progress during the experiment. 

After submitting the experiment you see live output for the training process. For each iteration, you see the model type and feature normalization/standardization method, the run duration, and the training accuracy. The field `BEST` tracks the best running training score based on your metric type.

In [None]:
training_run = experiment.submit(automl_config, show_output=True)

### Retrieve the fitted model

At the end of all training iterations, the automated machine learning process creates an ensemble algorithm from all individual runs, either with bagging or stacking. Retrieve the fitted ensemble into the variable `fitted_model`, and the best individual run into the variable `best_run`.

In [None]:
best_run, fitted_model = training_run.get_output()
print(best_run)
print(fitted_model)

## Test model accuracy

Use the fitted ensemble model to run predictions on the test dataset to predict taxi fares. The function `predict()` uses the fitted model and predicts the values of y, taxi fare cost, for the `X_test` dataset.

In [None]:
y_predict = fitted_model.predict(X_test.values)

Calculate the root mean squared error of the results. Use the `y_test` dataframe, and convert it to a list `y_actual` to compare to the predicted values. The function `mean_squared_error` takes two arrays of values and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable, cost. It indicates roughly how far the taxi fare predictions are from the actual fares, while heavily weighting large errors.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

y_actual = y_test.values.flatten().tolist()
rmse = sqrt(mean_squared_error(y_actual, y_predict))
rmse

Run the following code to calculate mean absolute percent error (MAPE) by using the full `y_actual` and `y_predict` datasets. This metric calculates an absolute difference between each predicted and actual value and sums all the differences. Then it expresses that sum as a percent of the total of the actual values.

In [None]:
sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)

Given that we used a fairly small sample of data relative to the full dataset (n=11748), model accuracy is fairly high at 85%, with RMSE at around +- $4.00 error in predicting taxi fare price. As a potential next step to improve accuracy, go back to the second cell of this notebook, and increase the sample size from 2,000 records per month, and run the entire experiment again to re-train the model with more data.

## Clean up resources

If you don't plan to use the resources you created, delete them, so you don't incur any charges.

1. In the Azure portal, select **Resource groups** on the far left.
1. From the list, select the resource group you created.
1. Select **Delete resource group**.
1. Enter the resource group name. Then select **Delete**.

## Next steps

* See the Azure Open Datasets [notebooks](https://github.com/Azure/OpenDatasetsNotebooks) for more code examples.
* Follow the [how-to](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-auto-train) for more information on automated machine learning in Azure Machine Learning service.