Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Want to *actually* do machine learning? 
## Part 1: Ingest and wrangle data

*Made for Microsoft Build 2019*

This is the first in a series that walks through how Azure Machine Learning service can speed up your machine learning modelling workflow so you can focus on the interesting tasks that matter. 

**Goal:**
In this notebook, we'll prepare our data for regression modeling. We'll use a small sample of NYC yellow taxicab data which we've got in Azure Blob Storage. Our end goal is to wrangle the data so we can use it to predict the cost of a taxi trip. 

In particular, we'll showcase:
- **Azure ML Datasets, from the Azure ML Python SDK**, which allow you to
    - connect to Azure storage accounts where your data is, maintain data lineages, take snapshots of data for auditability and reproducibility, and version the set of transformations applied to wrangle the data.
- (optionally) **Azure ML Data Prep SDK**, a companion toolkit to the Python SDK which allows you to
    - wrangle data locally and at scale using the same code artifact, and prepare your data before you have an Azure subscription.

### Pre-requisites

In order to run this notebook, you need:
- A Python 3.6 notebook server with the following installed:
    - Azure Machine Learning SDK for Python
    - (optionally) Azure Machine Learning Data Prep SDK for Python
- An Azure subscription, an Azure ML workspace, and sample data uploaded to an Azure Blob Storage container
    - If not, go to our [configuration](0_configuration.ipynb) notebook to get them set up.

In [7]:
from azureml.core import Workspace, Dataset
import azureml.dataprep as dprep
import numpy as np

In [10]:
# Retrieve an existing workspace
#ws = Workspace.from_config()

subscription_id = "db74d2db-c6a4-4287-b984-fd05ac72cf42"
resource_group = "omg_aml_testing"
workspace_name = "actuallydemosung"
workspace_region = "West Europe"

from azureml.core import Workspace

try:
    ws = Workspace(
        subscription_id = subscription_id, 
        resource_group = resource_group, 
        workspace_name = workspace_name
    )
    print("Workspace configuration succeeded. Skip the workspace creation steps below. subscription_id = "+subscription_id)
except:
    print("Workspace not accessible. Change your parameters or create a new workspace below")

Workspace configuration succeeded. Skip the workspace creation steps below. subscription_id = db74d2db-c6a4-4287-b984-fd05ac72cf42


### Outline
1. [Ingest data](#Ingest-data). Reading in data and parsing file formats can be tricky and frustrating when you encounter issues since it's the first thing you need to do when you start building ML models. We'll show you how Azure ML unifies and simplifies data ingestion across all kinds of file formats.
1. [Sample data](#Sample-data). We're often working with very large data sets, which can be unwieldy when trying to quickly iterate and wrangle. We'll show how easy it is to create a sample of your Dataset so you can experiment on that before applying to your full Dataset.
1. [Cleanse data](#Cleanse-data). You can use your favorite open-source library to wrangle your data, or you can try out an SDK we built to help make wrangling easier regardless of your data's size.
1. [Apply changes to full data](#Apply-changes-to-full-data). Now that you've created your data preparation script on your sampled Dataset, we'll apply that script to our full Dataset to make it ML-ready.

### Ingest data

Datasets makes it incredibly easy to use data in Azure storage for machine learning. It handles some of the most painful parts of getting data: figuring out how to ingest your data, managing credentials, and understanding file formats.

Our connection information to our Azure storage account is stored in our workpace's datastore, so we'll ask Datasets to use that information to access our data. Oftentimes, data comes stored across different files that we need to coalesce into one. In our case, we see that we have each month's data stored in a separate csv. We'll use Datasets to easily append them together with a simple globbing command:

In [22]:
datastore = ws.get_default_datastore()
dataset = Dataset.auto_read_files(path=datastore.path('yellow_tripdata_2018-01.csv'))

dataset.head(5)

ExecutionError: Failed to read files and get data source properties suggestions

We'll also register our Dataset to our workspace, which enables any collaborator in that workspace to access our common data artifacts. This makes it easier to work within teams by sharing the right data with the right permissions:

In [4]:
dataset = dataset.register(
    workspace=ws,
    name='nyc_taxi_full',
    description='NYC yellow taxicab data during the first half of 2018.',
    tags={'year':'2018', 'taxi_type':'yellow', 'status':'raw'},
    exist_ok=True,
    update_if_exist=True
)

In [5]:
# run on another compute to make it async
dataset.generate_profile(compute_target='jenren-profiler')

<azureml.data.dataset_action_run.DatasetActionRun at 0x7f718a627b00>

### Sample data

But as we know, data rarely comes cleaned and ML-ready! Immediately, we notice that the first row contains all null values so we likely want to filter that entry out. There might also be other data quality issues that we might need to address. Rather than experimenting on our full Dataset which can take much longer, we'll start with a sampled Dataset, figure out how we need to wrangle it, then apply those transformations to our full Dataset.

In [6]:
dataset_sampled = dataset.sample(
    sample_strategy='simple_random', 
    arguments={'probability':0.25, 'seed': 123})

dataset_sampled.head(5)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,1,2018-01-01 00:20:22,2018-01-01 00:52:51,1,10.2,1,False,140,257,2,33.5,0.5,0.5,0.0,0.0,0.3,34.8
1,1,2018-01-01 00:17:04,2018-01-01 00:22:24,1,0.7,1,False,170,170,2,5.5,0.5,0.5,0.0,0.0,0.3,6.8
2,1,2018-01-01 00:24:42,2018-01-01 00:31:56,2,0.7,1,False,170,162,2,6.0,0.5,0.5,0.0,0.0,0.3,7.3
3,2,2018-01-01 00:31:23,2018-01-01 00:45:38,1,2.32,1,False,186,231,1,11.0,0.5,0.5,3.08,0.0,0.3,15.38
4,2,2018-01-01 00:47:03,2018-01-01 01:26:24,1,9.49,1,False,231,116,1,35.0,0.5,0.5,9.08,0.0,0.3,45.38


### Cleanse data

Because we care about openness and interoperability, we made it possible for you to use the open-source libraries you already know and love. From here, you can convert your Dataset to a pandas or Spark dataframe to wrangle your data:

In [7]:
df_sampled = dataset_sampled.to_pandas_dataframe()

*Bonus:* We created a data preparation SDK to help you wrangle your data. With it, you can focus on writing one data wrangling script that can seamlessly transition between local, scale-up, and scale-out runtimes without needing to do costly rewrites.

> You absolutely don't need to use it if you already have tools that you know and love; we're showing it here in case you'd like to try it out and realize you'd love a solution that doesn't require you to rewrite scripts when you need to scale to larger data sets. You can imagine substituting our wrangling script using the [Azure ML Data Prep SDK](https://docs.microsoft.com/en-us/python/api/overview/azure/dataprep/intro?view=azure-dataprep-py) with your own pandas, dplyr, or Spark code instead.

In [8]:
dflow_sampled = dprep.read_pandas_dataframe(df_sampled, temp_folder='temp')

Now, we want to filter out rows that seem to be completely missing, like the first row with `NA` strings in all columns. We'll then keep only the useful columns for modeling: 

In [9]:
all_columns = dprep.ColumnSelector(term=".*", use_regex=True)
drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]
useful_columns = [
    "fare_amount", "distance", "pickup_region", "dropoff_region",
    "passenger_count", "pickup_datetime", "vendor", "payment_type"
]

dflow_sampled = (dflow_sampled
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "VendorID": "vendor",
        "tpep_pickup_datetime": "pickup_datetime",
        "trip_distance": "distance",
        "PULocationID": "pickup_region",
        "DOLocationID": "dropoff_region"
    })
    .keep_columns(columns=useful_columns))

dflow_sampled.head(5)

Unnamed: 0,vendor,pickup_datetime,passenger_count,distance,pickup_region,dropoff_region,payment_type,fare_amount
0,1,2018-01-08 21:13:45,4,2.9,137,142,1,13.5
1,2,2018-01-08 21:06:43,1,2.05,164,239,1,9.0
2,2,2018-01-08 21:32:34,1,9.03,138,238,1,27.0
3,2,2018-01-08 21:56:39,1,1.16,238,151,2,7.0
4,1,2018-01-08 21:52:00,1,0.8,68,246,2,5.0


We also notice that we have a lot of information in our `pickup_datetime` column that could be split out into multiple features, which might improve our model's performance. We'll use an intelligent transformation that attempts to split our column using example input-outputs to learn:

In [10]:
dflow_sampled = (dflow_sampled
    .split_column_by_example(
        source_column="pickup_datetime",
        example=("2009-01-04 02:52:00", ["2009-01-04", "02:52:00"])
    )
    .rename_columns(column_pairs={
        "pickup_datetime_1": "pickup_date",
        "pickup_datetime_2": "pickup_time"
    }))

dflow_sampled.head(5)

Unnamed: 0,vendor,pickup_datetime,pickup_date,pickup_time,passenger_count,distance,pickup_region,dropoff_region,payment_type,fare_amount
0,1,2018-01-08 21:13:45,2018-01-08,21:13:45,4,2.9,137,142,1,13.5
1,2,2018-01-08 21:06:43,2018-01-08,21:06:43,1,2.05,164,239,1,9.0
2,2,2018-01-08 21:32:34,2018-01-08,21:32:34,1,9.03,138,238,1,27.0
3,2,2018-01-08 21:56:39,2018-01-08,21:56:39,1,1.16,238,151,2,7.0
4,1,2018-01-08 21:52:00,2018-01-08,21:52:00,1,0.8,68,246,2,5.0


We can also extract more information as features to feed our ML model. For example, let's use another intelligent transform which creates a column by interpreting what day of the week a date is. We'll also continue splitting our date and time columns to get additional features:

In [11]:
dflow_sampled = (dflow_sampled
    # Add day of the week
    .derive_column_by_example(
        source_columns="pickup_date",
        new_column_name="pickup_weekday",
        example_data=[("2009-01-04", "Sunday"), ("2013-08-22", "Thursday")]
    )
    # Add year, month, and day of the month
    .split_column_by_example(
        source_column="pickup_date",
        example=("2009-01-04", ["2009", "01", "04"])
    )
    # Add hour, minute, and second
    .split_column_by_example(
        source_column="pickup_time",
        example=("02:52:58", ["02", "52", "58"])
    )
    # Tidy our dataflow
    .drop_columns(columns=[
        "pickup_datetime", "pickup_date", "pickup_time"
    ])
    .rename_columns(column_pairs={
        "pickup_date_1": "pickup_year",
        "pickup_date_2": "pickup_month",
        "pickup_date_3": "pickup_monthday",
        "pickup_time_1": "pickup_hour",
        "pickup_time_2": "pickup_minute",
        "pickup_time_3": "pickup_second"
    }))

dflow_sampled.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent Missing,Error Count,Empty Count,Unique Values,0.1% Quantile (est.),1% Quantile (est.),5% Quantile (est.),25% Quantile (est.),50% Quantile (est.),75% Quantile (est.),95% Quantile (est.),99% Quantile (est.),99.9% Quantile (est.),Mean,Standard Deviation,Variance,Skewness,Kurtosis
vendor,FieldType.INTEGER,1,4,13483329.0,0.0,13483329.0,0.0,0.0,0.0,3.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,1.56641,0.495813,0.24583,-0.263076,-1.90696
pickup_year,FieldType.STRING,2001,2084,13483329.0,0.0,13483329.0,0.0,0.0,0.0,12.0,,,,,,,,,,,,,,
pickup_month,FieldType.STRING,01,12,13483329.0,0.0,13483329.0,0.0,0.0,0.0,12.0,,,,,,,,,,,,,,
pickup_monthday,FieldType.STRING,01,31,13483329.0,0.0,13483329.0,0.0,0.0,0.0,31.0,,,,,,,,,,,,,,
pickup_weekday,FieldType.STRING,Friday,Wednesday,13483329.0,0.0,13483329.0,0.0,0.0,0.0,7.0,,,,,,,,,,,,,,
pickup_hour,FieldType.STRING,00,23,13483329.0,0.0,13483329.0,0.0,0.0,0.0,24.0,,,,,,,,,,,,,,
pickup_minute,FieldType.STRING,00,59,13483329.0,0.0,13483329.0,0.0,0.0,0.0,60.0,,,,,,,,,,,,,,
pickup_second,FieldType.STRING,00,59,13483329.0,0.0,13483329.0,0.0,0.0,0.0,60.0,,,,,,,,,,,,,,
passenger_count,FieldType.INTEGER,0,9,13483329.0,0.0,13483329.0,0.0,0.0,0.0,10.0,0.0,1.0,1.0,1.0,1.0,2.0,5.0,6.0,6.0,1.60003,1.24791,1.55728,2.2482,4.20344
distance,FieldType.DECIMAL,0,189484,13483329.0,0.0,13483329.0,0.0,0.0,0.0,,0.0,0.603187,0.600428,0.956016,1.60301,2.97954,10.7442,18.9668,26.9457,2.91042,51.7372,2676.74,3643.26,13343000.0


Oops! We've forgotten to specify the types of our columns. Doing so is important to improve our model's accuracy. Rather than doing it manually, we can let our data preparation tool infer what the column types are, then apply those changes if they look accurate:

In [12]:
type_infer = dflow_sampled.builders.set_column_types()
type_infer.learn()
type_infer

Column types conversion candidates:
'pickup_month': [FieldType.INTEGER],
'pickup_weekday': [FieldType.STRING],
'pickup_year': [FieldType.INTEGER],
'pickup_hour': [FieldType.INTEGER],
'passenger_count': [FieldType.INTEGER],
'dropoff_region': [FieldType.INTEGER],
'fare_amount': [FieldType.DECIMAL],
'pickup_monthday': [FieldType.INTEGER],
'pickup_second': [FieldType.INTEGER],
'pickup_region': [FieldType.INTEGER],
'vendor': [FieldType.INTEGER],
'pickup_minute': [FieldType.INTEGER],
'distance': [FieldType.DECIMAL],
'payment_type': [FieldType.INTEGER]

This looks right, so now we'll add it to our definition to persist the changes:

In [13]:
dflow_sampled = type_infer.to_dataflow()
dflow_sampled.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent Missing,Error Count,Empty Count,Unique Values,0.1% Quantile (est.),1% Quantile (est.),5% Quantile (est.),25% Quantile (est.),50% Quantile (est.),75% Quantile (est.),95% Quantile (est.),99% Quantile (est.),99.9% Quantile (est.),Mean,Standard Deviation,Variance,Skewness,Kurtosis
vendor,FieldType.INTEGER,1,4,13483329.0,0.0,13483329.0,0.0,0.0,0.0,3.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,1.56641,0.495813,0.24583,-0.263076,-1.90696
pickup_year,FieldType.INTEGER,2001,2084,13483329.0,0.0,13483329.0,0.0,0.0,0.0,12.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,2018.0,0.0442808,0.00196079,637.576,1129830.0
pickup_month,FieldType.INTEGER,1,12,13483329.0,0.0,13483329.0,0.0,0.0,0.0,12.0,1.0,1.0,1.0,2.0,3.79961,5.0,6.0,6.0,6.0,3.51635,1.68892,2.85246,-0.0223132,-1.23112
pickup_monthday,FieldType.INTEGER,1,31,13483329.0,0.0,13483329.0,0.0,0.0,0.0,31.0,1.0,3.41914,3.01409,8.0,15.1545,23.0,29.0,31.0,31.0,15.5539,8.68134,75.3656,0.0252804,-1.16533
pickup_weekday,FieldType.STRING,Friday,Wednesday,13483329.0,0.0,13483329.0,0.0,0.0,0.0,7.0,,,,,,,,,,,,,,
pickup_hour,FieldType.INTEGER,0,23,13483329.0,0.0,13483329.0,0.0,0.0,0.0,24.0,0.0,5.77168,5.25264,9.20356,14.1305,19.0,22.0,23.0,23.0,13.7901,6.11998,37.4541,-0.467103,-0.585467
pickup_minute,FieldType.INTEGER,0,59,13483329.0,0.0,13483329.0,0.0,0.0,0.0,60.0,0.0,5.56191,5.05439,14.5235,29.6817,44.8087,56.4572,59.0,59.0,29.5863,17.3368,300.566,-0.00937358,-1.2089
pickup_second,FieldType.INTEGER,0,59,13483329.0,0.0,13483329.0,0.0,0.0,0.0,60.0,0.0,5.51769,5.02169,14.5727,29.5653,44.5202,56.5341,59.0,59.0,29.5024,17.3139,299.77,0.00020861,-1.2003
passenger_count,FieldType.INTEGER,0,9,13483329.0,0.0,13483329.0,0.0,0.0,0.0,10.0,0.0,1.0,1.0,1.0,1.0,2.0,5.0,6.0,6.0,1.60003,1.24791,1.55728,2.2482,4.20344
distance,FieldType.DECIMAL,0,189484,13483329.0,0.0,13483329.0,0.0,0.0,0.0,,0.0,0.603187,0.600428,0.956016,1.60301,2.97954,10.7442,18.9668,26.9457,2.91042,51.7372,2676.74,3643.26,13343000.0


Though Datasets comes with a highly scalable data preparation capability, you can also use languages and packages that you already know and love. Every Dataset can be converted to a pandas or Spark dataframe, but you can also run your custom Python code on Datasets. Here, we might want to leverage some numpy methods, and we can easily add these scripts to our transformations:

In [14]:
dflow_sampled = (dflow_sampled
    .new_script_column(
        new_column_name='pickup_x',
        insert_after='cost',
        script="""
def newvalue(row):
    return np.cos(row['pickup_lat']) * np.cos(row['pickup_lng'])
        """
    )
    .new_script_column(
        new_column_name='pickup_y',
        insert_after='pickup_x',
        script="""
def newvalue(row):
    return np.cos(row['pickup_lat']) * np.sin(row['pickup_lng'])
        """
    )
    .new_script_column(
        new_column_name='pickup_z',
        insert_after='pickup_y',
        script="""
def newvalue(row):
    return np.sin(row['pickup_lat'])
        """
    )
    .new_script_column(
        new_column_name='dropoff_x',
        insert_after='pickup_z',
        script="""
def newvalue(row):
    return np.cos(row['dropoff_lat']) * np.cos(row['dropoff_lng'])
        """
    )
    .new_script_column(
        new_column_name='dropoff_y',
        insert_after='dropoff_x',
        script="""
def newvalue(row):
    return np.cos(row['dropoff_lat']) * np.sin(row['dropoff_lng'])
        """
    )
    .new_script_column(
        new_column_name='dropoff_z',
        insert_after='dropoff_y',
        script="""
def newvalue(row):
    return np.sin(row['dropoff_lng'])
        """
    )
    .drop_columns(columns=[
        'pickup_lat', 'pickup_lng', 'dropoff_lat', 'dropoff_lng'
    ])
)

Because we convert easily between pandas dataframes, you can use your favorite data visualization libraries to inspect your data as you wrangle your data:

In [15]:
%matplotlib inline
import matplotlib.pyplot as plt

df_sampled = dflow_sampled.to_pandas_dataframe()
plt.hist(df_sampled['distance'])
plt.show()

KeyboardInterrupt: 

Let's look again at our Dataset's summary statistics to verify whether we're ready to train our model:

In [None]:
dflow_sampled.get_profile()

By peeking at the profile, we notice that we should run two final filters to eliminate incorrectly captured data points: records should never exist where `cost` and `distance` values are 0. Let's apply our filters so we can improve our machine learning model's accuracy:

In [None]:
dflow_sampled = (dflow_sampled
    .filter(dprep.col("distance") > 0)
    .filter(dprep.col("fare_amount") > 0))

Now that we've wrangled our data into an ML-ready format, we can persist these changes by updating our Dataset's definition and verify its correctness with our profile again:

In [None]:
df_sampled_cleaned = dflow_sampled.to_pandas_dataframe()
dataset_sampled_cleaned = Dataset.from_pandas_dataframe(df_sampled_cleaned)

We haven't yet registered our Dataset to our Azure ML workspace. Registering the Dataset allows it to be persisted and used by any experiments and users in the workspace, making it convenient for hand-offs. We'll now register our Dataset with a few descriptive details:

In [None]:
dataset_sampled_cleaned = dataset_sampled_cleaned.register(
    workspace=ws,
    name='nyc_taxi_sampled_cleaned',
    description='Sampled NYC yellow taxicab data during 2018.',
    tags={'year':'2018', 'status':'cleaned'},
    exist_ok=True,
    update_if_exist=True
)

### Apply changes to full data

Now that we've figured out exactly how we need to prepare our data, we can generate our data preparation script by simply coalescing the transformations above into a function. We'll apply this on our full Dataset so we can use it for ML. (Though we wrote our preparation script with our Data Prep SDK, you can use whatever script using whichever library you want here.)

In [None]:
# Data preparation script
def prepare_dataframe(df):

    dflow = dprep.read_pandas_dataframe(df, temp_folder='temp-full')

    all_columns = dprep.ColumnSelector(term=".*", use_regex=True)
    drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]
    useful_columns = [
        "fare_amount", "distance", "pickup_region", "dropoff_region",
        "passenger_count", "pickup_datetime", "vendor", "payment_type"
    ]

    dflow = (dflow
    # Block 1
        .replace_na(columns=all_columns)
        .drop_nulls(*drop_if_all_null)
        .rename_columns(column_pairs={
            "VendorID": "vendor",
            "tpep_pickup_datetime": "pickup_datetime",
            "trip_distance": "distance",
            "PULocationID": "pickup_region",
            "DOLocationID": "dropoff_region"
        })
        .keep_columns(columns=useful_columns))
    # Block 2
        .split_column_by_example(
            source_column="pickup_datetime",
            example=("2009-01-04 02:52:00", ["2009-01-04", "02:52:00"])
        )
        .rename_columns(column_pairs={
            "pickup_datetime_1": "pickup_date",
            "pickup_datetime_2": "pickup_time"
        })
    # Block 3
        .derive_column_by_example(
            source_columns="pickup_date",
            new_column_name="pickup_weekday",
            example_data=[("2009-01-04", "Sunday"), ("2013-08-22", "Thursday")]
        )
        .split_column_by_example(
            source_column="pickup_date",
            example=("2009-01-04", ["2009", "01", "04"])
        )
        .split_column_by_example(
            source_column="pickup_time",
            example=("02:52:58", ["02", "52", "58"])
        )
        .drop_columns(columns=[
            "pickup_datetime", "pickup_date", "pickup_time"
        ])
        .rename_columns(column_pairs={
            "pickup_date_1": "pickup_year",
            "pickup_date_2": "pickup_month",
            "pickup_date_3": "pickup_monthday",
            "pickup_time_1": "pickup_hour",
            "pickup_time_2": "pickup_minute",
            "pickup_time_3": "pickup_second"
        }))

    # Block 4
    type_infer = dflow.builders.set_column_types()
    type_infer.learn()

    dflow = type_infer.to_dataflow()

    dflow = (dflow
    # Block 5
        .new_script_column(
            new_column_name='pickup_x',
            insert_after='cost',
            script="""
    def newvalue(row):
        return np.cos(row['pickup_lat']) * np.cos(row['pickup_lng'])
            """
        )
        .new_script_column(
            new_column_name='pickup_y',
            insert_after='pickup_x',
            script="""
    def newvalue(row):
        return np.cos(row['pickup_lat']) * np.sin(row['pickup_lng'])
            """
        )
        .new_script_column(
            new_column_name='pickup_z',
            insert_after='pickup_y',
            script="""
    def newvalue(row):
        return np.sin(row['pickup_lat'])
            """
        )
        .new_script_column(
            new_column_name='dropoff_x',
            insert_after='pickup_z',
            script="""
    def newvalue(row):
        return np.cos(row['dropoff_lat']) * np.cos(row['dropoff_lng'])
            """
        )
        .new_script_column(
            new_column_name='dropoff_y',
            insert_after='dropoff_x',
            script="""
    def newvalue(row):
        return np.cos(row['dropoff_lat']) * np.sin(row['dropoff_lng'])
            """
        )
        .new_script_column(
            new_column_name='dropoff_z',
            insert_after='dropoff_y',
            script="""
    def newvalue(row):
        return np.sin(row['dropoff_lng'])
            """
        )
        .drop_columns(columns=[
            'pickup_lat', 'pickup_lng', 'dropoff_lat', 'dropoff_lng'
        ])
    # Block 6
        .filter(dprep.col("distance") > 0)
        .filter(dprep.col("fare_amount") > 0))

    return dflow.to_pandas_dataframe()

Like before, we'll convert our Dataset into a pandas dataframe so we can transform it. We'll apply our preparation script to it:

In [None]:
df = dataset.to_pandas_dataframe()
df_cleaned = prepare_dataframe(df)

We'll save this dataframe back as a Dataset so we can store and share it within our Azure ML workspace. This makes it easy for any collaborator with access to my workspace to use the same artifact consistently:

In [None]:
dataset_cleaned = Dataset.from_pandas_dataframe(df_cleaned)
dataset_cleaned = dataset_cleaned.register(
    workspace=ws,
    name='nyc_taxi_cleaned',
    description='NYC yellow taxicab data during 2018.',
    tags={'year':'2018', 'status':'cleaned'},
    exist_ok=True,
    update_if_exist=True
)

Now that our Dataset is wrangled and registered, we can use this Dataset to build our ML model. Continue to [Part 2: Build and Train Models](2_build-models.ipynb).