Skip to content

Latest commit

 

History

History
143 lines (105 loc) · 7.1 KB

r-taxi-classification-create-features.md

File metadata and controls

143 lines (105 loc) · 7.1 KB
title titleSuffix description author ms.author ms.date ms.service ms.subservice ms.topic monikerRange
R tutorial: Create data features
SQL machine learning
In part three of this five-part tutorial series, you'll use T-SQL functions to create and store features from sample data with SQL machine learning.
WilliamDAssafMSFT
wiassaf
10/15/2020
sql
machine-learning
tutorial
>=sql-server-2016||>=sql-server-linux-ver15||>=azuresqldb-mi-current

R tutorial: Create data features

[!INCLUDE SQL Server 2016 SQL MI]

In part three of this five-part tutorial series, you'll learn how to create features from raw data by using a [!INCLUDEtsql] function. You'll then call that function from a SQL stored procedure to create a table that contains the feature values.

In this article, you'll:

[!div class="checklist"]

  • Modify a custom function to calculate trip distance
  • Save the features using another custom function

In part one, you installed the prerequisites and restored the sample database.

In part two, you reviewed the sample data and generated some plots.

In part four, you'll load the modules and call the necessary functions to create and train the model using a SQL Server stored procedure.

In part five, you'll learn how to operationalize the models that you trained and saved in part four.

In part five, you'll learn how to operationalize the models that you trained and saved in part four.

About feature engineering

After several rounds of data exploration, you have collected some insights from the data, and are ready to move on to feature engineering. This process of creating meaningful features from the raw data is a critical step in creating analytical models.

In this dataset, the distance values are based on the reported meter distance, and don't necessarily represent geographical distance or the actual distance traveled. Therefore, you'll need to calculate the direct distance between the pick-up and drop-off points, by using the coordinates available in the source NYC Taxi dataset. You can do this by using the Haversine formula in a custom [!INCLUDEtsql] function.

You'll use one custom T-SQL function, fnCalculateDistance, to compute the distance using the Haversine formula, and use a second custom T-SQL function, fnEngineerFeatures, to create a table containing all the features.

The overall process is as follows:

  • Create the T-SQL function that performs the calculations

  • Call the function to generate the feature data

  • Save the feature data to a table

Calculate trip distance using fnCalculateDistance

The function fnCalculateDistance should have been downloaded and registered with [!INCLUDEssNoVersion] as part of the preparation for this tutorial. Take a minute to review the code.

  1. In [!INCLUDEssManStudio], expand Programmability, expand Functions and then Scalar-valued functions.

  2. Right-click fnCalculateDistance, and select Modify to open the [!INCLUDEtsql] script in a new query window.

    CREATE FUNCTION [dbo].[fnCalculateDistance] (@Lat1 float, @Long1 float, @Lat2 float, @Long2 float)  
    -- User-defined function that calculates the direct distance between two geographical coordinates.  
    RETURNS float  
    AS  
    BEGIN  
      DECLARE @distance decimal(28, 10)  
      -- Convert to radians  
      SET @Lat1 = @Lat1 / 57.2958  
      SET @Long1 = @Long1 / 57.2958  
      SET @Lat2 = @Lat2 / 57.2958  
      SET @Long2 = @Long2 / 57.2958  
      -- Calculate distance  
      SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) * COS(@Long2 - @Long1))  
      --Convert to miles  
      IF @distance <> 0  
      BEGIN  
        SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) / @distance);  
      END  
      RETURN @distance  
    END
    GO
    • The function is a scalar-valued function, returning a single data value of a predefined type.

    • It takes latitude and longitude values as inputs, obtained from trip pick-up and drop-off locations. The Haversine formula converts locations to radians and uses those values to compute the direct distance in miles between those two locations.

Generate the features using fnEngineerFeatures

To add the computed values to a table that can be used for training the model, you'll use another function, fnEngineerFeatures. The new function calls the previously created T-SQL function, fnCalculateDistance, to get the direct distance between pick-up and drop-off locations.

  1. Take a minute to review the code for the custom T-SQL function, fnEngineerFeatures, which should have been created for you as part of the preparation for this walkthrough.

    CREATE FUNCTION [dbo].[fnEngineerFeatures] (  
    @passenger_count int = 0,  
    @trip_distance float = 0,  
    @trip_time_in_secs int = 0,  
    @pickup_latitude float = 0,  
    @pickup_longitude float = 0,  
    @dropoff_latitude float = 0,  
    @dropoff_longitude float = 0)  
    RETURNS TABLE  
    AS
      RETURN
      (
      -- Add the SELECT statement with parameter references here
      SELECT
        @passenger_count AS passenger_count,
        @trip_distance AS trip_distance,
        @trip_time_in_secs AS trip_time_in_secs,
        [dbo].[fnCalculateDistance](@pickup_latitude, @pickup_longitude, @dropoff_latitude, @dropoff_longitude) AS direct_distance
    
      )
    GO
    • This table-valued function that takes multiple columns as inputs, and outputs a table with multiple feature columns.

    • The purpose of this function is to create new features for use in building a model.

  2. To verify that this function works, use it to calculate the geographical distance for those trips where the metered distance was 0 but the pick-up and drop-off locations were different.

        SELECT tipped, fare_amount, passenger_count,(trip_time_in_secs/60) as TripMinutes,
        trip_distance, pickup_datetime, dropoff_datetime,
        dbo.fnCalculateDistance(pickup_latitude, pickup_longitude,  dropoff_latitude, dropoff_longitude) AS direct_distance
        FROM nyctaxi_sample
        WHERE pickup_longitude != dropoff_longitude and pickup_latitude != dropoff_latitude and trip_distance = 0
        ORDER BY trip_time_in_secs DESC

    As you can see, the distance reported by the meter doesn't always correspond to geographical distance. This is why feature engineering is so important. You can use these improved data features to train a machine learning model using R.

Next steps

In this article, you:

[!div class="checklist"]

  • Modified a custom function to calculate trip distance
  • Saved the features using another custom function

[!div class="nextstepaction"] R tutorial: Train and save model