<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Outlier Analysis
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>An outlier is a data point that is significantly different from other data points in a dataset. Outlier removal is important because outliers can skew the overall analysis and conclusions drawn from a dataset, leading to inaccurate results. By removing outliers, the data can better represent the majority of the dataset and improve the accuracy of the analysis.</p>

<!-- <p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the typical process for creating Machine Learning models, a significant amount of time is spent on data preparation and feature selection.  Data scientists and engineers will typically copy data to a tool of choice or data virtualization to perform these tasks.  Moving this data to these tools is impossible at a sufficient scale reflecting typical production volumes.  Even if we can transfer data to another system, the resource requirements to process and analyze this data becomes prohibitively large or expensive.</p>
 -->
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following demonstration will illustrate using native Vantage SQL functions that can provide greater efficiency, ease of use, and the ability to process data at an extreme scale.  Additionally, a new SQL function <b>TD_ColumnTransformer</b> can create a single, efficient data transformation pipeline.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The data for this demonstration consists of New York City taxi trip data and includes 500,000 rows with fare amount, passenger count, and pickup and dropoff latitude/longitude.  The demonstration illustrates various functions for data exploration and outlier removal.  The steps in this demo are as follows:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Perform statistical analysis on the whole data set</li>
    <li>Identify outliers and abnormal data distribution</li>
    <li>Remove outliers and review data</li>
    <li>Perform Bin Coding and Column Transformation to combine Feature Engineering steps</li>
    </ol>

<img src = 'images/Flow_Diagram_Outlier.png' width=100%>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>You will be prompted to provide the password. Enter your password, press Enter, then use down arrow to go to next cell.</p>

In [None]:
%connect local, hidewarnings=true

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Setup for execution of notebook. Begin running steps with Shift + Enter keys.</p>


In [None]:
SET query_band='DEMO=Outlier_Analysis_SQL.ipynb.ipynb;' UPDATE FOR SESSION;

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage.  You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage.  There are two statements in the following cell, and one is commented out.  You may switch which mode you choose by changing the comment string. </p>


In [None]:
-- call get_data('DEMO_NYCTaxi_cloud');    -- Takes about 2 minutes
call get_data('DEMO_NYCTaxi_local');     -- Takes about 3 minutes

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – if you want to see status of databases/tables created and space used.</p>


In [None]:
call space_report();  -- optional, takes about 10 seconds

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Statistical Analysis</b></p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Inspect the data</li>
    <li>Call TD_UnivariateStatistics function to gather statistics</li>
    <li>Pivot and select outlier stats</li>
    </ol>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>2.1 Inspect the Data</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As a warm-up, let us look at the tables in our database DEMO_NYCTaxi.</p>       

In [None]:
SELECT 
    DatabaseName,
    TableName
FROM
    DBC.Tables
WHERE
    DatabaseName = 'DEMO_NYCTaxi'

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The query below shows the number of rows in each of the tables in the database.</p>

In [None]:
SELECT
(
    SELECT COUNT(*)
    FROM DEMO_NYCTaxi.trip
) AS trip,
(
    SELECT COUNT(*)
    FROM DEMO_NYCTaxi.trip_fare
) AS trip_fare;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let's look at the sample data from trip and trip_fare tables.</p>

In [None]:
SELECT TOP 5 * FROM DEMO_NYCTaxi.trip;

In [None]:
SELECT TOP 5 * FROM DEMO_NYCTaxi.trip_fare;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We are interested in the following columns from the above tables:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>pickup_datetime</li>
    <li>passenger_count</li>
    <li>pickup_latitude</li>
    <li>pickup_longitude</li>
    <li>dropoff_latitude</li>
    <li>dropoff_longitude</li>
    <li>total_amount</li>
</ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Hence, we create a table with the interested columns as follows:</p>

In [None]:
CREATE MULTISET TABLE NYC_FULL_T AS (
    SELECT t.pickup_datetime as pickup_datetime,
    t.passenger_count as passenger_count,
    t.pickup_latitude as pickup_latitude,
    t.pickup_longitude as pickup_longitude,
    t.dropoff_latitude as dropoff_latitude,
    t.dropoff_longitude as dropoff_longitude,
    f.total_amount as fare_amount
FROM "DEMO_NYCTaxi"."trip" t
LEFT JOIN "DEMO_NYCTaxi"."trip_fare" f
    ON f.medallion = t.medallion
    AND f.pickup_datetime = t.pickup_datetime) WITH DATA;

In [None]:
SELECT COUNT(*) FROM NYC_FULL_T;
SELECT TOP 5 * FROM NYC_FULL_T;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above output shows five rows from the view created in the previous step. The view has 480k+ rows.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>2.2 Gather statistics using TD_UnivariateStatistics</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Univariate analysis is the simplest form of analyzing data. Its significant purpose is to describe; It takes data, summarizes it, and finds patterns in the data.
    <br>
    <a href = 'https://docs.teradata.com/search/all?query=TD_UnivariateStatistics&content-lang=en-US'>TD_UnivariateStatistics</a> displays descriptive statistics for each specified numeric input table column.
</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b><i>*Note that the next cell might take upto 1 minute to run.</i></b></p>

In [None]:
CREATE MULTISET TABLE AllStats_unpivoted AS (
    SELECT *
        FROM TD_UnivariateStatistics (
        ON NYC_FULL_T AS InputTable
        USING
        TargetColumns('fare_amount', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude')
        STATS( 
              'BOTTOM5',
              'COEFFICIENT OF VARIATION',
              'CORRECTED SUM OF SQUARES',
              'MEAN',
              'MEDIAN',
              'MODE',           
              'GEOMETRIC MEAN', 
              'HARMONIC MEAN', 
              'TRIMMED MEAN',
              'KURTOSIS',
              'SKEWNESS',
              'STANDARD ERROR',
              'STANDARD DEVIATION',
              'SUM',
              'POSITIVE VALUES COUNT', 
              'TOP5', 
              'INTERQUARTILE RANGE',
              'NEGATIVE VALUES COUNT',
              'NULL COUNT',
              'RANGE',
              'UNCORRECTED SUM OF SQUARES',
              'UNIQUE ENTITY COUNT',
              'VARIANCE',
              'ZERO VALUES COUNT',
              'PERCENTILES',
              'MINIMUM',
              'MAXIMUM'
        )
        ) AS dt
) WITH DATA;

In [None]:
SELECT TOP 5 * FROM AllStats_unpivoted;

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>2.3 Pivot for easier analysis</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href='https://docs.teradata.com/search/all?query=PIVOT&content-lang=en-US'>PIVOT</a> is a relational operator for transforming rows into columns. The function is helpful for reporting purposes, allowing you to aggregate and rotate data to create easy-to-read tables.</p>

In [None]:
REPLACE VIEW AllStats_pivoted AS
    SELECT *
        FROM AllStats_unpivoted
        PIVOT (
            MAX(StatValue) FOR  ATTRIBUTE IN ('fare_amount', 
                                                 'pickup_longitude',  
                                                 'pickup_latitude',
                                                 'dropoff_longitude',
                                                 'dropoff_latitude')
        ) tmp;

In [None]:
SELECT TOP 5 * FROM Allstats_pivoted ORDER BY 1;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The output above shows a cleaner way to look at the various statistics for all the specified columns.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Outlier Identification</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can see some extraordinary value outliers in our numeric data based on the statistics gathered above. We can use the <a href='https://docs.teradata.com/search/all?query=TD_HISTOGRAM&content-lang=en-US'>TD_Histogram</a> function to look at the distribution of each data column, which can guide how to remove outliers.</p>
    
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Inspect the distribution stastics</li>
    <li>Use TD_Histogram to view column distributions using calculated binning</li>
    <li>Use TD_Histogram using MinMax table as input</li>
    </ol>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.1 Simple SELECT</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Filter results out of the Pivoted table of Stats. We are interested in maximum, minimum and percentiles for all the columns.</p>

In [None]:
SELECT TOP 5 *
FROM AllStats_pivoted
WHERE "StatName" LIKE 'PERCENTILES(%)'
    OR  "StatName" IN ('MINIMUM', 'MAXIMUM')
ORDER BY 1;

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.2 Histograms for distribution analysis</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://docs.teradata.com/r/Teradata-VantageTM-Advanced-SQL-Engine-Analytic-Functions/April-2022/Data-Exploration-Functions/TD_Histogram'>TD_Histogram</a> calculates the frequency distribution of a data set using your choice from these methods:</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Sturges, which uses w = r/(1 + log2n) for bin width</li>
    <li>Scott, which uses w = 3.49s/(n1/3) for bin width</li>
    <li>Variable-width, which requires a MinMax table</li>   
    <li>Equal-width, which requires a MinMax table</li>
    </ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>An example using Sturges is below; note that the algorithmic methods work best on normally distributed data, so in this case, the bin widths illustrate how a few extreme outliers can skew the overall data distribution.</p>

In [None]:
SELECT *
FROM TD_Histogram (
    ON NYC_FULL_T AS InputTable
    USING
    TargetColumn('fare_amount')
    MethodType('Sturges')
) AS dt ORDER BY 1;

In [None]:
%chart x = MinValue, y = bin_percent, title = "Histogram of fare_amount"

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The results above indicate that <b>96.6%</b> of fares are in the range 0-50. Almost all the data is residing in the 0-100 fare price range.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>3.3 Histograms with MinMax table</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>An alternative method for defining bin width is to use a MinMax table. For fixed-width bins, the MinMax table only needs two columns, "MinValue" and "MaxValue", representing the column's overall min and max values. For variable-width bins, the MinMax table needs an extra "label" column, and each row MinValue and MaxValue will represent the value range of the bin.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Specifically for the fare amount column, this MinMax table will only bin values from 0 to 100 in the specified range. Visualization of the results yields a much more manageable distribution.</p>

In [None]:
CREATE MULTISET TABLE fare_amount_minmax ( 
    MinValue INTEGER, 
    MaxValue INTEGER,
    Label VARCHAR(10)
);

In [None]:
%dataload table = fare_amount_minmax, filepath = UseCases/Outlier_Analysis/data/EDA_MinMax.csv

In [None]:
SELECT TOP 5 * FROM fare_amount_minmax ORDER BY MinValue;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The table fare_amount_minmax has 51 bins of width 2 in the range 0 to 100. We will use this table to create a variable-width histogram. Note that you can change the fare_amount_minmax table to declare bins of any width.</p>

In [None]:
SELECT *
FROM TD_Histogram (
    ON NYC_FULL_T AS InputTable
    ON fare_amount_minmax AS minmax DIMENSION
    USING
    MethodType ('variable-width')
    TargetColumn ('fare_amount')
    nbins(51)
) AS dt ORDER BY 2;

In [None]:
%chart x = MinValue, y = bin_percent, title = "Histogram of fare_amount (0 to 100)", width = 500

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Now we have a better visualization of the distribution of majority of trips with fare in the range 1 to 100.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Outlier Removal</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>At this point, we've identified that our data has some extreme outliers. The Histogram illustration quantifies this for fare amount, but if we inspect the other columns, we will see similar features in the data.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>TD_OutlierFilterFit and TD_OutlierFilterTransform can selectively modify values or remove rows representing the outlier columns.</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Create a Fit Table using TD_OutlierFilterFit</li>
    <li>Transform the data set using TD_OutlierFilterTransform</li>
    <li>Compare distributions before and after</li>
</ol>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.1 TD_OutlierFilterFit</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://docs.teradata.com/search/all?query=TD_OutlierFilterFit&content-lang=en-US'>TD_OutlierFilterFit</a> function calculates the lower_percentile, upper_percentile, count of rows and median for the specified input table columns. The calculated values for each column help the TD_OutlierFilterTransform function detect outliers in the input table.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Some select parameters include</p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>TargetColumns; which takes a list or a range of values/ordinals</li>
    <li>ReplacementValue; delete, null. median, or a value</li>
    <li>OutlierMethod; Percentile, tukey, or carling</li>
    <li>Variable-width, which requires a MinMax table</li>   
    <li>Other outlier identification parameters depending on OutlierMethod</li>
    </ul>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this case, we will use the Percentile method on our latitude/longitude on columns to delete rows containing outlier values. Additional parameter values define upper and lower percentile values and the percentile calculation method (Discrete).</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b><i>*Note that the next cell might take upto 1 minute to run.</i></b></p>

In [None]:
CREATE MULTISET TABLE OutlierFitTbl AS(
    SELECT *
        FROM TD_OutlierFilterFit(
            ON NYC_FULL_T AS InputTable
            USING
            TargetColumns('pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude')  
            LowerPercentile(0.02)
            UpperPercentile(0.98)  
            OutlierMethod('Percentile')            
            ReplacementValue('delete')
        ) AS dt
) WITH DATA;

In [None]:
SELECT TOP 5 * FROM OutlierFitTbl;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above calculated values for each column in the OutlierFitTbl will help in removing outliers from the original data set.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.2 TD_OutlierFilterTransform</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://docs.teradata.com/search/all?query=TD_OutlierFilterTransform&content-lang=en-USm'>TD_OutlierFilterTransform</a> filters outliers from the input table. The metrics for determining outliers come from TD_OutlierFilterFit output, i.e. OutlierFitTbl in our case.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we create a View to save on space and allow for before and after comparison. This view has no outliers w.r.t. 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude'.</p>

In [None]:
REPLACE VIEW latlong_no_outliers AS (
    SELECT *
        FROM TD_OutlierFilterTransform(
        ON NYC_FULL_T AS InputTable PARTITION BY ANY
        ON OutlierFitTbl AS FitTable DIMENSION 
        ) AS dt
);

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>4.3 Before and After Comparison</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Use TD_Histogram to visualize the distributions of data before and after transformation. Use the <b>Pickup Latitude</b> column as a good example.</p>

In [None]:
SELECT *
    FROM TD_Histogram (
    ON NYC_FULL_T  AS InputTable
    USING
    MethodType ('Sturges')
    TargetColumn ('pickup_latitude')
) AS dt ORDER BY 1;

In [None]:
%chart x = MinValue, y = bin_percent, title = "Histogram of pickup_latitude Raw"

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This graph shows that a staggering 98.6% of rides were having pickup_lattitude between 40-45. Nearly 1.4% of rides have pickup_latitude outside the 40-45 range. This implies that these are outliers and might give incorrect results if used further in the analysis. These outliers may point to a geographic location that the company doesn't cater to. Hence we should remove these outliers.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>After Outlier Removal Comparison</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Use the <b>View</b> created above instead of the raw data. We will see a striking difference in the distribution of data.</p>

In [None]:
SELECT *
    FROM TD_Histogram (
    ON latlong_no_outliers AS InputTable
    USING
    MethodType ('Sturges')
    TargetColumn ('pickup_latitude')
) AS dt ORDER BY 1;

In [None]:
%chart x = MinValue, y = bin_percent, title = "Histogram of pickup_latitude Filtered"

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>By removing the outliers in the pickup_latitude, we are better able to represent the dataset.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Bin Coding</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Note that we've only filtered out latitude and longitude outliers and still need to transform the fare amount column, which does have extreme values that need to be addressed.  For this demonstration, we will use TD_BinCodeFit to create a fit table and then use TD_ColumnTransformer to combine the Outlier Filtering and Bin Coding steps into a single statement.</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Inspect the rows with max and min values</li>
    <li>Create a fit table with the value ranges and category labels</li>
    <li>Use TD_ColumnTransformer to create a transformation "Pipeline"</li>
    <li>Compare distributions before and after using TD_CategoricalSummary</li>
    </ol>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.1 Min and Max Values</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Use <a href = 'https://docs.teradata.com/search/all?query=TD_WhichMin&content-lang=en-US'>TD_WhichMin</a> and <a href='https://docs.teradata.com/search/all?query=TD_WhichMax&content-lang=en-US'>TD_WhichMax</a> to inspect the rows of the raw data table that contain the min and max values.</p>

In [None]:
SELECT * FROM TD_WhichMin (
   ON NYC_FULL_T AS InputTable
   USING
   TargetColumn('fare_amount')
) AS dt;

In [None]:
SELECT * FROM TD_WhichMax (
   ON NYC_FULL_T AS InputTable
   USING
   TargetColumn('fare_amount')
) AS dt;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Above are the rows with min fare($2.6) and max fare(\$450).</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.2 TD_BinCodeFit</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Bin coding is typically used to convert numeric data to categorical data by binning the numeric data into multiple numeric bins (intervals). The bins can have a fixed width with auto-generated labels or specified variable widths and labels.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://docs.teradata.com/search/all?query=TD_BinCodeFit&content-lang=en-US'>TD_BinCodeFit</a> outputs a table of information to input to TD_BinCodeTransform, which bin-codes the specified input table columns.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Above, when we investigated the <b>Fare Amount</b> column using TD_Histogram, we could use a custom MinMax table to create custom bins. The process is similar to Bin Coding. Here we will:</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Create a custom Dimension table with the column name, MinValue, MaxValue, and label</li>
    <li>Use that as input to the TD_BinCodeFit function</li>
    <li>Pass additional parameter values such as MethodType and TargetColumns</li> 
    </ul>

In [None]:
CREATE MULTISET TABLE fare_amount_range (
    ColumnName CHAR(15),
    MinValue SMALLINT,
    MaxValue SMALLINT,
    label CHAR(35)
);

In [None]:
INSERT INTO fare_amount_range VALUES ('fare_amount', 0, 5, '00-05');
INSERT INTO fare_amount_range VALUES ('fare_amount', 5, 10, '05-10');
INSERT INTO fare_amount_range VALUES ('fare_amount', 10, 15, '10-15');
INSERT INTO fare_amount_range VALUES ('fare_amount', 15, 20, '15-20');
INSERT INTO fare_amount_range VALUES ('fare_amount', 20, 25, '20-25');
INSERT INTO fare_amount_range VALUES ('fare_amount', 25, 30, '25-30');
INSERT INTO fare_amount_range VALUES ('fare_amount', 30, 35, '30-35');
INSERT INTO fare_amount_range VALUES ('fare_amount', 35, 40, '35-40');
INSERT INTO fare_amount_range VALUES ('fare_amount', 40, 45, '40-45');

In [None]:
CREATE MULTISET TABLE BinCodeFitTbl AS (
    SELECT *
        FROM TD_BincodeFit(
        ON NYC_FULL_T AS InputTable
        ON fare_amount_range as FitInput Dimension
        USING
            TargetColumns('fare_amount')
            MethodType('Variable-Width')
        ) AS dt
) WITH DATA;

In [None]:
SELECT TOP 5 * from BinCodeFitTbl ORDER BY 2;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here we can see that we created BinCodeFitTble using the fare_amount_range table. We will use this in the next step of the transformation.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.3 TD_ColumnTransformer</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/search/all?query=TD_ColumnTransformer&content-lang=en-US'>TD_ColumnTransformer</a> function transforms the entire dataset in a single operation. You only need to provide the FIT tables to the function, and the function runs all the transformations required in a single operation.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>For this demonstration, we will create another view using TD_ColumnTransformer, passing both <b>Outlier Filter Fit</b> and <b>Bin Coding Fit</b> tables.</p>


In [None]:
REPLACE VIEW nyc_outlier_transformed AS (
    SELECT *
        FROM TD_ColumnTransformer(
        ON NYC_FULL_T AS InputTable
            
        ON OutlierFitTbl AS OutlierFilterFitTable DIMENSION
        ON BinCodeFitTbl AS BincodeFitTable DIMENSION
            
    ) AS dt
);

In [None]:
SELECT TOP 5 * FROM nyc_outlier_transformed;

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As shown above, the data has been divided into the bins specified in the fare_amount_range table.</p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.4 TD_CategoricalSummary</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/search/all?query=TD_CategoricalSummary&content-lang=en-US'>TD_CategoricalSummary</a> function displays the distinct values and their counts for each specified input table column.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Since the fare_amount column has been converted from a numeric column to a categorical one, we will use TD_CategoricalSummary instead of TD_Histogram to count the distinct category values, and value counts across the data set.</p>

In [None]:
SELECT  *
FROM TD_CategoricalSummary (
    ON nyc_outlier_transformed AS InputTable
USING
    TargetColumns('fare_amount')
) AS dt ORDER BY 2;

In [None]:
%chart x = DistinctValue, y = DistinctValueCount, title = "Histogram of fare_amount range and DistinctValue_count", width = 300

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The above graph shows a better distribution of fare amount.
<br>
<br>
    So in this demonstration, we handled the outliers in the latitude longitude as well as fare_amount. Now this filtered dataset most closely represents the actual data i.e., the rides taken. This dataset can safely be used for further analysis.
</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>8. Cleanup</b>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Cleanup work tables to prevent errors next time. This section drops all the tables created during the demonstration.</p>

In [None]:
DROP TABLE NYC_FULL_T

In [None]:
DROP TABLE AllStats_unpivoted;

In [None]:
DROP VIEW AllStats_pivoted

In [None]:
DROP TABLE fare_amount_minmax;

In [None]:
DROP TABLE outlierFitTbl

In [None]:
DROP VIEW latlong_no_outliers

In [None]:
DROP TABLE fare_amount_range

In [None]:
DROP TABLE BinCodeFitTbl

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
call remove_data('DEMO_NYCTaxi');          -- Takes 10 seconds

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023. All Rights Reserved
        </div>
    </div>
</footer>