<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Housing Prices Prediction - PySpark to Teradataml Conversion
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction:</b></p>

<p style = 'font-size:16px;font-family:Arial'><code>teradatamlspk</code> is the Python package name of Teradata product pyspark2teradataml. The teradatamlspk package is built as an extension of teradataml - a Teradata Package for Python.</p>

<p style = 'font-size:16px;font-family:Arial'>Syntax and user accessibility of the teradatamlspk APIs are kept similar to PySpark APIs. This allows the existing PySpark workloads that runs on Spark engine to easily run on Teradata Vantage using ClearScape Analytics with minimal changes to the PySpark workloads.</p>

<p style = 'font-size:16px;font-family:Arial'><code>teradatamlspk</code> offers a function called <code>pyspark2teradataml</code> that enables conversion of a PySpark script to a teradatamlspk Python script. This function also generates HTML report for the conversion, which is useful for users to understand the changes done and carry out any manual changes in the generated teradatamlspk script, so that the script can be run on Vantage.</p>


<center><img src="images/Conv_Architecture.png" width=800 height=800/></center>

<br>
<p style = 'font-size:16px;font-family:Arial'>The teradatamlspk package supports the following Vantage systems:</p>
<li style = 'font-size:16px;font-family:Arial'>VantageCloud Lake</li>
<li style = 'font-size:16px;font-family:Arial'>VantageCloud Enterprise</li>
<li style = 'font-size:16px;font-family:Arial'>VantageCore</li></p>
<p style = 'font-size:16px;font-family:Arial'>Based on the connection to Vantage, the teradatamlspk package requires the following minimum software versions be installed on Vantage:</p>

<p style = 'font-size:16px;font-family:Arial'>For connection to Teradata Vantage with Analytics Database only:</p>
<p style = 'font-size:16px;font-family:Arial'>On Analytics Database:</p>
<li style = 'font-size:16px;font-family:Arial'>For VantageCore system, Analytics Database 17.20 or later</li>
<li style = 'font-size:16px;font-family:Arial'>For VantageCloud Enterprise system, Analytics Database 17.20 or later</li>
<li style = 'font-size:16px;font-family:Arial'>For VantageCloud Lake system, Analytics Database 20.00 or later</li></p>



<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to Vantage.</b></p>

<p style = 'font-size:16px;font-family:Arial'>Let us start by checking the version of the teradataml installed. The Openml functions used in this notebook will require Version 20.0.0.0.</p>

In [None]:
pip show teradataml

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial'><b>Note: </b><i>If the VM has lower version, please uncomment the below code and execute the cell.  After the cell executes, please restart the kernel. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
%%capture
!pip install --upgrade teradataml

<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
%%capture
# # '%%capture' suppresses the display of installation steps of the following packages
!pip install teradatamlspk

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial'><b>Note: </b><i>If the above install command is executed, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
import json
import getpass
import pandas as pd
import datetime
from teradataml import *

display.max_rows = 5

from teradatamlspk import pyspark2teradataml

<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ~/JupyterLabRoot/UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=HousingPrices_pyspark_to_tdml.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>   


In [None]:
#%run -i ../run_procedure.py "call get_data('DEMO_HousingPrices_cloud');"
 # Takes about 50 seconds
%run -i  ~/JupyterLabRoot/UseCases/run_procedure.py "call get_data('DEMO_HousingPrices_local');"
 # Takes about 50 secs

<p style = 'font-size:16px;font-family:Arial'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i  ~/JupyterLabRoot/UseCases/run_procedure.py "call space_report();"

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>3. Call function pyspark2teradataml for conversion.</b></p>

<p style = 'font-size:16px;font-family:Arial'>The first step of the process is to convert PySpark script to teradatamlspk script.</p>

<p style = 'font-size:20px;font-family:Arial'><b>PySpark Script Details</b></p>
<p style = 'font-size:16px;font-family:Arial'>The PySpark script is a simple script which does some initial data analysis on the Housing Price dataset. Data Transformation using StandardScalar and then uses LinearRegression to predict the house value based on the various features provided and display the metrics for LinearRegression</p>

<p style = 'font-size:16px;font-family:Arial'>We will take the following steps to achieve this migration.</p>

<li style = 'font-size:16px;font-family:Arial'><b>Step1:</b> Run pyspark2teradataml with PySpark Script as Input</li>
<li style = 'font-size:16px;font-family:Arial'><b>Step 2:</b> Review the HTML Report</li></p>
    
<p style = 'font-size:16px;font-family:Arial'>In this step we  pass the PySpark script as input to the function.</p>



In [None]:
pyspark2teradataml('Predicting_House_Prices_Pyspark.py')

<p style = 'font-size:16px;font-family:Arial'>The utility function has generated the Python script with teradatamlspk syntax and an HTML report for the conversion.</p>

<p style = 'font-size:16px;font-family:Arial'>The generated Python script may or may not run directly on Vantage.</p>

<p style = 'font-size:16px;font-family:Arial'>The utility function pyspark2teradataml takes care of most of the conversion, but there may be some instances where generated script requires additional manual changes.</p>
<p style = 'font-size:16px;font-family:Arial'><i><b>Note: We can see both the files in the left side pane for files</b></i></p>

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>4. Understanding the HTML report.</b></p>

<p style = 'font-size:16px;font-family:Arial'>The generated HTML file contains notes for the script with line number and a note it is applicable for. Notes are in three different colors as specified in the following list:</p>

<li style = 'font-size:16px;font-family:Arial'><code style= 'background-color:#f5f5f5; color:#000000'><b>black</b></code> (Light Theme) / <code style= 'background-color:#000000; color:#ffffff'><b>white</b></code> (Dark Theme): Notes which are colored in normal color (black when the theme is set to JupyterLab Light, white when the theme is set to JupyterLab Dark) do not need any attention. These notes give additional information about the APIs used in script to user.</li>

<li style = 'font-size:16px;font-family:Arial'><code style= 'background-color:#f5f5f5; color:#0000FF'><b>blue:</b></code> Notes which are colored in blue need user attention. These APIs have functionality but there may be some differences in functionality when compared with PySpark. These notes specify the exact differences for you to change the references of those APIs manually in script.</li>

<li style = 'font-size:16px;font-family:Arial'><code style= 'background-color:#f5f5f5; color:#FF0000'><b>red:</b></code> Notes which are colored in red need user attention. These APIs do not have functionality in teradatamlspk. You need to achieve the functionality through some other ways.</li></p>

In [None]:
from IPython.display import HTML
HTML(filename='Predicting_House_Prices_Pyspark_tdmlspk.html')

<p style = 'font-size:16px;font-family:Arial'>In this step, review the HTML report and act on the items accordingly.</p>

<li style = 'font-size:16px;font-family:Arial'><code>Line 19:</code> This line is colored in black, so no action is needed on this line. The report says RegressionMetrics uses RDD and teradatamlspk does not support RDD based API’s. So, the utility pyspark2teradataml removed it.</li>

<li style = 'font-size:16px;font-family:Arial'><code>Line 58:</code> This line is colored in blue, which requires user attention on this line. The report says getOrCreate accepts Vantage connection parameters. So, user should pass connection parameters here.</li>

<li style = 'font-size:16px;font-family:Arial'><code>Line 77:</code> This line is colored in black, so no action is needed on this line.</li>
<li style = 'font-size:16px;font-family:Arial'><code>Line 108:</code> This line is colored in black, so no action is needed on this line.</li>
<li style = 'font-size:16px;font-family:Arial'><code>Line 108:</code> This line is colored in blue, which requires user attention on this line. The report says header is mandatory if the script is reading the file from local file system instead of reading it from cloud storage.</li>
<ul style = 'font-size:16px;font-family:Arial'>So,
    <li>If the corresponding CSV file does not have header, then add a header.</li>
<li>If the file has header, then user do not need to take any action though this is mentioned in blue color.</li></ul>
<li style = 'font-size:16px;font-family:Arial'><code>Line 149:</code> This line is colored in blue, which requires user attention on this line. The report says sort operation is not propagated to next API. The script used in this example does not have any line of code where output of sort API is passed to input of other API. So, no action is taken on this line.</li>

<li style = 'font-size:16px;font-family:Arial'><code>Line 237:</code> This line is colored in black, so no action is needed on this line.</li>
<li style = 'font-size:16px;font-family:Arial'><code>Line 256:</code> This line is colored in blue, which requires user attention on this line. The report says StandardScaler function needs additional column called id to be present in input DataFrame. The report also says one can use function monotonically_increasing_id to create the column and advises to look at User Guide. So, this line is modified in the script to have id column for DataFrame. Along with this, the report also says the outputCol argument is not significant and output of transform returns all input columns along with scaled columns in output. And StandardScaler is an ML function. Unlike PySpark, teradatamlspk returns columns instead of vectors as mentioned in the "Important Notes" section in HTML. So, the script is modified by replacing the Vector with actual Columns.</li>

<li style = 'font-size:16px;font-family:Arial'><code>Line 277:</code> This line is colored in black, so no action is needed on this line.</li>
<li style = 'font-size:16px;font-family:Arial'><code>Line 386:</code> This line is colored in red, which requires user attention and action on this line. RegressionMetrics is not supported, so user should change it to RegressionEvaluator to make use of it’s functions. Since the script already uses RegressionEvaluator, the line 386 is commented out manually. Note that lines 386, 392, 398 and 404 uses RegressionEvaluator. So they all are commented out even though these lines are not mentioned in report.</li></p>

<p style = 'font-size:16px;font-family:Arial'>Apart from these ones, line 291 needs a change.
As mentioned in "Important Notes" section, ML functions do not accept vectors and it accepts multiple columns.</p>

<p style = 'font-size:16px;font-family:Arial'>So, the function LinearSVC in line 291 is changed to accept list of feature columns. Once all these changes are done, you can run the script on Vantage.</p>

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>5. Changes to the converted script.</b></p>

<p style = 'font-size:16px;font-family:Arial'>As mentioned in the above html report We will have to make changes to code.</p>
    
<p style = 'font-size:16px;font-family:Arial'>As mentioned on line 58 we will have to change the Vantage connection parameters.</p>

<p style = 'font-size:16px;font-family:Arial'><code> spark = TeradataSession.builder.master("local[2]").appName("Linear-Regression-California-Housing").getOrCreate(host=getpass.getpass('Enter host: '), user=getpass.getpass('Enter user: '), password=getpass.getpass('Enter password: ')) </code></p>
    
<p style = 'font-size:16px;font-family:Arial'>The below parameters should be entered for this environment</p>
<li style = 'font-size:16px;font-family:Arial'><code>host:</code> Enter Machine Name. (copy the hostname from dashboard. Ex: <b style = 'font-size:12px;font-family:Arial'> test-12345.env.clearscape.teradata.com</b>) </li>
<li style = 'font-size:16px;font-family:Arial'><code>user_name:</code> User Name is "demo_user" </li>
<li style = 'font-size:16px;font-family:Arial'><code>password:</code> Password for the CSAE environment </li> </p>

<hr style="height:1px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>5.1 Fetch Data in DataFrame </b></p>

<p style = 'font-size:16px;font-family:Arial'>On line 108 we will have to change the Data reading from google cloud storage.</p>

<p style = 'font-size:16px;font-family:Arial'><code>housing_df = spark.read.csv(path=HOUSING_DATA, schema=schema, header=True).cache()</code></p>
    
<p style = 'font-size:16px;font-family:Arial'>Here, we will create a dataframe from Vantage Table which we have loaded through the get_data procedure by reading from google cloud storage.</p>

<p style = 'font-size:16px;font-family:Arial'><code>housing_df = DataFrame(in_schema("DEMO_HousingPrices","Housing_Data"))</code></p>

In [None]:
housing_df = DataFrame(in_schema("DEMO_HousingPrices","Housing_Data"))
housing_df

<hr style="height:1px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>5.2 Changes to StandardScalar function </b></p>

<p style = 'font-size:16px;font-family:Arial'>As mentioned in the HTML report on line 256, the StandardScaler function needs additional column called id to be present in input DataFrame. We can use function monotonically_increasing_id to create the column and advises to look at User Guide. So, we will have to modify the script to have id column for DataFrame using the below code.</p>

<p style = 'font-size:16px;font-family:Arial'><code>from teradatamlspk.sql.functions import monotonically_increasing_id
assembled_df = assembled_df.withColumn("id", monotonically_increasing_id())
assembled_df.show(5)</code></p>


<p style = 'font-size:16px;font-family:Arial'>Use the StandardScalar function</p>

<p style = 'font-size:16px;font-family:Arial'><code>standardScaler = StandardScaler(inputCol=["totbdrms", "pop", "houshlds", "medinc", "rmsperhh", "popperhh", "bdrmsperrm"], outputCol="features_scaled")</code></p>

<p style = 'font-size:16px;font-family:Arial'><code>from teradatamlspk.sql.functions import monotonically_increasing_id
assembled_df = assembled_df.withColumn("id", monotonically_increasing_id())
    assembled_df.show(5)</code></p>

<p style = 'font-size:16px;font-family:Arial'>Fit the DataFrame to the scaler</p>
<p style = 'font-size:16px;font-family:Arial'><code>scaled_df = standardScaler.fit(assembled_df).transform(assembled_df)</code></p>

<p style = 'font-size:16px;font-family:Arial'><code>scaled_df.select(["totbdrms", "pop", "houshlds", "medinc", "rmsperhh", "popperhh", "bdrmsperrm"]).show(10, truncate=False)</code></p>

<p style = 'font-size:16px;font-family:Arial'>As mentioned in the above steps the changes are to be made in the script which is to be executed in Vantage using the teradatamlspk syntax.</p>

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>6. Executing the updated script.</b></p>

<p style = 'font-size:16px;font-family:Arial'>After making the recommended changes we will have the updated script which can be executed on Vantage.</p>

<p style = 'font-size:16px;font-family:Arial'>There is an updated script<code>(UpdatedPredicting_House_Prices_Pyspark_tdmlspk.py)</code> uploaded here which will execute all the functions on teradata using teradataml function library.</p>

<p style = 'font-size:16px;font-family:Arial'>For executing the updated script the below parameters should be entered for this environment </p>
<li style = 'font-size:16px;font-family:Arial'><code>host:</code> Host Name is  <b>"host.docker.internal"</b> </li>
<li style = 'font-size:16px;font-family:Arial'><code>user_name:</code> User Name is <b>"demo_user" </b></li>
<li style = 'font-size:16px;font-family:Arial'><code>password:</code> Password for the CSAE environment </li> </p>

<p style = 'font-size:12px;font-family:Arial'><b><i>Note: Some comments are added in the updated script manually for better understanding of the output.</i></b></p>

In [None]:
%run -i UpdatedPredicting_House_Prices_Pyspark_tdmlspk.py

<p style = 'font-size:18px;font-family:Arial'><b> Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial'>Thus, using the function called <code>pyspark2teradataml</code> from the <code>teradatamlspk</code> package we have seen the ease with which we can convert a PySpark script to a teradatamlspk Python script. The generated HTML report makes it easy for users to understand and make any manual changes needed in the generated teradatamlspk script, so that the script can be run on Vantage.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Limitations and Considerations</b></p>
<p style = 'font-size:16px;font-family:Arial'>Limitations and considerations when using the teradatamlspk package.</p>

<li style = 'font-size:16px;font-family:Arial'>PySpark Resilient Distributed Dataset (RDD) based APIs are not applicable for Teradata Vantage. Because PySpark stores data in RDD format and Vantage stores in different format.</li>

<li style = 'font-size:16px;font-family:Arial'>PySpark streaming APIs are not supported.</li>
<li style = 'font-size:16px;font-family:Arial'>pandas style DataFrame APIs offered by PySpark are not supported.</li></p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>7. Cleanup</b></p>


<p style = 'font-size:18px;font-family:Arial'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i  ~/JupyterLabRoot/UseCases/run_procedure.py "call remove_data('DEMO_HousingPrices');" 
#Takes 40 seconds

<p style = 'font-size:16px;font-family:Arial'>If you have updated the teradataml package, reinstall the package by uncommenting and running the below code cell.</p>

In [None]:
%%capture
#!pip install teradataml==17.20.0.6 --force-reinstall

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>Resources</b>
<p style = 'font-size:16px;font-family:Arial'>Let’s look at the elements we have available for reference for this notebook:</p>
<b style = 'font-size:20px;font-family:Arial'>Dataset Details</b>
<p>
<li style = 'font-size:16px;font-family:Arial'><code>longitude:</code> A measure of how far west a house is; a higher value is farther west</li>
<li style = 'font-size:16px;font-family:Arial'><code>latitude:</code> A measure of how far north a house is; a higher value is farther north</li>
<li style = 'font-size:16px;font-family:Arial'><code>housingMedianAge:</code> Median age of a house within a block; a lower number is a newer building</li>
<li style = 'font-size:16px;font-family:Arial'><code>totalRooms:</code> Total number of rooms within a block</li>
<li style = 'font-size:16px;font-family:Arial'><code>totalBedrooms:</code> Total number of bedrooms within a block</li>
<li style = 'font-size:16px;font-family:Arial'><code>population:</code> Total number of people residing within a block</li>
<li style = 'font-size:16px;font-family:Arial'><code>households:</code> Total number of households, a group of people residing within a home unit, for a block</li>
<li style = 'font-size:16px;font-family:Arial'><code>medianIncome:</code> Median income for households within a block of houses (measured in tens of thousands of US Dollars)</li>
<li style = 'font-size:16px;font-family:Arial'><code>medianHouseValue:</code> Median house value for households within a block (measured in US Dollars)</li>
<li style = 'font-size:16px;font-family:Arial'><code>oceanProximity:</code> Location of the house w.r.t ocean/sea</li></p>
<b style = 'font-size:18px;font-family:Arial'>Filters:</b> 
    <li style = 'font-size:16px;font-family:Arial'><b>Industry:</b> Real Estate</li>
<li style = 'font-size:16px;font-family:Arial'><b>Functionality:</b> PySpark to teradataml conversion</li> 
<li style = 'font-size:16px;font-family:Arial'><b>Use Case:</b> Housing Prices</li></p>
<b style = 'font-size:18px;font-family:Arial'>Related Resources:</b>
<li style = 'font-size:16px;font-family:Arial'><a href = 'https://teradata.seismic.com/app#/doccenter/dc7eb2cf-bd2e-462a-a056-fcc02c9fd2f2/doc/%252Fddb2fe9eb1-f754-d3df-07fa-0724e0ddd3e9%252FdfODEyNmNlZmEtZmM4Mi00ODUyLTgzZTAtOTEzMTBlODQ5YjUw%252CPT0%253D%252CRGVtbw%253D%253D%252FdfYmI3ODY3ZDQtM2Q4Zi00ZTk5LTg2ZDYtNjBlZTk4ODY2YTY4%252CPT0%253D%252CRXh0ZXJuYWwgQXVkaWVuY2Vz%252Flf2e777d25-587b-4009-a9ce-b3a605c3554d//?mode=view&searchId=a8b70941-59fb-4da3-b4e0-69ad0e68b21b'>Migrate PySpark Workloads to Teradata to Fast Track AI/ML and Minimize Costs</a> </li>



<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            © 2024 Teradata. All rights reserved.
        </div>
    </div>
</footer>