<img src="https://github.com/Microsoft/sqlworkshops/blob/master/graphics/solutions-microsoft-logo-small.png?raw=true" alt="Microsoft">
<br>

# **SQL Server 2019 big data cluster Tutorial**
## **06 - Using Spark for ETL**

In this tutorial you will learn how to work with Spark Jobs in a SQL Server big data cluster. 

Many times Spark is used to do transformations on data at large scale. In this Jupyter Notebook, you'll read a large text file into a Spark DataFrame, and then save out the top 10 examples as a table using SparkSQL.

> Switch your kernel to PySpark and run print("hello") or whatever you like to activate Spark context.  
>
> If it output error like this
> ```
> The code failed because of a fatal error:
>   Error sending http request and maximum retry encountered..
> ```
> Please switch to another kernel and switch back, and run again.

In [3]:
# Test if SparkSession is available as 'spark'
print('hello')

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
4,application_1577707785736_0005,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

hello

In [5]:
# Read the product reviews CSV files into a spark data frame, print schema & top rows
results = spark.read.option("inferSchema", "true").csv('/product_review_data').toDF("Item_ID", "Review")
results.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------------------+
|Item_ID|              Review|
+-------+--------------------+
|  72621|Works fine. Easy ...|
|  89334|great product to ...|
|  89335|Next time will go...|
|  84259|Great Gift Great ...|
|  84398|After trip to Par...|
|  66434|Simply the best t...|
|  66501|This is the exact...|
|  66587|Not super magnet;...|
|  66680|Installed as bath...|
|  66694|Our home was buil...|
|  84489|Hi ;We are runnin...|
|  79052|Terra cotta is th...|
|  73034|One of my fingern...|
|  73298|We installed thes...|
|  66810|needed silicone c...|
|  66912|Great Gift Great ...|
|  67028|Laguiole knives a...|
|  89770|Good sound timers...|
|  84679|AWESOME FEEDBACK ...|
|  84953|love the retro gl...|
+-------+--------------------+
only showing top 20 rows

In [6]:
# Save results as parquet file and create hive table
results.write.format("parquet").mode("overwrite").saveAsTable("Top_Product_Reviews")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
# Execute Spark SQL commands
sqlDF = spark.sql("SELECT * FROM Top_Product_Reviews LIMIT 10")
sqlDF.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------------------+
|Item_ID|              Review|
+-------+--------------------+
|  72621|Works fine. Easy ...|
|  89334|great product to ...|
|  89335|Next time will go...|
|  84259|Great Gift Great ...|
|  84398|After trip to Par...|
|  66434|Simply the best t...|
|  66501|This is the exact...|
|  66587|Not super magnet;...|
|  66680|Installed as bath...|
|  66694|Our home was buil...|
+-------+--------------------+

## **Next Steps: Continue on to Working with Spark and Machine Learning**

Now you're ready to open the final Python Notebook in this tutorial series - `bdc_tutorial_07.ipynb` - to learn how to create and work with Spark and Machine Learning.