In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.master("local[1]") \
                    .appName('test') \
                    .getOrCreate()

24/04/23 09:07:49 WARN Utils: Your hostname, codespaces-201091 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
24/04/23 09:07:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/23 09:07:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Datasets: Type-Safe Structured APIs
The first API we’ll describe is a type-safe version of Spark’s structured API called Datasets, for writing statically typed code in Java and Scala. The Dataset API is not available in Python and R, because those languages are dynamically typed.

# Structured Streaming

Let’s first analyze the data as a static dataset and create a DataFrame to do so.
We’ll also create a schema from this static dataset

In [3]:
staticDataFrame = spark.read.format('csv')\
    .option('header', 'true')\
    .option('inferSchema', 'true')\
    .load('../data/retail-data/by-day/*.csv')

                                                                                

In [4]:
staticDataFrame.createOrReplaceTempView('retail_data')
staticSchema = staticDataFrame.schema

The window function will include all data from each day in the aggregation. It’s simply a window over the time–series column in our data. This is a helpful tool for manipulating date and timestamps because we can specify our requirements in a more human form (via intervals), and Spark will group all of them together for us:

In [10]:
from pyspark.sql.functions import window, column, desc, col
staticDataFrame\
    .selectExpr(
        'CustomerId',
        '(UnitPrice * Quantity) as total_cost',
        'InvoiceDate')\
    .groupBy(
        col('CustomerId'), window(col('InvoiceDate'), '1 day'))\
    .sum('total_cost')\
    .show(5)



+----------+--------------------+-----------------+
|CustomerId|              window|  sum(total_cost)|
+----------+--------------------+-----------------+
|   16057.0|{2011-12-05 00:00...|            -37.6|
|   14126.0|{2011-11-29 00:00...|643.6300000000001|
|   13500.0|{2011-11-16 00:00...|497.9700000000001|
|   17160.0|{2011-11-08 00:00...|516.8499999999999|
|   15608.0|{2011-11-11 00:00...|            122.4|
+----------+--------------------+-----------------+
only showing top 5 rows



                                                                                

Because you’re likely running this in local mode, it’s a good practice to set the number of shuffle partitions to something that’s going to be a better fit for local mode. This configuration specifies the number of partitions that should be created after a shuffle. By default, the value is 200, but because there aren’t many executors on this machine, it’s worth reducing this to 5.

In [6]:
spark.conf.set('spark.sql.shuffle.partitions', '5')

Now that we’ve seen how that works, let’s take a look at the streaming code! You’ll notice that very little actually changes about the code. The biggest change is that we used readStream instead of read, additionally you’ll notice the maxFilesPerTrigger option, which simply specifies the number of files we should read in at once. This is to make our demonstration more “streaming,” and in a production scenario this would probably be omitted.

In [None]:
streamingDataFrame = spark.readStream\
    .schema(staticSchema)\
    .option('maxFilesPerTrigger', 1)\
    .format('csv')\
    .option('header', 'true')\
    .load('../data/retail-data/by-day/*.csv')

                                                                                

In [None]:
streamingDataFrame.isStreaming

True

Let’s set up the same business logic as the previous DataFrame manipulation. We’ll perform a
summation in the process:

In [12]:
purchaseByCustomerPerHour = streamingDataFrame\
    .selectExpr(
        'CustomerId',
        '(UnitPrice * Quantity) as total_cost',
        'InvoiceDate')\
    .groupBy(
        col('CustomerId'), window(col('InvoiceDate'), '1 day'))\
    .sum('total_cost')

This is still a lazy operation, so we will need to call a streaming action to start the execution of
this data flow.
Streaming actions are a bit different from our conventional static action because we’re going to be populating data somewhere instead of just calling something like count (which doesn’t make any sense on a stream anyways). The action we will use will output to an in-memory table that we will update after each trigger. In this case, each trigger is based on an individual file (the read option that we set). Spark will mutate the data in the in-memory table such that we will always have the highest value as specified in our previous aggregation:

In [13]:
purchaseByCustomerPerHour.writeStream\
    .format('memory')\
    .queryName('customer_purchases')\
    .outputMode('complete')\
    .start()

23/08/17 07:57:40 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-946e8a07-7575-4b53-9ecc-bd2d255665e6. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
23/08/17 07:57:40 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


<pyspark.sql.streaming.query.StreamingQuery at 0x7f4533b42530>

23/08/17 07:57:44 WARN FileStreamSource: Listed 305 file(s) in 3560 ms          
                                                                                

When we start the stream, we can run queries against it to debug what our result will look like if
we were to write this out to a production sink:

In [14]:
spark.sql('''
SELECT *
FROM customer_purchases
ORDER BY `sum(total_cost)` DESC
''')\
.show(5)



+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|      null|{2011-03-29 00:00...| 33521.39999999998|
|      null|{2010-12-21 00:00...|31347.479999999938|
|   18102.0|{2010-12-07 00:00...|          25920.37|
|      null|{2010-12-10 00:00...|25399.560000000012|
|      null|{2010-12-17 00:00...|25371.769999999768|
+----------+--------------------+------------------+
only showing top 5 rows



                                                                                

23/08/17 08:02:06 WARN FileStreamSource: Listed 305 file(s) in 2284 ms          
23/08/17 08:02:09 WARN FileStreamSource: Listed 305 file(s) in 2680 ms          
23/08/17 08:02:11 WARN FileStreamSource: Listed 305 file(s) in 2620 ms          
23/08/17 08:02:14 WARN FileStreamSource: Listed 305 file(s) in 2637 ms          
23/08/17 08:02:17 WARN FileStreamSource: Listed 305 file(s) in 2672 ms          
23/08/17 08:02:19 WARN FileStreamSource: Listed 305 file(s) in 2805 ms          
23/08/17 08:02:22 WARN FileStreamSource: Listed 305 file(s) in 2462 ms          
23/08/17 08:02:25 WARN FileStreamSource: Listed 305 file(s) in 3020 ms          
23/08/17 08:02:28 WARN FileStreamSource: Listed 305 file(s) in 2715 ms          
23/08/17 08:02:30 WARN FileStreamSource: Listed 305 file(s) in 2785 ms          
23/08/17 08:02:34 WARN FileStreamSource: Listed 305 file(s) in 3170 ms          
23/08/17 08:02:36 WARN FileStreamSource: Listed 305 file(s) in 2722 ms          
23/08/17 08:02:40 WARN FileS

You’ll notice that the composition of our table changes as we read in more data! With each file, the results might or might not be changing based on the data. Naturally, because we’re grouping customers, we hope to see an increase in the top customer purchase amounts over time (and do for a period of time!).

# Machine Learning and Advanced Analytics

In [7]:
staticDataFrame.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



Machine learning algorithms in MLlib require that data is represented as numerical values. Our current data is represented by a variety of different types, including timestamps, integers, and strings. Therefore we need to transform this data into some numerical representation. In this instance, we’ll use several DataFrame transformations to manipulate our date data:

In [8]:
from pyspark.sql.functions import date_format, col

preppedDataFrame = staticDataFrame\
    .na.fill(0)\
    .withColumn('day_of_week', date_format(col('InvoiceDate'), 'EEEE'))\
    .coalesce(5)

We are also going to need to split the data into training and test sets. In this instance, we are going to do this manually by the date on which a certain purchase occurred; however, we could also use MLlib’s transformation APIs to create a training and test set via train validation splits or cross validation:

In [9]:
trainDataFrame = preppedDataFrame\
    .where("InvoiceDate < '2011-07-01'")

testDataFrame = preppedDataFrame\
    .where("InvoiceDate >= '2011-07-01'")

Now that we’ve prepared the data, let’s split it into a training and test set. Because this is a time– series set of data, we will split by an arbitrary date in the dataset. Although this might not be the optimal split for our training and test, for the intents and purposes of this example it will work just fine. We’ll see that this splits our dataset roughly in half:

In [10]:
trainDataFrame.count()

                                                                                

245903

In [11]:
testDataFrame.count()

                                                                                

296006

Spark’s MLlib also provides a number of transformations with which we can automate some of our general transformations. One such transformer is a StringIndexer:

In [14]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer()\
    .setInputCol('day_of_week')\
    .setOutputCol('day_of_week_index')

This will turn our days of weeks into corresponding numerical values. For example, Spark might represent Saturday as 6, and Monday as 1. However, with this numbering scheme, we are implicitly stating that Saturday is greater than Monday (by pure numerical values). This is obviously incorrect. To fix this, we therefore need to use a OneHotEncoder to encode each of these values as their own column. These Boolean flags state whether that day of week is the relevant day of the week:

In [11]:
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder()\
    .setInputCol('day_of_week_index')\
    .setOutputCol('day_of_week_encoded')

In [12]:
from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler()\
    .setInputCols(['UnitPrice', 'Quantity', 'day_of_week_encoded'])\
    .setOutputCol('features')

In [15]:
from pyspark.ml import Pipeline

transformationPipeline = Pipeline()\
    .setStages([indexer, encoder, vectorAssembler])

We first need to fit our transformers to this dataset. Basically our StringIndexer needs to know how many unique values there are to be indexed. After those exist, encoding is easy but Spark must look at
all the distinct values in the column to be indexed in order to store those values later on:

In [16]:
fittedPipeline = transformationPipeline.fit(trainDataFrame)

                                                                                

After we fit the training data, we are ready to take that fitted pipeline and use it to transform all
of our data in a consistent and repeatable way:

In [31]:
transformedTraining = fittedPipeline.transform(trainDataFrame)

This will put a copy of the intermediately transformed dataset into memory, allowing us to repeatedly access it at much lower cost than running the entire pipeline again:

In [32]:
transformedTraining.cache()

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string, day_of_week: string, day_of_week_index: double, day_of_week_encoded: vector, features: vector]

We now have a training set; it’s time to train the model. First we’ll import the relevant model
that we’d like to use and instantiate it:

In [21]:
from pyspark.ml.clustering import KMeans

kmeans = KMeans()\
    .setK(20)\
    .setSeed(1)

In [22]:
kmModel = kmeans.fit(transformedTraining)

23/08/18 11:58:51 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                

# Lower-Level APIs

One thing that you might use RDDs for is to parallelize raw data that you have stored in memory
on the driver machine. For instance, let’s parallelize some simple numbers and create a DataFrame after we do so. We then can convert that to a DataFrame to use it with other DataFrames:

In [3]:
from pyspark.sql import Row

spark.sparkContext.parallelize([Row(1), Row(2), Row(3)]).toDF()

                                                                                

DataFrame[_1: bigint]