# Managing Big Data for Connected Devices

## 420-N63-NA

## Kawser Wazed Nafi
 ----------------------------------------------------------------------------------------------------------------------------------
    
## StructType & StructField

PySpark StructType & StructField classes are used to programmatically specify the schema to the DataFrame and create complex columns like nested struct, array, and map columns. StructType is a collection of StructField’s that defines column name, column data type, boolean to specify if the field can be nullable or not and metadata.

If a data is given in an unstructured way, StructType and StructField are used to make them structured and use them as input the PySpark Dataframe. This helped us to perform the data analysis with proper data understanding and with more structured and regulated way.

At the time of creating a PySpark Dataframe, we can specify the structure of the data using StructType and StructField.

## StructType Example 1

Let's us consider an input data which has no structure itself. Using StructType we can give the data a name as well as we can define the dataType of the given data as well.

In [3]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

ss = SparkSession.builder.master("local[4]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

data = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

# Explanation:
# StructType defines the schema with six fields: firstname, middlename, lastname, id, gender, and salary.

# Each field is defined using StructField, specifying:

# The column name (e.g., firstname)
# The data type (e.g., StringType(), IntegerType())
# The nullable flag (True means the column can have null values)
# The DataFrame is created using createDataFrame(), and the schema is explicitly applied.
schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
dataframe = ss.createDataFrame(data=data,schema=schema)
dataframe.printSchema()
dataframe.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|James    |          |Smith   |36636|M     |3000  |
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |     |F     |-1    |
+---------+----------+--------+-----+------+------+



## StructType Example 2

Let's us consider that the same input data contains the firstname, middle name and last name section as tuple. When we have an additional tuple in the given data, we can consider this data as a nested structure. To address that data in your program, you have to used nexted StrutType Object.

In [4]:
structureData = [
    (("James","","Smith"),"36636","M",3100),
    (("Michael","Rose",""),"40288","M",4300),
    (("Robert","","Williams"),"42114","M",1400),
    (("Maria","Anne","Jones"),"39192","F",5500),
    (("Jen","Mary","Brown"),"","F",-1)
  ]
structureSchema = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
         StructField('id', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', IntegerType(), True)
         ])

dataframe2 = ss.createDataFrame(data=structureData,schema=structureSchema)
dataframe2.printSchema()
dataframe2.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+--------------------+-----+------+------+
|name                |id   |gender|salary|
+--------------------+-----+------+------+
|{James, , Smith}    |36636|M     |3100  |
|{Michael, Rose, }   |40288|M     |4300  |
|{Robert, , Williams}|42114|M     |1400  |
|{Maria, Anne, Jones}|39192|F     |5500  |
|{Jen, Mary, Brown}  |     |F     |-1    |
+--------------------+-----+------+------+



## Exercise 1

From our latest Movie dataset we got the following data:
    
data = [((1,4.2),(1,"funny")),
((2,4.5),(3,"funny")),
((1,4.0),(6,"funny")),
((3,5.0),(47,"action")),
((4,4.3),(50,"romantic")),
((3,3.2),(70,"biography")),
((4,5.0),(101,"biography")),
((4,4.6),(110,"Scientific")),
((1,5.0),(151,"action")),
((1,4.6),( 157,"action")),
((2,3.5),(167,"funny")),
((1,4.1),(172,"funny")),
((3,4.7),(181,"action")),
((4,3.9),(192,"romantic")),
((3,3.8),(201,"biography")),
((4,5.0),(211,"biography")),
((4,4.6),(224,"Scientific")),
((1,5.0),(231,"action"))]

the data is divided in the following structure: (userID, rating),(movieID, generes)

Please structure the data and load it into dataframe for additional additional study.

     

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, StringType

# Initialize Spark Session
ss = SparkSession.builder.master("local[4]").appName("MovieDataStructuring").getOrCreate()

# Provided Data
data = [
    ((1, 4.2), (1, "funny")),
    ((2, 4.5), (3, "funny")),
    ((1, 4.0), (6, "funny")),
    ((3, 5.0), (47, "action")),
    ((4, 4.3), (50, "romantic")),
    ((3, 3.2), (70, "biography")),
    ((4, 5.0), (101, "biography")),
    ((4, 4.6), (110, "Scientific")),
    ((1, 5.0), (151, "action")),
    ((1, 4.6), (157, "action")),
    ((2, 3.5), (167, "funny")),
    ((1, 4.1), (172, "funny")),
    ((3, 4.7), (181, "action")),
    ((4, 3.9), (192, "romantic")),
    ((3, 3.8), (201, "biography")),
    ((4, 5.0), (211, "biography")),
    ((4, 4.6), (224, "Scientific")),
    ((1, 5.0), (231, "action"))
]

# Defining the Schema
schema = StructType([
    StructField('user_rating', StructType([
        StructField('userID', IntegerType(), True),
        StructField('rating', FloatType(), True)
    ])),
    StructField('movie_info', StructType([
        StructField('movieID', IntegerType(), True),
        StructField('genre', StringType(), True)
    ]))
])

# Creating DataFrame
df = ss.createDataFrame(data, schema=schema)

# Display Schema and Data
df.printSchema()
df.show(truncate=False)

# Flattening the nested structure for easier analysis
df_flat = df.select(
    df.user_rating.userID.alias("userID"),
    df.user_rating.rating.alias("rating"),
    df.movie_info.movieID.alias("movieID"),
    df.movie_info.genre.alias("genre")
)

# Displaying the Flattened Data
df_flat.show(truncate=False)

## Transform

Another method or API provided by Spark to prepare the dataFrame for further analysis is pyspark.sql.DataFrame.transform(). The pyspark.sql.DataFrame.transform() is used to chain the custom transformations and this function returns the new DataFrame after applying the specified transformations.
This function returns the new data maintaining the same number of rows.

### Syntax

DataFrame.transform(func: Callable[[…], DataFrame], *args: Any, **kwargs: Any) → pyspark.sql.dataframe.DataFrame


In [5]:

# Imports
from pyspark.sql import SparkSession

# Create SparkSession
ss = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount"]

# Create DataFrame
dataframe = ss.createDataFrame(data = simpleData, schema = columns)
dataframe.printSchema()
dataframe.show(truncate=False)


root
 |-- CourseName: string (nullable = true)
 |-- fee: long (nullable = true)
 |-- discount: long (nullable = true)

+----------+----+--------+
|CourseName|fee |discount|
+----------+----+--------+
|Java      |4000|5       |
|Python    |4600|10      |
|Scala     |4100|15      |
|Scala     |4500|15      |
|PHP       |3000|20      |
+----------+----+--------+



We can add custom transformation function in our program, pass our dataframe to them and will finally get the transformed data together.

In [9]:
# Custom transformation 1
from pyspark.sql.functions import upper
def to_upper_str_columns(dataframe):
    return dataframe.withColumn("CourseName",upper(dataframe.CourseName))

# Custom transformation 2
def reduce_price(dataframe,reduceBy):
    return dataframe.withColumn("new_fee",dataframe.fee - reduceBy)

# Custom transformation 3
def apply_discount(dataframe):
    return dataframe.withColumn("discounted_fee",  \
             dataframe.new_fee - (dataframe.new_fee * dataframe.discount) / 100)

# We are going to reduce the reduce the course price 1000 CAD for all the courses. At the same time, we are going to transform all the course names to uppercase.
dataframe2 =  dataframe.transform(to_upper_str_columns) \
        .transform(reduce_price,1000) \
        .transform(apply_discount)
dataframe2.show()

+----------+----+--------+-------+--------------+
|CourseName| fee|discount|new_fee|discounted_fee|
+----------+----+--------+-------+--------------+
|      JAVA|4000|       5|   3000|        2850.0|
|    PYTHON|4600|      10|   3600|        3240.0|
|     SCALA|4100|      15|   3100|        2635.0|
|     SCALA|4500|      15|   3500|        2975.0|
|       PHP|3000|      20|   2000|        1600.0|
+----------+----+--------+-------+--------------+



## Exercise 2

From the Exercise 1, we have got the following dataset

rom our latest Movie dataset we got the following data:
    
data = [((1,4.2),(1,"funny")),
((2,4.5),(3,"funny")),
((1,4.0),(6,"funny")),
((3,5.0),(47,"action")),
((4,4.3),(50,"romantic")),
((3,3.2),(70,"biography")),
((4,5.0),(101,"biography")),
((4,4.6),(110,"Scientific")),
((1,5.0),(151,"action")),
((1,4.6),( 157,"action")),
((2,3.5),(167,"funny")),
((1,4.1),(172,"funny")),
((3,4.7),(181,"action")),
((4,3.9),(192,"romantic")),
((3,3.8),(201,"biography")),
((4,5.0),(211,"biography")),
((4,4.6),(224,"Scientific")),
((1,5.0),(231,"action"))]

the data is divided in the following structure: (userID, rating),(movieID, generes)

The system got an issue and found that for some unaccountable reason, ratings for the "Funny" generes movies reduced by 15\% and those reduced ratings were recorded. But this reduction didnot happen with all the ratings. It happened only for the ratings lower that 4.5.

Please increase the ratings by 15\% for the "Funny" Generes movies whose ratings are recorded lower than 4.5. List both the old ratings and new ratings side by side as shown in the examples.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
from pyspark.sql.functions import col, when

# Create SparkSession
ss = SparkSession.builder.appName('Exercise 2').getOrCreate()

# Data
data = [((1,4.2),(1,"funny")),
        ((2,4.5),(3,"funny")),
        ((1,4.0),(6,"funny")),
        ((3,5.0),(47,"action")),
        ((4,4.3),(50,"romantic")),
        ((3,3.2),(70,"biography")),
        ((4,5.0),(101,"biography")),
        ((4,4.6),(110,"Scientific")),
        ((1,5.0),(151,"action")),
        ((1,4.6),(157,"action")),
        ((2,3.5),(167,"funny")),
        ((1,4.1),(172,"funny")),
        ((3,4.7),(181,"action")),
        ((4,3.9),(192,"romantic")),
        ((3,3.8),(201,"biography")),
        ((4,5.0),(211,"biography")),
        ((4,4.6),(224,"Scientific")),
        ((1,5.0),(231,"action"))]

# Schema
schema = StructType([
    StructField('user', StructType([
        StructField('userID', IntegerType(), True),
        StructField('rating', DoubleType(), True)
    ])),
    StructField('movie', StructType([
        StructField('movieID', IntegerType(), True),
        StructField('genre', StringType(), True)
    ]))
])

# Create DataFrame
df = ss.createDataFrame(data, schema)

# Adjust ratings for 'funny' genre where rating < 4.5
df_adjusted = df.withColumn("old_rating", col("user.rating")) \
                .withColumn("new_rating", 
                            when((col("movie.genre") == "funny") & (col("user.rating") < 4.5),
                                 col("user.rating") * 1.15)
                            .otherwise(col("user.rating")))

# Display the result
df_adjusted.select("user.userID", "movie.movieID", "movie.genre", "old_rating", "new_rating").show(truncate=False)

## Union, Unionall and UnionByName

PySpark union() and unionAll() transformations are used to merge two or more DataFrame’s of the same schema or structure. The Union operation directly merge the two dataframes together one by one without seeing the data at all.




In [11]:
simpleData = [("James","Sales","NY",90000,34,10000), \
    ("Michael","Sales","NY",86000,56,20000), \
    ("Robert","Sales","CA",81000,30,23000), \
    ("Maria","Finance","CA",90000,24,23000) \
  ]

columns= ["employee_name","department","state","salary","age","bonus"]
dataframe = ss.createDataFrame(data = simpleData, schema = columns)
dataframe.printSchema()
dataframe.show(truncate=False)


simpleData2 = [("James","Sales","NY",90000,34,10000), \
    ("Maria","Finance","CA",90000,24,23000), \
    ("Jen","Finance","NY",79000,53,15000), \
    ("Jeff","Marketing","CA",80000,25,18000), \
    ("Kumar","Marketing","NY",91000,50,21000) \
  ]
columns2= ["employee_name","department","state","salary","age","bonus"]

dataframe2 = ss.createDataFrame(data = simpleData2, schema = columns2)

dataframe2.printSchema()
dataframe2.show(truncate=False)


unionDF = dataframe.union(dataframe2)
unionDF.printSchema()
unionDF.show(truncate=False)


unionAllDF = dataframe.unionAll(dataframe2)
unionAllDF.printSchema()
unionAllDF.show(truncate=False)

unionAllDFbyName = dataframe.unionByName(dataframe2)
unionAllDF.printSchema()
unionAllDF.show(truncate=False)


root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|Maria        |Finance   |CA   |90000 |24 |23000|
+-------------+----------+-----+------+---+-----+

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----

## Exercise 3

Can you see the differences between Union, UnionAll and UnionByName? Please state them over here.

<h1>Answer</h1>
<p>union(): Combines two DataFrames and removes duplicates.</p>
<p>unionAll(): Combines two DataFrames without removing duplicates.</p>
<p>unionByName(): Combines DataFrames based on column names, not order.</p>
<p>Use union() when you want to remove duplicates.
Use unionAll() when you want to keep duplicates (or just use union() in modern PySpark).
Use unionByName() when column orders are different, but the column names are the same.</p>

## Exercise 4
Consider the given data over here. Perform Union, UnionAll and UnionByName operation on these two given data.

data1 = [((1,4.2),(1,"funny")),
((2,4.5),(3,"funny")),
((1,4.0),(6,"funny")),
((3,5.0),(47,"action")),
((4,4.3),(50,"romantic")),
((3,3.2),(70,"biography")),
((4,5.0),(101,"biography")),
((4,4.6),(110,"Scientific")),
((1,5.0),(151,"action")),
((1,4.6),( 157,"action")),
((2,3.5),(167,"funny")),
((1,4.1),(172,"funny")),
((3,4.7),(181,"action")),
((4,3.9),(192,"romantic")),
((3,3.8),(201,"biography")),
((4,5.0),(211,"biography")),
((4,4.6),(224,"Scientific")),
((1,5.0),(231,"action"))]


data2 = [((2,4.1),(1,"funny")),
((1,4.2),(3,"funny")),
((2,4.0),(6,"funny")),
((1,4.0),(47,"action")),
((2,4.7),(50,"romantic")),
((1,3.6),(70,"biography")),
((2,4.2),(101,"biography")),
((4,4.7),(111,"Scientific")),
((3,5.0),(151,"action")),
((2,4.6),( 157,"action")),
((1,3.5),(167,"funny")),
((5,4.1),(172,"funny")),
((2,4.5),(181,"action")),
((3,4.3),(192,"romantic")),
((2,4.2),(201,"biography")),
((3,5.0),(211,"biography")),
((3,4.6),(224,"Scientific")),
((2,5.0),(231,"action"))]



In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType

# Initialize Spark session
spark = SparkSession.builder.appName("Union Example").getOrCreate()

# Data
data1 = [((1,4.2),(1,"funny")), ((2,4.5),(3,"funny")), ((1,4.0),(6,"funny")), ((3,5.0),(47,"action")),
         ((4,4.3),(50,"romantic")), ((3,3.2),(70,"biography")), ((4,5.0),(101,"biography")),
         ((4,4.6),(110,"Scientific")), ((1,5.0),(151,"action")), ((1,4.6),(157,"action")),
         ((2,3.5),(167,"funny")), ((1,4.1),(172,"funny")), ((3,4.7),(181,"action")),
         ((4,3.9),(192,"romantic")), ((3,3.8),(201,"biography")), ((4,5.0),(211,"biography")),
         ((4,4.6),(224,"Scientific")), ((1,5.0),(231,"action"))]

data2 = [((2,4.1),(1,"funny")), ((1,4.2),(3,"funny")), ((2,4.0),(6,"funny")), ((1,4.0),(47,"action")),
         ((2,4.7),(50,"romantic")), ((1,3.6),(70,"biography")), ((2,4.2),(101,"biography")),
         ((4,4.7),(111,"Scientific")), ((3,5.0),(151,"action")), ((2,4.6),(157,"action")),
         ((1,3.5),(167,"funny")), ((5,4.1),(172,"funny")), ((2,4.5),(181,"action")),
         ((3,4.3),(192,"romantic")), ((2,4.2),(201,"biography")), ((3,5.0),(211,"biography")),
         ((3,4.6),(224,"Scientific")), ((2,5.0),(231,"action"))]

# Schema
schema = StructType([
    StructField("userInfo", StructType([
        StructField("userID", IntegerType(), True),
        StructField("rating", DoubleType(), True)
    ])),
    StructField("movieInfo", StructType([
        StructField("movieID", IntegerType(), True),
        StructField("genre", StringType(), True)
    ]))
])

# Creating DataFrames
df1 = spark.createDataFrame(data1, schema=schema)
df2 = spark.createDataFrame(data2, schema=schema)

# Union (removes duplicates by default)
union_df = df1.union(df2).distinct()
print("Union Result:")
union_df.show(truncate=False)

# UnionAll (includes duplicates, same as union without distinct)
union_all_df = df1.union(df2)
print("UnionAll Result:")
union_all_df.show(truncate=False)

# UnionByName (matches columns by name)
union_by_name_df = df1.unionByName(df2)
print("UnionByName Result:")
union_by_name_df.show(truncate=False)

Union Result:


Py4JJavaError: An error occurred while calling o54.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4) (172.17.18.224 executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:203)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:108)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:701)
	at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:745)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:698)
	at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:663)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:639)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:585)
	at java.base/java.net.ServerSocket.accept(ServerSocket.java:543)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:190)
	... 40 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:203)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:108)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:701)
	at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:745)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:698)
	at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:663)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:639)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:585)
	at java.base/java.net.ServerSocket.accept(ServerSocket.java:543)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:190)
	... 40 more
