In PySpark Row class is available by importing pyspark.sql.Row which is represented as a record/row in DataFrame, one can create a Row object by using named arguments, or create a custom Row like class

Key Points of Row Class:

    Earlier to Spark 3.0, when used Row class with named arguments, the fields are sorted by name.
    Since 3.0, Rows created from named arguments are not sorted alphabetically instead they will be ordered in the position entered.
    To enable sorting by names, set the environment variable PYSPARK_ROW_FIELD_SORTING_ENABLED to true.
    Row class provides a way to create a struct-type column as well.

In [1]:
## 1. Create a Row Object

from pyspark.sql import Row

row = Row("Sourav","Data Engineer")

print(f"My Name is {row[0]} and I am a {row[1]}")

My Name is Sourav and I am a Data Engineer


In [5]:
## Alternatively you can also write with named arguments. Benefits with the named argument is you can access with field name row.name.


row1 = Row(name="Sourav",skill="Data Engineer")


print(f"My Name is {row1.name} and I am a {row1.skill}")

My Name is Sourav and I am a Data Engineer


### Create Custom Class from Row

We can also create a Row like class, for example “Person” and use it similar to Row object. This would be helpful when you wanted to create real time object and refer it’s properties. On below example, we have created a Person class and used similar to Row.


In [7]:
Person = Row("name", "age")
p1=Person("James", 40)
p2=Person("Alice", 35)
print(p1.name +","+p2.name)

James,Alice


### 3. Using Row class on PySpark RDD

We can use Row class on PySpark RDD. When you use Row to create an RDD, after collecting the data you will get the result back in Row.

In [18]:
from pyspark.sql import SparkSession, Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data1 = [Row(name="James,,Smith",lang=["Java","Scala","C++"],state="CA"), 
    Row(name="Michael,Rose,",lang=["Spark","Java","C++"],state="NJ"),
    Row(name="Robert,,Williams",lang=["CSharp","VB"],state="NV")]
rdd=spark.sparkContext.parallelize(data1)
print(rdd.collect())

[Row(name='James,,Smith', lang=['Java', 'Scala', 'C++'], state='CA'), Row(name='Michael,Rose,', lang=['Spark', 'Java', 'C++'], state='NJ'), Row(name='Robert,,Williams', lang=['CSharp', 'VB'], state='NV')]


In [11]:

collData=rdd.collect()
for row in collData:
    print(row.name + "," +str(row.lang))

James,,Smith,['Java', 'Scala', 'C++']
Michael,Rose,,['Spark', 'Java', 'C++']
Robert,,Williams,['CSharp', 'VB']


In [12]:
## Alternatively, you can also do by creating a Row like class “Person”

Person=Row("name","lang","state")
data = [Person("James,,Smith",["Java","Scala","C++"],"CA"), 
    Person("Michael,Rose,",["Spark","Java","C++"],"NJ"),
    Person("Robert,,Williams",["CSharp","VB"],"NV")]
rdd=spark.sparkContext.parallelize(data)

collData=rdd.collect()
for row in collData:
    print(row.name + "," +str(row.lang))

James,,Smith,['Java', 'Scala', 'C++']
Michael,Rose,,['Spark', 'Java', 'C++']
Robert,,Williams,['CSharp', 'VB']


In [16]:
## Create RDD using row without names arg

data = [Row("James,,Smith",["Java","Scala","C++"],"CA"), 
    Row("Michael,Rose,",["Spark","Java","C++"],"NJ"),
    Row("Robert,,Williams",["CSharp","VB"],"NV")]

rdd=spark.sparkContext.parallelize(data)
collData=rdd.collect()

for row in collData:
    print(row[0]+ "," +str(row[1]))

James,,Smith,['Java', 'Scala', 'C++']
Michael,Rose,,['Spark', 'Java', 'C++']
Robert,,Williams,['CSharp', 'VB']


### 4. Using Row class on PySpark DataFrame


Similarly, Row class also can be used with PySpark DataFrame, By default data in DataFrame represent as Row. To demonstrate, I will use the same data that was created for RDD.

In [19]:
df=spark.createDataFrame(data1)
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- lang: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- state: string (nullable = true)

+----------------+------------------+-----+
|            name|              lang|state|
+----------------+------------------+-----+
|    James,,Smith|[Java, Scala, C++]|   CA|
|   Michael,Rose,|[Spark, Java, C++]|   NJ|
|Robert,,Williams|      [CSharp, VB]|   NV|
+----------------+------------------+-----+



In [20]:
## You can also change the column names by using toDF() function

columns = ["name","languagesAtSchool","currentState"]
df=spark.createDataFrame(data).toDF(*columns)
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- languagesAtSchool: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- currentState: string (nullable = true)



### 5. Create Nested Struct Using Row Class
The below example provides a way to create a struct type using the Row class. Alternatively, you can also create struct type using By Providing Schema using PySpark StructType & StructFields

In [25]:
person = Row("name","prop")
property = Row("hair","eye")

data3 = [ person("James",property("black","blue")),person("Sourav",property("black","black")) ]

df1=spark.createDataFrame(data3)
df1.printSchema()    

df1.show()

root
 |-- name: string (nullable = true)
 |-- prop: struct (nullable = true)
 |    |-- hair: string (nullable = true)
 |    |-- eye: string (nullable = true)

+------+--------------+
|  name|          prop|
+------+--------------+
| James| {black, blue}|
|Sourav|{black, black}|
+------+--------------+

