In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PySparkLearning').getOrCreate()

### Create a Row Object
Row class extends the tuple hence it takes variable number of arguments, Row() is used to create the row object. Once the row object created, we can retrieve the data from Row using index similar to tuple.



In [2]:
from pyspark.sql import Row

row=Row("James",40)
print(row[0] +","+str(row[1]))

James,40


Alternatively you can also write with named arguments. Benefits with the named argument is you can access with field name 

In [3]:
row = Row(name='Sandeep', age = 24)

print(row.name + " "+ str(row.age))

Sandeep 24


### Create custom class from Row

We can also create a Row like class, for example “Person” and use it similar to Row object. This would be helpful when you wanted to create real time object and refer it’s properties. On below example, we have created a Person class and used similar to Row.

In [4]:
Person = Row("name", "age")

p1 = Person('Sandeep',24)
print(p1.name+" -> "+str(p1.age))

Sandeep -> 24


### Using Row class on PySpark RDD

We can use Row class on PySpark RDD. When you use Row to create an RDD, after collecting the data you will get the result back in Row.

In [5]:
data = [
        Row(name="James,,Smith",lang=["Java","Scala","C++"],state="CA"), 
        Row(name="Michael,Rose,",lang=["Spark","Java","C++"],state="NJ"),
        Row(name="Robert,,Williams",lang=["CSharp","VB"],state="NV")
       ]

rdd=spark.sparkContext.parallelize(data)
print(rdd.collect())

[Row(name='James,,Smith', lang=['Java', 'Scala', 'C++'], state='CA'), Row(name='Michael,Rose,', lang=['Spark', 'Java', 'C++'], state='NJ'), Row(name='Robert,,Williams', lang=['CSharp', 'VB'], state='NV')]


In [6]:
for row in rdd.collect():
    print(row.name + "," +str(row.lang))

James,,Smith,['Java', 'Scala', 'C++']
Michael,Rose,,['Spark', 'Java', 'C++']
Robert,,Williams,['CSharp', 'VB']


Alternatively, you can also do by creating a Row like class “Person”

In [7]:
Person=Row("name","lang","state")

data = [
        Person("James,,Smith",["Java","Scala","C++"],"CA"), 
        Person("Michael,Rose,",["Spark","Java","C++"],"NJ"),
        Person("Robert,,Williams",["CSharp","VB"],"NV")
    ]

rdd = spark.sparkContext.parallelize(data)
print(rdd.collect())

[Row(name='James,,Smith', lang=['Java', 'Scala', 'C++'], state='CA'), Row(name='Michael,Rose,', lang=['Spark', 'Java', 'C++'], state='NJ'), Row(name='Robert,,Williams', lang=['CSharp', 'VB'], state='NV')]


### Using Row class on PySpark DataFrame

Similarly, Row class also can be used with PySpark DataFrame, By default data in DataFrame represent as Row

In [8]:
data = [
        Row(name="James,,Smith",lang=["Java","Scala","C++"],state="CA"), 
        Row(name="Michael,Rose,",lang=["Spark","Java","C++"],state="NJ"),
        Row(name="Robert,,Williams",lang=["CSharp","VB"],state="NV")
       ]

df = spark.createDataFrame(data)
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- lang: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- state: string (nullable = true)

+----------------+------------------+-----+
|            name|              lang|state|
+----------------+------------------+-----+
|    James,,Smith|[Java, Scala, C++]|   CA|
|   Michael,Rose,|[Spark, Java, C++]|   NJ|
|Robert,,Williams|      [CSharp, VB]|   NV|
+----------------+------------------+-----+



You can also change the column names by using `toDF()` function

In [9]:
columns = ["name","prog_lang","current_state"]
df=spark.createDataFrame(data).toDF(*columns)
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- prog_lang: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- current_state: string (nullable = true)

