## Row and Column objects



In [1]:
from pyspark.sql import SparkSession

In [2]:
spark1 = SparkSession.builder.appName('row_col').getOrCreate()

### Column objects

In [3]:
df = spark1.read.json('file:///home/erin/Downloads/spark-3.0.1-bin-hadoop2.7/SparkFolder/spark/Data/people.json')

### What is the type of a single column?

In [4]:
type(df['age'])

pyspark.sql.column.Column

### extract a single column as a DataFrame? Use select()

In [5]:
df.select('age')

DataFrame[age: bigint]

In [6]:
df.select('age').show()

+----+
| age|
+----+
|null|
|  30|
|  19|
+----+



### Row objects

In [7]:
df.head(2)

[Row(age=None, name='Michael'), Row(age=30, name='Andy')]

In [8]:
df.head(2)[0]

Row(age=None, name='Michael')

In [9]:
row0=(df.head(2)[0])

In [10]:
type(row0)

pyspark.sql.types.Row

### Row object has a very useful asDict method

In [11]:
row0.asDict()

{'age': None, 'name': 'Michael'}

Remember that in Pandas DataFrame we have pandas.series object as either column or row.
The reason Spark offers separate Column or Row object is the ability to work over a distributed file system where there is distinction will come handy.

### Create a new columns (after some processing of existing columns)

In [12]:
#you cannot think like pandas. following will produce error
df['newage'] = 2*df['age']

TypeError: 'DataFrame' object does not support item assignment

### Use useColumn() method instead

In [13]:
df.withColumn('double_age',df['age']*2).show()

+----+-------+----------+
| age|   name|double_age|
+----+-------+----------+
|null|Michael|      null|
|  30|   Andy|        60|
|  19| Justin|        38|
+----+-------+----------+



### Just for renaming, use withColumnRenamed() method

In [14]:
df.withColumnRenamed('age','my_new_age').show()

+----------+-------+
|my_new_age|   name|
+----------+-------+
|      null|Michael|
|        30|   Andy|
|        19| Justin|
+----------+-------+



### You can do operation with mutliple columns, like a vector sum

In [15]:
df2 = df.withColumn('half_age',df['age']/2)

In [16]:
df2.show()

+----+-------+--------+
| age|   name|half_age|
+----+-------+--------+
|null|Michael|    null|
|  30|   Andy|    15.0|
|  19| Justin|     9.5|
+----+-------+--------+



In [17]:
df2=df2.withColumn('new_age',df2['age']+df2['half_age'])
df2.show()

+----+-------+--------+-------+
| age|   name|half_age|new_age|
+----+-------+--------+-------+
|null|Michael|    null|   null|
|  30|   Andy|    15.0|   45.0|
|  19| Justin|     9.5|   28.5|
+----+-------+--------+-------+



Now if you print a schema, you will see that the data type of half_age and new_age are automatically set to double (due to floating point operation performed)

In [18]:
df2.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
 |-- half_age: double (nullable = true)
 |-- new_age: double (nullable = true)



DataFrame is immutable and there is no inplace choice like Pandas! So the original DataFrame has not changed

In [19]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

