# withColumn() Function

PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.

withColumn() takes two arguments first one is column name and second is a col expression for the new or modified col.

Lets look at how we can change the values of a column using this function

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark=SparkSession.builder.appName("Cols").master("local[*]").getOrCreate()

In [2]:
data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]

df= spark.createDataFrame(data,columns)
df.show()

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|       dob|gender|salary|
+---------+----------+--------+----------+------+------+
|    James|          |   Smith|1991-04-01|     M|  3000|
|  Michael|      Rose|        |2000-05-19|     M|  4000|
|   Robert|          |Williams|1978-09-05|     M|  4000|
|    Maria|      Anne|   Jones|1967-12-01|     F|  4000|
|      Jen|      Mary|   Brown|1980-02-17|     F|    -1|
+---------+----------+--------+----------+------+------+



In [5]:
#lets modify data in a column

df1=df.withColumn("salary",df.salary*10) # can also be written as col(salary)*10
df1.show()

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|       dob|gender|salary|
+---------+----------+--------+----------+------+------+
|    James|          |   Smith|1991-04-01|     M| 30000|
|  Michael|      Rose|        |2000-05-19|     M| 40000|
|   Robert|          |Williams|1978-09-05|     M| 40000|
|    Maria|      Anne|   Jones|1967-12-01|     F| 40000|
|      Jen|      Mary|   Brown|1980-02-17|     F|   -10|
+---------+----------+--------+----------+------+------+



In [12]:
#lets change the data type of a column
df1.printSchema()

# to change the datatype we use cast method which is availble for column objects
from pyspark.sql.types import StringType
from pyspark.sql.functions import col
df1.withColumn("salary",df1.salary.cast("Integer")).printSchema()
df1.withColumn("salary",col("salary").cast(StringType())).printSchema()

# We can refer to columns using dataframe.colname or col('colname')

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: string (nullable = true)



In [14]:
# Now lets look at creating new column
# we can create new column either by modifying existing column or using a constant value across the column with lit function.
from pyspark.sql.functions import lit
df3= df1.withColumn("NewSalary",col('salary')-10)
df3.show()

df4= df1.withColumn("Country", lit("India"))
df4.show()

+---------+----------+--------+----------+------+------+---------+
|firstname|middlename|lastname|       dob|gender|salary|NewSalary|
+---------+----------+--------+----------+------+------+---------+
|    James|          |   Smith|1991-04-01|     M| 30000|    29990|
|  Michael|      Rose|        |2000-05-19|     M| 40000|    39990|
|   Robert|          |Williams|1978-09-05|     M| 40000|    39990|
|    Maria|      Anne|   Jones|1967-12-01|     F| 40000|    39990|
|      Jen|      Mary|   Brown|1980-02-17|     F|   -10|      -20|
+---------+----------+--------+----------+------+------+---------+

+---------+----------+--------+----------+------+------+-------+
|firstname|middlename|lastname|       dob|gender|salary|Country|
+---------+----------+--------+----------+------+------+-------+
|    James|          |   Smith|1991-04-01|     M| 30000|  India|
|  Michael|      Rose|        |2000-05-19|     M| 40000|  India|
|   Robert|          |Williams|1978-09-05|     M| 40000|  India|
|    M

We saw that using withColumn() we can modify existing columns and create new ones. The function checks the column name provided in the argument, if a column with that name exists in the dataframe then it will modify that column according to the logic provided or else it will create a new column with that logic.

# withColumnRenamed() function

This function takes two parameters; the first is your existing column name and the second is the new column name you wish for.

In [16]:
df1.withColumnRenamed("dob","Date of Birth").show()
df1.withColumnRenamed(df1.dob,"Date of Birth").show() # this is not possible because func

+---------+----------+--------+-------------+------+------+
|firstname|middlename|lastname|Date of Birth|gender|salary|
+---------+----------+--------+-------------+------+------+
|    James|          |   Smith|   1991-04-01|     M| 30000|
|  Michael|      Rose|        |   2000-05-19|     M| 40000|
|   Robert|          |Williams|   1978-09-05|     M| 40000|
|    Maria|      Anne|   Jones|   1967-12-01|     F| 40000|
|      Jen|      Mary|   Brown|   1980-02-17|     F|   -10|
+---------+----------+--------+-------------+------+------+



PySparkTypeError: [NOT_ITERABLE] Column is not iterable.