# PySpark `withColumn()` usage by 
                            Aishwarya Raut

PySpark `withColumn()` is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more

In [1]:
data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SP').getOrCreate()
df = spark.createDataFrame(data=data, schema = columns)

In [4]:
df.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)



In [6]:
# 1. Change DataType 
from pyspark.sql.functions import col
a=df.withColumn("salary",col("salary").cast("Integer"))
a.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



# 2. Update The Value of an Existing Column

PySpark withColumn() function of DataFrame can also be used to change the value of an existing column. In order to change the value, pass an existing column name as a first argument and a value to be assigned as a second argument to the withColumn() function. Note that the second argument should be Column type 

In [7]:
df.withColumn("salary",col("salary")*100)

DataFrame[firstname: string, middlename: string, lastname: string, dob: string, gender: string, salary: bigint]

# 3. Create a Column from an Existing

In [10]:
a=df.withColumn("new_column",col("salary")*-1)
a.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- new_column: long (nullable = true)



# 4. Add a New Column using withColumn()

In [None]:
df.withColumn("Country",lit("USA")).show()

df.withColumn("country",lit("USA"))
    .withColumn("anotherColumn",lit("anothervalue"))
    .show()

# 5. Rename Column Name

In [11]:
a=df.withColumnRenamed("gender","sex")
a.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- salary: long (nullable = true)



# 6. Drop Column From PySpark DataFrame

In [12]:
a=df.drop("salary")
a.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)



Note: Note that all of these functions return the new DataFrame after applying the functions instead of updating DataFrame.