#### *WithColumn()*
PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more. In this notebook we can see commonly used PySpark DataFrame column operations using withColumn() examples.

In [0]:
# Creating Dataframe

data = [
    ("James", "Developer", 3000),
    ("Michael", "Testing", 2000),
    ("Tony", "Developer", 4000),
    ("Parker", "Support", 3000),
    ("Stephen", "Developer", 4000),
    ("Clark", "Support", 3500),
    ("Bruce", "Testing", 3000),
    ("Allen", "Developer", 3500),
    ("Loki", "Support", 3000),
    ("Buttowski", "Developer", 5000),
]

schema = ["Name", "Role", "Salary"]

In [0]:
df = spark.createDataFrame(data, schema)

In [0]:
df.display()

Name,Role,Salary
James,Developer,3000
Michael,Testing,2000
Tony,Developer,4000
Parker,Support,3000
Stephen,Developer,4000
Clark,Support,3500
Bruce,Testing,3000
Allen,Developer,3500
Loki,Support,3000
Buttowski,Developer,5000


In [0]:
from pyspark.sql.functions import col, cast, lit

#### *Converting the datatype for a column. As you can see before the datatype for 'Salary' column is 'long' now we have changed it to 'integer'*

In [0]:
df.printSchema()
df1 = df.withColumn("Salary", col("Salary").cast("integer"))
df1.display()
df1.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Role: string (nullable = true)
 |-- Salary: long (nullable = true)



Name,Role,Salary
James,Developer,3000
Michael,Testing,2000
Tony,Developer,4000
Parker,Support,3000
Stephen,Developer,4000
Clark,Support,3500
Bruce,Testing,3000
Allen,Developer,3500
Loki,Support,3000
Buttowski,Developer,5000


root
 |-- Name: string (nullable = true)
 |-- Role: string (nullable = true)
 |-- Salary: integer (nullable = true)



#### *PySpark withColumn() function of DataFrame can also be used to change the value of an existing column. In order to change the value, pass an existing column name as a first argument and a value to be assigned as a second argument to the withColumn() function. Note that the second argument should be Column type.*

In [0]:
# Updating a column value in the table.

df.withColumn("Salary", col("Salary") * 2).display()

Name,Role,Salary
James,Developer,6000
Michael,Testing,4000
Tony,Developer,8000
Parker,Support,6000
Stephen,Developer,8000
Clark,Support,7000
Bruce,Testing,6000
Allen,Developer,7000
Loki,Support,6000
Buttowski,Developer,10000


####*In order to create a new column, pass the column name you wanted to the first argument of withColumn() transformation function. Make sure this new column not already present on DataFrame, if it presents it updates the value of that column.*

In [0]:
# Creating a new column in the table.

df2 = df.withColumn("Country", lit("India"))
df2.display()

Name,Role,Salary,Country
James,Developer,3000,India
Michael,Testing,2000,India
Tony,Developer,4000,India
Parker,Support,3000,India
Stephen,Developer,4000,India
Clark,Support,3500,India
Bruce,Testing,3000,India
Allen,Developer,3500,India
Loki,Support,3000,India
Buttowski,Developer,5000,India


#### *Creating a new column using an existing column in the table using withColumn().*


In [0]:
df2.withColumn("Bonus",col("Salary")/10).display()

Name,Role,Salary,Country,Bonus
James,Developer,3000,India,300.0
Michael,Testing,2000,India,200.0
Tony,Developer,4000,India,400.0
Parker,Support,3000,India,300.0
Stephen,Developer,4000,India,400.0
Clark,Support,3500,India,350.0
Bruce,Testing,3000,India,300.0
Allen,Developer,3500,India,350.0
Loki,Support,3000,India,300.0
Buttowski,Developer,5000,India,500.0


#### *To rename an existing column use withColumnRenamed() function on DataFrame. 'Country' column is renamed to 'Nation'.*

In [0]:
df2.withColumnRenamed("Country","Nation").display()

Name,Role,Salary,Nation
James,Developer,3000,India
Michael,Testing,2000,India
Tony,Developer,4000,India
Parker,Support,3000,India
Stephen,Developer,4000,India
Clark,Support,3500,India
Bruce,Testing,3000,India
Allen,Developer,3500,India
Loki,Support,3000,India
Buttowski,Developer,5000,India


#### *Drop function is used to delete a particular column or a group of column from the table*

In [0]:
# Dropping one column.

df.limit(5).display()
df3=df.drop(df.Salary).limit(5).display()

# Dropping multiple columns.

df2.limit(5).display()
df4=df2.drop(df2.Role,df2.Country).limit(5).display()

Name,Role,Salary
James,Developer,3000
Michael,Testing,2000
Tony,Developer,4000
Parker,Support,3000
Stephen,Developer,4000


Name,Role
James,Developer
Michael,Testing
Tony,Developer
Parker,Support
Stephen,Developer


Name,Role,Salary,Country
James,Developer,3000,India
Michael,Testing,2000,India
Tony,Developer,4000,India
Parker,Support,3000,India
Stephen,Developer,4000,India


Name,Salary
James,3000
Michael,2000
Tony,4000
Parker,3000
Stephen,4000
