
* We can rename column or expression using **alias** as part of select
* We can add or rename column or expression using **withColumn** on top of dataframe
* We can rename one column at a time using **withColumnRenamed** on top of Dataframe
* We typically use **withColumn** to perform row level transformations and then to provide a name to the result. If we provide the same name as existing column, then the column will be replaced with new one
* If we want to just rename the column then it is better to use **withColumnRenamed**
* If we want to apply any transformation, we need to either use **select** or **withColumn**
* We can rename bunch of columns using **toDF**.


In [0]:
%run "/Users/surajthallapalli@outlook.com/02 Selecting and Renaming Columns in Spark Dataframe/Creating Spark Dataframe 2023-07-11 20:32:19"



+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
| id|first_name|   last_name|               email|       phone_numbers|courses|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
|  1|    Corrie|Van den Oord| cvandenoor@etsy.com|{+91 8645879087, ...| [1, 2]|       true|    1000.55|   2021-01-15|2021-02-10 01:15:00|
|  2|      John|        Cena|       john@cena.com|{+91 9886879087, ...| [3, 4]|       true|      900.0|   2022-05-15|2024-03-15 01:16:00|
|  3|     James|        Bond|      james@bond.com|{+91 3245879087, ...|    [2]|      false|      750.6|   2023-01-12|2018-05-05 05:17:02|
|  4|    Robert|      Dowrey|   robert@dowrey.com|                null|     []|       true|        NaN|         null|2019-04-03 08:14:08|
|  5|     Chris|  Hemmsworth|chris

Out[7]: [('id', 'bigint'),
 ('first_name', 'string'),
 ('last_name', 'string'),
 ('email', 'string'),
 ('phone_numbers', 'struct<mobile:string,home:string>'),
 ('courses', 'array<bigint>'),
 ('is_customer', 'boolean'),
 ('amount_paid', 'double'),
 ('customer_from', 'date'),
 ('last_updated_ts', 'timestamp')]

In [0]:
users_df.select('id','first_name','last_name').show()

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|      John|        Cena|
|  3|     James|        Bond|
|  4|    Robert|      Dowrey|
|  5|     Chris|  Hemmsworth|
+---+----------+------------+



In [0]:
from pyspark.sql.functions import *

In [0]:
users_df.select(
    "id",
    "first_name",
    "last_name",
    concat("first_name", lit(","), "last_name").alias("full_name")
).show()

+---+----------+------------+-------------------+
| id|first_name|   last_name|          full_name|
+---+----------+------------+-------------------+
|  1|    Corrie|Van den Oord|Corrie,Van den Oord|
|  2|      John|        Cena|          John,Cena|
|  3|     James|        Bond|         James,Bond|
|  4|    Robert|      Dowrey|      Robert,Dowrey|
|  5|     Chris|  Hemmsworth|   Chris,Hemmsworth|
+---+----------+------------+-------------------+



In [0]:
users_df.select("id","first_name","last_name").withColumn("full_name", concat("first_name",lit(","),"last_name") ).show()

+---+----------+------------+-------------------+
| id|first_name|   last_name|          full_name|
+---+----------+------------+-------------------+
|  1|    Corrie|Van den Oord|Corrie,Van den Oord|
|  2|      John|        Cena|          John,Cena|
|  3|     James|        Bond|         James,Bond|
|  4|    Robert|      Dowrey|      Robert,Dowrey|
|  5|     Chris|  Hemmsworth|   Chris,Hemmsworth|
+---+----------+------------+-------------------+



In [0]:
users_df.select("id","first_name","last_name").withColumn("full_name", concat("first_name",lit(","),"last_name").alias("fn") ).show()

+---+----------+------------+-------------------+
| id|first_name|   last_name|          full_name|
+---+----------+------------+-------------------+
|  1|    Corrie|Van den Oord|Corrie,Van den Oord|
|  2|      John|        Cena|          John,Cena|
|  3|     James|        Bond|         James,Bond|
|  4|    Robert|      Dowrey|      Robert,Dowrey|
|  5|     Chris|  Hemmsworth|   Chris,Hemmsworth|
+---+----------+------------+-------------------+



In [0]:
users_df.select("id","first_name","last_name").withColumn("first_name", concat("first_name",lit(","),"last_name").alias("fn") ).show()

+---+-------------------+------------+
| id|         first_name|   last_name|
+---+-------------------+------------+
|  1|Corrie,Van den Oord|Van den Oord|
|  2|          John,Cena|        Cena|
|  3|         James,Bond|        Bond|
|  4|      Robert,Dowrey|      Dowrey|
|  5|   Chris,Hemmsworth|  Hemmsworth|
+---+-------------------+------------+



In [0]:
users_df.select('id','first_name','last_name').withColumn('fn', col('first_name')).show()

+---+----------+------------+------+
| id|first_name|   last_name|    fn|
+---+----------+------------+------+
|  1|    Corrie|Van den Oord|Corrie|
|  2|      John|        Cena|  John|
|  3|     James|        Bond| James|
|  4|    Robert|      Dowrey|Robert|
|  5|     Chris|  Hemmsworth| Chris|
+---+----------+------------+------+



In [0]:
users_df.select('id','first_name','last_name').withColumn('fn', users_df['first_name']).show()

+---+----------+------------+------+
| id|first_name|   last_name|    fn|
+---+----------+------------+------+
|  1|    Corrie|Van den Oord|Corrie|
|  2|      John|        Cena|  John|
|  3|     James|        Bond| James|
|  4|    Robert|      Dowrey|Robert|
|  5|     Chris|  Hemmsworth| Chris|
+---+----------+------------+------+




* Find the number courses that each id has taken

In [0]:
users_df.select('id','courses').show()

+---+-------+
| id|courses|
+---+-------+
|  1| [1, 2]|
|  2| [3, 4]|
|  3|    [2]|
|  4|     []|
|  5|     []|
+---+-------+



In [0]:
users_df.select('id','courses').withColumn('course_count', size('courses')).show()

+---+-------+------------+
| id|courses|course_count|
+---+-------+------------+
|  1| [1, 2]|           2|
|  2| [3, 4]|           2|
|  3|    [2]|           1|
|  4|     []|           0|
|  5|     []|           0|
+---+-------+------------+

