##### Some screenshots from this lecture videos
![today's topic](https://drive.google.com/uc?id=1a0WF6uKcBRZ8_FFIOTkUdmVpoppqfvlY)
![Schema](https://drive.google.com/uc?id=1B3Rh-PASPrFFv65yZSlU9hqiM9MVx3Zb)
![Row_object](https://drive.google.com/uc?id=18B_1slKjgzpLfl3tKKp7kxolmPFKyEWU)
![Row_creation](https://drive.google.com/uc?id=13eh_amnLTGDcMTXuDXxH9A8mwfR81LAc)
![columns](https://drive.google.com/uc?id=1lmu2dONNVFr4nK35kLv8SjzROyUgz8e_)

In [None]:
employee_df  = spark.read.format("csv")\
                           .option("header","true")\
                           .option("inferSchema","true")\
                           .option("mode","PERMISSIVE")\
                           .load('/FileStore/tables/employee_details.csv')
employee_df.show()

+---+--------+---+------+------------+--------+
| id|    name|age|salary|     address| nominee|
+---+--------+---+------+------------+--------+
|  1|  Soumya| 23| 15000|      Odisha|nominee1|
|  2| Jyotsna| 23| 19000|      Mumbai|nominee2|
|  3|Pratisha| 17| 20000|     Kolkata|   India|
|  4|  Pritam| 22|100000|Uttarpradesh|   India|
|  5|  Vikash| 31| 30000|        null|nominee5|
+---+--------+---+------+------------+--------+



In [None]:
employee_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: integer (nullable = true)
 |-- address: string (nullable = true)
 |-- nominee: string (nullable = true)



Print column list

In [None]:
employee_df.columns

Out[5]: ['id', 'name', 'age', 'salary', 'address', 'nominee']

## Row Object
A Row object in PySpark represents a record in a DataFrame or an element in an RDD of tuples.
It is an ordered collection of fields that can be accessed starting at index 0.

We can create a Row object using the Row() method. 
For example, the following code creates a Row object with three fields:

In [None]:
from pyspark.sql import Row

row = Row(name="Alice", age=25, city="London")

# We can access the fields of a Row object using the attribute syntax.
# For example, the following code prints the name of the person in the Row object:
print(row.name)
# We can also access the fields of a Row object using the index syntax. 
# For example, the following code prints the age of the person in the Row object:
print(row[1])

Alice
25


Row objects can be used in a variety of ways in PySpark.<br>
For example, you can use them to create new DataFrames, to filter existing DataFrames,and to perform aggregations on DataFrames.

In [None]:
# Creating dataframe using Row object
from pyspark.sql import Row
rows = [Row(id = 1,Name = "Ramesh",Age = 24,Salary = 50000,Address = "India",Nominee = "Santosh"),Row(id = 2,Name = "Suresh",Age = 30,Salary = 500000,Address = "Canada",Nominee = "Sailesh")]
df = spark.createDataFrame(rows)
df.show()


+---+------+---+------+-------+-------+
| id|  Name|Age|Salary|Address|Nominee|
+---+------+---+------+-------+-------+
|  1|Ramesh| 24| 50000|  India|Santosh|
|  2|Suresh| 30|500000| Canada|Sailesh|
+---+------+---+------+-------+-------+



## Different ways of selecting columns in DataFrame

#####  Using string form in select method

In [None]:
# Using String form
employee_df.select("name").show()

+--------+
|    name|
+--------+
|  Soumya|
| Jyotsna|
|Pratisha|
|  Pritam|
|  Vikash|
+--------+



#####  Using col method within select method

In [None]:
from pyspark.sql.functions import col
employee_df.select(col("name")).show()

+--------+
|    name|
+--------+
|  Soumya|
| Jyotsna|
|Pratisha|
|  Pritam|
|  Vikash|
+--------+



In [None]:
# Reason behind using col method 
# Issue in selecting column using string form:
# Suppose my requirement is to add 5 to the id column in the dataframe then and if we use string form in the below way it will lead an AnalysisException.
employee_df.select("id + 5").show()

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
[0;32m<command-2659229039539279>[0m in [0;36m<cell line: 4>[0;34m()[0m
[1;32m      2[0m [0;31m# Issue in selecting column using string form:[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      3[0m [0;31m# Suppose my requirement is to add 5 to the id column in the dataframe then and if we use string form in the below way it will lead an analysis exception.[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 4[0;31m [0memployee_df[0m[0;34m.[0m[0mselect[0m[0;34m([0m[0;34m"id + 5"[0m[0;34m)[0m[0;34m.[0m[0mshow[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m/databricks/spark/python/pyspark/instrumentation_utils.py[0m in [0;36mwrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m             [0mstart[0m [0;34m=[0m [0mtime[0m[0;34m.[0m[0mperf_counter[0m[0;34m

In [None]:
# col method resolves the above issue
employee_df.select(col("id")+5).show()

+--------+
|(id + 5)|
+--------+
|       6|
|       7|
|       8|
|       9|
|      10|
+--------+



In [None]:
from pyspark.sql.functions import expr
# Another twist is here that we can also resolve this issue in another way i.e using expr function
# Basically it takes in a string argument and executes a SQL-like expression and returns a pyspark Column data type. 
# That means in earlier scenario where we simply write employee_df.select("id + 5").show() which was giving AnalysisException because pyspark was not getting the column
# named "id + 5"(as we know select function only take column(s) name as argument) but here as written above expr function takes string argument
# and upon execute it as SQL-like expression and returns column which is required argument for select statement
employee_df.select(expr("id + 5")).show()

+--------+
|(id + 5)|
+--------+
|       6|
|       7|
|       8|
|       9|
|      10|
+--------+



#####  Using the [] operator

Note: Picking columns from a DataFrame this way is handy when joining tables, especially<br> 
if their column names don't match. It helps us easily refer to the right table for each column <br> during the join


In [None]:
employee_df.select(employee_df["salary"]).show()

+------+
|salary|
+------+
| 15000|
| 19000|
| 20000|
|100000|
| 30000|
+------+



#####  Using .(dot)
This is also useful in same scenario i.e during the join operation

In [None]:
employee_df.select(employee_df.address).show()

+------------+
|     address|
+------------+
|      Odisha|
|      Mumbai|
|     Kolkata|
|Uttarpradesh|
|        null|
+------------+



### Selecting Multiple Columns

In [None]:
employee_df.select("id","name","salary").show()
employee_df.select(col("id"),col("name")).show()
employee_df.select(employee_df["id"],employee_df["address"]).show()
employee_df.select(employee_df.name,employee_df.nominee).show()

+---+--------+------+
| id|    name|salary|
+---+--------+------+
|  1|  Soumya| 15000|
|  2| Jyotsna| 19000|
|  3|Pratisha| 20000|
|  4|  Pritam|100000|
|  5|  Vikash| 30000|
+---+--------+------+

+---+--------+
| id|    name|
+---+--------+
|  1|  Soumya|
|  2| Jyotsna|
|  3|Pratisha|
|  4|  Pritam|
|  5|  Vikash|
+---+--------+

+---+------------+
| id|     address|
+---+------------+
|  1|      Odisha|
|  2|      Mumbai|
|  3|     Kolkata|
|  4|Uttarpradesh|
|  5|        null|
+---+------------+

+--------+--------+
|    name| nominee|
+--------+--------+
|  Soumya|nominee1|
| Jyotsna|nominee2|
|Pratisha|   India|
|  Pritam|   India|
|  Vikash|nominee5|
+--------+--------+



In [None]:
# All ways of selecting columns in a single statement
# Note: We use all ways interchangeably based on our convenient
employee_df.select("id",col("name"),employee_df["age"],employee_df.salary).show(truncate=False)

+---+--------+---+------+
|id |name    |age|salary|
+---+--------+---+------+
|1  |Soumya  |23 |15000 |
|2  |Jyotsna |23 |19000 |
|3  |Pratisha|17 |20000 |
|4  |Pritam  |22 |100000|
|5  |Vikash  |31 |30000 |
+---+--------+---+------+



In [None]:
# Selecting all the columns in dataframe
employee_df.select("*").show()

+---+--------+---+------+------------+--------+
| id|    name|age|salary|     address| nominee|
+---+--------+---+------+------------+--------+
|  1|  Soumya| 23| 15000|      Odisha|nominee1|
|  2| Jyotsna| 23| 19000|      Mumbai|nominee2|
|  3|Pratisha| 17| 20000|     Kolkata|   India|
|  4|  Pritam| 22|100000|Uttarpradesh|   India|
|  5|  Vikash| 31| 30000|        null|nominee5|
+---+--------+---+------+------------+--------+



#### Doing aliasing and concatenation using expr

In [None]:
employee_df.select(expr("id as employee_id"),expr("name as employee_name"),expr("concat(name,nominee)")).show()

+-----------+-------------+---------------------+
|employee_id|employee_name|concat(name, nominee)|
+-----------+-------------+---------------------+
|          1|       Soumya|       Soumyanominee1|
|          2|      Jyotsna|      Jyotsnanominee2|
|          3|     Pratisha|        PratishaIndia|
|          4|       Pritam|          PritamIndia|
|          5|       Vikash|       Vikashnominee5|
+-----------+-------------+---------------------+



## SparkSQL

In [None]:
# In order to use the dataframe in SparkSQL,at first we need to convert it to table/view
employee_df.createOrReplaceTempView("employee_tbl")

In [None]:
spark.sql(
"""
SELECT * FROM employee_tbl
"""
).show()

+---+--------+---+------+------------+--------+
| id|    name|age|salary|     address| nominee|
+---+--------+---+------+------------+--------+
|  1|  Soumya| 23| 15000|      Odisha|nominee1|
|  2| Jyotsna| 23| 19000|      Mumbai|nominee2|
|  3|Pratisha| 17| 20000|     Kolkata|   India|
|  4|  Pritam| 22|100000|Uttarpradesh|   India|
|  5|  Vikash| 31| 30000|        null|nominee5|
+---+--------+---+------+------------+--------+



In [None]:
spark.sql(
"""
SELECT id,name,salary,nominee FROM employee_tbl
"""
).show()

+---+--------+------+--------+
| id|    name|salary| nominee|
+---+--------+------+--------+
|  1|  Soumya| 15000|nominee1|
|  2| Jyotsna| 19000|nominee2|
|  3|Pratisha| 20000|   India|
|  4|  Pritam|100000|   India|
|  5|  Vikash| 30000|nominee5|
+---+--------+------+--------+

