Hi there,
One of our brother suggested it is good to have an interactive lesson so here is an attampt for this.

In this notebook, we will explain **Data frames** in detail.

So each exercises will have few bullet points, with the topics and a sample code.
There will be some questions at the end of the notebook which you are supposed to answer them and submit as Lab Work

In [None]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('Om_Sairam').getOrCreate()

: 

# DataFrame
- It is basically "A distributed collection of data grouped into named columns"
- Unlike datasets, dataframes are loosely typed.
- One can also create PySpark DataFrame from different data sources like TXT, CSV, JSON, ORV, Avro, Parquet, XML formats by reading from HDFS, and clod platforms
- There are multiple ways to create dataframe, the most generic one is using `spark.read`
- In our previous assignment, we used `iris_dataset = spark.read.option("inferSchema","true").option("header","true").csv("irisdata.csv")`
    - We are telling spark to take the file, infer its schema and also the provided csv has header.
- Now we will try to create schema manually.

### Schema
- A schema defines the column names and types of a DataFrame
- A schema is a `StructType` made up of a number of fields, `StructFields`, that have a name, type, a Boolean flag specifying whether that column can contain missing or null values
- One can even insert random metadata in the schema as well.

In [None]:
#Iris DataFrame Headers: p_w;p_l;s_w;s_l;type
myManualSchema = StructType([
    StructField("p_w", FloatType(), True),
    StructField("p_l", FloatType(), True),
    StructField("s_w", FloatType(), True),
    StructField("s_l", FloatType(), True),
    StructField("type", StringType(), True, metadata={"hello":"world"})
])
#Now we defined schema, now lets create the data frame and use the above schema
iris_df = spark.read.csv("iris_dataset.csv", schema=schema, sep=",")
#One can even create dataframe from rdd using createDataFrame method.

## Columns and Expressions
- Columns in Spark are similar to columns in a spreadsheet
- It cannot be used outside the context of the DataFrame
    - To have a real value in column, we should have `row` which will be inside of `DataFrame`

In [None]:
from pyspark.sql.functions import col, column
col("someColumnName")
column("someColumnName")
#Different ways of creating columns
#If you want to use specific column in a dataframe, df
iris_df.col("p_w")#Just eg.
iris_df.columns#Displays all the columns

### Expressions
- An expression is a set of transformations on one or more values in a record in a DataFrame
- The `expr()` function is used to express transformations or computations involving DataFrame columns.
- If it is bit confusing just remember the following:
- **Columns are just expressions.**
- **Columns and transformations of those columns compile to the same logical plan as parsed expressions.**

## Record and Rows
- In Spark, each row in a DataFrame is a single record. Spark represents this record as an object of type `Row`
- Spark manipulates Row objects using column expressions in order to produce usable values
- Row objects internally represent arrays of bytes.
- There is abstraction present here, making us to use the column expression to manipulate them.
- It’s important to note that only DataFrames have schemas. Rows themselves do not have schemas.
- When creating Row manually, one must specify the values in the same order as the schema of the DataFrame to which they might be appended.

In [None]:
from pyspark.sql import Row
myRow = Row("Hello", None, 1, False)


In [None]:
# So let us stich it altogether
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType
myManualSchema = StructType([
    StructField("some", StringType(), True),
    StructField("col", StringType(), True),
    StructField("names", LongType(), False)
])
myRow = Row("Hello", None, 1)
myDf = spark.createDataFrame([myRow], myManualSchema)
myDf.show()
#Before Running the code, try guessing the output.

## Note Worthy Points
- By default Spark is case insensitive; one can make Spark case sensitive by setting the configuration:
    - `set spark.sql.caseSensitive true`
- Sometimes we need to cast the spark columns to different datatypes. It can be done:
    - `df.withColumn("count2", col("count").cast("long"))`
    - `withColumn` is used to create new columns.
- To rename a column, we will use `df.withColumnRenamed("OLD_NAME", "new_name")`
- Take a guess on how do we drop the columns.
- *Remeber to reduce the partition size from 200 to 5*

In [None]:
## Selecting

## Filtering
- To filter rows, we create an expression that evaluates to true or false.
- The rows, to which the expression is evaluated as false, are *filtered out*
- There are two methods to perform this operation: `where` or `filter`

In [None]:
df.filter(col("count") < 2).show(2)
df.where("count < 2").show(2)

## Unique
- A very common use case is to extract the unique or distinct values in a DataFrame
- We use `distinct` function for the following.
- It is a transformation, so it will return a new data frame with only unique values.

In [None]:
df.select("col1", "col2").distinct().count()

## Random
- Sometimes, you  just want to sample some random records from your DataFrame.
- It can be perofmred using `sample` method on a DataFrame
- It is done as follows:

In [None]:
seed = 5#Seed should be provided for better random behaviour.
withReplacement = False
fraction = 0.5
df.sample(withReplacement, fraction, seed).count()

In [None]:
## Sorting
- To sort a df based on the column, one can use `sort` and `orderBy`
- To more explicitly specify sort direction,use the `asc` and `desc` functions if operating bon a column

In [None]:
df.sort("col").show(5)
df.orderBy("col2", "col").show(5)
df.orderBy(col("col1"), col("col2")).show(5)#Also  FIne

In [None]:
from pyspark.sql.functions import desc, asc
df.orderBy(expr("col desc")).show(2)#Note the usage of expr
df.orderBy(col("col").desc(), col("col2").asc()).show(2)

In [None]:
#Lab Work
- Use the MTCars data set to answer the folling questions.
1. Create the dataframe by specifying the Manual Schema
2. Rename all the columns to something for your liking
3. Show the distinct cars based on the number of cylinders
4. Sort the dataframe based on the milage of the car.
5. Your friend is planning to buy a new car in a pocket friendly manner. So allocate a score to all cars in your data frame
    Eg: - Create a column called `score`.
        - Come up with a formula that provides score, say :
                - milage is important so 0.2 * value of milage + 0.5 * # of cyl ... so on
6. Just for Fun add a new Row into the Data frame for Nano
 Details: Nano;Manual;25kmpl;2Cyl;
