## Overview
This notebook shows how to create RDD and convert the RDD into DataFrame in pyspark using different ways and options. 

#### **Contents :**

- **Create a PySpark RDD**
- **Convert PySpark RDD to DataFrame**

This is a **Python** notebook so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` magic command. `Python`, `Scala(%scala)`, `SQL(%sql)`, `FileStore(%fs)` and `R(%r)` all are supported.

**Spark RDD Documentation Link**
- https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds

#### Create a PySpark RDD
In PySpark, when you have data in a list meaning you have a collection of data in a PySpark driver memory when you create an RDD, this collection is going to be parallelized.

In [0]:
dept = [("Finance",10),("Marketing",20),("Sales",30),("IT",40)]
rdd = spark.sparkContext.parallelize(dept)

print(rdd.collect())

[('Finance', 10), ('Marketing', 20), ('Sales', 30), ('IT', 40)]


#### Convert PySpark RDD to DataFrame
Converting PySpark RDD to DataFrame can be done using `toDF()` and `createDataFrame()`

In [0]:
# PySpark provides toDF() function in RDD which can be used to convert RDD into Dataframe
df = rdd.toDF()
df.printSchema()
df.show(truncate=False)

root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)

+---------+---+
|_1       |_2 |
+---------+---+
|Finance  |10 |
|Marketing|20 |
|Sales    |30 |
|IT       |40 |
+---------+---+



In [0]:
deptColumns = ["dept_name","dept_id"]
df = rdd.toDF(deptColumns)
df.printSchema()
df.show(truncate=False)

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+



In [0]:
# SparkSession class provides createDataFrame() method to create DataFrame and it takes rdd object as an argument.
deptDF = spark.createDataFrame(rdd, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+



##### using createDataFrame() with StructType schema
When you infer the schema, by default the datatype of the columns is derived from the data and set’s nullable to true for all columns. We can change this behavior by supplying schema using StructType – where we can specify a column name, data type and nullable for each field/column.

In [0]:
from pyspark.sql.types import StructType,StructField, StringType
deptSchema = StructType([       
    StructField('dept_name', StringType(), True),
    StructField('dept_id', StringType(), True)
])

deptDF1 = spark.createDataFrame(rdd, schema = deptSchema)
deptDF1.printSchema()
deptDF1.show(truncate=False)

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: string (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

