### 1. Exercise
#### RDD to DataFrame
Similar to RDDs, DataFrames are immutable and distributed data structures in Spark. Even though RDDs are a fundamental data structure in Spark, working with data in DataFrame is easier than RDD most of the time and so understanding of how to convert RDD to DataFrame is necessary.

In this exercise, you'll first make an RDD using the sample_list which contains the list of tuples ('Mona',20), ('Jennifer',34),('John',20), ('Jim',26) with each tuple contains the name of the person and their age. Next, you'll create a DataFrame using the RDD and the schema (which is the list of 'Name' and 'Age') and finally confirm the output as PySpark DataFrame.

Remember, you already have a SparkContext sc and SparkSession spark available in your workspace.

#### Instructions
1. Create a sample_list from tuples - ('Mona',20), ('Jennifer',34), ('John',20), ('Jim',26).
1. Create an RDD from the sample_list.
1. Create a PySpark DataFrame using the above RDD and schema.
1. Confirm the output as PySpark DataFrame.

In [1]:
import pyspark as sp
# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession

#Create spark context
sc = sp.SparkContext.getOrCreate()

# Create spark session
spark = SparkSession.builder.getOrCreate()

In [2]:
# Create a list of tuples
sample_list = [('Mona',20), ('Jennifer',34), ('John',20), ('Jim',26)]

# Create a RDD from the list
rdd = sc.parallelize(sample_list)

# Create a PySpark DataFrame
names_df = spark.createDataFrame(rdd, schema=['Name', 'Age'])

# Check the type of names_df
print("The type of names_df is", type(names_df))

The type of names_df is <class 'pyspark.sql.dataframe.DataFrame'>


### 2. Exercise
#### Loading CSV into DataFrame
In the previous exercise, you have seen a method of creating DataFrame but generally, loading data from CSV file is the most common method of creating DataFrames. In this exercise, you'll create a PySpark DataFrame from a people.csv file that is already provided to you as a file_path and confirm the created object is a PySpark DataFrame.

Remember, you already have SparkSession spark and file_path variable (which is the path to the people.csv file) available in your workspace.

#### Instructions
1. Create a DataFrame from file_path variable which is the path to the people.csv file.
2. Confirm the output as PySpark DataFrame.

In [3]:
#filepath
file_path = 'data/people.csv'

# Create an DataFrame from file_path
people_df = spark.read.csv(file_path, header=True, inferSchema=True)

# Check the type of people_df
print("The type of people_df is", type(people_df))

The type of people_df is <class 'pyspark.sql.dataframe.DataFrame'>


In [4]:
# Check the type of people_df
print("The type of people_df is", type(people_df))

The type of people_df is <class 'pyspark.sql.dataframe.DataFrame'>
