## 1. Create DataFrame from RDD

In [2]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySparkLearning").getOrCreate()

In [3]:
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]

#### 1.1 Using toDF() function

PySpark RDD’s `toDF()` method is used to create a DataFrame from existing RDD. Since RDD doesn’t have columns, the DataFrame is created with default column names “_1” and “_2” as we have two columns.


In [4]:
rdd = spark.sparkContext.parallelize(data)

dfFromRDD1 = rdd.toDF()
dfFromRDD1.printSchema()

root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)



If you wanted to provide column names to the DataFrame use toDF() method with column names as arguments as shown below.

In [5]:
dfFromRDD1 = rdd.toDF(columns)
dfFromRDD1.printSchema()

root
 |-- language: string (nullable = true)
 |-- users_count: string (nullable = true)



#### 1.2 Using createDataFrame() from SparkSession

Using `createDataFrame()` from SparkSession is another way to create and it takes rdd object as an argument. and chain with toDF() to specify names to the columns.



In [8]:
rdd = spark.sparkContext.parallelize(data)

dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)
dfFromRDD2.printSchema()

root
 |-- language: string (nullable = true)
 |-- users_count: string (nullable = true)



## 2. Create DataFrame from List Collection

In this section, we will see how to create PySpark DataFrame from a list. These examples would be similar to what we have seen in the above section with RDD, but we use the list data object instead of “rdd” object to create DataFrame.

#### 2.1 Using createDataFrame() from SparkSession

Calling `createDataFrame()` from SparkSession is another way to create PySpark DataFrame, it takes a list object as an argument. and chain with `toDF()` to specify names to the columns.



In [9]:
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]

In [10]:
dfFromData1 = spark.createDataFrame(data).toDF(*columns)
dfFromData1.printSchema()    

root
 |-- language: string (nullable = true)
 |-- users_count: string (nullable = true)



#### 2.2 Using createDataFrame() with the Row type

`createDataFrame()` has another signature in PySpark which takes the collection of Row type and schema for column names as arguments. To use this first we need to convert our “data” object from the list to list of Row.

In [12]:
from pyspark.sql import Row

In [16]:
rowData = map(lambda x: Row(*x), data) 
dfFromData2 = spark.createDataFrame(rowData,columns)
dfFromData2.printSchema()

root
 |-- language: string (nullable = true)
 |-- users_count: string (nullable = true)



#### 2.3 Create DataFrame with schema

If you wanted to specify the column names along with their data types, you should create the `StructType` schema first and then assign this while creating a DataFrame.

In [17]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType


In [19]:
data = [("James","","Smith","36636","M",3000),
        ("Michael","Rose","","40288","M",4000),
        ("Robert","","Williams","42114","M",4000),
        ("Maria","Anne","Jones","39192","F",4000),
        ("Jen","Mary","Brown","","F",-1)
    ]

In [24]:
schema = StructType([ \
                    StructField("firstname",StringType(),True), \
                    StructField("middlename",StringType(),True), \
                    StructField("lastname",StringType(),True), \
                    StructField("id", StringType(), True), \
                    StructField("gender", StringType(), True), \
                    StructField("salary", IntegerType(), True) \
  ])
 

In [25]:
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|James    |          |Smith   |36636|M     |3000  |
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |     |F     |-1    |
+---------+----------+--------+-----+------+------+

