# PySpark – Create DataFrame with Examples
## &copy;  [Omkar Mehta](omehta2@illinois.edu) ##
### Industrial and Enterprise Systems Engineering, The Grainger College of Engineering,  UIUC ###

<hr style="border:2px solid blue"> </hr>

You can manually create a PySpark DataFrame using `toDF()` and `createDataFrame()` methods, both these function takes different signatures in order to create DataFrame from existing RDD, list, and DataFrame.

In [0]:
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import *

# Dataframe from list
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]

# Create rdd first
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)

In [0]:
# Use toDF() from rdd
dfFromRDD1 = rdd.toDF()
dfFromRDD1.printSchema()

In [0]:
# Add columns to it
columns = ["language","users_count"]
dfFromRDD1 = rdd.toDF(columns)
dfFromRDD1.printSchema()

In [0]:
# Use CreateDataFrame on rdd object
dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)


In [0]:
# Use CreateDataFrame on list object
dfFromData2 = spark.createDataFrame(data).toDF(*columns)


In [0]:
# Using createDataFrame() with the Row type
rowData = map(lambda x: Row(*x), data) 
dfFromData3 = spark.createDataFrame(rowData,columns)

In [0]:
# Create DataFrame with schema
data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)

In [0]:
# Create DataFrame from Data sources
df2 = spark.read.csv("/FileStore/tables/covid_analytics_clinical_data.csv")
#df2.show()

In [0]:
# Creating from text (TXT) file
df2 = spark.read.text("/FileStore/tables/data.txt")


In [0]:
# Creating from JSON file
df2 = spark.read.json("/FileStore/tables/example_1.json")
