In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
import pyspark.sql.types._

spark=SparkSession.builder.appName("DD").master("local[*]").getOrCreate()

We can create a dataframe in multiple ways either using createDataFrame() function, toDF() function or from external files such as csv, json, parquet, etc.
Finally, PySpark DataFrame also can be created by reading data from RDBMS Databases and NoSQL databases.

1. Now let us see how we can create dataframe using a RDD.

In [24]:
#let us define the data first
columns = ["language","users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000)]

#now we will use parallelize function of sparkcontext to create a RDD.
rdd= spark.sparkContext.parallelize(data)
df = rdd.toDF()
df.show()

#Now let us give the column names as a parameter to the toDF function.

df1=rdd.toDF(columns)
df1.show()

#The toDF() function takes only one argument that is *cols, which is a multiple number of col elements. 
#The number of col objects given should be equal to number of columns in the RDD

+------+------+
|    _1|    _2|
+------+------+
|  Java| 20000|
|Python|100000|
| Scala|  3000|
+------+------+

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+



We can also create dataframe from RDD using createDataFrame() method of SparkSession object by giving a list and a schema.
Schema can be just columns name of StructType object.

In [26]:
# here we have given only column names as schema
df2= spark.createDataFrame(rdd,columns)
df2.show()
df2.printSchema()
# lets create a new schema of structType consisting of StructField items.
from pyspark.sql.types import StructType,StructField,IntegerType,StringType
schema = StructType([StructField("Language",StringType(),True),StructField("users_count",IntegerType(),True)])
df3=spark.createDataFrame(rdd,schema)
df3.show()
df3.printSchema()

#we can see that users count which was automatically considered as Long by spark is now Integer as we have given it in schema.
# StructType takes a list of StructField items as parameters.
#StructField takes col name, col types and Nullable as parameters.
# if the data does not conform to the defined schema, spark will throw runtime error.

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+

root
 |-- language: string (nullable = true)
 |-- users_count: long (nullable = true)

+--------+-----------+
|Language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+

root
 |-- Language: string (nullable = true)
 |-- users_count: integer (nullable = true)



 We can also use createDataFrame to create a new dataframe from a list, row objects etc.

In [32]:
data= [(1,"S"),(2,"D")]
cols=["ID","NAME"]

df4=spark.createDataFrame(data,cols)
df4.show()

#now let us use row object
from pyspark.sql import Row
data1=[Row(1,"S"),Row(2,"D")]
df5=spark.createDataFrame(data1)
df5.show()

# we can also create a row object and use it

person = Row("ID","Name")
data2=[person(1,"S"),person(2,"D")]
df6=spark.createDataFrame(data2)
df6.show()

+---+----+
| ID|NAME|
+---+----+
|  1|   S|
|  2|   D|
+---+----+

+---+---+
| _1| _2|
+---+---+
|  1|  S|
|  2|  D|
+---+---+

+---+----+
| ID|Name|
+---+----+
|  1|   S|
|  2|   D|
+---+----+



These are the ways we can create a dataframe manually by providing data.
However we can create dataframe from source files and DBs, using other methods.