## Create spark dataframes using python collections & pandas dataframe

#### Index

[1. Create single column spark dataframe using list](#first) <br>
[2. Create multi column spark dataframe using list](#second) <br>
[3. Overview of Row](#third) <br>
[4. Convert list of list into spark dataframe using Row](#fourth) <br>
[5. Convert list of tuples into spark dataframe using Row](#fifth) <br>
[6. Convert list of dicts into spark dataframe using Row](#sixth) <br>

In [10]:
# Import necessary libraries
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType
import pandas as pd
import datetime

In [2]:
# Initiate spark session
spark = SparkSession \
        .builder \
        .appName('CreateSparkDF') \
        .getOrCreate()

In [3]:
spark

### 1. Create single column spark dataframe using list <a id="first"></a>

In [16]:
age_lst = [11, 23, 14, 16, 25, 21]

In [6]:
help(spark.createDataFrame)

Help on method createDataFrame in module pyspark.sql.session:

createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True) method of pyspark.sql.session.SparkSession instance
    Creates a :class:`DataFrame` from an :class:`RDD`, a list or a :class:`pandas.DataFrame`.
    
    When ``schema`` is a list of column names, the type of each column
    will be inferred from ``data``.
    
    When ``schema`` is ``None``, it will try to infer the schema (column names and types)
    from ``data``, which should be an RDD of either :class:`Row`,
    :class:`namedtuple`, or :class:`dict`.
    
    When ``schema`` is :class:`pyspark.sql.types.DataType` or a datatype string, it must match
    the real data, or an exception will be thrown at runtime. If the given schema is not
    :class:`pyspark.sql.types.StructType`, it will be wrapped into a
    :class:`pyspark.sql.types.StructType` as its only field, and the field name will be "value".
    Each record will also be wrapped into a tu

In [17]:
# NEED TO PASS SCHEMA 
spark.createDataFrame(age_lst)

TypeError: Can not infer schema for type: <class 'int'>

#### Create dataframe in two ways
    -> pass datatype
    -> call constructor

In [28]:
# Pass datatype
spark.createDataFrame(age_lst, 'int')

DataFrame[value: int]

In [19]:
# Call constructor
spark.createDataFrame(age_lst, IntegerType())

DataFrame[value: int]

In [20]:
name_lst = ['Joey', 'Ross', 'Chandler', 'Monica', 'Pheobe', 'Racheal']

In [30]:
spark.createDataFrame(name_lst, 'string')

DataFrame[value: string]

In [25]:
spark.createDataFrame(name_lst, StringType())

DataFrame[value: string]

In [31]:
# list of tuple
age_lst = [(11, ), (12, ), (15, )]
spark.createDataFrame(age_lst)

DataFrame[_1: bigint]

In [32]:
spark.createDataFrame(age_lst, 'int')

TypeError: field value: IntegerType can not accept object (11,) in type <class 'tuple'>

In [33]:
spark.createDataFrame(age_lst, 'age int')

DataFrame[age: int]

**NOTE:**
1. If you are creating a dataframe using a simple **list**, then you need to paas schema (i.e. data type).
2. If you are creating a dataframe using **list of tuple**, then you need not to paas schema, it will automatically assign it for you.
3. If you try to pass datatype alone, in case of **list of tuple**, it will throw an error.
4. If you want to assign **datatype** to **list of tuple** then use `column_name datatype ('age int')` as a pair

### 2. Create multiple column spark dataframe using list <a id="second"></a>


In [35]:
user_lst = [(1, 'Phoebe'), (2, 'Joey'), (3, 'Ross'), (4, 'Monica'), (5, 'Chandler'), (6, 'Rachael')]

In [36]:
spark.createDataFrame(user_lst)

DataFrame[_1: bigint, _2: string]

In [37]:
spark.createDataFrame(user_lst, 'user_id int, user_first_name string')

DataFrame[user_id: int, user_first_name: string]

### 3. Overview of Row<a id="third"></a>


In [3]:
user_lst = [(1, 'Phoebe'), (2, 'Joey'), (3, 'Ross'), (4, 'Monica'), (5, 'Chandler'), (6, 'Rachael')]
df = spark.createDataFrame(user_lst, 'user_id int, user_first_name string')

In [4]:
df.show()

+-------+---------------+
|user_id|user_first_name|
+-------+---------------+
|      1|         Phoebe|
|      2|           Joey|
|      3|           Ross|
|      4|         Monica|
|      5|       Chandler|
|      6|        Rachael|
+-------+---------------+



In [5]:
# Return list of Row Object
df.collect()

[Row(user_id=1, user_first_name='Phoebe'),
 Row(user_id=2, user_first_name='Joey'),
 Row(user_id=3, user_first_name='Ross'),
 Row(user_id=4, user_first_name='Monica'),
 Row(user_id=5, user_first_name='Chandler'),
 Row(user_id=6, user_first_name='Rachael')]

In [6]:
type(df.collect())

list

In [7]:
from pyspark.sql import Row

In [9]:
r = Row('Pheobe', 21)

In [10]:
type(r)

pyspark.sql.types.Row

In [11]:
help(Row)

Help on class Row in module pyspark.sql.types:

class Row(builtins.tuple)
 |  Row(*args, **kwargs)
 |  
 |  A row in :class:`DataFrame`.
 |  The fields in it can be accessed:
 |  
 |  * like attributes (``row.key``)
 |  * like dictionary values (``row[key]``)
 |  
 |  ``key in row`` will search through row keys.
 |  
 |  Row can be used to create a row object by using named arguments.
 |  It is not allowed to omit a named argument to represent that the value is
 |  None or missing. This should be explicitly set to None in this case.
 |  
 |  NOTE: As of Spark 3.0.0, Rows created from named arguments no longer have
 |  field names sorted alphabetically and will be ordered in the position as
 |  entered. To enable sorting for Rows compatible with Spark 2.x, set the
 |  environment variable "PYSPARK_ROW_FIELD_SORTING_ENABLED" to "true". This
 |  option is deprecated and will be removed in future versions of Spark. For
 |  Python versions < 3.6, the order of named arguments is not guarante

In [14]:
'Pheobe' in r, r[0]

(True, 'Pheobe')

In [15]:
r2 = Row(name='Joey', age=22)

In [16]:
r2['name'], r2.name

('Joey', 'Joey')

**NOTE:**
1. `df.collect()` returns list of row objects (convert dataframe to list).
2. `df.show()` return dataframe. 
3. `Row` is a generic row object with an ordered collection of fields that can be accessed by an ordinal / an index (aka generic access by ordinal), a name (aka native primitive access) or using Scala’s pattern matching.

**e.g.** Access row elements using index or name conventions like **r[0]** or **r['name'] (or r.name)**

### 4. Convert list of list into spark dataframe using Row<a id="fourth"></a>


In [17]:
user_lst = [[1, 'Phoebe'], [2, 'Joey'], [3, 'Ross'], [4, 'Monica'], [5, 'Chandler'], [6, 'Rachael']]

In [19]:
type(user_lst), type(user_lst[1])

(list, list)

In [21]:
spark.createDataFrame(user_lst, 'user_id int, user_first_name string')

DataFrame[user_id: int, user_first_name: string]

In [25]:
# Convert list of list into list of rows
user_rows = [Row(*user) for user in user_lst]

In [26]:
spark.createDataFrame(user_rows, 'user_id int, user_first_name string')

DataFrame[user_id: int, user_first_name: string]

### 5. Convert list of tuples into spark dataframe using Row<a id="fifth"></a>

In [27]:
user_lst = [(1, 'Phoebe'), (2, 'Joey'), (3, 'Ross'), (4, 'Monica'), (5, 'Chandler'), (6, 'Rachael')]

In [31]:
type(user_lst), type(user_lst[1])

(list, tuple)

In [32]:
user_rows = [Row(*user) for user in user_lst]

In [34]:
user_rows

[<Row(1, 'Phoebe')>,
 <Row(2, 'Joey')>,
 <Row(3, 'Ross')>,
 <Row(4, 'Monica')>,
 <Row(5, 'Chandler')>,
 <Row(6, 'Rachael')>]

In [35]:
spark.createDataFrame(user_rows, 'user_id int, user_first_name string')

DataFrame[user_id: int, user_first_name: string]

### 6. Convert list of dicts into spark dataframe using Row<a id="sixth"></a>


In [60]:
user_lst = [
            {'user_id': 1, 'user_first_name': 'Phoebe'},
            {'user_id': 2, 'user_first_name': 'Joey'},
            {'user_id': 3, 'user_first_name': 'Ross'},
            {'user_id': 4, 'user_first_name': 'Monica'},
            {'user_id': 5, 'user_first_name': 'Chandler'},
            {'user_id': 6, 'user_first_name': 'Rachael'}
           ]

In [58]:
print(f"\033[41m {'-'*20} Creating dataframe from list of dict is DEPRECATED {'-'*20} \033[0m")

[41m -------------------- Creating dataframe from list of dict is DEPRECATED -------------------- [0m


In [37]:
spark.createDataFrame(user_lst)



DataFrame[user_first_name: string, user_id: bigint]

In [67]:
# using args
user_rows = [Row(*user.values()) for user in user_lst]

In [68]:
spark.createDataFrame(user_rows, 'user_id int, user_first_name string')

DataFrame[user_id: int, user_first_name: string]

In [71]:
# using keyword args
user_rows = [Row(**user) for user in user_lst]

In [70]:
user_rows

[Row(user_id=1, user_first_name='Phoebe'),
 Row(user_id=2, user_first_name='Joey'),
 Row(user_id=3, user_first_name='Ross'),
 Row(user_id=4, user_first_name='Monica'),
 Row(user_id=5, user_first_name='Chandler'),
 Row(user_id=6, user_first_name='Rachael')]

In [72]:
spark.createDataFrame(user_rows)

DataFrame[user_id: bigint, user_first_name: string]

### 7. Create spark dataframe using pandas dataframe<a id="seventh"></a>


In [24]:
users = [
            {
                "id": 1,
                "first_name": "Pheobe",
                "last_name": "Buffay",
                "email": "pheobebuffay@abc.com",
                "is_customer": True,
                "amount_paid": 1000.55,
                "customer_from": datetime.date(2021, 1, 13),
                "last_updated_ts": datetime.datetime(2021, 2, 10, 1, 15, 0)
            },
            {
                "id": 2,
                "first_name": "Joey",
                "last_name": "Tribbiani",
                "email": "joey@abc.com",
                "is_customer": True,
                "amount_paid": 900.0,
                "customer_from": datetime.date(2021, 2, 14),
                "last_updated_ts": datetime.datetime(2021, 2, 18, 3, 33, 0)
            },
            {
                "id": 3,
                "first_name": "Monica",
                "last_name": "Geller",
                "email": "monica@abc.com",
                "is_customer": True,
                "amount_paid": 1000.90,
                "customer_from": datetime.date(2021, 2, 22),
                "last_updated_ts": datetime.datetime(2021, 2, 28, 7, 33, 0)
            },
            {
                "id": 4,
                "first_name": "Ross",
                "last_name": "Geller",
                "email": "ross@abc.com",
                "is_customer": True,
                "amount_paid": 1200.55,
                "customer_from": datetime.date(2021, 1, 19),
                "last_updated_ts": datetime.datetime(2021, 2, 18, 1, 10, 0)
            },
            {
                "id": 5,
                "first_name": "Rachel",
                "last_name": "Green",
                "email": "rachel@abc.com",
                "is_customer": True,
                "customer_from": datetime.date(2021, 2, 24),
                "last_updated_ts": datetime.datetime(2021, 2, 18, 3, 33, 0)
            },
            {
                "id": 6,
                "first_name": "Chandler",
                "last_name": "Bing",
                "email": "chandler@abc.com",
                "is_customer": True,
                "customer_from": datetime.date(2021, 2, 22),
                "last_updated_ts": datetime.datetime(2021, 2, 25, 7, 33, 0)
            }
        ]

**NOTE:**
    
 In the above schema, there are some fields missing in the 4th and 5th dictionary. If I try to create dataframe out of it, it will throw an error.
    
But if I try to create spark dataframe using pandas dataframe, it will run without any error and put `Null` in empty fields.

In [8]:
spark.createDataFrame([Row(**user) for user in users]).show()

Py4JJavaError: An error occurred while calling o60.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 3, DESKTOP-0V7FHTA, executor driver): java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 8 fields are required while 6 values are provided.
	at org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$$nestedInanonfun$makeFromJava$16$1.applyOrElse(EvaluatePython.scala:186)
	at org.apache.spark.sql.execution.python.EvaluatePython$.nullSafeConvert(EvaluatePython.scala:211)
	at org.apache.spark.sql.execution.python.EvaluatePython$.$anonfun$makeFromJava$16(EvaluatePython.scala:180)
	at org.apache.spark.sql.SparkSession.$anonfun$applySchemaToPythonRDD$2(SparkSession.scala:739)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3627)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2697)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2697)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2904)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 8 fields are required while 6 values are provided.
	at org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$$nestedInanonfun$makeFromJava$16$1.applyOrElse(EvaluatePython.scala:186)
	at org.apache.spark.sql.execution.python.EvaluatePython$.nullSafeConvert(EvaluatePython.scala:211)
	at org.apache.spark.sql.execution.python.EvaluatePython$.$anonfun$makeFromJava$16(EvaluatePython.scala:180)
	at org.apache.spark.sql.SparkSession.$anonfun$applySchemaToPythonRDD$2(SparkSession.scala:739)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	... 1 more


In [26]:
spark.createDataFrame(pd.DataFrame(users)).show()

+---+----------+---------+--------------------+-----------+-----------+-------------+-------------------+
| id|first_name|last_name|               email|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+---------+--------------------+-----------+-----------+-------------+-------------------+
|  1|    Pheobe|   Buffay|pheobebuffay@abc.com|       true|    1000.55|   2021-01-13|2021-02-10 01:15:00|
|  2|      Joey|Tribbiani|        joey@abc.com|       true|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|    Monica|   Geller|      monica@abc.com|       true|     1000.9|   2021-02-22|2021-02-28 07:33:00|
|  4|      Ross|   Geller|        ross@abc.com|       true|    1200.55|   2021-01-19|2021-02-18 01:10:00|
|  5|    Rachel|    Green|      rachel@abc.com|       true|        NaN|   2021-02-24|2021-02-18 03:33:00|
|  6|  Chandler|     Bing|    chandler@abc.com|       true|        NaN|   2021-02-22|2021-02-25 07:33:00|
+---+----------+---------+--------------------

In [27]:
spark.createDataFrame(pd.DataFrame(users)).printSchema()

root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- is_customer: boolean (nullable = true)
 |-- amount_paid: double (nullable = true)
 |-- customer_from: date (nullable = true)
 |-- last_updated_ts: timestamp (nullable = true)

