## Create spark dataframes using python collections & pandas dataframe

#### Index

[1. Create single column spark dataframe using list](#first) <br>
[2. Create multi column spark dataframe using list](#second)

In [24]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType

In [3]:
# Initiate spark session
spark = SparkSession \
        .builder \
        .appName('CreateSparkDF') \
        .getOrCreate()

In [4]:
spark

### 1. Create single column spark dataframe using list <a id="first"></a>

In [16]:
age_lst = [11, 23, 14, 16, 25, 21]

In [6]:
help(spark.createDataFrame)

Help on method createDataFrame in module pyspark.sql.session:

createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True) method of pyspark.sql.session.SparkSession instance
    Creates a :class:`DataFrame` from an :class:`RDD`, a list or a :class:`pandas.DataFrame`.
    
    When ``schema`` is a list of column names, the type of each column
    will be inferred from ``data``.
    
    When ``schema`` is ``None``, it will try to infer the schema (column names and types)
    from ``data``, which should be an RDD of either :class:`Row`,
    :class:`namedtuple`, or :class:`dict`.
    
    When ``schema`` is :class:`pyspark.sql.types.DataType` or a datatype string, it must match
    the real data, or an exception will be thrown at runtime. If the given schema is not
    :class:`pyspark.sql.types.StructType`, it will be wrapped into a
    :class:`pyspark.sql.types.StructType` as its only field, and the field name will be "value".
    Each record will also be wrapped into a tu

In [17]:
# NEED TO PASS SCHEMA 
spark.createDataFrame(age_lst)

TypeError: Can not infer schema for type: <class 'int'>

#### Create dataframe in two ways
    -> pass datatype
    -> call constructor

In [28]:
# Pass datatype
spark.createDataFrame(age_lst, 'int')

DataFrame[value: int]

In [19]:
# Call constructor
spark.createDataFrame(age_lst, IntegerType())

DataFrame[value: int]

In [20]:
name_lst = ['Joey', 'Ross', 'Chandler', 'Monica', 'Pheobe', 'Racheal']

In [30]:
spark.createDataFrame(name_lst, 'string')

DataFrame[value: string]

In [25]:
spark.createDataFrame(name_lst, StringType())

DataFrame[value: string]

In [31]:
# list of tuple
age_lst = [(11, ), (12, ), (15, )]
spark.createDataFrame(age_lst)

DataFrame[_1: bigint]

In [32]:
spark.createDataFrame(age_lst, 'int')

TypeError: field value: IntegerType can not accept object (11,) in type <class 'tuple'>

In [33]:
spark.createDataFrame(age_lst, 'age int')

DataFrame[age: int]

**NOTE:**
1. If you are creating a dataframe using a simple **list**, then you need to paas schema (i.e. data type).
2. If you are creating a dataframe using **list of tuple**, then you need not to paas schema, it will automatically assign it for you.
3. If you try to pass datatype alone, in case of **list of tuple**, it will throw an error.
4. If you want to assign **datatype** to **list of tuple** then use `column_name datatype ('age int')` as a pair

### 2. Create multiple column spark dataframe using list <a id="second"></a>


In [35]:
user_lst = [(1, 'Phoebe'), (2, 'Joey'), (3, 'Ross'), (4, 'Monica'), (5, 'Chandler'), (6, 'Rachael')]

In [36]:
spark.createDataFrame(user_lst)

DataFrame[_1: bigint, _2: string]

In [37]:
spark.createDataFrame(user_lst, 'user_id int, user_first_name string')

DataFrame[user_id: int, user_first_name: string]