## Read data from Spark DataFrame into files

* Reading files using direct APIs such as `csv, json`, etc under `spark.read`.
* Reading files using `format` and load under `spark.read`.
* Specifying options as arguments as well as using functions such as `option` and `options`.
* Supported file formats:
    * `csv`
    * `text`
    * `json`
    * `parquet`
    * `orc`
* Other common file formats:
    * `xml`
    * `avro`
* Important file formats for certification - `csv`, `json`, `parquet`
* Reading compressed files

**NOTE**

* Check if the files are compressed(gz, snappy, bz2, etc). Most common ones are gz and snappy.
* Understand the file format(text, json, avro, parquet etc). Sometimes files will not have extensions.
* If files does not have extensions, make sure to confirm the details by going through the tech spec or by opening the file.
* We will get the tech specs from our leads or architects while working on real world projects.
* If the files are of text file format, check if the data is delimited or separated by a specific character.
* Use appropriate API under `spark.read` to read the data.

In [34]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

In [2]:
spark = SparkSession. \
        builder. \
        appName('ReadDataSparkDF'). \
        getOrCreate()

In [3]:
schema = """
    order_id INT,
    order_date TIMESTAMP,
    order_customer_id INT,
    order_status STRING
"""

In [4]:
orders = spark.read.schema(schema).csv('../data/orders.csv', header=None)
orders.show()

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
|       6|2013-07-25 00:00:00|             7130|       COMPLETE|
|       7|2013-07-25 00:00:00|             4530|       COMPLETE|
|       8|2013-07-25 00:00:00|             2911|     PROCESSING|
|       9|2013-07-25 00:00:00|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:00|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:00|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:00|             1837|         CLOSED|
|      13|2013-07-25 00:0

In [5]:
orders.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: timestamp (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [6]:
orders.dtypes

[('order_id', 'int'),
 ('order_date', 'timestamp'),
 ('order_customer_id', 'int'),
 ('order_status', 'string')]

In [7]:
orders_json = spark.read \
        .json('../data/orders.json', multiLine=True)

In [8]:
orders_json.count()

68883

In [9]:
orders_json.printSchema()

root
 |-- order_customer_id: long (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_id: long (nullable = true)
 |-- order_status: string (nullable = true)



In [10]:
orders_json.show(5)

+-----------------+----------+--------+---------------+
|order_customer_id|order_date|order_id|   order_status|
+-----------------+----------+--------+---------------+
|            11599|   00:00.0|       1|         CLOSED|
|              256|   00:00.0|       2|PENDING_PAYMENT|
|            12111|   00:00.0|       3|       COMPLETE|
|             8827|   00:00.0|       4|         CLOSED|
|            11318|   00:00.0|       5|       COMPLETE|
+-----------------+----------+--------+---------------+
only showing top 5 rows



#### JSON to PARQUET

In [11]:
input_dir = '../data/retail_db_json/'
output_dir = '../data/retail_db_parquet/'

In [12]:
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(input_dir))

In [13]:
# Get all the files from input_path (json) and store it in another location (parquet)
for file in list_status:
    file_path = file.getPath() 
    if not file_path.getName().endswith('sql'):
        print(f"Converting data in `{file_path}` from json to parquet")
        df = spark.read.json(str(file_path))
        df.coalesce(1).write.parquet(f'{output_dir}/{file_path.getName()}', mode='overwrite') # coalesce -> create 1 copy

Converting data in `file:/E:/Practice/PySpark/data/retail_db_json/categories` from json to parquet
Converting data in `file:/E:/Practice/PySpark/data/retail_db_json/customers` from json to parquet
Converting data in `file:/E:/Practice/PySpark/data/retail_db_json/departments` from json to parquet
Converting data in `file:/E:/Practice/PySpark/data/retail_db_json/orders` from json to parquet
Converting data in `file:/E:/Practice/PySpark/data/retail_db_json/order_items` from json to parquet
Converting data in `file:/E:/Practice/PySpark/data/retail_db_json/products` from json to parquet


In [14]:
orders = spark.read.parquet(f'{output_dir}/orders')

In [15]:
orders.dtypes

[('order_customer_id', 'bigint'),
 ('order_date', 'string'),
 ('order_id', 'bigint'),
 ('order_status', 'string')]

In [16]:
orders.show()

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
|            12111|2013-07-25 00:00:...|       3|       COMPLETE|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|
|             7130|2013-07-25 00:00:...|       6|       COMPLETE|
|             4530|2013-07-25 00:00:...|       7|       COMPLETE|
|             2911|2013-07-25 00:00:...|       8|     PROCESSING|
|             5657|2013-07-25 00:00:...|       9|PENDING_PAYMENT|
|             5648|2013-07-25 00:00:...|      10|PENDING_PAYMENT|
|              918|2013-07-25 00:00:...|      11| PAYMENT_REVIEW|
|             1837|2013-07-25 00:00:...|      12|         CLOSED|
|         

#### Databricks Code for converting json files to parquet

#### Convert COMMA Separated to PIPE Separated Files

In [17]:
input_dir = '../data/retail_db/'
output_dir = '../data/retail_db_pipe/'

In [18]:
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(input_dir))

In [19]:
for file in list_status:
    file_path = file.getPath()
    file_name = file_path.getName()
    if not file_name.endswith('sql'):
        print(f"Converting data in `{input_dir}` from csv to pipe separated")
        df = spark.read.csv(str(file_path))
        df.coalesce(1).write.mode('overwrite').csv(f'{output_dir}/{file_name}', sep='|')

Converting data in `../data/retail_db/` from csv to pipe separated
Converting data in `../data/retail_db/` from csv to pipe separated
Converting data in `../data/retail_db/` from csv to pipe separated
Converting data in `../data/retail_db/` from csv to pipe separated
Converting data in `../data/retail_db/` from csv to pipe separated
Converting data in `../data/retail_db/` from csv to pipe separated


In [20]:
schema = """
    order_id INT,
    order_date TIMESTAMP,
    order_customer_id INT,
    order_status STRING
"""

In [21]:
orders = spark.read.schema(schema).csv(f'{output_dir}/orders', sep='|')

In [22]:
orders.dtypes

[('order_id', 'int'),
 ('order_date', 'timestamp'),
 ('order_customer_id', 'int'),
 ('order_status', 'string')]

In [23]:
orders.show(5)

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
+--------+-------------------+-----------------+---------------+
only showing top 5 rows



#### Reading data from CSV files into Spark DataFrame using multiple approaches.

* `spark.read.csv('path_to_folder')`
* `spark.read.format('csv').load('path_to_folder')`
* We can explicitely specify the schema as `string` or using `StructType`.
* We can also read the data which is delimited or separated by other characters than comma.
* If the files have header we can create the dataframe with schema by using options as `header` and `inferSchema`. It will pick column names from the header while data types will be inferred based on the data.
* If the files does not have header we can create the dataframe with schema by passing column names using `toDF` and by using `inferSchema` option.

In [24]:
orders = spark.read.csv('../data/orders.csv')

In [25]:
orders.columns

['_c0', '_c1', '_c2', '_c3']

In [26]:
orders.dtypes

[('_c0', 'string'), ('_c1', 'string'), ('_c2', 'string'), ('_c3', 'string')]

In [28]:
schema = """
    order_id INT,
    order_date TIMESTAMP,
    order_customer_id INT,
    order_status STRING
"""

In [31]:
spark.read.schema(schema).csv('../data/orders.csv').show(5)

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
+--------+-------------------+-----------------+---------------+
only showing top 5 rows



In [32]:
spark.read.csv('../data/orders.csv', schema=schema).show(5)

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
+--------+-------------------+-----------------+---------------+
only showing top 5 rows



In [35]:
schema = StructType([
    StructField('order_id', IntegerType()),
    StructField('order_date', TimestampType()),
    StructField('order_customer_id', IntegerType()),
    StructField('order_status', StringType()),
])

In [36]:
spark.read.schema(schema).csv('../data/orders.csv').show(5)

+--------+-------------------+-----------------+---------------+
|order_id|         order_date|order_customer_id|   order_status|
+--------+-------------------+-----------------+---------------+
|       1|2013-07-25 00:00:00|            11599|         CLOSED|
|       2|2013-07-25 00:00:00|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:00|            12111|       COMPLETE|
|       4|2013-07-25 00:00:00|             8827|         CLOSED|
|       5|2013-07-25 00:00:00|            11318|       COMPLETE|
+--------+-------------------+-----------------+---------------+
only showing top 5 rows



#### Using toDF and inferSchema using csv to create spark dataframe

In [38]:
spark.read.option('inferSchema', True).csv('../data/orders.csv').dtypes

[('_c0', 'int'), ('_c1', 'string'), ('_c2', 'int'), ('_c3', 'string')]

In [42]:
help(spark.read.option('inferSchema', True).csv('../data/orders.csv').toDF)

Help on method toDF in module pyspark.sql.dataframe:

toDF(*cols) method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` that with new specified column names
    
    :param cols: list of new column names (string)
    
    >>> df.toDF('f1', 'f2').collect()
    [Row(f1=2, f2='Alice'), Row(f1=5, f2='Bob')]



In [40]:
spark.read.option('inferSchema', True).csv('../data/orders.csv').toDF('order_id',
'order_date',
'order_customer_id',
'order_status').dtypes

[('order_id', 'int'),
 ('order_date', 'string'),
 ('order_customer_id', 'int'),
 ('order_status', 'string')]

In [43]:
columns = ['order_id',
'order_date',
'order_customer_id',
'order_status']

In [44]:
spark.read.option('inferSchema', True).csv('../data/orders.csv').toDF(*columns).dtypes

[('order_id', 'int'),
 ('order_date', 'string'),
 ('order_customer_id', 'int'),
 ('order_status', 'string')]

In [45]:
spark.read.csv('../data/orders.csv', inferSchema=True).toDF(*columns).dtypes

[('order_id', 'int'),
 ('order_date', 'string'),
 ('order_customer_id', 'int'),
 ('order_status', 'string')]

 #### Using options using different ways while creating the dataframe
 
 * Using keyword arguments as part of APIs. We can use keyword arguments as part of `load` as well as direct API(`csv`).
 * `spark.read.option`
 * `spark.read.options`
 * If key the option is incorrect then the options will be ignored.
 
 Depending up on the API based on the file format the options as well as arguments way.

In [52]:
orders = spark.read.csv('../data/retail_db_pipe/orders', sep='|', header=None, inferSchema=True).toDF(*columns)

In [54]:
orders.show(5)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
+--------+--------------------+-----------------+---------------+
only showing top 5 rows



In [55]:
orders = spark.read.format('csv').load('../data/retail_db_pipe/orders', sep='|', header=None, inferSchema=True).toDF(*columns)

In [56]:
orders.show(5)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
+--------+--------------------+-----------------+---------------+
only showing top 5 rows



In [57]:
spark.read.option('sep', '|').option('header', None).option('inferSchema', True). \
csv('../data/retail_db_pipe/orders').toDF(*columns).show(5)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
+--------+--------------------+-----------------+---------------+
only showing top 5 rows



In [59]:
spark.read.options(sep='|', header=None, inferSchema=True). \
csv('../data/retail_db_pipe/orders').toDF(*columns).show(5)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
+--------+--------------------+-----------------+---------------+
only showing top 5 rows



In [60]:
options = {
    'sep': '|',
    'header': None,
    'inferSchema': True
}

In [61]:
spark.read.options(**options). \
csv('../data/retail_db_pipe/orders').toDF(*columns).show(5)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
+--------+--------------------+-----------------+---------------+
only showing top 5 rows



#### Reading JSON files 

In [64]:
spark.read.json('../data/orders.json', multiLine=True).show(2)

+-----------------+----------+--------+---------------+
|order_customer_id|order_date|order_id|   order_status|
+-----------------+----------+--------+---------------+
|            11599|   00:00.0|       1|         CLOSED|
|              256|   00:00.0|       2|PENDING_PAYMENT|
+-----------------+----------+--------+---------------+
only showing top 2 rows



In [71]:
df = spark.read.format('json').load('../data/retail_db_json/orders', multiLine=True)

df.inputFiles() # Not working

* If inferSchema is used, entire data need to be read to infer the schema accurately while creating the dataframe.
* If the data size is too big then additional time will be spent to infer the schema.
* When we explicitly specify the schema, data will not be read while creating the dataframe.
* As we have seen we should be able to explicitly specify the schema using string or StructType.
* Inferring schmea will come handy to quickly understand the structure of the data as part of proof of concepts as well as design.
* Schema will be inferred by default for files of type JSON, Parquet and ORC. Column names and datatypes will be inferred using metadata that will be associated with these types of files.
* Inferring the schema on CSV files will create dataframes with system generated column names. If inferSchema is used, then the dataframe will determine the datatypes. if the files contain header, then column names can be inherited using it. If not, we need to explicitly pass the columns using `toDF`.

In [76]:
spark.read.parquet('../data/retail_db_parquet/orders').show(2)

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
+-----------------+--------------------+--------+---------------+
only showing top 2 rows



In [77]:
spark.read.parquet('../data/retail_db_parquet/orders').dtypes

[('order_customer_id', 'bigint'),
 ('order_date', 'string'),
 ('order_id', 'bigint'),
 ('order_status', 'string')]

In [78]:
schema = """
    order_id INT,
    order_date TIMESTAMP,
    order_customer_id INT,
    order_status STRING
"""

In [79]:
spark.read.schema(schema).parquet('../data/retail_db_parquet/orders').dtypes

[('order_id', 'int'),
 ('order_date', 'timestamp'),
 ('order_customer_id', 'int'),
 ('order_status', 'string')]

In [80]:
spark.read.format('parquet').load('../data/retail_db_parquet/orders', schema=schema).dtypes

[('order_id', 'int'),
 ('order_date', 'timestamp'),
 ('order_customer_id', 'int'),
 ('order_status', 'string')]