In [1]:
import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession

from lib.logger import Log4j

spark = SparkSession.builder.master("local[3]").appName("Read Formats using API").getOrCreate()
logger = Log4j(spark)
logger.info("Starting HelloSparkSQL")

In [2]:
flightTimeCsvDf = spark.read \
                    .format("csv") \
                    .option("header", "true") \
                    .load("data/flight*.csv")

In [4]:
flightTimeCsvDf.show(5)
logger.info("CSV Schema:" + flightTimeCsvDf.schema.simpleString())

+--------+----------+-----------------+------+----------------+----+--------------+------------+--------+---------+-------+------------+--------+---------+--------+
| FL_DATE|OP_CARRIER|OP_CARRIER_FL_NUM|ORIGIN|ORIGIN_CITY_NAME|DEST|DEST_CITY_NAME|CRS_DEP_TIME|DEP_TIME|WHEELS_ON|TAXI_IN|CRS_ARR_TIME|ARR_TIME|CANCELLED|DISTANCE|
+--------+----------+-----------------+------+----------------+----+--------------+------------+--------+---------+-------+------------+--------+---------+--------+
|1/1/2000|        DL|             1451|   BOS|      Boston, MA| ATL|   Atlanta, GA|        1115|    1113|     1343|      5|        1400|    1348|        0|     946|
|1/1/2000|        DL|             1479|   BOS|      Boston, MA| ATL|   Atlanta, GA|        1315|    1311|     1536|      7|        1559|    1543|        0|     946|
|1/1/2000|        DL|             1857|   BOS|      Boston, MA| ATL|   Atlanta, GA|        1415|    1414|     1642|      9|        1721|    1651|        0|     946|
|1/1/2000|

reader API correctly reading column from header correctly, hwever datatype for each column is string.
What if we infered the schema. lets try

In [5]:
flightTimeCsvDf = spark.read \
                    .format("csv") \
                    .option("header", "true") \
                    .option("inferSchema","true")\
                    .load("data/flight*.csv")

In [6]:
flightTimeCsvDf.show(5)
logger.info("CSV Schema:" + flightTimeCsvDf.schema.simpleString())

+--------+----------+-----------------+------+----------------+----+--------------+------------+--------+---------+-------+------------+--------+---------+--------+
| FL_DATE|OP_CARRIER|OP_CARRIER_FL_NUM|ORIGIN|ORIGIN_CITY_NAME|DEST|DEST_CITY_NAME|CRS_DEP_TIME|DEP_TIME|WHEELS_ON|TAXI_IN|CRS_ARR_TIME|ARR_TIME|CANCELLED|DISTANCE|
+--------+----------+-----------------+------+----------------+----+--------------+------------+--------+---------+-------+------------+--------+---------+--------+
|1/1/2000|        DL|             1451|   BOS|      Boston, MA| ATL|   Atlanta, GA|        1115|    1113|     1343|      5|        1400|    1348|        0|     946|
|1/1/2000|        DL|             1479|   BOS|      Boston, MA| ATL|   Atlanta, GA|        1315|    1311|     1536|      7|        1559|    1543|        0|     946|
|1/1/2000|        DL|             1857|   BOS|      Boston, MA| ATL|   Atlanta, GA|        1415|    1414|     1642|      9|        1721|    1651|        0|     946|
|1/1/2000|

Now it is little better, numeric field infered to be an integer, However date field is still a string.
point is stregth- you cannot rely on schema infered option.
so you have only two options here
1. explicit - explicitly set schema for dataframe or
2. use dataframe that comes with schema implicit schema

###### Lets Read json file 

In [8]:
flightTimeJsonDf = spark.read \
                    .format("json") \
                    .load("data/flight*.json")
flightTimeJsonDf.show(5)
logger.info("Json Schema:" + flightTimeJsonDf.schema.simpleString())

+--------+---------+------------+------------+--------+----+--------------+--------+--------+----------+-----------------+------+----------------+-------+---------+
|ARR_TIME|CANCELLED|CRS_ARR_TIME|CRS_DEP_TIME|DEP_TIME|DEST|DEST_CITY_NAME|DISTANCE| FL_DATE|OP_CARRIER|OP_CARRIER_FL_NUM|ORIGIN|ORIGIN_CITY_NAME|TAXI_IN|WHEELS_ON|
+--------+---------+------------+------------+--------+----+--------------+--------+--------+----------+-----------------+------+----------------+-------+---------+
|    1348|        0|        1400|        1115|    1113| ATL|   Atlanta, GA|     946|1/1/2000|        DL|             1451|   BOS|      Boston, MA|      5|     1343|
|    1543|        0|        1559|        1315|    1311| ATL|   Atlanta, GA|     946|1/1/2000|        DL|             1479|   BOS|      Boston, MA|      7|     1536|
|    1651|        0|        1721|        1415|    1414| ATL|   Atlanta, GA|     946|1/1/2000|        DL|             1857|   BOS|      Boston, MA|      9|     1642|
|    2005|

problem still remain as it is. as jason do not come with header and for json bydefault schema is infered by reader API so we removed option method for header and infer schema. but still date type is string.

So we have only otion to set expplicit schema.

Before that read file which come with schema viz. parquet

###### Lets Read parquet file

In [12]:
flightTimeParquetDf = spark.read \
                    .format("parquet") \
                    .load("data/flight*.parquet")
flightTimeParquetDf.show(5)
logger.info("Parquet Schema:" + flightTimeParquetDf.schema.simpleString())

+--------+---------+------------+------------+--------+----+--------------+--------+--------+----------+-----------------+------+----------------+-------+---------+
|ARR_TIME|CANCELLED|CRS_ARR_TIME|CRS_DEP_TIME|DEP_TIME|DEST|DEST_CITY_NAME|DISTANCE| FL_DATE|OP_CARRIER|OP_CARRIER_FL_NUM|ORIGIN|ORIGIN_CITY_NAME|TAXI_IN|WHEELS_ON|
+--------+---------+------------+------------+--------+----+--------------+--------+--------+----------+-----------------+------+----------------+-------+---------+
|    1348|        0|        1400|        1115|    1113| ATL|   Atlanta, GA|     946|1/1/2000|        DL|             1451|   BOS|      Boston, MA|      5|     1343|
|    1543|        0|        1559|        1315|    1311| ATL|   Atlanta, GA|     946|1/1/2000|        DL|             1479|   BOS|      Boston, MA|      7|     1536|
|    1651|        0|        1721|        1415|    1414| ATL|   Atlanta, GA|     946|1/1/2000|        DL|             1857|   BOS|      Boston, MA|      9|     1642|
|    2005|

In [13]:
spark.stop()

Well Done. Everything is perfect.

Point is stregth - use parquet format as long as it is possible. It is recommended and default file format for apache spark. 