In [0]:
spark

Core Structure:\
<pre>
DataframeReader.format(...)\
            .option("key", "value")\
            .schema(...)\
            .load(...)
</pre>

format [Optional] => Data file format. e.g. CSV, JSON, JDBC/ODBC, table, parquet (Default)
option [Optional] => inferschema, mode, header
schema [Optional] => manual schema can be passed
load [Required] => Path where our data is residing

Dataframe Reader API => To access this, use "spark.read"
spark => SparkSession

example:\
<pre>
spark.read.format("csv")\
    .option("header", "true")\
    .option("inferschema", "true")\
    .option("mode", "FAILFAST")\
    .load("c:\user\download\data.csv")
</pre>

Mode:
1. FAILFAST -> Fail execution if malformed record is found in dataset
2. DROPMALFORMED -> Drop the corrupted record and continue with the execution
3. PERMISSIVE -> This is default. Set null value to all corrupted fields

File formats:
1. csv -> Comma Separated Value
2. json -> javascript object notation
3. parquet -> Column-oriented data file format designed for efficient data storage and retrieval.

In [0]:
flight_df = spark.read.format("csv")\
            .option("header", "false")\
            .option("inferschema", "false")\
            .option("mode", "FAILFAST")\
            .load("/FileStore/tables/flight_data-1.csv")
# Show 5 records | As header is false, so no header will be displayed
flight_df.show(5)

+-----------------+-------------------+-----+
|              _c0|                _c1|  _c2|
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
+-----------------+-------------------+-----+
only showing top 5 rows



In [0]:
flight_df_header = spark.read.format("csv")\
            .option("header", "true")\
            .option("inferschema", "false")\
            .option("mode", "FAILFAST")\
            .load("/FileStore/tables/flight_data-1.csv")
# Show 5 records
# As header is true, header will be displayed
flight_df_header.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [0]:
# Show schema
# As inferschema is false, so schema is not inferred (meaning, do not try to read the data and try to figure out the data in it)
# That's why count is coming as string rather than integer as it is not inferred.
flight_df_header.printSchema

Out[12]: <bound method DataFrame.printSchema of DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: string]>

In [0]:
flight_df_header_schema = spark.read.format("csv")\
            .option("header", "true")\
            .option("inferschema", "true")\
            .option("mode", "FAILFAST")\
            .load("/FileStore/tables/flight_data-1.csv")
# Show 5 records
# As header is true, header will be displayed
# Here inferredschema is true. So the schema will get inferred
flight_df_header_schema.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [0]:
# Show schema
# As inferschema is true, so schema is inferred (meaning, try to read the data and try to figure out the data in it)
# That's why count is coming as int as the data in it is inferred and it was found that the data is type int and not string.
flight_df_header_schema.printSchema

Out[15]: <bound method DataFrame.printSchema of DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: int]>

---