Questions:
1. How to create Schema in PySpark?
2. What are other ways to create it?
3. What is StructField and StructType in Schema?
4. What if I have header in my data?

We can create Schema in Spark using any of the below processes:
1. Using StructType and StructField (from pyspark.sql.types import StructType, StructField)
  - StructType -> Defines the structure of the DF (List or Collection of StructField)
  - StructField -> Structure of the Columns in the DF (name of the field, data type of the field, the field nullable or not)
2. Using DDL (Data Definition Language)

Create a Schema using StructType and StructField:\
<pre>
my_schema = StructType([
  StructField("id", IntegerType(), True),
  StructField("name", StringType(), True),
  StructField("age", IntegerType(), True)
])
</pre>

Create a Schema using DDL:\
<pre>
ddl_my_schema = "id integer, name string, age integer"
<pre>

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
my_schema = StructType([
  StructField("DEST_COUNTRY_NAME", StringType(), True),
  StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
  StructField("count", IntegerType(), True)
])
flight_df = spark.read.format("csv")\
            .option("header", "false")\
            .option("skiprows", 1)\
            .option("inferschema", "false")\
            .schema(my_schema)\
            .option("mode", "FAILFAST")\
            .load("/FileStore/tables/flight_data.csv")
# As header and inferschema both are false, spark will ignore the headers in the data set and will not infer the data in the cells while generating the DF
# But as we have added our own custom schema, spark will consider this schema while generating the DF.
# The headers in the DF will come from the Fields that we defined in StructField
# We needed to make the the mode, PERMISSIVE, so the issue fields will simply have null in them. 
# We are using it as the 1st row has string in the count column. Which should have been integer.
# That is because, the 1st row will show the pre-existing actual row in the dataset.
# In this case, skiprows will skip the 1st row (as defined), the default header in the dataset 
# Once this is done, we can again change the mode to FAILFAST.
flight_df.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



Check if the uploaded file is present in the DB or not in Databricks\
<pre>
%fs
ls /FileStore/tables/flight_data.csv
</pre>