# DAY 05 - Read Files into DataFrame
- Youtube Link: https://www.youtube.com/watch?v=02lSlhwLU4c

### Reading a CSV file into a Spark DataFrame
- Basic read
- with headers,
- inferring the Schema

In [None]:
# Declare the path to the file
csv_path = 'Files/property-sales.csv'

# Read a CSV file from Files/property-sales.csv
df_csv = spark.read.csv(csv_path, headers = True, inferSchema = True)

display(df_csv)

In [None]:
df_csv.dtypes

### Writing DataFrames to files (JSON)
- We can write our dataframe as a JSON file by calling df.write.json()

In [None]:
# Call write.json()
df_csv.write.json("Files/json/property-sales.json", mode = 'overwrite')

### Reading a JSON File into DataFrame

In [None]:
df_json = spark.read.json('Files/json/property-sales.json')
display(df_json)

### Writing out to Parquet

In [None]:
df_json.write.parquet('Files/json/property-sales.parquet', mode = 'overwrite')

### Reading a Parquet into a DataFrame

In [None]:
df_parquet = spark.read.parquet('Files/json/property-sales.parquet')
display(df_parquet)

# Reading Multiple Files in the Same Folder
- creating multiple parquet files in the parquet subfolder first
- read in all the parquet files into one df

In [None]:
# Read all the parquet files in the 'Files/parquet/' folder into a dataframe
df_all_parquet = spark.read.parquet('Files/parquet/*.parquet')    # "*" wildcard symbol that will read all of the files within the folder

### Checking if this has worked using _metadata
- Spark provides us with all the file metadata in a "hidden" column that we can add to the dataframe using _metadata.

In [None]:
# Read all the parquet files, then add the _metadata column
df_all_parquet_plus_metadata = spark.read\     # "\" cutting the code - go to a new line to make it more readable
    .parquet('Files/parquet/*.parquet')\
    .select("*", "_metadata")

display(df_all_parquet_plus_metadata)

# Further Learning
- Ignoring corrupt/missing files
- Custom path filtering (PathGlobFilter)
- More recursive file reading patterns within complex folder structures.