Questions:
1. What is parquet file format?
2. Why do we need parquet?
3. How to read parquet file?
4. What makes parquet default choice?
5. What encoding is done on date?
6. What compression technique is used?
7. How to optimize the parquet file?
8. What is row group, column and pages?
9. How projection pruning and predicate pushdown works?

- Doc - [Link](https://parquet.apache.org/docs/concepts/)
- Parquet is a file format.
- Parquet is a Columnar based file format.


Columnar Based File Format and Row Based File Format: [Link](https://blog.devgenius.io/big-data-file-formats-d980f5d07e44) | [Link](https://www.linkedin.com/pulse/day10-data-layout-row-based-vs-column-based-farhan-khan/)
- In Big data, we do write once and read many.
- Hence, using a file format which is faster to read is preferrable to use.
- In case of Column oriented file formats, the way they are stored in disk, it becomes easier to read data from it.
- That is beacuse, suppose we want to read column 1 and column 3 of a data. So, to do that, for row oriented data, we first read column1 of record1 and read it and then jump to column3 of record1 and read it and then jump to Column1 of record2 and read it and then jump to Column3 of record2 and read it and keep repeating this for the rest of the records, This is because, the data stored in row oriented fashion is not continuous and we need to make jumps in the memory. It takes more time. But in case of column oriented data, we first read the complete Column1 and then jump to where Column3 data are stored and read all the Column3 at once. This makes things easier as less memory jumps are required. Which is why reading column oriented data is way easier than row oriented data. But by this logic, row oriented data is faster to write than column oriented data.

OLAP vs OLTP:
- OLAP - Online Analytical Processing
  - Only few columns are read.
  - Column-Oriented file format is used. (Faster to read)
- OLTP - Online Transactional Processing
  - Write -> insert, update, delete
  - Row-Oriented file format is used. (Faster to write)

- At the end of the day we want to reduce cost and time and improve performance of our queries. For which we prefer to use column-oriented file format, Parquet

<pre>
File uploaded to /FileStore/tables/part_r_00000_1a9822ba_b8fb_4d8e_844a_ea30d0801b9e_gz.parquet
</pre>
gz is a compression technique

In [0]:
df = spark.read.parquet("/FileStore/tables/part_r_00000_1a9822ba_b8fb_4d8e_844a_ea30d0801b9e_gz.parquet")
df.show()

# We do not need to give extra parameters to read parquet files as it has many metadata in it to infer all of them itself.
# Parquet is a binary file format, so we cannot normally read it.

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|    1|
|       United States|            Ireland|  264|
|       United States|              India|   69|
|               Egypt|      United States|   24|
|   Equatorial Guinea|      United States|    1|
|       United States|          Singapore|   25|
|       United States|            Grenada|   54|
|          Costa Rica|      United States|  477|
|             Senegal|      United States|   29|
|       United States|   Marshall Islands|   44|
|              Guyana|      United States|   17|
|       United States|       Sint Maarten|   53|
|               Malta|      United States|    1|
|             Bolivia|      United States|   46|
|            Anguilla|      United States|   21|
|Turks and Caicos ...|      United States|  136|
|       United States|        Afghanistan|    2|
|Saint Vincent and..

- Although Parquet is columnar based file format, but to read huge records of data faster, it divides the records into chunks/groups. So that reading becomes faster. So, we can say that in a way, Parquet uses both columnar and row based storage to make reading faster.
- These chunk of records are called Row Group.
  - Row Group is a logical partition to divide our data into smaller chunks for better read.
  - The default size of a Row Group is 128 MB. 
  - Each Row Group contains its own Metadata (Data of data. Like who saved it, when was it saved, number of records, minimum size, max size, etc.).
  - Each Row Group contains multiple Columns.
  - Each Columns contain multible Pages.
  - Each Page has metadata and Values.
  - Values is our actual data.
  - All of these meta data makes Parquet so good to read.
- Parquet is a structured file format.
- It is stored in Binary form.
- [Link](https://karthiksharma1227.medium.com/understanding-parquet-and-its-optimization-opportunities-c7265a360391) | [Link](https://data-mozart.com/parquet-file-format-everything-you-need-to-know/) | [Link](https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705)

- Commands to see and inspect metadata in local.
<pre>
pip install parquet-tools
parquet-tools show (file_path)
parquet-tools inspect (file_path)
</pre>
- Python commands to read more metada:
<pre>
import pyarrow as pa
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile(r'(file_path)')
parquet_file.metadata
parquet_file.metadata.row_group(0) 
parquet_file.metadata.row_group(0).column(0)
parquet_file.metadata.row_group(0).column(0).statistics 
</pre>

- Parquet uses the below compression methods to compress the data more:
  - First Parquet does Dictionary Compression
  - Then it daoes RLE (Run Length Encoding) and bit-packing
- Encoding Types:
  - Gzip
  - Snappy
  - LZO

- Data Organization in Parquet:
  - File
    - Row Group (We have metadata at group level too)
      - Column
        - Pages
          - Metadata
            - min
            - max
            - count

- Optimization
<pre>
select * from table where age < 18
</pre>
- Now if we read this data from Parquet file, then we can quickly determine from which Row Groups we need to read the data (As each Row Group will have the min age and max age metadata) without going through each individual record. This makes the reading faster. Makes less IO operations, which saves cost, time and saves CPU utilization (Improved performance).

Projection Pruning:
- This means that, we should not select/read/scan the columns that are not rquired from the very beginning.
- As we are avoiding unnecessary columns, parquet's hybrid model helps here and makes reading faster.

Predicate Pushdown:
- Predicate Pushdown prevent the Row Groups from getting scanned, whose metadata does not match with our query.
- For the above example the Row groups with min age more than or equal to 18 will be discarded without getting scanned.

Now for another example, consider the below query:
<pre>
select * from table where age = 18
</pre>
- Now, we know that Parquet does Dictionary Encoding of each of it's columns and store them as meta data in Row Group.
- So, while going through, Row Groups, each Row Group that does not have dictionary key 18 for the age column in them, will be directly discarded.
- This maked reading the data faster.

---