###Potential Interview Questions:
1. What is parquet file format?
2. Why do we need parquet?
3. How to read parquet file?
4. What makes parquet default choice?
5. What encoding is done on data?
6. What compression techniques are used?
7. How to optimize the parquet file?
8. What is row group, column and pages ?
9. How projection pruning and predicate pushdown works?

**What is Parquet file format**
- Parquet is a columnar storage file format that is optimized for use with big data processing frameworks
- Parquet organizes data by columns rather than by rows. This columnar storage format is well-suited for analytics
  and data processing<br> tasks where only specific columns need to be read and processed.


**What is row-based file format and column-based file format?**

Row-based and column-based file formats refer to the way data is organized and stored within a file.
- In row-based storage, data is organized and stored in rows. Each row contains all the data for a single record or tuple.
  Ex:CSV, TSV, JSON, and Avro 
- In column-based storage, data is organized and stored in columns. Each column contains all the values for a specific attribute
  across all records.<br>Parquet, and ORC file are columnar file formats.

<img src="https://drive.google.com/uc?id=1FrpDlPGea0HL0MMK2qXgFDrcGo1xG9Mj" alt="drawing" style="width:700px;"/>


> *Note: The principle of "write once, read many" is a common philosophy in the context of big data. Apache Parquet and Apache 
ORC<br> (Optimized Row Columnar),are designed with principles that align with "write once, read many". They provide columnar storage and <br>compression, which are well-suited for analytical queries.*

|OLAP|OLTP|
|-----|-----|
|Online Analytical Processing|Online Transactional Processing|
|Only few columns read|Write: Insert,update,delete|
|column-based file format|row-based file format|

[Click here to check detailed explanation of the mentioned topic](https://chat.openai.com/share/cb02bd87-ca82-4595-8f76-e78110c20a69)

In [None]:
# Read parquet file
# One of the key reasons why no additional options are often required when reading Parquet files in PySpark is because 
# of its self-describing nature and rich metadata
df = spark.read.parquet('/FileStore/tables/part_r_00000_1a9822ba_b8fb_4d8e_844a_ea30d0801b9e_gz.parquet')
df.show(truncate=False)

+--------------------------------+-------------------+-----+
|DEST_COUNTRY_NAME               |ORIGIN_COUNTRY_NAME|count|
+--------------------------------+-------------------+-----+
|United States                   |Romania            |1    |
|United States                   |Ireland            |264  |
|United States                   |India              |69   |
|Egypt                           |United States      |24   |
|Equatorial Guinea               |United States      |1    |
|United States                   |Singapore          |25   |
|United States                   |Grenada            |54   |
|Costa Rica                      |United States      |477  |
|Senegal                         |United States      |29   |
|United States                   |Marshall Islands   |44   |
|Guyana                          |United States      |17   |
|United States                   |Sint Maarten       |53   |
|Malta                           |United States      |1    |
|Bolivia                


**Indeed Parquet is a column-based file format which is most efficient file format for analytical workload but there some limitation<br>
of it which has discussed in the video(Attaching the screenshot of that particular portion)**

> Suppose you have a table with 100 columns, and each column has a size of 10GB. In a columnar storage format, when you need to read<br> only 
> the 1st,2nd, and last columns, the cost and time associated with scanning through the 3rd column to the 98th column become inefficient.

<img src="https://drive.google.com/uc?id=12aWtNhZ7xakiZaoSZj8IKlsjvvf18V1u" alt="drawing" style="width:700px;"/>

**So to resolve this issue Parquet uses a hybrid storage format which sequentially stores chunks of columns,lending 
to high performance<br> when selecting and filtering data.**

<img src="https://drive.google.com/uc?id=1ANDtigOKXiuea_-6i4XmbIVXHz1s3B9D" alt="drawing" style="width:700px;"/>

**Note**
- Parquet is a structured data file format.
- It comes with binary form means normally we can't read the contents of the file using normal editor of the system.
- And of course it's a columnar-based file format but it uses hybrid approach for better efficiency.


*Below figure demonstrates the internal structure of the parquet file.*

<img src="https://drive.google.com/uc?id=1Ntsvj4Mt99CRw3ndRNFRaIqMCDIXX8HH" alt="drawing" style="width:550px;"/>

- The data in a Parquet file is broken into horizontal slices called RowGroups
- Each RowGroup contains a single ColumnChunk for each column in the schema

**For example, the following diagram illustrates a Parquet file with three columns “A”, “B” and “C” stored in two RowGroups<br> for a total of 6 ColumnChunks.**
<img src="https://drive.google.com/uc?id=1eLvt23RddsoyLfnkQxuBhx3WbFtVyeGK" alt="drawing" style="width:350px;"/>

*Also take a look at the below picture for better visualization on organization of data in parquet file*

<img src="https://drive.google.com/uc?id=1-hrSNC-BQb5etzP7YWzwfJciufx35H3V" alt="drawing" style="width:800px;"/>


**Compression**

File compression is the act of taking a file and making it smaller. In Parquet, compression is performed column by column<br> 
and it is built to support flexible compression options and extendable encoding schemas per data type – e.g., different encoding<br> 
can be used for compressing integer and string data.

Parquet data can be compressed using these encoding methods:

- Dictionary encoding: this is enabled automatically and dynamically for data with a small number of unique values.
- Bit packing: Storage of integers is usually done with dedicated 32 or 64 bits per integer. This allows more 
  efficient storage of small integers.
- Run length encoding (RLE): when the same value occurs multiple times, a single value is stored once along with the number 
  of occurrences.<br> Parquet implements a combined version of bit packing and RLE, in which the encoding switches based on which
  produces the best compression results.

*Below picture has shown the visualization of the encoding in Parquet file format*

<img src="https://drive.google.com/uc?id=1o73zy_x_ALacL6mcP6OtTSmORx4SOky5" alt="drawing" style="width:750px;"/>

**Some screenshots from video:**

<img src="https://drive.google.com/uc?id=1gupCvFH1IAVbnpvbckOVhaFflLteRQwi" alt="drawing" style="width:750px;"/>

<img src="https://drive.google.com/uc?id=1ThWpOD317T3NO4lotetYrwVMQnvZxZ-D" alt="drawing" style="width:750px;"/>

<img src="https://drive.google.com/uc?id=1Mai4qWJhUh63jmphe_fuPWD9bsdFmfe6" alt="drawing" style="width:750px;"/>
