# **Parquet**

## **Introduction**

It is an open source, column-oriented data file format designed for efficent data storage and retrivel. It provides efficent data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python etc..

### **What does it mean Column-Oriented?**

This means that data is stored by column by column instead of row by row.

***Example:***

ID     → [1, 2, 3]

Name   → [John, Anna, Peter]

Age    → [30, 28, 35]

Salary → [50000, 60000, 70000]

### **Characteristics:**


*   Excellent for analysis(OLAP)
*   Extremely fast for aggregations and filters
*  High compression efficency

### **What Does “Efficient Data Storage and Retrieval” Mean?**

It means that storing data using minimum space and retrieving only the required data in the fastest possible time.

* **Efficient Storage** → Less disk, less memory, less cost.

* **Efficient Retrieval** → Faster reading, lower latency, higher performance


***Storage Comparison***

Format	File Size

***CSV*** :	1 GB

***JSON*** :	1.2 GB

***Excel*** :	900 MB

***Parquet*** :	150–300 MB


### **Who support Parquet**

* Appache Spark
* Hive
* Flink
* Presto DB







In [2]:
# Exporting csv file to Parquet using pandas

import pandas as pd

# Initializing the file path
csv_file_path = r"/content/car_data.csv"

# Reading the csv file and converted to a dataframe
dataframe = pd.read_csv(csv_file_path)

dataframe.to_parquet(r"/content/car_data.parquet", engine="pyarrow", compression="snappy")

print("CSV file converted to Parquet file")






CSV file converted to Parquet file


In [3]:
# Need to verify the parquest file

parquest_file_validation = pd.read_parquet(r"/content/car_data.parquet")

parquest_file_validation.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [4]:
# Need to check the file Size Reduction

import os

csv_size = os.path.getsize(r"/content/car_data.csv") / (1024**2)
parquet_size = os.path.getsize(r"/content/car_data.parquet") / (1024**2)

print(f"CSV size     : {csv_size:.2f} MB")
print(f"Parquet size : {parquet_size:.2f} MB")


CSV size     : 1.41 MB
Parquet size : 0.12 MB


## **Advantage of Using Parquet File format**

| Benefit     | Result                     |
| ----------- | -------------------------- |
| Storage     | 70–90% reduction           |
| Speed       | 10–30x faster reads        |
| ML training | Faster dataset loading     |
| Cloud cost  | Lower storage + query cost |


# **Conclusion**

Apache Parquet is a highly efficient, column-oriented file format designed to optimize both data storage and retrieval, making it ideal for large-scale analytics and data science workflows. By storing data column-wise, applying advanced compression techniques, and enabling selective column reading, Parquet significantly reduces file size and improves query performance compared to traditional row-based formats like CSV and Excel. Converting CSV files into Parquet using Python not only saves storage space but also accelerates data processing, analytics, and machine learning pipelines. As a result, Parquet has become the industry standard for big data processing, cloud data lakes, and high-performance analytical systems.