
# Handling Parquet Files with Python
This notebook demonstrates how to handle Parquet files in Python, along with installation instructions and a comparison between using Excel and Parquet files in terms of performance and operations.

## Install Necessary Libraries

Make sure you have the necessary libraries installed:

```bash
pip install pandas pyarrow fastparquet openpyxl
```


In [None]:

import pandas as pd
import numpy as np

# Generate sample data
np.random.seed(42)
data = {
    'id': np.arange(1, 1001),
    'name': np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eva'], 1000),
    'surname': np.random.choice(['Smith', 'Johnson', 'Williams', 'Jones', 'Brown'], 1000),
    'email': [f'user{i}@example.com' for i in range(1, 1001)],
    'address': np.random.choice(['123 Main St', '456 Oak St', '789 Pine St', '101 Maple St'], 1000)
}

# Create DataFrame
df = pd.DataFrame(data)

# Save to Excel and Parquet files
df.to_excel('users_info.xlsx', index=False)
df.to_parquet('users_info.parquet', index=False)

df.head()


In [None]:

import time

# Read and time Excel
start_time = time.time()
df_excel = pd.read_excel('users_info.xlsx')
excel_time = time.time() - start_time
print(f"Time to load Excel file: {excel_time:.4f} seconds")

# Read and time Parquet
start_time = time.time()
df_parquet = pd.read_parquet('users_info.parquet')
parquet_time = time.time() - start_time
print(f"Time to load Parquet file: {parquet_time:.4f} seconds")

# Compare DataFrame sizes
print(f"Excel file size: {df_excel.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")
print(f"Parquet file size: {df_parquet.memory_usage(deep=True).sum() / (1024 * 1024):.2f} MB")
