# **Advanced Python:** Reduce Size of Datasets
**Name:** Arsalan Ali<br>
**Email:** arslanchaos@gmail.com

A lot of times when we're working with big datasets it takes away a lot of memory<br>
We can shrink the dataset size by various ways

### **Table of Contents**
* df_shrink
* dtype_diet
* parquet

### 1. dtype_diet
It converts the columns into smaller datatypes so they won't take much memory

In [6]:
import pandas as pd
from dtype_diet import report_on_dataframe, optimize_dtypes

df = pd.read_csv("prices.csv")
settings = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, settings)


print(f'Before: {df.memory_usage(deep=True).sum()/1024**2:.2f} MB')
print(f'After: {new_df.memory_usage(deep=True).sum()/1024**2:.2f} MB')


Before: 2272.44 MB
After: 695.16 MB


### 2. Parquet Format
It compresses the dataset thus reducing its size on storage but in memory it stays the same

In [None]:
# Just a Decorator to Calculate Time and Memory

import time
import psutil

def time_and_memory_profiler(func):
  def wrapper(*args, **kwargs):
    start_time = time.perf_counter()
    memory_before = psutil.Process().memory_info().rss
    
    result = func(*args, **kwargs)
    
    print(f"Time taken: {time.perf_counter() - start_time:.6f} Seconds")
    print(f"Memory usage: {(psutil.Process().memory_info().rss - memory_before)/1048576} Megabytes")
    
    return result
  
  return wrapper

In [None]:
# Parquet

def load_parquet(file):
    df = pd.read_parquet(file)
    return df
    
load_parquet("prices.parquet")