# <a id='toc1_'></a>[Large DataSets Scaling (LDS) - Using Efficient Datatypes](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Large DataSets Scaling (LDS) - Using Efficient Datatypes](#toc1_)    
    - [Function to build a dataset](#toc1_1_1_)    
    - [Build a DataSet](#toc1_1_2_)    
  - [Load a DataSet to improve datatypes](#toc1_2_)    
  - [Inspect the data types and memory usage to see where we should focus our attention.](#toc1_3_)    
    - [Make a copy of ts](#toc1_3_1_)    
  - [Optimization](#toc1_4_)    
      - [See % of optimization in each case ts2/ts](#toc1_4_1_1_)    
      - [See % of optimization in total ts2/ts](#toc1_4_1_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [76]:
import pandas as pd
import numpy as np

### <a id='toc1_1_1_'></a>[Function to build a dataset](#toc0_)

In [77]:
# Function to make a dataframe to work with scaling datasets
def make_timeseries(start="2024-01-01", end="2024-12-31", freq="1D", seed=None):
    """Build a dataset"""
    # Build an index
    index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
    n = len(index)
    state = np.random.RandomState(seed)
    columns = {
        "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
        "id": state.poisson(1000, size=n),
        "x": state.rand(n) * 2 - 1,
        "y": state.rand(n) * 2 - 1,
    }
    df = pd.DataFrame(columns, index=index, columns=sorted(columns))
    if df.index[-1] == end:
        df = df.iloc[:-1]
    return df

### <a id='toc1_1_2_'></a>[Build a DataSet](#toc0_)

In [78]:
ts = make_timeseries(freq="30s", seed=0)
ts.to_parquet("timeseries.parquet")
ts.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2024-01-01 00:00:00,1041,Alice,0.889987,0.281011
2024-01-01 00:00:30,988,Bob,-0.455299,0.488153


## <a id='toc1_2_'></a>[Load a DataSet to improve datatypes](#toc0_)

In [79]:
ts = pd.read_parquet("timeseries.parquet")
ts.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2024-01-01 00:00:00,1041,Alice,0.889987,0.281011
2024-01-01 00:00:30,988,Bob,-0.455299,0.488153


## <a id='toc1_3_'></a>[Inspect the data types and memory usage to see where we should focus our attention.](#toc0_)

In [80]:
ts.dtypes

id        int32
name     object
x       float64
y       float64
dtype: object

In [81]:
ts.memory_usage(deep=True)  # memory usage in bytes

Index     8409608
id        4204804
name     56766826
x         8409608
y         8409608
dtype: int64

### <a id='toc1_3_1_'></a>[Make a copy of ts](#toc0_)
To compare the before and after depuration

In [82]:
ts2 = ts.copy()

The **name** column is taking up much more memory than any other. It has just a few unique values, so it’s a good candidate for converting to a pandas.Categorical. With a pandas.Categorical, we store each unique name once and use space-efficient integers to know which specific name is used in each row.

In [83]:
ts2.name = ts.name.astype("category")
ts2.memory_usage(deep=True)  # memory usage in bytes

Index    8409608
id       4204804
name     1051471
x        8409608
y        8409608
dtype: int64

We can go a bit further and downcast the numeric columns to their smallest types using **pandas.to_numeric()**.

In [84]:
ts2.id = pd.to_numeric(ts.id, downcast="unsigned")
ts2.memory_usage(deep=True)  # memory usage in bytes

Index    8409608
id       2102402
name     1051471
x        8409608
y        8409608
dtype: int64

## <a id='toc1_4_'></a>[Optimization](#toc0_)

#### <a id='toc1_4_1_1_'></a>[See % of optimization in each case ts2/ts](#toc0_)

In [85]:
ts2.memory_usage(deep=True)/ts.memory_usage(deep=True)

Index    1.000000
id       0.500000
name     0.018523
x        1.000000
y        1.000000
dtype: float64

#### <a id='toc1_4_1_2_'></a>[See % of optimization in total ts2/ts](#toc0_)

In [86]:
ts2.memory_usage(deep=True).sum()/ts.memory_usage(deep=True).sum()

0.3292638922760198