# Based on: [Scaling to large datasets](https://pandas.pydata.org/docs/user_guide/scale.html)

In [1]:
import pandas as pd
import numpy as np

# 1. Load Less Data

Some *pandas* "readers" allow you to specify the columns to load:

```python
pandas.read_csv("...", usecols=[...])
pandas.read_excel("...", usecols=[...])
pandas.read_feather("...", columns=[...])
pandas.read_hdf("...", columns=[...])
pandas.read_paraquet("...", columns=[...])
pandas.read_sql("...", columns=[...])
pandas.read_spss("...", usecols=[...])
pandas.read_stata("...", columns=[...])
pandas.read_table("...", usecols=[...])
```

In [2]:
# Avoid loading petal_length and petal_width
iris_data = pd.read_csv("../data/iris.csv", usecols=["sepal_length", "sepal_width", "species"])
iris_data.head()

Unnamed: 0,sepal_length,sepal_width,species
0,5.1,3.5,setosa
1,4.9,3.0,setosa
2,4.7,3.2,setosa
3,4.6,3.1,setosa
4,5.0,3.6,setosa


# 2. Use Efficient Data Types


## 2.1 CategoricalDtype

Text data columns with few unique values (low cardinality) can be stored using the [CategoricalDtype][1].

This saves unique names only once, and stores them as integers which eat up much less space.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.CategoricalDtype.html

In [3]:
text = pd.Series(list("ABCD") * 100000)
text.dtype, text.memory_usage(deep=True)

(dtype('O'), 23200128)

In [4]:
# Use .cat.as_ordered() if order is significant
categorical = text.astype("category").cat.as_ordered()
categorical.dtype, categorical.memory_usage(deep=True)

(CategoricalDtype(categories=['A', 'B', 'C', 'D'], ordered=True), 400532)

## 2.2 Downcast Numeric Dtypes

The `downcast` argument of the [pandas.to_numeric][1] method allows you to shrink a numeric column to its smallest type.

[1]: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

In [5]:
numeric = pd.Series(range(100000))
numeric.dtype, numeric.memory_usage(deep=True)

(dtype('int64'), 800128)

In [6]:
numeric = pd.to_numeric(numeric, downcast="integer")
numeric.dtype, numeric.memory_usage(deep=True)

(dtype('int32'), 400128)

# 3. Use Chunking

Chunking allows you to break up large datasets into manageable portions that can comfortably fit in memory.

> Works best when the operation you're performing doesn't require coordination between chunks.

Some "readers" offer parameters to control the `chunksize`.

```python
pandas.read_csv("...", chunksize=...)
pandas.read_hdf("...", chunksize=...)
pandas.read_json("...", chunksize=...)
pandas.read_sas("...", chunksize=...)
pandas.read_sql("...", chunksize=...)
pandas.read_stata("...", chunksize=...)
```

In [7]:
pd.Series(range(1000000)).to_csv("../data/data.csv", index=False)

for chunk in pd.read_csv("../data/data.csv", chunksize=200000):
    print(chunk.shape, chunk.iloc[0][0], chunk.iloc[-1][0])

(200000, 1) 0 199999
(200000, 1) 200000 399999
(200000, 1) 400000 599999
(200000, 1) 600000 799999
(200000, 1) 800000 999999


# 4. Use Other Libraries

Because of its popularity, `pandas`’ API has become something of a standard that other libraries implement.

Common packages for working with large datasets:

- [Dask](https://docs.dask.org/)
- [PySpark](https://spark.apache.org/docs/latest/api/python/)