Even when we have 1TB of Disk Storage, 8GB/16GB of RAM still pandas and much other data loading API struggles to load a 2GB file.

This is because when a process requests for memory, memory is allocated in two ways:

Contiguous Memory Allocation (consecutive blocks are assigned)
Non Contiguous Memory Allocation(separate blocks at different locations)
Pandas use Contiguous Memory to load data into RAM because read and write operations are must faster on RAM than Disk(or SSDs).

Reading from SSDs: ~16,000 nanoseconds
Reading from RAM: ~100 nanoseconds

Before going into multiprocessing & GPUs, etc… let us see how to use pd.read_csv() effectively.

## 1. use cols:
    Rather than loading data and removing unnecessary columns that aren’t useful when processing your data.
    load only the useful columns.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('C:/Users/Pranay/Downloads/penguins2.csv')
df

Unnamed: 0,Island,CulmenLength,CulmenDepth,FlipperLength,BodyMass,Species
0,Torgersen,39.1,18.7,181.0,3750.0,0
1,Torgersen,39.5,17.4,186.0,3800.0,0
2,Torgersen,40.3,18.0,195.0,3250.0,0
3,Torgersen,0.0,0.0,0.0,0.0,0
4,Torgersen,36.7,19.3,193.0,3450.0,0
...,...,...,...,...,...,...
339,Dream,55.8,19.8,207.0,4000.0,2
340,Dream,43.5,18.1,202.0,3400.0,2
341,Dream,49.6,18.2,193.0,3775.0,2
342,Dream,50.8,19.0,210.0,4100.0,2


In [3]:
df.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Columns: 6 entries, Island to Species
dtypes: float64(4), int64(1), object(1)
memory usage: 34.8 KB


In [4]:
len(df.columns)

6

In [5]:
req_cols = ['Island','BodyMass','Species']
len(req_cols)

3

In [6]:
df = pd.read_csv("C:/Users/Pranay/Downloads/penguins2.csv", usecols=req_cols)


In [7]:
df.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Columns: 3 entries, Island to Species
dtypes: float64(1), int64(1), object(1)
memory usage: 26.7 KB


## 2. Using correct dtypes for numerical data:
Every column has it’s own dtype in a pandas DataFrame, for example, integers have int64, int32, int16 etc…

int8 can store integers from -128 to 127.

int16 can store integers from -32768 to 32767.

int64 can store integers from -9223372036854775808 to .9223372036854775807.

Pandas assign int64 to integer datatype by default, therefore by defining correct dtypes we can reduce memory usage significantly.



In [8]:
df['BodyMass'].memory_usage(index=False, deep=True)

2752

In [9]:
df['BodyMass'].min()

0.0

In [10]:
df['BodyMass'].max()

6300.0

In [11]:
import pandas as pd

# Load the CSV file
df = pd.read_csv("C:/Users/Pranay/Downloads/penguins2.csv")

# Fill missing values (NaN) with a specific value or method
df_filled = df.fillna(value=0)  # Fill with 0

# Save the filled dataframe to a new CSV file
df_filled.to_csv("C:/Users/Pranay/Downloads/penguins2.csv", index=False)


In [12]:
df = pd.read_csv("C:/Users/Pranay/Downloads/penguins2.csv", dtype={"BodyMass": "int16"})

In [13]:
df['BodyMass'].memory_usage(index=False, deep=True)

688

In [14]:
(2752 - 688)/2752*100

75.0

## 3. Using correct dtypes for categorical columns:

In [15]:
df['BodyMass'].memory_usage(index=False, deep=True)

688

In [16]:
df = pd.read_csv("C:/Users/Pranay/Downloads/penguins2.csv", dtype={"BodyMass": "category"})

In [17]:
df['BodyMass'].memory_usage(index=False, deep=True)

8430

In [18]:
(8430-688)/8430*100

91.83867141162516

If your DataFrame contains lots of empty values or missing values or NANs you can reduce their memory footprint by converting them to Sparse Series.

## 4. nrows, skip rows
nrows The number of rows to read from the file.

In [19]:
import pandas as pd
df = pd.read_csv("C:/Users/Pranay/Downloads/penguins2.csv", nrows=1000)
len(df)
1000

1000

 skiprows Line numbers to skip (0-indexed) or the number of lines to skip (int) at the start of the file.

In [20]:
# Can be either list or first N rows.
df = pd.read_csv("C:/Users/Pranay/Downloads/penguins2.csv", skiprows=[0,2,5]) 
# It might remove headings

## 5. Loading Data in Chunks:
Memory Issues in pandas read_csv() are there for a long time. So one of the best workarounds to load large datasets is in chunks.

In [21]:
len(df)

341

### Check the total length after loading chunk by chunk

In [22]:
df = pd.read_csv("C:/Users/Pranay/Downloads/penguins2.csv", chunksize=1000)

In [23]:
total_len = 0
for chunk in df:
    # Do some preprocessing to reduce the memory size of each chunk
    total_len += len(chunk)
print(total_len)

344


### concatnate each chunk one by one

In [24]:
df = pd.read_csv("C:/Users/Pranay/Downloads/penguins2.csv", iterator=True, chunksize=1000)  # gives TextFileReader
df = pd.concat(df, ignore_index=True)

In [25]:
len(df)

344

## 6. Multiprocessing using pandas:
As pandas don’t have njobs variable to make use of multiprocessing power. we can utilize multiprocessinglibrary to handle chunk size operations asynchronously on multi-threads which can reduce the run time by half.

In [26]:
import pandas as pd
import multiprocessing as mp

In [27]:
%%time
df = pd.read_csv("C:/Users/Pranay/Downloads/penguins2.csv", chunksize=1000)
total_length = 0
for chunk in df:
    total_length += len(chunk)
print(total_length)

344
Wall time: 13.8 ms


In [None]:
%%time
LARGE_FILE = "C:/Users/Pranay/Downloads/penguins2.csv"
CHUNKSIZE = 100  # processing 1000 rows at a time

def process_frame(df):
    # process data frame
    return len(df)

if __name__ == '__main__':
    reader = pd.read_table(LARGE_FILE, chunksize=CHUNKSIZE)
    pool = mp.Pool(4)  # use 4 processes

    funclist = []
    for df in reader:
        # process each data frame
        f = pool.apply_async(process_frame, [df])
        funclist.append(f)

    result = 0
    for f in funclist:
        result += f.get()  # no timeout specified

    print(f"There are {result} rows of data")


## 7. Dask Instead of Pandas:
It supports parallel computing and loads data faster than pandas

In [29]:
%%time 
import dask.dataframe as dd
data = dd.read_csv("C:/Users/Pranay/Downloads/penguins2.csv", dtype={'MachineHoursCurrentMeter': 'float64'},assume_missing=True)
data.compute()

Wall time: 2.21 s


Unnamed: 0,Island,CulmenLength,CulmenDepth,FlipperLength,BodyMass,Species
0,Torgersen,39.1,18.7,181.0,3750.0,0.0
1,Torgersen,39.5,17.4,186.0,3800.0,0.0
2,Torgersen,40.3,18.0,195.0,3250.0,0.0
3,Torgersen,0.0,0.0,0.0,0.0,0.0
4,Torgersen,36.7,19.3,193.0,3450.0,0.0
...,...,...,...,...,...,...
339,Dream,55.8,19.8,207.0,4000.0,2.0
340,Dream,43.5,18.1,202.0,3400.0,2.0
341,Dream,49.6,18.2,193.0,3775.0,2.0
342,Dream,50.8,19.0,210.0,4100.0,2.0
