## Memory Errors when working with large datasets
When trying to import a large dataset to dataframe format with pandas (for example, using the read_csv function), you are likely to run into MemoryError. This error indicates that you have run out of memory in your RAM. Pandas uses in-memory analytics, so larger-than-memory datasets won’t load. Additionally, any operations performed on the dataframe require memory as well.

In [1]:
import pandas as pd

## Dask library
One of the solutions to memory error is to use another library. Here Dask comes in handy. Dask is a Python library for parallel computing, which is able to perform computations on large datasets while scaling well-known Python libraries such as pandas, NumPy, and scikit-learn.

Dask splits the dataset into a number of partitions. Unlike pandas, each Dask partition is sent to a separate CPU core. This feature allows us to work on a larger-than-memory dataset but also speeds up the computations on that dataset.

## Dask DataFrame
Dask DataFrame is a collection of smaller pandas DataFrames, split along the index.

In [8]:
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster

In [14]:
#read a single CSV file
df = dd.read_csv('en-fr.csv')

# Set up a LocalCluster with four workers
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
#Check the number of partitions
df.npartitions

Perhaps you already have a cluster running?
Hosting the HTTP server on port 57985 instead


131

In [11]:
#Change the number of partitions
df.repartition(npartitions=4)

Unnamed: 0_level_0,en,fr
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1
,object,object
,...,...
,...,...
,...,...
,...,...


In [11]:
#Save the Dask Dataframe to CSV files(1 file per partition)
df.to_csv('C:Users/Bildad Otieno/Documents/Practice/Eng-French/EF1.csv')

['C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\000.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\001.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\002.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\003.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\004.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\005.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\006.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\007.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\008.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\009.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\010.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-French\\EF1.csv\\011.part',
 'C:Users\\Bildad Otieno\\Documents\\Practice\\Eng-F

In [9]:
#Save the Dask DataFrame to a single CSV file
df.compute().to_csv('C:Users/Bildad Otieno/Documents/Practice/Eng-French/EF2.csv')

### Checking Memory Usage of Each Column

In [12]:
df.memory_usage(deep=True)

Dask Series Structure:
npartitions=1
    int64
      ...
dtype: int64
Dask Name: series-groupby-sum-agg, 545 tasks

In [15]:

# Shut down the Dask cluster
client.close()
cluster.close()
