# Chapter 2 

Before installing any library, it is highly advised that you create a separate virtual environment, for example, using conda. The concept and purpose behind creating virtual
environments were discussed in detail in **Chapter 1**, *Getting Started with Time Series Analysis*, with multiple examples.

* To install Modin using `Conda` (with a Dask backend), run the following:

```
conda install -c conda-forge modin-dask
```
* To install with `pip`, use the following:

```
pip install modin[dask]
```

You will measure the time and memory usage using pandas and again using Modin. To measure memory usage, you will need to install the `memory_profiler` library.

```
pip install memory_profiler
```

The memory_profiler library provides IPython and Jupyter magics such as `%memit` and `%mprun`, similar to known magics such as `%timeit` and `%time`.

## Using Modin
* Continue from recipe "Reading data from a SAS dataset"
* The recipe introduces concepts for working with large data sets 

> " the **Modin** library acts as a wrapper or, more specifically, an abstraction on top of Dask or Ray that
uses a similar API to pandas. Modin makes optimizing your pandas' code much more
straightforward without learning another framework, and all it takes is a single line
of code." - Chapter 2 Page 62

## Comparing Performance: Memory & Time

In [None]:
#!conda install -c anaconda memory_profiler -y
#!pip install memory-profiler

In [1]:
import memory_profiler 
memory_profiler.__version__

'0.58.0'

In [2]:
%load_ext memory_profiler

In [3]:
path = '../../datasets/Ch2/large_file.csv'

In [4]:
import pandas as pd

In [5]:
%%time
%memit pd.read_csv(path).groupby('label_source').count()

peak memory: 219.71 MiB, increment: 125.79 MiB
CPU times: user 375 ms, sys: 116 ms, total: 491 ms
Wall time: 1.45 s


In [6]:
from modin.config import Engine
Engine.put("dask")  # Modin will use Dask
import modin.pandas as pd
from distributed import Client
client = Client()

In [7]:
%reload_ext memory_profiler

In [8]:
%%time
%memit pd.read_csv(path).groupby('label_source').count()

peak memory: 153.58 MiB, increment: 9.20 MiB
CPU times: user 1.02 s, sys: 253 ms, total: 1.28 s
Wall time: 2.46 s
