# Chapter 2 

Before installing any library, it is highly advised that you create a separate virtual environment, for example, using conda. The concept and purpose behind creating virtual
environments were discussed in detail in **Chapter 1**, *Getting Started with Time Series Analysis*, with multiple examples.

* To install Modin using `Conda` (with a Dask backend), run the following:

```
conda install -c conda-forge modin-dask
```
* To install with `pip`, use the following:

```
pip install modin[dask]
```

You will measure the time and memory usage using pandas and again using Modin. To measure memory usage, you will need to install the `memory_profiler` library.

```
pip install memory_profiler
```

The memory_profiler library provides IPython and Jupyter magics such as `%memit` and `%mprun`, similar to known magics such as `%timeit` and `%time`.

## Using Modin
* Continue from recipe "Reading data from a SAS dataset"
* The recipe introduces concepts for working with large data sets 

> " the **Modin** library acts as a wrapper or, more specifically, an abstraction on top of Dask or Ray that
uses a similar API to pandas. Modin makes optimizing your pandas' code much more
straightforward without learning another framework, and all it takes is a single line
of code." - Chapter 2 Page 62

## Comparing Performance: Memory & Time

In [1]:
#!conda install -c anaconda memory_profiler -y
#!pip install memory-profiler

In [1]:
#import memory_profiler 
%load_ext memory_profiler

In [2]:
import pandas as pd
pd.__version__

'2.2.0'

In [3]:
import modin
modin.__version__

'0.27.0'

In [7]:
from pathlib import Path
from modin.config import Engine
Engine.put("dask")  # Modin will use Dask
import modin.pandas as pd
# from distributed import Client
# client = Client()

In [8]:
file_path = Path('../../datasets/Ch2/yellow_tripdata_2023.csv')

In [13]:
%%time
%%memit 
df_pd = pd.read_csv(file_path)

peak memory: 346.61 MiB, increment: 165.66 MiB
CPU times: user 1.24 s, sys: 326 ms, total: 1.56 s
Wall time: 8.2 s


In [14]:
df_pd.info()

<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 16186386 entries, 0 to 16186385
Data columns (total 20 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   VendorID               int64  
 1   tpep_pickup_datetime   object 
 2   tpep_dropoff_datetime  object 
 3   passenger_count        float64
 4   trip_distance          float64
 5   RatecodeID             float64
 6   store_and_fwd_flag     object 
 7   PULocationID           int64  
 8   DOLocationID           int64  
 9   payment_type           int64  
 10  fare_amount            float64
 11  extra                  float64
 12  mta_tax                float64
 13  tip_amount             float64
 14  tolls_amount           float64
 15  improvement_surcharge  float64
 16  total_amount           float64
 17  congestion_surcharge   float64
 18  Airport_fee            float64
 19  airport_fee            float64
dtypes: float64(13), int64(4), object(3)
memory usage: 2.4+ GB


In [15]:
df_pd.shape

(16186386, 20)

In [16]:
df_pd.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,airport_fee
0,1,2023-05-01 00:33:13,2023-05-01 00:53:01,0.0,7.8,1.0,N,138,43,1,33.8,7.75,0.5,8.6,0.0,1.0,51.65,0.0,1.75,
1,1,2023-05-01 00:42:49,2023-05-01 01:11:18,2.0,8.1,1.0,N,138,262,1,35.9,10.25,0.5,9.5,0.0,1.0,57.15,2.5,1.75,
2,1,2023-05-01 00:56:34,2023-05-01 01:13:39,2.0,9.1,1.0,N,138,141,1,35.2,10.25,0.5,10.7,6.55,1.0,64.2,2.5,1.75,
3,2,2023-05-01 00:00:52,2023-05-01 00:20:12,1.0,8.21,1.0,N,138,140,1,33.1,6.0,0.5,2.24,0.0,1.0,47.09,2.5,1.75,
4,1,2023-05-01 00:05:50,2023-05-01 00:19:41,0.0,7.9,1.0,N,138,263,1,31.0,10.25,0.5,9.85,6.55,1.0,59.15,2.5,1.75,


In [None]:
df.read_csv()