why modin.pandas takes more time than pandas in df.read_csv()? #625

pintuiitbhi · 2019-05-21T21:41:46Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
Modin installed from (source or binary):pip install modin
Modin version:0.5.0
Python version:3.6.6
Exact command to reproduce:

Modin.pandas is taking too much time to read a csv file of 500MB compared to pandas

I have included the screenshots and codes in the stackoverflow .

devin-petersohn · 2019-05-21T22:24:56Z

Hi @pintuiitbhi, thanks for the question. I need more information to be able to answer your question.

Did you run the pandas code first, then Modin in the same environment? Due to the limited memory capacity, you may be exceeding your memory and the time you see will include some cost for the OS to swap some objects to disk. Typically pandas memory costs are much higher than the file.

Are you also running other memory hungry applications (e.g. internet browser with several tabs open)? This will also hurt performance if you are trying to read a 500MB file into 4GB of memory.

Is your disk an SSD or a HDD? If there is data swapping (which is what my best guess is right now) HDD will take longer than SSD.

Can you share the notebook file in a github gist with some sample of the data so I can also try to reproduce this?

pintuiitbhi · 2019-05-22T06:26:43Z

Yes, I runned the pandas code first in the same notebook in which I was importing modin.pandas.
Yes I was also running internet browser.
My disk is HDD. 4 cores, Linux Ubuntu 16.04, 4GB RAM, 1TB hard disk, dual boot .

You can find the code in following link. Any dataset will work.
code

OK, I got you. When I just import modin.pandas in notebook , then df.read_csv took less time. But when I restart and clear outputs of cells in notebook, then again it tooks much more time than earlier. Why is that?

Following the the output of cell when I import modin.panda :

2019-05-22 11:47:09,292 INFO node.py:497 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-22_11-47-09_291873_17634/logs. 2019-05-22 11:47:09,427 INFO services.py:409 -- Waiting for redis server at 127.0.0.1:53855 to respond... 2019-05-22 11:47:09,574 INFO services.py:409 -- Waiting for redis server at 127.0.0.1:49556 to respond... 2019-05-22 11:47:09,589 INFO services.py:806 -- Starting Redis shard with 0.81 GB max memory. 2019-05-22 11:47:09,655 INFO node.py:511 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-22_11-47-09_291873_17634/logs. 2019-05-22 11:47:09,660 WARNING services.py:1318 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 957259776 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'. 2019-05-22 11:47:09,661 INFO services.py:1441 -- Starting the Plasma object store with 2.0 GB memory using /tmp.

I observed that when dev/shm then it is fast . If it is dev/tmp then too slow. Why?

I have one more question.
Suppose I have a notebook which used only pandas. Now I just changed the import statement to modin.pandas. Will it create problem if notebook has ml model like Neural network etc?

When I tried to run "df.corr() " in modin.pandas it get stuck. I mean there is a warning like

UserWarning: DataFrame.corr defaulting to pandas implementation.

pintuiitbhi · 2019-05-22T06:33:41Z

What is the point in modin.pandas being fast if I have to close internet browsing ,stop other works and then run modin.pandas?

On same System , whether I am running internet browsing , restarting and running notebook again , pandas still works fast in case of df.read_csv. on same csv file.

Let me clear if I am missing any points . Or I have to do some other configuration in my system to make modin.pandas fast. I don't want to stop internet browsing and stop other notebook I am working on.

devin-petersohn · 2019-05-22T06:52:43Z

@pintuiitbhi The best bet for your situation would probably be to try the out of core feature: https://modin.readthedocs.io/en/latest/out_of_core.html

We try to limit the memory that Modin takes up by being ~50-60% of the available memory on the machine. That way it plays nice with other applications (e.g. browser tabs, etc.). In your case we're allocating 2GB (see the last line of the printed log). Unfortunately it seems that 2GB is not enough memory to handle the 500MB dataset if you want to process it, and this is consistent with the recommendations from the creator of pandas as well. Pandas will just use all the memory, and doesn't box itself in this way. Modin does it to make sure that it doesn't take up too many resources.

Out of core will let you use all of the memory and also give the capability to use the disk as a backup for the memory, but should not interfere with other applications. Please give it a try and let me know if it helps. I agree you should not have to stop using other application while using Modin, that is why we have setup Modin to take roughly half of available memory.

corr is not parallelized yet. If you're familiar with parallel algorithms it would be a great first contribution as well.

pintuiitbhi · 2019-05-22T07:14:00Z

What should be specification of my laptop or if I deploy my notebook on server for modin to work fine without creating any bottleneck ( apart from "Default to Pandas")? What should be RAM size, Hard disk or SSD size etc?

Suppose I deploy my model on server and switched to modin from pandas just changing import statement. Will it be advisable to use modin for Machine Learning applications? I found that main pandas api used in ML are not implemented yet in modin like ix, align, assign, at, combine, corr, corrwith, cov, dot, drop_duplicates, get_value, hist, iat, is_copy, ix,last, mask,pivot,pivot_table, reindex_axis,replace,resample,rolling,select,set_value,shift,squeeze,to_dense,to_timestamp,to_xarray,truncate.

pintuiitbhi · 2019-05-22T07:17:02Z

If I use out of core then Modin will swap data from hard disk. And swaping from hard disk is slow. So I think there is no point in using out of core in my case.

devin-petersohn · 2019-05-22T13:41:03Z

The Out of Core functionality should be more efficient that just letting the OS choose what to swap.

Some of the functionality you've listed has been deprecated in pandas. Those are not a high priority because pandas no longer officially supports them. As for the others, we are definitely interested in supporting them, and if you're interested in contributing, please let me know!

ddutt · 2019-05-22T17:56:44Z

I tried using the out of core option for reading 417M of parquet files. Pandas still wins, by about 30%. 12.2-12.8 s vs 14.4-15s.

Dinesh

devin-petersohn · 2019-05-22T18:03:33Z

@ddutt Are you using the code from this comment? #624 (comment)

That creates a pandas DataFrame, and would not be run in-parallel. Unless the data is extremely small there should be no reason pandas is faster for reading parquet or csv files. We have benchmarked out of core for up to 250GB of data, and typically it only has a 50-60% overhead if the data exceeds memory.

devin-petersohn · 2019-05-22T18:09:03Z

To clarify the 50-60% overhead is on Modin's performance, not pandas. As I mentioned, the performance of Modin should be faster than pandas, even with Out of core. We do a lot of benchmarking and testing to ensure this is the case, so please let us know more information about your workflow/system.

ddutt · 2019-05-22T18:55:25Z

Good catch, @devin-petersohn. Yes, that just creates a pandas dataframe. My bad. Sorry. Which is why I want the read_parquet() support from Modin natively as I mentioned in the bug #624.

Dinesh

ddutt · 2019-05-22T18:56:51Z

If I leave out the to_pandas() part, would it run in parallel?

devin-petersohn · 2019-05-22T19:47:24Z

If I leave out the to_pandas() part, would it run in parallel?

It can run in-parallel, but it would not be using Modin to do it. Link to documentation: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html

To support this it is a relatively small change, and it should be able to be done in the next release (happening next week).

ddutt · 2019-05-22T20:00:23Z

Excellent. I tried some obvious things such as providing columns to avoid trying to do ParquetFile in the ray code etc. But it didn't work. Obviously, my glance-and-fix approach didn't work. Will wait for your fix/release. Am happy to try before the release if you like and send me the patch.

Appreciate the fantastic response time,

Dinesh

ddutt · 2019-05-25T02:34:12Z

OK, coming back to this test after all the fixes to get read_parquet to work, Modin and Pandas get the same perf for the 417M of parquet file read. With out of core support, Modin ekes out ~8% or so better perf (32 vs 34s). Not very dramatic. I have a 4 core, i7-8550U at 1.8Ghz processor and about 16G RAM.

devin-petersohn · 2019-05-25T03:30:09Z

@ddutt With parquet the maximum parallelism in the reader is limited to the number of columns, which may mean that the utilization is not that great, especially with hyperthreading.

There is another conversation about this in the discourse (Link), and a PR (that has unfortunately gone stale) #528 that would include row groups to maximize utilization and fit with Modin's partitioning pattern better. How many columns does your dataset have? Also is 417M megabytes or millions of rows?

ddutt · 2019-05-25T05:26:12Z

I tried reading only 3 or 4 columns given I have only 4 cores, but that shaved about 8s of both Pandas and Modin. They both came in even with reduced columns.

I have 417MB of parquet files, but they're all small files. I wonder if the small files are part of the problem. There are about 268K rows of 21 columns.

Dinesh

devin-petersohn · 2019-05-27T18:57:33Z

Small files may be part of the problem. I'll try to recreate this and see if I can reduce the time.

williamma12 · 2019-05-28T19:55:43Z

Hi @ddutt, I did some benchmarking to determine the performance of read_parquet for #632 and got the following results for a randomly generated integer dataframe of 6000x6000, which is approximately 500mb. Columns col1 and col2 consist of three randomly generated integers for partitioning purposes while everything else is randomly selected from 100 integers.

Without partition and no selected columns:

Modin: 0.83 seconds
Pandas: 1.03 seconds

Without partition and 2 selected columns:

Modin: 0.19 seconds
Pandas: 0.13 seconds

Partitioned and no selected columns:

Modin: 3.03 seconds
Pandas: 4.20 seconds

Partitioned and 2 selected columns:

Modin: 0.85 seconds
Pandas: 0.96 seconds

Of the two selected columns, one was one of the partitioned columns (col2 to be specific) and one non partitioned columns.

I ran this on my laptop, which has a 3.1 GHz Intel Core i7 with 16 gb of ram.

pavanpraneeth · 2021-03-22T12:27:20Z

hi when i ran the modin for first it was quick ,i changed it tonormal pandas and back to modin . then it became dead slow

devin-petersohn · 2021-03-23T00:03:58Z

@pavanpraneeth I believe you opened an issue #2911, Let us discuss more there.

devin-petersohn added the question ❓ Questions about Modin label May 21, 2019

devin-petersohn added the Performance 🚀 Performance related issues and pull requests. label May 22, 2019

aregm closed this as completed Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why modin.pandas takes more time than pandas in df.read_csv()? #625

why modin.pandas takes more time than pandas in df.read_csv()? #625

pintuiitbhi commented May 21, 2019 •

edited

Loading

devin-petersohn commented May 21, 2019

pintuiitbhi commented May 22, 2019 •

edited

Loading

pintuiitbhi commented May 22, 2019 •

edited

Loading

devin-petersohn commented May 22, 2019

pintuiitbhi commented May 22, 2019

pintuiitbhi commented May 22, 2019

devin-petersohn commented May 22, 2019

ddutt commented May 22, 2019

devin-petersohn commented May 22, 2019

devin-petersohn commented May 22, 2019

ddutt commented May 22, 2019

ddutt commented May 22, 2019

devin-petersohn commented May 22, 2019

ddutt commented May 22, 2019

ddutt commented May 25, 2019

devin-petersohn commented May 25, 2019

ddutt commented May 25, 2019 •

edited

Loading

devin-petersohn commented May 27, 2019

williamma12 commented May 28, 2019

pavanpraneeth commented Mar 22, 2021

devin-petersohn commented Mar 23, 2021

why modin.pandas takes more time than pandas in df.read_csv()? #625

why modin.pandas takes more time than pandas in df.read_csv()? #625

Comments

pintuiitbhi commented May 21, 2019 • edited Loading

System information

Modin.pandas is taking too much time to read a csv file of 500MB compared to pandas

I have included the screenshots and codes in the stackoverflow .

devin-petersohn commented May 21, 2019

pintuiitbhi commented May 22, 2019 • edited Loading

pintuiitbhi commented May 22, 2019 • edited Loading

devin-petersohn commented May 22, 2019

pintuiitbhi commented May 22, 2019

pintuiitbhi commented May 22, 2019

devin-petersohn commented May 22, 2019

ddutt commented May 22, 2019

devin-petersohn commented May 22, 2019

devin-petersohn commented May 22, 2019

ddutt commented May 22, 2019

ddutt commented May 22, 2019

devin-petersohn commented May 22, 2019

ddutt commented May 22, 2019

ddutt commented May 25, 2019

devin-petersohn commented May 25, 2019

ddutt commented May 25, 2019 • edited Loading

devin-petersohn commented May 27, 2019

williamma12 commented May 28, 2019

pavanpraneeth commented Mar 22, 2021

devin-petersohn commented Mar 23, 2021

pintuiitbhi commented May 21, 2019 •

edited

Loading

pintuiitbhi commented May 22, 2019 •

edited

Loading

pintuiitbhi commented May 22, 2019 •

edited

Loading

ddutt commented May 25, 2019 •

edited

Loading