-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why modin.pandas takes more time than pandas in df.read_csv()? #625
Comments
Hi @pintuiitbhi, thanks for the question. I need more information to be able to answer your question. Did you run the pandas code first, then Modin in the same environment? Due to the limited memory capacity, you may be exceeding your memory and the time you see will include some cost for the OS to swap some objects to disk. Typically pandas memory costs are much higher than the file. Are you also running other memory hungry applications (e.g. internet browser with several tabs open)? This will also hurt performance if you are trying to read a 500MB file into 4GB of memory. Is your disk an SSD or a HDD? If there is data swapping (which is what my best guess is right now) HDD will take longer than SSD. Can you share the notebook file in a github gist with some sample of the data so I can also try to reproduce this? |
Yes, I runned the pandas code first in the same notebook in which I was importing modin.pandas. You can find the code in following link. Any dataset will work. OK, I got you. When I just import modin.pandas in notebook , then df.read_csv took less time. But when I restart and clear outputs of cells in notebook, then again it tooks much more time than earlier. Why is that? Following the the output of cell when I import modin.panda :
I observed that when dev/shm then it is fast . If it is dev/tmp then too slow. Why? I have one more question. When I tried to run "df.corr() " in modin.pandas it get stuck. I mean there is a warning like
|
What is the point in modin.pandas being fast if I have to close internet browsing ,stop other works and then run modin.pandas? On same System , whether I am running internet browsing , restarting and running notebook again , pandas still works fast in case of df.read_csv. on same csv file. Let me clear if I am missing any points . Or I have to do some other configuration in my system to make modin.pandas fast. I don't want to stop internet browsing and stop other notebook I am working on. |
@pintuiitbhi The best bet for your situation would probably be to try the out of core feature: https://modin.readthedocs.io/en/latest/out_of_core.html We try to limit the memory that Modin takes up by being ~50-60% of the available memory on the machine. That way it plays nice with other applications (e.g. browser tabs, etc.). In your case we're allocating 2GB (see the last line of the printed log). Unfortunately it seems that 2GB is not enough memory to handle the 500MB dataset if you want to process it, and this is consistent with the recommendations from the creator of pandas as well. Pandas will just use all the memory, and doesn't box itself in this way. Modin does it to make sure that it doesn't take up too many resources. Out of core will let you use all of the memory and also give the capability to use the disk as a backup for the memory, but should not interfere with other applications. Please give it a try and let me know if it helps. I agree you should not have to stop using other application while using Modin, that is why we have setup Modin to take roughly half of available memory.
|
What should be specification of my laptop or if I deploy my notebook on server for modin to work fine without creating any bottleneck ( apart from "Default to Pandas")? What should be RAM size, Hard disk or SSD size etc? Suppose I deploy my model on server and switched to modin from pandas just changing import statement. Will it be advisable to use modin for Machine Learning applications? I found that main pandas api used in ML are not implemented yet in modin like ix, align, assign, at, combine, corr, corrwith, cov, dot, drop_duplicates, get_value, hist, iat, is_copy, ix,last, mask,pivot,pivot_table, reindex_axis,replace,resample,rolling,select,set_value,shift,squeeze,to_dense,to_timestamp,to_xarray,truncate. |
If I use out of core then Modin will swap data from hard disk. And swaping from hard disk is slow. So I think there is no point in using out of core in my case. |
The Out of Core functionality should be more efficient that just letting the OS choose what to swap. Some of the functionality you've listed has been deprecated in pandas. Those are not a high priority because pandas no longer officially supports them. As for the others, we are definitely interested in supporting them, and if you're interested in contributing, please let me know! |
I tried using the out of core option for reading 417M of parquet files. Pandas still wins, by about 30%. 12.2-12.8 s vs 14.4-15s. Dinesh |
@ddutt Are you using the code from this comment? #624 (comment) That creates a pandas DataFrame, and would not be run in-parallel. Unless the data is extremely small there should be no reason pandas is faster for reading parquet or csv files. We have benchmarked out of core for up to 250GB of data, and typically it only has a 50-60% overhead if the data exceeds memory. |
To clarify the 50-60% overhead is on Modin's performance, not pandas. As I mentioned, the performance of Modin should be faster than pandas, even with Out of core. We do a lot of benchmarking and testing to ensure this is the case, so please let us know more information about your workflow/system. |
Good catch, @devin-petersohn. Yes, that just creates a pandas dataframe. My bad. Sorry. Which is why I want the read_parquet() support from Modin natively as I mentioned in the bug #624. Dinesh |
If I leave out the to_pandas() part, would it run in parallel? |
It can run in-parallel, but it would not be using Modin to do it. Link to documentation: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html To support this it is a relatively small change, and it should be able to be done in the next release (happening next week). |
Excellent. I tried some obvious things such as providing columns to avoid trying to do ParquetFile in the ray code etc. But it didn't work. Obviously, my glance-and-fix approach didn't work. Will wait for your fix/release. Am happy to try before the release if you like and send me the patch. Appreciate the fantastic response time, Dinesh |
OK, coming back to this test after all the fixes to get read_parquet to work, Modin and Pandas get the same perf for the 417M of parquet file read. With out of core support, Modin ekes out ~8% or so better perf (32 vs 34s). Not very dramatic. I have a 4 core, i7-8550U at 1.8Ghz processor and about 16G RAM. |
@ddutt With parquet the maximum parallelism in the reader is limited to the number of columns, which may mean that the utilization is not that great, especially with hyperthreading. There is another conversation about this in the discourse (Link), and a PR (that has unfortunately gone stale) #528 that would include row groups to maximize utilization and fit with Modin's partitioning pattern better. How many columns does your dataset have? Also is 417M megabytes or millions of rows? |
I tried reading only 3 or 4 columns given I have only 4 cores, but that shaved about 8s of both Pandas and Modin. They both came in even with reduced columns. I have 417MB of parquet files, but they're all small files. I wonder if the small files are part of the problem. There are about 268K rows of 21 columns. Dinesh |
Small files may be part of the problem. I'll try to recreate this and see if I can reduce the time. |
Hi @ddutt, I did some benchmarking to determine the performance of Without partition and no selected columns:
Without partition and 2 selected columns:
Partitioned and no selected columns:
Partitioned and 2 selected columns:
Of the two selected columns, one was one of the partitioned columns ( I ran this on my laptop, which has a 3.1 GHz Intel Core i7 with 16 gb of ram. |
hi when i ran the modin for first it was quick ,i changed it tonormal pandas and back to modin . then it became dead slow |
@pavanpraneeth I believe you opened an issue #2911, Let us discuss more there. |
System information
Modin.pandas is taking too much time to read a csv file of 500MB compared to pandas
I have included the screenshots and codes in the stackoverflow .
Here is the link:
Screenshots and code
The text was updated successfully, but these errors were encountered: