sync fork #1

austin-chou · 2019-01-08T00:00:40Z

What do these changes do?

Related issue number

passes git diff upstream/master -u -- "*.py" | flake8 --diff
passes black --check modin/

* printing aws file path * corrected pathname * public perf tests * remove s3 ls statements

#238) * Dropping git revision because it causes problems for some Ubuntu users * Removing from pandas also * lint

* Update README.md * Typo fix for README * Revert "Typo fix for README" This reverts commit dc2023f.

* Fixing issue iterating over groupby with multiple columns * formatting * performance improvement

@devin-petersohn

…246) * modin.pandas.io.read_parquet to partition columns according to CPU cores, reading multiple columns at once using Ray * blank line removed from EOF * column splits recalculated according to @devin-petersohn review * Formatting for consistency and flake8

* Revert some changes in behavior for sum/prod * Fix default value for numeric_only * Fix mean in the context of the new sum changes * Add comments

* Adding advanced usage docs * Updating docs * Updating phrasing * Addressing comments

* Refactor dataframe * Fix bug * Fix lint

* Bump version 0.2.3 * Fix issue in init * Removing typing as a dependency

* Pin redis until Ray release * Moving requirement order * Add requirements.txt * Pin pytest version on travis * Add

* Jenkins perf builds for Master Commits * formatting change * code for adding commit ordering info * formatting * updated deps on Dockerfile * removed extra dependency from Dockerfile

* squeeze fixes * updated squeeze * fixed rebasing issue * updated squeeze * updating to fix travis fails * black and flake * Update test_dataframe.py * removed unncess comments * addressing comments * fixed formatting * changed an ndim parameter Co-Authored-By: adits31 <adityasheth@berkeley.edu>

@devin-petersohn

* modin.pandas.io.read_parquet to partition columns according to CPU cores, reading multiple columns at once using Ray * blank line removed from EOF * column splits recalculated according to @devin-petersohn review * Formatting for consistency and flake8 * issue #204 Use Ray to parallelize `read_hdf` similar to the way `read_parquet` works * reformatted by black * merged from master

type(self) in each method

* Update documentation * Update other supported methods * Resolve comments * minor typo fix

* Adding read_table * Removing unneeded variable * Removing default * Fix python2 bug * Removing unneeded code * Add read_table to __init__ * Adding read_table to documentation

* Adding skeleton for a Dask runtime implementation * Lint * Revert extra newline * Fix error message * Fix typo

* fixed empty rows error in to_pandas * formatting and code style

* Fixing filter for read_csv args * Remove hardcode * Fix bug * Lint

* Update using_modin.rst The blog post link leads to a non-existent page. I suggested the post that was probably mentioned. * Update using_modin.rst

* Fix mode when axis=1 and remove a reindex * formatting

* Fix rfloordiv error * formatting

* Scenario that caused the bug: * `query` or a similar filter operation run * Some partitions completely removed from the dataframe * During `get_indices`, the filter optimization on BlockPartitions caused the empty partitions to not count toward the index computation * Fix: Use _partitions_cache, which is not modified when accessed.

* Fixed sort_index * Fixed formatting * Fixed when ascending is None * Update modin/data_management/query_compiler/pandas_query_compiler.py Co-Authored-By: williamma12 <12377941+williamma12@users.noreply.github.com> * Changed insert to keep old index explicitly

) * Change the way we test whether something is callable in SeriesView * This will make sure that all things that are callable are evaluated * Fix bug * Revert test change

* Removed fixtures for checking equality Parameterized the different dataframes into a parameterize mark added comments and used parameterization for test_functions Moved test_empty_df to be alphabetical Grouped inter df operations Updated equals functions to one function now in utils.py for test_dataframe.py Alphabetized the tests Continued alphabetized the tests Changed quantiles and query to new testing structure Changed some fillna tests, median, mean, and max to use new structure Updated tests upto test_get Updated test_get Updated test_get_dummies and equality checker updated bool arguments parameterization' updated testing of aggregate functions and created name_contains for cleaner code Updated all and any tests Updated count test Updated cumulative min/max/sum/prod functions Added more apply tests updated test_diff class of functions updated a lot of functions Updated query and rank Updated more tests Updated test_sort_index Updated more tests and give equals a possible error 0.01 Finished test_dataframe.py Updated tests Fixed test_sum in groupby and changed all names of ray_ to modin_ changed ray_df to modin_df Rebased and updated test after clip was implemented Ran black on test_dataframe and test_groupby Fixed ray to modin errors updated test for dropna Fix test_dropna_subset Fix test_groupby errors from rebasing Resolve flake errors Fixed remaining flake8 errors Fixed more remaining flake8 errors Fixed upto but not including clip functions Fixed parts of clip and all of count Fixed tests upto but not including mean Fixed test_mean Removed numeric_only requirement for count Changed abs to use _validate_dtypes function cleaned up count Fixed median and reorganized full_axis_reduce functions Fixed numeric_only = None tests. Fixed numeric functions Added larger test data Fixed mode test Fixed test_prod wanting too much TypeErrors but still broken fixing test_clip Added None as an option to arguments Added none argument testing of axis for all and any some fixes for empty df and series testing Fixed sum test and removed print statement for sample Removed default value for min_count in sum Fixed sort_values Forgot to remove debugging print statements Fixed sort_index update inplace and more rigorous testing of nans fix plot tests formatting fixed tests for min and max fix Updated some tests Refactored tests to better test for errors updated test_mean fixed test_equals and made transpose more comprehensive rebased and reset groupby to master Update inter_df_math_helper updated remaining inter df operations Updated inter_df_operations tests Updated error testing Updated testing for inter_df_operations Update test_rank Changed parameterization to create dataframe within tests insteadd fixed test_copy Fixed tests for all and any fixed overeager find and replace misspelled modin Updated jenkins Updated test___bool__ fixed test_agg fixed test_aggregate Fixed tests all and any Updated equality testint Undo travis changes Skip test_apply for now until we are able to properly test UDFs Skip applymap for now execpt for numeric ones Testing applymap for numeric dataframes Update test for at Updated clip tests Updated clip tests Update cumsum test Updated test duplicate, empty_df, fillna_dataframe Fix jenkins Update jenkins Update jenkins Update fillna test functions Skip tests that otherwise default to pandas Updated test insert skip is_copy because defaulting to pandas Test transform only on numeric functions Update travis to run on python backend Skip rename nocopy update jenkins Reset dataframe and querycompiler to master Fixed merge conflicts in test_dataframe Updated test clip and squeeze Update tests to take out axis and integer none testing Updated testing apply_numeric Minor changes Update test_inplace" Fixed typo in dropna_inplace * Update test_merge and the all integer test dataframe * Update travis build syntax error * Update install-dependencies.sh * Update how series results are tested for equality * Update test_eval_df_use_case syntax error * Update new tests to new test suite format * Removing the running of mixed dtype dataframes * Update equality testing * Fix set_axis assertion error * Fix set_index assertion error * Update testing dataframes to be positive * Fixed index checking in df_equals * Lint * Expand tests and fix bugs * Fix issue with travis * Fixing nan failure * Update pytest * Run only test_dataframe.py in parallel on travis * Run only test_dataframe.py in parallel on jenkins * Revert version change

* Checking for proper types (Series and SeriesView) on insert * Lint

…ssed in (#363) * Filtering default exclude values based on the include passed in * Fix error checking

* Fixing partitioning issue when doing a reindex/concat * Adds a new parameter to apply for axis_partitions * Allows partitioning to be either recomputed and rebalanced or maintained between runs * Adding parameter to Python and Dask backends

#365) * This allows the partition to return a valid DataFrame and not an Exception * In the future, additional post-processing may be needed to throw the pandas error that gets thrown if no data of that type exists.

* Adding a condition for selective apply operations that will simply return an empty 2D numpy array for partitions if there are no partitions. * This is added to both `apply_func_to_select_indices` and `apply_func_to_select_indices_along_full_axis`

Adjust logo size and text size.

* Adding groupby columns and index name when necessary * Lint * Making sure we only change columns when we group by index values

* Making drop faster for drop operations * Drops the same way that pandas Index gets dropped with duplicates * Also clean up some inefficient and duplicated `dropna` code * Lint * Reverting dropna cleanup to fix in future PR

#383) * Checking for type before we check the length to avoid spurious errors. * Resolves #382 * Prevents errors from DataFrames with 2 rows * Lint

* Resolves #385

#386) * Converting PandasQueryCompiler.getitem_array to accept numeric indices * Resolves #384 * Make changes to the functions that use getitem_array to now use indices instead * Converting some range to RangeIndex to get indexing * Lint

* Resolves #388 * For now, we have a class that overrides all function calls and converts the arguments to pandas. * The class created is a metaclass with the sole purpose of converting the arguments to pandas compatible args. * This approach allows us to avoid copying a bunch of code and hard-coding the entire module.

…it (#393) * Resolves #392 * This creates a view on all objects, even PandasQueryCompilerView objects, when using an Indexer (e.g. LocIndexer) * Allows `df.iloc[...].iloc[...]` to be supported now

* Resolves #145 * Performs a `reindex` on the `dtypes_cache` if there is already something there. * Adds a `dtype` property to `PandasQueryCompilerView` objects to be handled differently than parent.

* Resolves #394 * Aliasing isnull to isna, which was already implemented.

osalpekar and others added 30 commits October 27, 2018 17:24

Perf Data S3 File Path Fix (#226)

a6444c7

* printing aws file path * corrected pathname * public perf tests * remove s3 ls statements

Dropping git revision because it causes problems for some Ubuntu users (

73fcf18

#238) * Dropping git revision because it causes problems for some Ubuntu users * Removing from pandas also * lint

Update README (#242)

4ff641e

* Update README.md * Typo fix for README * Revert "Typo fix for README" This reverts commit dc2023f.

Adding read_gbq as a default from pandas implementation (#244)

c691bec

Fixing issue iterating over groupby with multiple columns (#237)

f58d5b6

* Fixing issue iterating over groupby with multiple columns * formatting * performance improvement

Fixed sample (#249)

e4079d2

Fix to_pandas for Series objects (#254)

142d7fd

Revert some changes in behavior for sum/prod (#253)

cdee5fb

* Revert some changes in behavior for sum/prod * Fix default value for numeric_only * Fix mean in the context of the new sum changes * Add comments

fix all and any to work for python backend (#252)

c5f080f

Adding advanced usage docs (#250)

0041447

* Adding advanced usage docs * Updating docs * Updating phrasing * Addressing comments

Fix minor edge case in repr (#256)

e705a1f

Refactor dataframe (#257)

c138dda

* Refactor dataframe * Fix bug * Fix lint

Bump version 0.2.3 (#258)

afff86c

* Bump version 0.2.3 * Fix issue in init * Removing typing as a dependency

Pin redis until Ray release (#262)

fde541a

* Pin redis until Ray release * Moving requirement order * Add requirements.txt * Pin pytest version on travis * Add

Jenkins perf builds for Master Commits (#251)

6a750cd

* Jenkins perf builds for Master Commits * formatting change * code for adding commit ordering info * formatting * updated deps on Dockerfile * removed extra dependency from Dockerfile

ignore doulbe ray init errors (#261)

bb64c28

Bump version (#263)

a0dd742

pin pytest version to 3.9.3 (#265)

770fe96

Refactor block partitions file to use __constructor__ instead of (#268)

578669a

type(self) in each method

Update documentation (#269)

0442d5f

* Update documentation * Update other supported methods * Resolve comments * minor typo fix

Adding read_table (#270)

1e7a1f3

* Adding read_table * Removing unneeded variable * Removing default * Fix python2 bug * Removing unneeded code * Add read_table to __init__ * Adding read_table to documentation

Dask skeleton (#271)

5894ad2

* Adding skeleton for a Dask runtime implementation * Lint * Revert extra newline * Fix error message * Fix typo

fixed empty rows error in to_pandas (#274)

67275a9

* fixed empty rows error in to_pandas * formatting and code style

make modin dataframe from copy of pandas dataframe (#276)

227b77a

Fixing filter for read_csv args (#240)

6433bb8

* Fixing filter for read_csv args * Remove hardcode * Fix bug * Lint

Update using_modin.rst (#280)

201d2e5

* Update using_modin.rst The blog post link leads to a non-existent page. I suggested the post that was probably mentioned. * Update using_modin.rst

Update ray and make changes to be compatible with API changes (#284)

8e2c936

williamma12 and others added 29 commits December 31, 2018 13:17

Fixes rmod (#344)

be4a6f1

Fix mode when axis=1 and remove a reindex (#333)

277af64

* Fix mode when axis=1 and remove a reindex * formatting

Fix rfloordiv error (#342)

a005454

* Fix rfloordiv error * formatting

Removed pow dtype checking (#346)

d0cb649

Fixed sort_index (#160)

95f2953

* Fixed sort_index * Fixed formatting * Fixed when ascending is None * Update modin/data_management/query_compiler/pandas_query_compiler.py Co-Authored-By: williamma12 <12377941+williamma12@users.noreply.github.com> * Changed insert to keep old index explicitly

Setting min_count default to 0 to match pandas (#352)

ecb5638

Change the way we test whether something is callable in SeriesView (#354

d55e383

) * Change the way we test whether something is callable in SeriesView * This will make sure that all things that are callable are evaluated * Fix bug * Revert test change

Checking for proper types (Series and SeriesView) on insert (#357)

5a0e9b1

* Checking for proper types (Series and SeriesView) on insert * Lint

Filtering default exclude values in describe based on the include pa…

990f87e

…ssed in (#363) * Filtering default exclude values based on the include passed in * Fix error checking

Adding crosstab that defaults to pandas to modin.pandas (#367)

203bbec

Adding case for operations like align so we can properly convert (#371)

1b0f921

Implementing __abs__ for SeriesView object (#373)

32f4927

Handling class objects in SeriesView as not callable (#375)

ad05c14

Update README.md

208c803

Adjust logo size and text size.

Adding groupby columns and index name when necessary (#380)

f2bce57

* Adding groupby columns and index name when necessary * Lint * Making sure we only change columns when we group by index values

Making drop faster for drop operations (#379)

9576ec7

* Making drop faster for drop operations * Drops the same way that pandas Index gets dropped with duplicates * Also clean up some inefficient and duplicated `dropna` code * Lint * Reverting dropna cleanup to fix in future PR

Checking for type before we check the length to avoid spurious errors. (

a519421

#383) * Checking for type before we check the length to avoid spurious errors. * Resolves #382 * Prevents errors from DataFrames with 2 rows * Lint

Removing data from DataFrame.hist parameter requirements (#387)

71e8e2e

* Resolves #385

Removing is_view from PandasQueryCompilerView and codepath requiring …

0126674

…it (#393) * Resolves #392 * This creates a view on all objects, even PandasQueryCompilerView objects, when using an Indexer (e.g. LocIndexer) * Allows `df.iloc[...].iloc[...]` to be supported now

Correcting dtypes after iloc issue (#391)

f546487

* Resolves #145 * Performs a `reindex` on the `dtypes_cache` if there is already something there. * Adds a `dtype` property to `PandasQueryCompilerView` objects to be handled differently than parent.

Add tab completion to SeriesView to be identical to Series (#390)

825da14

Adding isnull to modin.pandas (#395)

0c8eddc

* Resolves #394 * Aliasing isnull to isna, which was already implemented.

austin-chou merged commit 0c8eddc into austin-chou:master Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync fork #1

sync fork #1

austin-chou commented Jan 8, 2019

sync fork #1

sync fork #1

Conversation

austin-chou commented Jan 8, 2019

What do these changes do?

Related issue number