Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create using Woodwork with Dask guide #304

Merged
merged 6 commits into from Oct 23, 2020
Merged

Conversation

thehomebrewnerd
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd changed the base branch from main to dask-inference October 22, 2020 21:19
@codecov
Copy link

codecov bot commented Oct 22, 2020

Codecov Report

Merging #304 into dask-inference will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           dask-inference      #304   +/-   ##
================================================
  Coverage          100.00%   100.00%           
================================================
  Files                  24        24           
  Lines                2682      2682           
================================================
  Hits                 2682      2682           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46a0211...3e47b74. Read the comment docs.

@thehomebrewnerd thehomebrewnerd requested review from tamargrey and gsheni and removed request for tamargrey October 22, 2020 21:24
Copy link
Contributor

@gsheni gsheni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor things. Looks good overall and helpful.

"metadata": {},
"outputs": [],
"source": [
"# Will bring Dask dataframe into memory\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comment because it kind of looks weird in the code snipped. We already mentioned it above. You could bold or quote above if you want to highlight it.

"metadata": {},
"outputs": [],
"source": [
"# Will bring Dask dataframe into memory\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

]
},
{
"cell_type": "code",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you
Screen Shot 2020-10-22 at 7 36 54 PM
forgot to remove this cell.

"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have a Dask dataframe, we can use this dataframe to create a Woodwork DataTable, just as we would with a pandas DataFrame:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this sentence, we have:

  • Dask dataframe
  • Woodwork DataTable
  • pandas DataFrame

which doesn't seem consistent from a capitalization standpoint. I think we usually capitalize Woodwork and Dask, but not pandas--is there a reason for that?

With DataFrame/DataTable, it's probably not as clear what we want to do. I think I've been trying to stick to pascal case when actually talking about the object and not capitalizing when referring to it as a word, but I'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've typically adopted the conventions from the respective libraries. If you look at the pandas documentation, they do not capitalize pandas, whereas Dask does in their documentation.

Regarding DataFrame/DataTable/dataframe - I will review my usage of this as I haven't been too consistent, but my thoughts are similar to yours. If we are referring to a specific object or implementation - like Dask DataFrame, we should use pascal case, but if we are referring to a more general representation that is not tied to a specific implementation, we should follow normal capitalization rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense to follow each library's conventions!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through everything and tried to clean up the capitalization. Hopefully I caught everything.

"cell_type": "markdown",
"metadata": {},
"source": [
"# Using Woodwork with Dask Dataframes\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataframes -> DataFrames

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@tamargrey tamargrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@gsheni gsheni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@thehomebrewnerd thehomebrewnerd merged commit 13369be into dask-inference Oct 23, 2020
@thehomebrewnerd thehomebrewnerd deleted the dask-guide branch October 23, 2020 15:02
thehomebrewnerd added a commit that referenced this pull request Oct 27, 2020
* dask logical type inference

* add dask to requirements.txt

* lint fix

* refactor inference tests

* more test refactors

* update release notes

* update test fixtures

* update conftest.py

* Various updates for Dask validation and testing (#260)

* updates for dask validation

* update release notes

* comment out unused code

* update time index test for dask

* parameterize test series and fix tests

* lint fix

* add rows to sample_df and fix head

* Skip Ordinal order validation for Dask dataframes (#270)

* updates for dask validation

* update release notes

* comment out unused code

* update time index test for dask

* parameterize test series and fix tests

* lint fix

* fix ordinal test with Dask

* update release notes

* fix accidental change in conftest.py

* fix changelog

* fix changelog

* Better dask datetime support (#286)

* better dask datetime support

* update release notes

* remove test xfail

* Test numeric time index with Dask input (#288)

* parameterize numeric time index tests

* parameterize additional tests

* update numeric time index test

* update pop method to work with Dask

* Update DataTable.describe to work with Dask (#296)

* update describe to work with Dask

* update release notes

* improve describe performance with Dask

* fix release notes

* Update DataTable.get_mutual_information() to work with Dask (#300)

* update mi to work with Dask

* update release notes

* update test fixture

* Create using Woodwork with Dask guide (#304)

* add using Woodwork with Dask docs guide

* update release notes

* remove comments and empty cell

* update dataframe wording

* remove empty cell

* fix title capitalization

* update release notes

* fix merge issues

* remove unused args from to_pandas util func
@gsheni gsheni mentioned this pull request Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a guide that documents limitations or special considerations when using Dask/Koalas
3 participants