Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify first module wrap-up quiz to not need a SimpleImputer #361

Closed
lesteve opened this issue Jun 3, 2021 · 7 comments
Closed

Simplify first module wrap-up quiz to not need a SimpleImputer #361

lesteve opened this issue Jun 3, 2021 · 7 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@lesteve
Copy link
Collaborator

lesteve commented Jun 3, 2021

https://mooc-forums.inria.fr/moocsl/t/m1-wrap-up-quiz-q5-simpleimputer-question/2535
There were some feed-back from the beta that it was hard to answer the question because it was not clear that a pipeline could be nested. We tried to give more guidance in the question:

  • this is the first module wrap-up quiz it should not be too hard
  • using something that we have not seen or mentioned (missing data imputation) is not a great idea in general
  • having to use a complex pipeline whereas we have seen only simple pipelines

Proposed solutions:

  • do the missing data imputations with pandas in the code we give to load the data (my favourite option personnally)
  • alternatives? I think we talked about this and there were other proposals but I can't remember (probably partly because I favour the previous option 😉, feel free to edit my post to add them)
@lesteve lesteve added this to the MOOC 2.0 milestone Jun 3, 2021
@lesteve lesteve changed the title Simplify first module wrap-up quiz to not need a SimpleImuter Simplify first module wrap-up quiz to not need a SimpleImputer Jun 3, 2021
@GaelVaroquaux GaelVaroquaux added the enhancement New feature or request label Jul 16, 2021
@GaelVaroquaux
Copy link
Collaborator

We probably need to create a notebook with a title similar to "Illustration of a rich pipeline: handling missing values"

@lesteve
Copy link
Collaborator Author

lesteve commented Jul 20, 2021

I think the consensus at the time we discussed it (probably @GaelVaroquaux was not involved though I don't remember for sure) was not to add more content to the module 1 and do the simplest thing which was removing missing values with a few lines of pandas.

Whether we should talk about imputing missing data somewhere and where to put it, I have to say I don't know.

@GaelVaroquaux
Copy link
Collaborator

GaelVaroquaux commented Jul 20, 2021 via email

@lesteve
Copy link
Collaborator Author

lesteve commented Jul 20, 2021

I would even store a simplified dataset that does not have these misssing
values, to avoid having to discuss this.

Good point, we are using a local CSV file so this is probably the simplest thing to do. This would be nice to add a note about this in datasets/README.md.

@ArturoAmorQ
Copy link
Collaborator

I would even store a simplified dataset that does not have these misssing
values, to avoid having to discuss this.

There are some features such as 'Alley', 'PoolQC' , 'Fence' and 'MiscFeature' that have more than 500 na values.
A solution could be to erase them in the csv file and then erase rows with missing values, either on the csv or with a simple dropna() directly on the notebook. It's a matter of taste.
In any case that leaves us with 1094 out of the original 1460 entries.

Erasing columns means that we will have to adjust the rest of the questions and hints accordingly. What do you think?

@lesteve
Copy link
Collaborator Author

lesteve commented Jul 21, 2021

We can directly erase it in the CSV this way the quiz instructions are a bit simpler (and we don't have to explain that we are dropping NaNs or why we are doing it).

Erasing columns means that we will have to adjust the rest of the questions and hints accordingly. What do you think?

Good points I guess that means we may need to change quite a lot of the quiz with this change (for example the correction will change since we don't need a SimpleImputer anymore). I guess we may want to wait before tackling this issue then, IMO we need to decide on a rough strategy regarding quiz changes, the main question is basically who is going to do the manual updates in FUN. The next meeting is a good occasion of talking about this last point.

@lesteve
Copy link
Collaborator Author

lesteve commented Jul 23, 2021

So we agreed to:

  • remove SimpleImputer from the wrap-up quiz. We need to recheck the entire quiz and adapt it. This is the point of this issue
  • have missing value, imputing, "advanced pipeline" into a separate module as an ambitious goal and reevaluate depending on how fast we progress on less complicated things: Add "advanced pipeline", missing value, imputing module, maybe more #414

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants