Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract more data from practicalplants.json for the companionship algorithm #20

Closed
petteripitkanen opened this issue Dec 14, 2018 · 15 comments

Comments

@petteripitkanen
Copy link
Collaborator

The function readCrops() in db/practicalplants.js selects useful properties from raw practicalplants.org data and normalizes their content to a format that is easier to handle for the companionship algorithm.

There are still quite a few properties that are not selected and normalized by readCrops(). It would be useful to have more data available for the creation of new goodness functions for the companionship algorithm.

Look for "TODO" in db/practicalplants.js.

@gljivar
Copy link
Contributor

gljivar commented Aug 10, 2019

Could you maybe prioritize which properties are most important for the algorithm, to start with them?

@petteripitkanen
Copy link
Collaborator Author

Yes, these would be useful:

  • family
  • genus
  • edible part and use (1)
  • medicinal part and use (1)
  • material part and use (1)
  • fertility
  • salinity
  1. These are objects, properties that are useful are part used and part used for, but part use details is free-form text.

From these it might be possible to parse something from the free-form text, but likely difficult:

  • range
  • habitat
  • PFAF cultivation notes
  • PFAF propagation notes
  • PFAF material use notes
  • PFAF toxicity notes
  • PFAF medicinal use notes
  • PFAF edible use notes

Rest look either useless for the algorithm, or missing from most crops.

@gljivar
Copy link
Contributor

gljivar commented Aug 13, 2019

I have checked and family has 283 unique values, and genus over 1000. What is your view about how to show it in UI?
Dropdown concept, with ALL values in powerplants.js (e.g. ALL_FAMILY_VALUES) would not work good in these cases. I tried and it is not possible to scrool through dropdown values. So most likely some kind of popup window would have to be used.

Should I just try extract these all values in this ticket, and then there would be another ticket about how this would be shown in UI?

@petteripitkanen
Copy link
Collaborator Author

I think dropdown is ok to get started, it can then be fixed/improved in another issues and PRs, also it is good and ok to split this to two PRs: one for getting family and genus extracted and tested, and another for adding them to UI.

Note that we currently use readCrops to preprocess the raw data, specifically to convert ALL_FAMILY_VALUES to PP_FAMILY_VALUES, and then there are tests for these in test/db/practicalplants.js. It is possible that family and genus have duplicates, small differences like upper/lower case letters and typos.

@gljivar
Copy link
Contributor

gljivar commented Aug 15, 2019

Fertility has these unique values:
"self fertile"
"self fertile, self sterile"
"self sterile"

Should this be converted just to have two unique value possibilities, "self fertile" and "self sterile". And then fertility would be an array property. And objects that contain "self fertile, self sterile" as fertility value, would get both values into that fertility array property. Same as there is preparsing for pollinator values.

@petteripitkanen
Copy link
Collaborator Author

Exactly, there are plants that have some opposition to self-fertilization but not total opposition.

@gljivar
Copy link
Contributor

gljivar commented Aug 16, 2019

Actually fertility values are already implemented.
I still need to check one thing, in UI maybe there are values of family showing in lowercase.

@petteripitkanen
Copy link
Collaborator Author

There could be a test for checking that name properties always start with an upper case letter (#78).

It might also be related to #73, I tried to debug this but so far only got annoyed, help wanted. :)

@gljivar
Copy link
Contributor

gljivar commented Aug 18, 2019

After running migrate and deleting locally indexed db as described in issue #73 , it looks that all Genus and Family values are showing properly in the UI with uppercase starting letter.

Edit issue
But I have noticed issue that after Saving on Edit, value of dropdowns are not immediately visible on next opening of Edit. Change is visible only after whole page is refreshed. Can you please double check and raise an issue if needed.

Object properties extraction
Regarding extraction of object properties, I have noticed that it would be first time object properties are getting extracted. Would you prefer to have it in same format as in json file (object with two properties) or rather just two separate properties. In case of "edible part and use" to have properties "edible part", "edible part used for".
Additional complication is that I see that besides these properties, there are also crop array properties "edible parts" and "edible uses". And in some cases they are contradicting "edible part and use". For example I found this for crop Rosemary:
"edible part and use":{"part used":"Leaves","part used for":"Herbs"},
"edible parts":"flowers, leaves","edible uses":"Herb, Salad, Dry"

@petteripitkanen
Copy link
Collaborator Author

There could be three properties edibleParts, medicinalParts, materialParts, and these would all be arrays that contain symmetrical objects that have the properties part and use.

If you take a look at the document for Rosmarinus officinalis by using the practicalplants MediaWiki API, you can see that the property edible part and use is not a single object but an array, and this seems to be the case for other crops as well. I created #83 for fixing this.

It looks like the properties edible parts and edible uses are redundant, if needed there could be utility functions getEdibleParts and getEdibleUses that take a Crop as an argument and determine these from the edibleParts property. There could also be a test to check if these properties actually are redundant as they seem to be.

@gljivar
Copy link
Contributor

gljivar commented Aug 18, 2019

I would assume that in that case this issue could be closed or blocked, and another ticket specifying how to extract these values could be created, and that issue could be done after issue #83 is solved. As you mentioned, I have checked and it is array in original extract as you mentioned, but in json file it is an object, so extraction depends on #83.

Rest of properties
I have analyzed rest of properties and they contain interesting data. In some cases might possibly be good for the matching algorithm. Such as areas where plant can be found, or does it have anything edible.
But the text is practically unique for every plant and it would need some focus on what to extract, and define rules about it. If you think it would be useful to work on it now, please let me know what is most useful property to start extracting.

Here is statistics about unique records:

  • range has 4000 unique values, but if some rules are defined it could be extracted where plant grows
  • habitat has 6000 unique values, and it might be also useful to define rules to extract some values to know where plant grows
  • PFAF cultivation notes has more than 6000 unique values, and it might be also useful to define rules to extract some values to know how plant is cultivated
  • PFAF propagation notes has more than 2000 unique values, and it might be also useful to define rules to extract some values to know how plant should be stored and treated
  • PFAF material use notes has more than 2700 unique values, maybe not useful now as it details how plant is used as material
  • PFAF toxicity notes has more than 680 unique values, not sure how useful it is for the algorithm, but it might be that if plant has this note, that it is poisonous and can be excluded from algorithm calculations
  • PFAF medicinal use notes has more than 3200 unique values, maybe not useful now as it details how plant is used for medicine
  • PFAF edible use notes has more than 3700 unique values, maybe not useful now as it details how to prepare plant for consumption

@petteripitkanen
Copy link
Collaborator Author

petteripitkanen commented Aug 26, 2019

About these textual properties, I haven't analyzed these completely yet, but especially range looks interesting as it is connected to the geographic location which will be used later for many other things. I opened #92 for this.

Some notes for the other properties:

It is likely that the textual properties need several iterations before the most useful properties and sets of values for them are found.

@petteripitkanen
Copy link
Collaborator Author

Let's keep this issue open until there are more specific issues that cover all properties that we want to extract. These properties are already covered:

These properties are not yet covered by smaller issues:

  • fertility, @gljivar what was the status of your work on this?
  • Highly used textual properties habitat, PFAF cultivation notes, PFAF propagation notes, PFAF material use notes, PFAF toxicity notes, PFAF medicinal use notes, PFAF edible use notes.

@gljivar
Copy link
Contributor

gljivar commented Aug 26, 2019

Fertility extraction was already implemented.

Ok sure. Issue can stay open.

@petteripitkanen
Copy link
Collaborator Author

Sorry, I opened #95 for extracting data from textual properties.

Analyze practicalplants data automation moved this from In progress to Done Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants