Extract more data from practicalplants.json for the companionship algorithm #20

petteripitkanen · 2018-12-14T20:09:38Z

The function readCrops() in db/practicalplants.js selects useful properties from raw practicalplants.org data and normalizes their content to a format that is easier to handle for the companionship algorithm.

There are still quite a few properties that are not selected and normalized by readCrops(). It would be useful to have more data available for the creation of new goodness functions for the companionship algorithm.

Look for "TODO" in db/practicalplants.js.

gljivar · 2019-08-10T09:44:37Z

Could you maybe prioritize which properties are most important for the algorithm, to start with them?

petteripitkanen · 2019-08-10T11:02:44Z

Yes, these would be useful:

family
genus
edible part and use (1)
medicinal part and use (1)
material part and use (1)
fertility
salinity

These are objects, properties that are useful are part used and part used for, but part use details is free-form text.

From these it might be possible to parse something from the free-form text, but likely difficult:

range
habitat
PFAF cultivation notes
PFAF propagation notes
PFAF material use notes
PFAF toxicity notes
PFAF medicinal use notes
PFAF edible use notes

Rest look either useless for the algorithm, or missing from most crops.

gljivar · 2019-08-13T21:08:52Z

I have checked and family has 283 unique values, and genus over 1000. What is your view about how to show it in UI?
Dropdown concept, with ALL values in powerplants.js (e.g. ALL_FAMILY_VALUES) would not work good in these cases. I tried and it is not possible to scrool through dropdown values. So most likely some kind of popup window would have to be used.

Should I just try extract these all values in this ticket, and then there would be another ticket about how this would be shown in UI?

petteripitkanen · 2019-08-13T22:39:28Z

I think dropdown is ok to get started, it can then be fixed/improved in another issues and PRs, also it is good and ok to split this to two PRs: one for getting family and genus extracted and tested, and another for adding them to UI.

Note that we currently use readCrops to preprocess the raw data, specifically to convert ALL_FAMILY_VALUES to PP_FAMILY_VALUES, and then there are tests for these in test/db/practicalplants.js. It is possible that family and genus have duplicates, small differences like upper/lower case letters and typos.

gljivar · 2019-08-15T20:39:40Z

Fertility has these unique values:
"self fertile"
"self fertile, self sterile"
"self sterile"

Should this be converted just to have two unique value possibilities, "self fertile" and "self sterile". And then fertility would be an array property. And objects that contain "self fertile, self sterile" as fertility value, would get both values into that fertility array property. Same as there is preparsing for pollinator values.

petteripitkanen · 2019-08-15T22:56:21Z

Exactly, there are plants that have some opposition to self-fertilization but not total opposition.

gljivar · 2019-08-16T18:31:47Z

Actually fertility values are already implemented.
I still need to check one thing, in UI maybe there are values of family showing in lowercase.

petteripitkanen · 2019-08-17T13:25:23Z

There could be a test for checking that name properties always start with an upper case letter (#78).

It might also be related to #73, I tried to debug this but so far only got annoyed, help wanted. :)

gljivar · 2019-08-18T11:15:32Z

After running migrate and deleting locally indexed db as described in issue #73 , it looks that all Genus and Family values are showing properly in the UI with uppercase starting letter.

Edit issue
But I have noticed issue that after Saving on Edit, value of dropdowns are not immediately visible on next opening of Edit. Change is visible only after whole page is refreshed. Can you please double check and raise an issue if needed.

Object properties extraction
Regarding extraction of object properties, I have noticed that it would be first time object properties are getting extracted. Would you prefer to have it in same format as in json file (object with two properties) or rather just two separate properties. In case of "edible part and use" to have properties "edible part", "edible part used for".
Additional complication is that I see that besides these properties, there are also crop array properties "edible parts" and "edible uses". And in some cases they are contradicting "edible part and use". For example I found this for crop Rosemary:
"edible part and use":{"part used":"Leaves","part used for":"Herbs"},
"edible parts":"flowers, leaves","edible uses":"Herb, Salad, Dry"

petteripitkanen · 2019-08-18T16:21:04Z

There could be three properties edibleParts, medicinalParts, materialParts, and these would all be arrays that contain symmetrical objects that have the properties part and use.

If you take a look at the document for Rosmarinus officinalis by using the practicalplants MediaWiki API, you can see that the property edible part and use is not a single object but an array, and this seems to be the case for other crops as well. I created #83 for fixing this.

It looks like the properties edible parts and edible uses are redundant, if needed there could be utility functions getEdibleParts and getEdibleUses that take a Crop as an argument and determine these from the edibleParts property. There could also be a test to check if these properties actually are redundant as they seem to be.

gljivar · 2019-08-18T21:49:02Z

I would assume that in that case this issue could be closed or blocked, and another ticket specifying how to extract these values could be created, and that issue could be done after issue #83 is solved. As you mentioned, I have checked and it is array in original extract as you mentioned, but in json file it is an object, so extraction depends on #83.

Rest of properties
I have analyzed rest of properties and they contain interesting data. In some cases might possibly be good for the matching algorithm. Such as areas where plant can be found, or does it have anything edible.
But the text is practically unique for every plant and it would need some focus on what to extract, and define rules about it. If you think it would be useful to work on it now, please let me know what is most useful property to start extracting.

Here is statistics about unique records:

range has 4000 unique values, but if some rules are defined it could be extracted where plant grows
habitat has 6000 unique values, and it might be also useful to define rules to extract some values to know where plant grows
PFAF cultivation notes has more than 6000 unique values, and it might be also useful to define rules to extract some values to know how plant is cultivated
PFAF propagation notes has more than 2000 unique values, and it might be also useful to define rules to extract some values to know how plant should be stored and treated
PFAF material use notes has more than 2700 unique values, maybe not useful now as it details how plant is used as material
PFAF toxicity notes has more than 680 unique values, not sure how useful it is for the algorithm, but it might be that if plant has this note, that it is poisonous and can be excluded from algorithm calculations
PFAF medicinal use notes has more than 3200 unique values, maybe not useful now as it details how plant is used for medicine
PFAF edible use notes has more than 3700 unique values, maybe not useful now as it details how to prepare plant for consumption

petteripitkanen · 2019-08-26T17:59:57Z

About these textual properties, I haven't analyzed these completely yet, but especially range looks interesting as it is connected to the geographic location which will be used later for many other things. I opened #92 for this.

Some notes for the other properties:

Maybe PFAF edible use notes, PFAF medicinal use notes, PFAF material use notes can be used in conjunction with edibleParts, medicinalParts, materialParts (the properties from Extract properties: edible part and use, medicinal part and use, material part and use #91), to find more data for these properties.
Maybe PFAF toxicity notes could be graded and/or categorized.

It is likely that the textual properties need several iterations before the most useful properties and sets of values for them are found.

petteripitkanen · 2019-08-26T18:19:28Z

Let's keep this issue open until there are more specific issues that cover all properties that we want to extract. These properties are already covered:

Little used properties (Analyze little used properties from practicalplants.org data #74)
Object properties edible part and use, medicinal part and use, material part and use (Extract properties: edible part and use, medicinal part and use, material part and use #91)
range (Extract property for areas where plant grows #92)

These properties are not yet covered by smaller issues:

fertility, @gljivar what was the status of your work on this?
Highly used textual properties habitat, PFAF cultivation notes, PFAF propagation notes, PFAF material use notes, PFAF toxicity notes, PFAF medicinal use notes, PFAF edible use notes.

gljivar · 2019-08-26T19:48:04Z

Fertility extraction was already implemented.

Ok sure. Issue can stay open.

petteripitkanen · 2019-08-26T21:07:45Z

Sorry, I opened #95 for extracting data from textual properties.

petteripitkanen added good first issue companionship algorithm labels Dec 14, 2018

gljivar mentioned this issue Aug 15, 2019

Extract values of family and genus properties and show them in UI #71

Merged

petteripitkanen mentioned this issue Aug 15, 2019

Analyze little used properties from practicalplants.org data #74

Open

gljivar mentioned this issue Aug 17, 2019

Extract salinity values and show them in UI #75

Merged

petteripitkanen added this to To do in Analyze practicalplants data Aug 26, 2019

petteripitkanen moved this from To do to In progress in Analyze practicalplants data Aug 26, 2019

petteripitkanen mentioned this issue Aug 26, 2019

Extract data from practicalplants properties that contain textual data #95

Open

petteripitkanen closed this as completed Aug 26, 2019

Analyze practicalplants data automation moved this from In progress to Done Aug 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract more data from practicalplants.json for the companionship algorithm #20

Extract more data from practicalplants.json for the companionship algorithm #20

petteripitkanen commented Dec 14, 2018

gljivar commented Aug 10, 2019

petteripitkanen commented Aug 10, 2019

gljivar commented Aug 13, 2019 •

edited

petteripitkanen commented Aug 13, 2019

gljivar commented Aug 15, 2019

petteripitkanen commented Aug 15, 2019

gljivar commented Aug 16, 2019

petteripitkanen commented Aug 17, 2019

gljivar commented Aug 18, 2019 •

edited

petteripitkanen commented Aug 18, 2019

gljivar commented Aug 18, 2019

petteripitkanen commented Aug 26, 2019 •

edited

petteripitkanen commented Aug 26, 2019

gljivar commented Aug 26, 2019

petteripitkanen commented Aug 26, 2019

Extract more data from practicalplants.json for the companionship algorithm #20

Extract more data from practicalplants.json for the companionship algorithm #20

Comments

petteripitkanen commented Dec 14, 2018

gljivar commented Aug 10, 2019

petteripitkanen commented Aug 10, 2019

gljivar commented Aug 13, 2019 • edited

petteripitkanen commented Aug 13, 2019

gljivar commented Aug 15, 2019

petteripitkanen commented Aug 15, 2019

gljivar commented Aug 16, 2019

petteripitkanen commented Aug 17, 2019

gljivar commented Aug 18, 2019 • edited

petteripitkanen commented Aug 18, 2019

gljivar commented Aug 18, 2019

petteripitkanen commented Aug 26, 2019 • edited

petteripitkanen commented Aug 26, 2019

gljivar commented Aug 26, 2019

petteripitkanen commented Aug 26, 2019

gljivar commented Aug 13, 2019 •

edited

gljivar commented Aug 18, 2019 •

edited

petteripitkanen commented Aug 26, 2019 •

edited