Wikidata #778

almudenasanz · 2019-05-09T12:49:08Z

Description

The final objetive if to use Wikidata as a new Knowledge Graph for Recommendation algorithms, and to extract entities description to use new datasets (like Movielens) with DKN in DKN. This is the first step in that direction. I have implemented:

New utils functions to do specific queries in Wikidata:

Query list of related entities from a string representing the name of an entity. The goal is to be able to create a Knowledge Graph from the linked entities in Wikidata
Query entity description string representing the name of an entity

To test the functions created I have added a new notebook. The first section consists on creating a Knowledge Graph from the linked entities in Wikidata, and visualising the result of the KG. The second part tests the enriching of the name of an entity with their description and list of related entities, the goal is using this enriching for new datasets (like Movielens) with DKN.

Related Issues

#525

Checklist:

My code follows the code style of this project, as detailed in our contribution guidelines.
I have added tests. -> I have added tests in the notebook, should I add more?
I have updated the documentation accordingly.

review-notebook-app · 2019-05-09T12:49:10Z

Check out this pull request on ReviewNB: https://app.reviewnb.com/microsoft/recommenders/pull/778

Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows.

msftclas · 2019-05-09T12:49:21Z

All CLA requirements met.

yueguoguo

Thanks for the work!

In general, it would be great to have

More text and background introduction about knowledge graph. Since it can be a super big topic, so probably it would be good to focus the notebook to be around something that directly relates to recommendation (e.g., DKN). While entity linking can be a core part of the notebook.
In the repository, we have developed a DevOps pipeline that helps test codes in the utility functions and notebooks. Try to understand how it works and how to write good unit tests for your functions. Goo examples can be found in the folder of tests.

yueguoguo · 2019-05-10T09:10:05Z

reco_utils/dataset/wikidata.py

+        entityID: wikidata entityID corresponding to the title string. 
+                  'entityNotFound' will be returned if no page is found
+    """
+    url = "https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&format=json&titles="


Hardcoded string may not be desirable. Make it an input variable or constant.

hi, sorry for the delay in the answer. since the request to the new query (I had to do some changes) is a concatenation of strings and a variable:

requests.get("https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch="+name+"&format=json&prop=pageprops&ppprop=wikibase_item")

how do you suggest I can do this? should make the two substrings constants and concatenate them in the function?

here's an approach

# defined at beginning of module import urllib API_URL = "https://en.wikipedia.org/w/api.php" def find_wikidataID(name): url_opts = "&".join([ "action=query", "list=search", "srsearch={}".format(urllib.parse.quote(name)), "format=json", "prop=pageprops", "ppprop=wikibase_item", ]) requests.get("{url}?{opts}".format(url=API_URL, opts=url_opts))

I implemented this here: a5338c0 Thanks a lot for the tip!

reco_utils/dataset/wikidata.py

notebooks/01_prepare_data/wikidata.ipynb

reco_utils/dataset/wikidata.py

miguelgfierro · 2019-05-10T14:53:57Z

reco_utils/dataset/wikidata.py

+    try:
+        entityID = r.json()["query"]["pages"][entityID]["pageprops"]["wikibase_item"]
+    except:
+        entityID = "entityNotFound"


why returning a string vs raising an exception?

I return a string for the cases where the function is used in a loop, so we can have a record of that entity not having a response and the code can keep running. Do you recommend doing something else?

this can vary a bit, but often I find it is easier to raise the exception here and let the calling function catch it and handle it as needed. This removes the need to check the output for a specific string defined here and so there is looser coupling across functions and more flexibility in handling errors upstream.

miguelgfierro · 2019-05-10T14:57:35Z

In the repository, we have developed a DevOps pipeline that helps test codes in the utility functions and notebooks. Try to understand how it works and how to write good unit tests for your functions. Goo examples can be found in the folder of tests.

@almudenasanz let me know if you need help with this. There is information here https://github.com/microsoft/recommenders/tree/master/tests

miguelgfierro · 2019-05-14T10:48:31Z

Hi @almudenasanz, we were talking internally. @chenhuims is going to work KG networks with you. He will probably do CKE, which will complement your work on Ripple and KGCN.

We thought that it would be interesting that the final output of the notebook that you are doing could be the knowledge graph of Movielens as a dumped file (in the format of the networks), which we would save in a blob. Then in the notebooks of the KG networks, we will start from that saved file. After that, you guys could work in parallel.

We would need to have a KG for the 4 ML (100k, 1M, 10M, 20M). @Leavingseason was wondering if there is a restriction in the number of requests in the Wikidata API

miguelgfierro · 2019-05-14T10:52:43Z

One note on ways of working for @almudenasanz and @chenhuims. In other situations where there are several people working on the same issue, they will agree on the tasks to work on (depending on the bandwidth) and then either push to the same branch (in your case it would be wikidata or one person will do a PR to that branch).

almudenasanz · 2019-05-14T13:53:55Z

We would need to have a KG for the 4 ML (100k, 1M, 10M, 20M). @Leavingseason was wondering if there is a restriction in the number of requests in the Wikidata API

@Leavingseason @miguelgfierro In the Query Limits section (https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual) of the documentation they mention: "the service is limited to 5 parallel queries per IP". Seems like the only limit is in parallel queries per IP, not in sequential queries

gramhagen · 2019-05-15T20:13:16Z

just curious if this would be made simpler by leveraging an existing package like: https://pypi.org/project/Wikidata/

miguelgfierro · 2019-05-16T12:00:45Z

just curious if this would be made simpler by leveraging an existing package like: https://pypi.org/project/Wikidata/

The latest release is from 2017, is this maintained?

… wikipedia

gramhagen · 2019-05-16T12:38:22Z

hmm, there are comments from the maintainer on issues that are more recent Dec 2018, it's possible the api hasn't changed enough to warrant any updates?

almudenasanz · 2019-05-16T12:44:14Z

just curious if this would be made simpler by leveraging an existing package like: https://pypi.org/project/Wikidata/

Hi @gramhagen , I looked into the package but for the specific implementations that I use on the notebook:

Retrieve wikipedia page title from search query
Retrieve wikidata item identifier from page title
Retrieve all connected Wikidata item identifier from a Wikidata item identifier

I did not find implementations in the package that are simpler than the one's I implemented. But happy to discuss any suggestions!

almudenasanz · 2019-05-16T12:53:16Z

We would need to have a KG for the 4 ML (100k, 1M, 10M, 20M). @Leavingseason was wondering if there is a restriction in the number of requests in the Wikidata API

@miguelgfierro I added to the notebook a new section that extracts the entities for the 100k Movielens version.

I had to reimplement a bit the find_wikidataID method, because I was matching strings to exact wikipedia page titles. Some of the movie titles did not match exactly the wikipedia title, so I added a simple query that uses a text query to retrieve the first matching page title, and now works well for all movies.

I need to ask you where to put the output file of the MovieLens KG

gramhagen · 2019-05-16T13:07:30Z

ok, no problem, wasn't sure if it would be helpful to leverage that, but sounds like it's not in this case. thanks for checking into it

miguelgfierro · 2019-05-16T13:27:52Z

I need to ask you where to put the output file of the MovieLens KG

we are changing the blobs of recommenders, can you send me the file somehow? then I'll uploaded to the correct place

miguelgfierro · 2019-05-30T09:24:13Z

hey @almudenasanz @chenhuims would you please have an update on the state of this PR and the work you are doing with movielens+wikidata?

Please let me know if you have any blockers

miguelgfierro · 2019-07-22T10:27:10Z

this looks good @almudenasanz. One question, how long does it take to compute the small KG with movielens 100k?

Can you add a test for the notebook? Depending on the time, they will go in unit test or maybe smoke tests

Here info on how to add the test https://github.com/microsoft/recommenders/tree/master/tests#how-to-create-tests-on-notebooks-with-papermill

almudenasanz

this looks good @almudenasanz. One question, how long does it take to compute the small KG with movielens 100k?

It takes 1-2 seconds per entry (combining text query to wikidata entity ID and finding related entities), so I was able to query the 100K Movielens dataset containing 1682 movies in 45 mins. The time it will take for the other KGs will depend on the number of movies reviewed. According to this link https://grouplens.org/datasets/movielens/ the amounts are:

Movielens 100K ~1.700 movies
Movielens 1M ~4.000 movies
Movielens 10M ~10.000 movies

The API supports up to 5 parallel queries from the same IP, so we could reduce it 5 times.

Can you add a test for the notebook? Depending on the time, they will go in unit test or maybe smoke tests

Here info on how to add the test https://github.com/microsoft/recommenders/tree/master/tests#how-to-create-tests-on-notebooks-with-papermill

I will look into the tests

miguelgfierro · 2019-07-24T20:51:15Z

so I was able to query the 100K Movielens dataset containing 1682 movies in 45 mins.

wow that's a lot. Maybe we can think of a way of running the first part (without movielens) in the unit tests, and then do some queries (not all) in the movielens part. Ideally we want the unit test to be less than one min and integration less than 15-20min.

Do you have any idea on how to execute the notebook under the times I mentioned?

almudenasanz · 2019-07-25T16:34:45Z

so I was able to query the 100K Movielens dataset containing 1682 movies in 45 mins.

wow that's a lot. Maybe we can think of a way of running the first part (without movielens) in the unit tests, and then do some queries (not all) in the movielens part. Ideally we want the unit test to be less than one min and integration less than 15-20min.

Do you have any idea on how to execute the notebook under the times I mentioned?

I can create a parameter to run the tests only on a sample of the movielens dataset, would that work?

miguelgfierro · 2019-07-26T10:24:04Z

I can create a parameter to run the tests only on a sample of the movielens dataset, would that work?

yeah that's reasonable. I think a good way of doing this would be to have a unit test that just check that the first part of the notebook runs, similarly to this example. Then we can have what you suggested in the integration test and check programmatically some outputs like in this example. Maybe check that the first rows of the KG of movielens are correctly created. Kind of similar to what we are doing in the criteo tests.

Under this schema, maybe the integration tests take less than 5min

almudenasanz · 2019-07-26T10:33:55Z

yeah that's reasonable. I think a good way of doing this would be to have a unit test that just check that the first part of the notebook runs, similarly to this example. Then we can have what you suggested in the integration test and check programmatically some outputs like in this example.

Great, I already started working on them, and querying the 1M MovieLens dataset. I will commit them when I finish

almudenasanz · 2019-07-26T12:03:25Z

I added the tests, and sent you the 1M file by email

I introduced sample parameters so both the unit and integration tests can run in under 1 min, and the integration test checks that the output os the file has the expected number of responses

miguelgfierro · 2019-07-27T15:56:29Z

@almudenasanz, I solved the conflicts but one test failed, there is a small error in the code:

E           /anaconda/envs/reco_base/lib/python3.6/site-packages/tqdm/_tqdm.py in wrapper(*args, **kwargs)
E               673                     # take a fast or slow code path; so stop when t.total==t.n
E               674                     t.update(n=1 if not t.total or t.n < t.total else 0)
E           --> 675                     return func(*args, **kwargs)
E               676 
E               677                 # Apply the provided function (in **kwargs)
E           
E           <ipython-input-19-6ccb9974139b> in <lambda>(x)
E                 1 tqdm().pandas(desc="Number of movies completed")
E           ----> 2 result = pd.concat(list(movies.progress_apply(lambda x: wikidata_KG_from_movielens(x), axis=1)))
E           
E           <ipython-input-17-3cd273a9c115> in wikidata_KG_from_movielens(df)
E                 3     entity_id = find_wikidataID(df["Title"] + " film")
E                 4     if entity_id != "entityNotFound":
E           ----> 5         json_links = query_entity_links(entity_id)
E                 6         related_entities,related_names = read_linked_entities(json_links)
E                 7         d = pd.DataFrame({"original_entity":[entity_id]* len(related_entities),
E           
E           /data/home/recocat/cicd/3/s/reco_utils/dataset/wikidata.py in query_entity_links(entityID)
E               100         data = r.json()
E               101     except:
E           --> 102         print(e)
E               103         print("Entity ID not Found in Wikidata")
E               104         return {}
E           
E           NameError: ("name 'e' is not defined", 'occurred at index 21')

just fixed it

miguelgfierro · 2019-07-27T17:50:53Z

@almudenasanz there is a time out:

$ pytest --durations=0 tests/unit/ -m "notebooks and not spark and not gpu"
======================================================= test session starts ========================================================
platform linux -- Python 3.6.8, pytest-4.2.1, py-1.7.0, pluggy-0.8.1
rootdir: /data/home/recocat/notebooks/miguel/Recommenders, inifile:
collected 163 items / 153 deselected / 10 selected

tests/unit/test_notebook_utils.py .                                                                                          [ 10%]
tests/unit/test_notebooks_python.py .........                                                                                [100%]

========================================================= warnings summary =========================================================
/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/util/tf_inspect.py:75
/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/util/tf_inspect.py:75
  /anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/util/tf_inspect.py:75: DeprecationWarning: inspect.getargspec() is deprecated since Python 3.0, use inspect.signature() or inspect.getfullargspec()
    return _inspect.getargspec(target)

reco_utils/recommender/ncf/dataset.py:321
  /data/home/recocat/notebooks/miguel/Recommenders/reco_utils/recommender/ncf/dataset.py:321: DeprecationWarning: invalid escape sequence \[
    """

/anaconda/envs/reco_base/lib/python3.6/site-packages/nbconvert/exporters/exporter_locator.py:28
  /anaconda/envs/reco_base/lib/python3.6/site-packages/nbconvert/exporters/exporter_locator.py:28: DeprecationWarning: `nbconvert.exporters.exporter_locator` is deprecated in favor of `nbconvert.exporters.base` since nbconvert 5.0.
    DeprecationWarning)

reco_utils/recommender/rbm/rbm.py:452
  /data/home/recocat/notebooks/miguel/Recommenders/reco_utils/recommender/rbm/rbm.py:452: DeprecationWarning: invalid escape sequence \s
    """

tests/unit/test_notebook_utils.py::test_is_jupyter
  /anaconda/envs/reco_base/lib/python3.6/site-packages/jupyter_client/session.py:371: DeprecationWarning: Session._key_changed is deprecated in traitlets 4.1: use @observe and @unobserve instead.
    def _key_changed(self):

-- Docs: https://docs.pytest.org/en/latest/warnings.html
====================================================== slowest test durations ======================================================
1294.08s call     tests/unit/test_notebooks_python.py::test_wikidata_runs
217.83s call     tests/unit/test_notebooks_python.py::test_vw_deep_dive_runs
120.90s call     tests/unit/test_notebooks_python.py::test_rlrmc_quickstart_runs
45.77s call     tests/unit/test_notebooks_python.py::test_lightgbm
40.65s call     tests/unit/test_notebooks_python.py::test_surprise_deep_dive_runs
19.60s call     tests/unit/test_notebooks_python.py::test_sar_deep_dive_runs
16.61s call     tests/unit/test_notebooks_python.py::test_baseline_deep_dive_runs
12.61s call     tests/unit/test_notebooks_python.py::test_sar_single_node_runs
3.51s call     tests/unit/test_notebooks_python.py::test_template_runs
2.55s call     tests/unit/test_notebook_utils.py::test_is_jupyter

(0.00 durations hidden.  Use -vv to show these durations.)
===================================== 10 passed, 153 deselected, 6 warnings in 1777.51 seconds

almudenasanz · 2019-07-29T07:15:12Z

I have changed the default parameters of the notebook to do the sampling, and the test have passed. Seems like the injection of the parameters for the test was not working.

* new file with wikidata functions * fix in json extraction * new notebook with wikidata use examples * retry request with lowercase in case of failure * WIP: example creating KG from movielens entities * introduced new step to retrieve first page title from a text query in wikipedia * updated movielens links extraction using wikidata * adapted docstrings for sphinx and removed parenthesis from output * added description and labels to nodes to graph preview * #778 (comment) new format for queries * raising exceptions in requests and using get() to retrieve dict values * moved imports to first cell and movielens size as a parameter * output file name as paramenter * DATA: update sum check * adding unit test for sum to 1 issue * improved description and adapted to tests * improved Exception descriptions * integration tests * unit tests * added wikidata_KG to conftest * changed name notebook * *NOTE: Adding shows the computation time of all tests.* * imports up * Update wikidata.py * changed default parameter of sample for tests * Add sphinx documentation for wikidata * modified parameter extraction for tests * added parameters tag to cell * changed default sampling to test parameters in test * notebook cleaned cells output * Docker Support (#718) * DOCKER: add pyspark docker file * DOCKER: remove unused line * DOCKER: remove old file * DOCKER: add SETUP text * DOCKER: add azureml` * DOCKER: udpate dockerfile * DOCKER: use a branch of the repo * SETUP: update setup * DOCKER: update dockerfile * DOC: update setup * DOCKER: one that binds all * SETUP: update docker use * DOCKER: move to top level * SETUP: use a different base name * DOCKER: use the same keywords in the repo for environment arg * SETUP: update environment variable names * updating dockerfile to use multistage build and adding readme * adding full stage * fixing documentation * adding info for running full env * README: update notes for exporting environment on certain platform * README: updated with example on Windows * README: fix typo

Wikidata

almudenasanz added 3 commits May 6, 2019 22:48

new file with wikidata functions

233a14e

fix in json extraction

fa89eb5

new notebook with wikidata use examples

b5481f2

almudenasanz requested a review from miguelgfierro May 9, 2019 12:49

almudenasanz requested a review from yueguoguo as a code owner May 9, 2019 12:49

almudenasanz self-assigned this May 9, 2019

miguelgfierro requested review from gramhagen and chenhuims May 9, 2019 15:23

yueguoguo requested a review from Leavingseason May 10, 2019 09:08

yueguoguo reviewed May 10, 2019

View reviewed changes

miguelgfierro reviewed May 10, 2019

View reviewed changes

miguelgfierro added this to the Deep Learning + Knowledge Graph Based Recommender Scenario milestone May 10, 2019

almudenasanz added 2 commits May 15, 2019 19:20

retry request with lowercase in case of failure

6981d98

WIP: example creating KG from movielens entities

18aceaf

almudenasanz added 2 commits May 16, 2019 14:26

introduced new step to retrieve first page title from a text query in…

73e5aea

… wikipedia

updated movielens links extraction using wikidata

aee01d5

almudenasanz added 2 commits July 18, 2019 17:49

moved imports to first cell and movielens size as a parameter

9c03398

output file name as paramenter

c8f0ff0

almudenasanz commented Jul 23, 2019

View reviewed changes

almudenasanz added 6 commits July 26, 2019 13:57

improved description and adapted to tests

9271a0c

improved Exception descriptions

3a172a9

integration tests

9244b37

unit tests

7370c09

added wikidata_KG to conftest

0b02b71

changed name notebook

16cfcd2

miguelgfierro added 2 commits July 27, 2019 16:37

imports up

dfacd89

Merge branch 'staging' into wikidata

f9306bf

Update wikidata.py

3e83e7f

changed default parameter of sample for tests

bf1cdaf

miguelgfierro approved these changes Jul 29, 2019

View reviewed changes

miguelgfierro merged commit d4bb65a into staging Jul 29, 2019

miguelgfierro deleted the wikidata branch July 29, 2019 09:16

miguelgfierro mentioned this pull request Jul 29, 2019

Add sphinx documentation for wikidata #879

Merged

3 tasks

gramhagen mentioned this pull request Jul 30, 2019

Staging to master #882

Merged

3 tasks

yueguoguo pushed a commit that referenced this pull request Sep 9, 2019

#778 (comment) new format for queries

6fe8cdb

yueguoguo pushed a commit that referenced this pull request Sep 9, 2019

Merge pull request #778 from microsoft/wikidata

169daf1

Wikidata

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wikidata #778

Wikidata #778

almudenasanz commented May 9, 2019 •

edited

review-notebook-app bot commented May 9, 2019

msftclas commented May 9, 2019 •

edited

yueguoguo left a comment

yueguoguo May 10, 2019

almudenasanz Jun 13, 2019

gramhagen Jun 18, 2019

almudenasanz Jul 22, 2019

miguelgfierro May 10, 2019

almudenasanz Jun 13, 2019

gramhagen Jun 18, 2019

miguelgfierro commented May 10, 2019

miguelgfierro commented May 14, 2019

miguelgfierro commented May 14, 2019

almudenasanz commented May 14, 2019 •

edited

gramhagen commented May 15, 2019

miguelgfierro commented May 16, 2019 •

edited

gramhagen commented May 16, 2019

almudenasanz commented May 16, 2019

almudenasanz commented May 16, 2019

gramhagen commented May 16, 2019

miguelgfierro commented May 16, 2019

miguelgfierro commented May 30, 2019

miguelgfierro commented Jul 22, 2019

almudenasanz left a comment

miguelgfierro commented Jul 24, 2019

almudenasanz commented Jul 25, 2019

miguelgfierro commented Jul 26, 2019

almudenasanz commented Jul 26, 2019

almudenasanz commented Jul 26, 2019

miguelgfierro commented Jul 27, 2019 •

edited

miguelgfierro commented Jul 27, 2019

almudenasanz commented Jul 29, 2019

Wikidata #778

Wikidata #778

Conversation

almudenasanz commented May 9, 2019 • edited

Description

Related Issues

Checklist:

review-notebook-app bot commented May 9, 2019

msftclas commented May 9, 2019 • edited

yueguoguo left a comment

Choose a reason for hiding this comment

yueguoguo May 10, 2019

Choose a reason for hiding this comment

almudenasanz Jun 13, 2019

Choose a reason for hiding this comment

gramhagen Jun 18, 2019

Choose a reason for hiding this comment

almudenasanz Jul 22, 2019

Choose a reason for hiding this comment

miguelgfierro May 10, 2019

Choose a reason for hiding this comment

almudenasanz Jun 13, 2019

Choose a reason for hiding this comment

gramhagen Jun 18, 2019

Choose a reason for hiding this comment

miguelgfierro commented May 10, 2019

miguelgfierro commented May 14, 2019

miguelgfierro commented May 14, 2019

almudenasanz commented May 14, 2019 • edited

gramhagen commented May 15, 2019

miguelgfierro commented May 16, 2019 • edited

gramhagen commented May 16, 2019

almudenasanz commented May 16, 2019

almudenasanz commented May 16, 2019

gramhagen commented May 16, 2019

miguelgfierro commented May 16, 2019

miguelgfierro commented May 30, 2019

miguelgfierro commented Jul 22, 2019

almudenasanz left a comment

Choose a reason for hiding this comment

miguelgfierro commented Jul 24, 2019

almudenasanz commented Jul 25, 2019

miguelgfierro commented Jul 26, 2019

almudenasanz commented Jul 26, 2019

almudenasanz commented Jul 26, 2019

miguelgfierro commented Jul 27, 2019 • edited

miguelgfierro commented Jul 27, 2019

almudenasanz commented Jul 29, 2019

almudenasanz commented May 9, 2019 •

edited

msftclas commented May 9, 2019 •

edited

almudenasanz commented May 14, 2019 •

edited

miguelgfierro commented May 16, 2019 •

edited

miguelgfierro commented Jul 27, 2019 •

edited