Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikidata #778

Merged
merged 23 commits into from Jul 29, 2019
Merged

Wikidata #778

merged 23 commits into from Jul 29, 2019

Conversation

almudenasanz
Copy link
Collaborator

@almudenasanz almudenasanz commented May 9, 2019

Description

The final objetive if to use Wikidata as a new Knowledge Graph for Recommendation algorithms, and to extract entities description to use new datasets (like Movielens) with DKN in DKN. This is the first step in that direction. I have implemented:

New utils functions to do specific queries in Wikidata:

  • Query list of related entities from a string representing the name of an entity. The goal is to be able to create a Knowledge Graph from the linked entities in Wikidata
  • Query entity description string representing the name of an entity

To test the functions created I have added a new notebook. The first section consists on creating a Knowledge Graph from the linked entities in Wikidata, and visualising the result of the KG. The second part tests the enriching of the name of an entity with their description and list of related entities, the goal is using this enriching for new datasets (like Movielens) with DKN.

Related Issues

#525

Checklist:

  • My code follows the code style of this project, as detailed in our contribution guidelines.
  • I have added tests. -> I have added tests in the notebook, should I add more?
  • I have updated the documentation accordingly.

@almudenasanz almudenasanz self-assigned this May 9, 2019
@review-notebook-app
Copy link

Check out this pull request on ReviewNB: https://app.reviewnb.com/microsoft/recommenders/pull/778

Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows.

@msftclas
Copy link

msftclas commented May 9, 2019

CLA assistant check
All CLA requirements met.

Copy link
Collaborator

@yueguoguo yueguoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

In general, it would be great to have

  1. More text and background introduction about knowledge graph. Since it can be a super big topic, so probably it would be good to focus the notebook to be around something that directly relates to recommendation (e.g., DKN). While entity linking can be a core part of the notebook.
  2. In the repository, we have developed a DevOps pipeline that helps test codes in the utility functions and notebooks. Try to understand how it works and how to write good unit tests for your functions. Goo examples can be found in the folder of tests.

entityID: wikidata entityID corresponding to the title string.
'entityNotFound' will be returned if no page is found
"""
url = "https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&format=json&titles="
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded string may not be desirable. Make it an input variable or constant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, sorry for the delay in the answer. since the request to the new query (I had to do some changes) is a concatenation of strings and a variable:

requests.get("https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch="+name+"&format=json&prop=pageprops&ppprop=wikibase_item")

how do you suggest I can do this? should make the two substrings constants and concatenate them in the function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's an approach

# defined at beginning of module
import urllib
API_URL = "https://en.wikipedia.org/w/api.php"


def find_wikidataID(name):
    url_opts = "&".join([
        "action=query",
        "list=search",
        "srsearch={}".format(urllib.parse.quote(name)),
        "format=json",
        "prop=pageprops",
        "ppprop=wikibase_item",
    ])
    requests.get("{url}?{opts}".format(url=API_URL, opts=url_opts))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this here: a5338c0 Thanks a lot for the tip!

reco_utils/dataset/wikidata.py Show resolved Hide resolved
reco_utils/dataset/wikidata.py Show resolved Hide resolved
notebooks/01_prepare_data/wikidata.ipynb Outdated Show resolved Hide resolved
notebooks/01_prepare_data/wikidata.ipynb Outdated Show resolved Hide resolved
notebooks/01_prepare_data/wikidata.ipynb Outdated Show resolved Hide resolved
notebooks/01_prepare_data/wikidata.ipynb Outdated Show resolved Hide resolved
reco_utils/dataset/wikidata.py Show resolved Hide resolved
reco_utils/dataset/wikidata.py Outdated Show resolved Hide resolved
reco_utils/dataset/wikidata.py Outdated Show resolved Hide resolved
try:
entityID = r.json()["query"]["pages"][entityID]["pageprops"]["wikibase_item"]
except:
entityID = "entityNotFound"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why returning a string vs raising an exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I return a string for the cases where the function is used in a loop, so we can have a record of that entity not having a response and the code can keep running. Do you recommend doing something else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can vary a bit, but often I find it is easier to raise the exception here and let the calling function catch it and handle it as needed. This removes the need to check the output for a specific string defined here and so there is looser coupling across functions and more flexibility in handling errors upstream.

@miguelgfierro
Copy link
Collaborator

In the repository, we have developed a DevOps pipeline that helps test codes in the utility functions and notebooks. Try to understand how it works and how to write good unit tests for your functions. Goo examples can be found in the folder of tests.

@almudenasanz let me know if you need help with this. There is information here https://github.com/microsoft/recommenders/tree/master/tests

@miguelgfierro
Copy link
Collaborator

Hi @almudenasanz, we were talking internally. @chenhuims is going to work KG networks with you. He will probably do CKE, which will complement your work on Ripple and KGCN.

We thought that it would be interesting that the final output of the notebook that you are doing could be the knowledge graph of Movielens as a dumped file (in the format of the networks), which we would save in a blob. Then in the notebooks of the KG networks, we will start from that saved file. After that, you guys could work in parallel.

We would need to have a KG for the 4 ML (100k, 1M, 10M, 20M). @Leavingseason was wondering if there is a restriction in the number of requests in the Wikidata API

@miguelgfierro
Copy link
Collaborator

One note on ways of working for @almudenasanz and @chenhuims. In other situations where there are several people working on the same issue, they will agree on the tasks to work on (depending on the bandwidth) and then either push to the same branch (in your case it would be wikidata or one person will do a PR to that branch).

@almudenasanz
Copy link
Collaborator Author

almudenasanz commented May 14, 2019

We would need to have a KG for the 4 ML (100k, 1M, 10M, 20M). @Leavingseason was wondering if there is a restriction in the number of requests in the Wikidata API

@Leavingseason @miguelgfierro In the Query Limits section (https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual) of the documentation they mention: "the service is limited to 5 parallel queries per IP". Seems like the only limit is in parallel queries per IP, not in sequential queries

@gramhagen
Copy link
Collaborator

just curious if this would be made simpler by leveraging an existing package like: https://pypi.org/project/Wikidata/

@miguelgfierro
Copy link
Collaborator

miguelgfierro commented May 16, 2019

just curious if this would be made simpler by leveraging an existing package like: https://pypi.org/project/Wikidata/

The latest release is from 2017, is this maintained?

@gramhagen
Copy link
Collaborator

hmm, there are comments from the maintainer on issues that are more recent Dec 2018, it's possible the api hasn't changed enough to warrant any updates?

@almudenasanz
Copy link
Collaborator Author

just curious if this would be made simpler by leveraging an existing package like: https://pypi.org/project/Wikidata/

Hi @gramhagen , I looked into the package but for the specific implementations that I use on the notebook:

  • Retrieve wikipedia page title from search query
  • Retrieve wikidata item identifier from page title
  • Retrieve all connected Wikidata item identifier from a Wikidata item identifier

I did not find implementations in the package that are simpler than the one's I implemented. But happy to discuss any suggestions!

@almudenasanz
Copy link
Collaborator Author

We would need to have a KG for the 4 ML (100k, 1M, 10M, 20M). @Leavingseason was wondering if there is a restriction in the number of requests in the Wikidata API

@miguelgfierro I added to the notebook a new section that extracts the entities for the 100k Movielens version.

I had to reimplement a bit the find_wikidataID method, because I was matching strings to exact wikipedia page titles. Some of the movie titles did not match exactly the wikipedia title, so I added a simple query that uses a text query to retrieve the first matching page title, and now works well for all movies.

I need to ask you where to put the output file of the MovieLens KG

@gramhagen
Copy link
Collaborator

ok, no problem, wasn't sure if it would be helpful to leverage that, but sounds like it's not in this case. thanks for checking into it

@miguelgfierro
Copy link
Collaborator

I need to ask you where to put the output file of the MovieLens KG

we are changing the blobs of recommenders, can you send me the file somehow? then I'll uploaded to the correct place

@miguelgfierro
Copy link
Collaborator

hey @almudenasanz @chenhuims would you please have an update on the state of this PR and the work you are doing with movielens+wikidata?

Please let me know if you have any blockers

@miguelgfierro
Copy link
Collaborator

this looks good @almudenasanz. One question, how long does it take to compute the small KG with movielens 100k?

Can you add a test for the notebook? Depending on the time, they will go in unit test or maybe smoke tests

Here info on how to add the test https://github.com/microsoft/recommenders/tree/master/tests#how-to-create-tests-on-notebooks-with-papermill

Copy link
Collaborator Author

@almudenasanz almudenasanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good @almudenasanz. One question, how long does it take to compute the small KG with movielens 100k?

It takes 1-2 seconds per entry (combining text query to wikidata entity ID and finding related entities), so I was able to query the 100K Movielens dataset containing 1682 movies in 45 mins. The time it will take for the other KGs will depend on the number of movies reviewed. According to this link https://grouplens.org/datasets/movielens/ the amounts are:

  • Movielens 100K ~1.700 movies
  • Movielens 1M ~4.000 movies
  • Movielens 10M ~10.000 movies

The API supports up to 5 parallel queries from the same IP, so we could reduce it 5 times.

Can you add a test for the notebook? Depending on the time, they will go in unit test or maybe smoke tests

Here info on how to add the test https://github.com/microsoft/recommenders/tree/master/tests#how-to-create-tests-on-notebooks-with-papermill

I will look into the tests

@miguelgfierro
Copy link
Collaborator

so I was able to query the 100K Movielens dataset containing 1682 movies in 45 mins.

wow that's a lot. Maybe we can think of a way of running the first part (without movielens) in the unit tests, and then do some queries (not all) in the movielens part. Ideally we want the unit test to be less than one min and integration less than 15-20min.

Do you have any idea on how to execute the notebook under the times I mentioned?

@almudenasanz
Copy link
Collaborator Author

so I was able to query the 100K Movielens dataset containing 1682 movies in 45 mins.

wow that's a lot. Maybe we can think of a way of running the first part (without movielens) in the unit tests, and then do some queries (not all) in the movielens part. Ideally we want the unit test to be less than one min and integration less than 15-20min.

Do you have any idea on how to execute the notebook under the times I mentioned?

I can create a parameter to run the tests only on a sample of the movielens dataset, would that work?

@miguelgfierro
Copy link
Collaborator

I can create a parameter to run the tests only on a sample of the movielens dataset, would that work?

yeah that's reasonable. I think a good way of doing this would be to have a unit test that just check that the first part of the notebook runs, similarly to this example. Then we can have what you suggested in the integration test and check programmatically some outputs like in this example. Maybe check that the first rows of the KG of movielens are correctly created. Kind of similar to what we are doing in the criteo tests.

Under this schema, maybe the integration tests take less than 5min

@almudenasanz
Copy link
Collaborator Author

yeah that's reasonable. I think a good way of doing this would be to have a unit test that just check that the first part of the notebook runs, similarly to this example. Then we can have what you suggested in the integration test and check programmatically some outputs like in this example.

Great, I already started working on them, and querying the 1M MovieLens dataset. I will commit them when I finish

@almudenasanz
Copy link
Collaborator Author

I added the tests, and sent you the 1M file by email

I introduced sample parameters so both the unit and integration tests can run in under 1 min, and the integration test checks that the output os the file has the expected number of responses

@miguelgfierro
Copy link
Collaborator

miguelgfierro commented Jul 27, 2019

@almudenasanz, I solved the conflicts but one test failed, there is a small error in the code:

E           /anaconda/envs/reco_base/lib/python3.6/site-packages/tqdm/_tqdm.py in wrapper(*args, **kwargs)
E               673                     # take a fast or slow code path; so stop when t.total==t.n
E               674                     t.update(n=1 if not t.total or t.n < t.total else 0)
E           --> 675                     return func(*args, **kwargs)
E               676 
E               677                 # Apply the provided function (in **kwargs)
E           
E           <ipython-input-19-6ccb9974139b> in <lambda>(x)
E                 1 tqdm().pandas(desc="Number of movies completed")
E           ----> 2 result = pd.concat(list(movies.progress_apply(lambda x: wikidata_KG_from_movielens(x), axis=1)))
E           
E           <ipython-input-17-3cd273a9c115> in wikidata_KG_from_movielens(df)
E                 3     entity_id = find_wikidataID(df["Title"] + " film")
E                 4     if entity_id != "entityNotFound":
E           ----> 5         json_links = query_entity_links(entity_id)
E                 6         related_entities,related_names = read_linked_entities(json_links)
E                 7         d = pd.DataFrame({"original_entity":[entity_id]* len(related_entities),
E           
E           /data/home/recocat/cicd/3/s/reco_utils/dataset/wikidata.py in query_entity_links(entityID)
E               100         data = r.json()
E               101     except:
E           --> 102         print(e)
E               103         print("Entity ID not Found in Wikidata")
E               104         return {}
E           
E           NameError: ("name 'e' is not defined", 'occurred at index 21')

just fixed it

@miguelgfierro
Copy link
Collaborator

@almudenasanz there is a time out:

$ pytest --durations=0 tests/unit/ -m "notebooks and not spark and not gpu"
======================================================= test session starts ========================================================
platform linux -- Python 3.6.8, pytest-4.2.1, py-1.7.0, pluggy-0.8.1
rootdir: /data/home/recocat/notebooks/miguel/Recommenders, inifile:
collected 163 items / 153 deselected / 10 selected

tests/unit/test_notebook_utils.py .                                                                                          [ 10%]
tests/unit/test_notebooks_python.py .........                                                                                [100%]

========================================================= warnings summary =========================================================
/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/util/tf_inspect.py:75
/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/util/tf_inspect.py:75
  /anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/util/tf_inspect.py:75: DeprecationWarning: inspect.getargspec() is deprecated since Python 3.0, use inspect.signature() or inspect.getfullargspec()
    return _inspect.getargspec(target)

reco_utils/recommender/ncf/dataset.py:321
  /data/home/recocat/notebooks/miguel/Recommenders/reco_utils/recommender/ncf/dataset.py:321: DeprecationWarning: invalid escape sequence \[
    """

/anaconda/envs/reco_base/lib/python3.6/site-packages/nbconvert/exporters/exporter_locator.py:28
  /anaconda/envs/reco_base/lib/python3.6/site-packages/nbconvert/exporters/exporter_locator.py:28: DeprecationWarning: `nbconvert.exporters.exporter_locator` is deprecated in favor of `nbconvert.exporters.base` since nbconvert 5.0.
    DeprecationWarning)

reco_utils/recommender/rbm/rbm.py:452
  /data/home/recocat/notebooks/miguel/Recommenders/reco_utils/recommender/rbm/rbm.py:452: DeprecationWarning: invalid escape sequence \s
    """

tests/unit/test_notebook_utils.py::test_is_jupyter
  /anaconda/envs/reco_base/lib/python3.6/site-packages/jupyter_client/session.py:371: DeprecationWarning: Session._key_changed is deprecated in traitlets 4.1: use @observe and @unobserve instead.
    def _key_changed(self):

-- Docs: https://docs.pytest.org/en/latest/warnings.html
====================================================== slowest test durations ======================================================
1294.08s call     tests/unit/test_notebooks_python.py::test_wikidata_runs
217.83s call     tests/unit/test_notebooks_python.py::test_vw_deep_dive_runs
120.90s call     tests/unit/test_notebooks_python.py::test_rlrmc_quickstart_runs
45.77s call     tests/unit/test_notebooks_python.py::test_lightgbm
40.65s call     tests/unit/test_notebooks_python.py::test_surprise_deep_dive_runs
19.60s call     tests/unit/test_notebooks_python.py::test_sar_deep_dive_runs
16.61s call     tests/unit/test_notebooks_python.py::test_baseline_deep_dive_runs
12.61s call     tests/unit/test_notebooks_python.py::test_sar_single_node_runs
3.51s call     tests/unit/test_notebooks_python.py::test_template_runs
2.55s call     tests/unit/test_notebook_utils.py::test_is_jupyter

(0.00 durations hidden.  Use -vv to show these durations.)
===================================== 10 passed, 153 deselected, 6 warnings in 1777.51 seconds 

@almudenasanz
Copy link
Collaborator Author

I have changed the default parameters of the notebook to do the sampling, and the test have passed. Seems like the injection of the parameters for the test was not working.

@miguelgfierro miguelgfierro merged commit d4bb65a into staging Jul 29, 2019
@miguelgfierro miguelgfierro deleted the wikidata branch July 29, 2019 09:16
@gramhagen gramhagen mentioned this pull request Jul 30, 2019
3 tasks
gramhagen added a commit that referenced this pull request Jul 31, 2019
* new file with wikidata functions

* fix in json extraction

* new notebook with wikidata use examples

* retry request with lowercase in case of failure

* WIP: example creating KG from movielens entities

* introduced new step to retrieve first page title from a text query in wikipedia

* updated movielens links extraction using wikidata

* adapted docstrings for sphinx and removed parenthesis from output

* added description and labels to nodes to graph preview

* #778 (comment) new format for queries

* raising exceptions in requests and using get() to retrieve dict values

* moved imports to first cell and movielens size as a parameter

* output file name as paramenter

* DATA: update sum check

* adding unit test for sum to 1 issue

* improved description and adapted to tests

* improved Exception descriptions

* integration tests

* unit tests

* added wikidata_KG to conftest

* changed name notebook

* *NOTE: Adding  shows the computation time of all tests.*

* imports up

* Update wikidata.py

* changed default parameter of sample for tests

* Add sphinx documentation for wikidata

* modified parameter extraction for tests

* added parameters tag to cell

* changed default sampling to test parameters in test

* notebook cleaned cells output

* Docker Support (#718)

* DOCKER: add pyspark docker file

* DOCKER: remove unused line

* DOCKER: remove old file

* DOCKER: add SETUP text

* DOCKER: add azureml`

* DOCKER: udpate dockerfile

* DOCKER: use a branch of the repo

* SETUP: update setup

* DOCKER: update dockerfile

* DOC: update setup

* DOCKER: one that binds all

* SETUP: update docker use

* DOCKER: move to top level

* SETUP: use a different base name

* DOCKER: use the same keywords in the repo for environment arg

* SETUP: update environment variable names

* updating dockerfile to use multistage build and adding readme

* adding full stage

* fixing documentation

* adding info for running full env

* README: update notes for exporting environment on certain platform

* README: updated with example on Windows

* README: fix typo
yueguoguo pushed a commit that referenced this pull request Sep 9, 2019
yueguoguo pushed a commit that referenced this pull request Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants