Retrieve and visualize relevant information from a collection of web portals with a natural language query using IBM Discovery service
There is a lot of information on the web, and we are always interested in getting the relevant information. The below two common scenarios drive the need for it:
- There is usually a specific item of interest, and we try to get all the relevant information about the item.
- We have a question and are searching to find answers for it that is somewhere in the web pages.
In this code pattern, we try to address a specific scenario where we try to query for relevant information from a group of web pages. IBM Discovery service provides the ability to crawl through web pages and build a queryable collection. We will build an application that uses the APIs of IBM Discovery service to create, query, get status and delete a document collection of web portals. The application renders the query results on a custom built web user interface. This provides flexibility for the end-user to design and build the web user interface to suit specific information and visualization requirements.
In this application you can:
- Specify a list of URLs that
Discovery
will crawl to build the collection. - Specify a query in
natural language
and get relevant results with insights fromDiscovery
. - Visualize the top five matching documents, passages and entities for the query
When you have completed this code pattern, you will understand how to:
- Use the
Discovery
APIs -- to create a collection using webcrawl.
- to get the status of a collection.
- to query the collection using natural language
- to delete a collection.
- Parse, read and visualize the results from
Discovery
.
- User requests for creation/status/deletion of a collection or queries a collection through a custom built web UI.
- The request is sent to a server application on the cloud.
- The application invokes an api on the Discovery service using the Watson SDK.
- The Discovery service processes the results and sends it back to the application. The results are then visualized by the User.
Open a terminal and run the below command to clone the repo.
git clone https://github.com/IBM/discovery-webcrawl-insights
This will create a folder discovery-webcrawl-insights
.
Click here to create an instance of the Discovery
service.
Select a region and pricing plan. Click Create
.
Once the service is provisioned, open the service page and click on Service credentials
. Click on Copy
icon to copy the credentials.
Go to the repo folder discovery-webcrawl-insights
, edit the file credentials.json
and replace content with new credentials you copied.
You can optionally use a virtual environment to avoid having these dependencies clash with those of other Python projects or your operating system.
Install the dependencies listed in the requirements.txt file to be able to run the app locally. Open a terminal. Go to the repo folder discovery-webcrawl-insights
and run the below command.
pip install -r requirements.txt
Run the application with the below command:
python mydiscovery.py
The application can be accessed at http://localhost:8000.
Open a terminal. Go to the repo folder discovery-webcrawl-insights
and run the below commands.
$ ibmcloud login [--sso]
$ ibmcloud cf create-service discovery lite mydiscoveryservice
$ ibmcloud cf push
$ ibmcloud cf bind-service customcollections mydiscoveryservice
$ ibmcloud cf restage customcollections
Once the application is deployed and running fine, go to the IBM Cloud Dashboard.
Click on Visit App URL
to access the application.
The application is ready. You can now create a collection and then query it using natural language.
Enter a name for the collection, and then enter the urls -
http://developer.ibm.com/patterns,http://developer.ibm.com/tutorials,http://developer.ibm.com/article
Note: The collection configuration is hardcoded in the source file mydiscovery.py. You can modify the configuration based on your needs and re-deploy the application. For more information on configuration please refer this link.
Now the collection is created.
It takes time for the collection to be built. You can now check the status of the collection, and then run your query in natural language
.
This code pattern is licensed under the Apache License, Version 2. Separate third-party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 and the Apache License, Version 2.