Aggregate and query information from web portals

Retrieve and visualize relevant information from a collection of web portals with a natural language query using IBM Discovery service

There is a lot of information on the web, and we are always interested in getting the relevant information. The below two common scenarios drive the need for it:

There is usually a specific item of interest, and we try to get all the relevant information about the item.
We have a question and are searching to find answers for it that is somewhere in the web pages.

In this code pattern, we try to address a specific scenario where we try to query for relevant information from a group of web pages. IBM Discovery service provides the ability to crawl through web pages and build a queryable collection. We will build an application that uses the APIs of IBM Discovery service to create, query, get status and delete a document collection of web portals. The application renders the query results on a custom built web user interface. This provides flexibility for the end-user to design and build the web user interface to suit specific information and visualization requirements.

In this application you can:

Specify a list of URLs that Discovery will crawl to build the collection.
Specify a query in natural language and get relevant results with insights from Discovery.
Visualize the top five matching documents, passages and entities for the query

When you have completed this code pattern, you will understand how to:

Use the Discovery APIs -
- to create a collection using webcrawl.
- to get the status of a collection.
- to query the collection using natural language
- to delete a collection.
Parse, read and visualize the results from Discovery.

Flow

User requests for creation/status/deletion of a collection or queries a collection through a custom built web UI.
The request is sent to a server application on the cloud.
The application invokes an api on the Discovery service using the Watson SDK.
The Discovery service processes the results and sends it back to the application. The results are then visualized by the User.

Prerequisites

1. Clone the repo

Open a terminal and run the below command to clone the repo.

git clone https://github.com/IBM/discovery-webcrawl-insights

This will create a folder discovery-webcrawl-insights.

2. Deploy the application

2.1 Deploy the application locally

2.11 Create an instance of Discovery service on IBM Cloud

Click here to create an instance of the Discovery service.

Select a region and pricing plan. Click Create.

2.12 Configure service credentials

Once the service is provisioned, open the service page and click on Service credentials. Click on Copy icon to copy the credentials. Go to the repo folder discovery-webcrawl-insights, edit the file credentials.json and replace content with new credentials you copied.

You can optionally use a virtual environment to avoid having these dependencies clash with those of other Python projects or your operating system.

Install the dependencies listed in the requirements.txt file to be able to run the app locally. Open a terminal. Go to the repo folder discovery-webcrawl-insights and run the below command.

pip install -r requirements.txt

Run the application with the below command:

python mydiscovery.py

The application can be accessed at http://localhost:8000.

2.2 Deploy the application on IBM Cloud

Open a terminal. Go to the repo folder discovery-webcrawl-insights and run the below commands.

$ ibmcloud login [--sso]
$ ibmcloud cf create-service discovery lite mydiscoveryservice
$ ibmcloud cf push
$ ibmcloud cf bind-service customcollections mydiscoveryservice
$ ibmcloud cf restage customcollections

Once the application is deployed and running fine, go to the IBM Cloud Dashboard.

Click on Visit App URL to access the application.

3. Analyze the results

The application is ready. You can now create a collection and then query it using natural language.

3.1 Create a collection for the IBM Developer Portal

Enter a name for the collection, and then enter the urls -

http://developer.ibm.com/patterns,http://developer.ibm.com/tutorials,http://developer.ibm.com/article

Note: The collection configuration is hardcoded in the source file mydiscovery.py. You can modify the configuration based on your needs and re-deploy the application. For more information on configuration please refer this link.

Now the collection is created.

3.2 Query the collection

It takes time for the collection to be built. You can now check the status of the collection, and then run your query in natural language.

License

This code pattern is licensed under the Apache License, Version 2. Separate third-party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 and the Apache License, Version 2.

Apache License FAQ

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
images		images
static		static
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
credentials.json		credentials.json
manifest.yml		manifest.yml
mydiscovery.py		mydiscovery.py
query.html		query.html
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aggregate and query information from web portals

Retrieve and visualize relevant information from a collection of web portals with a natural language query using IBM Discovery service

Flow

Prerequisites

Steps

1. Clone the repo

2. Deploy the application

2.1 Deploy the application locally

2.11 Create an instance of Discovery service on IBM Cloud

2.12 Configure service credentials

2.2 Deploy the application on IBM Cloud

3. Analyze the results

3.1 Create a collection for the IBM Developer Portal

3.2 Query the collection

License

About

Releases

Packages

Contributors 4

Languages

License

IBM/discovery-webcrawl-insights

Folders and files

Latest commit

History

Repository files navigation

Aggregate and query information from web portals

Retrieve and visualize relevant information from a collection of web portals with a natural language query using IBM Discovery service

Flow

Prerequisites

Steps

1. Clone the repo

2. Deploy the application

2.1 Deploy the application locally

2.11 Create an instance of Discovery service on IBM Cloud

2.12 Configure service credentials

2.2 Deploy the application on IBM Cloud

3. Analyze the results

3.1 Create a collection for the IBM Developer Portal

3.2 Query the collection

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages