This notebook is a walkthrough on using `ConceptModel` methodology to effeciently create models useful for the applications that you have in mind. This notebook is a followup to the basic [Concept Modeling notebook](./watsongraph - Concept Modeling.ipynb), which introduces `ConceptModel` logic and operations and which you should read first before diving into this one.

## Entropy

You may have heard of the principle of [six degrees of seperation](https://en.wikipedia.org/wiki/Six_degrees_of_separation), a fascinating theory from network analysis that states that any two people in the United States are at most six "friend of friends" distant from one another. This concept (the subject of a fantastical 1920s mail-order experiment) has since entered popular culture in forms such as [Bacon numbers](https://en.wikipedia.org/wiki/Bacon_number) and [Erős numbers](https://en.wikipedia.org/wiki/Erd%C5%91s_number). Wikipedia has its own [six degrees principle](https://en.wikipedia.org/wiki/Wikipedia:Six_degrees_of_Wikipedia), a popular conception that all *sufficiently mainstream* Wikipedia articles are within six hops of one another, at most.

How much entropy does Watson's cognitive graph have? To illustrate how far Watson's cognitive graph can wander let's take a few random 5-step walks. Try re-running this code block yourself. Where do you end up?

In [1]:
from watsongraph.conceptmodel import ConceptModel
import random

In [2]:
def jump(concept, level=0):
    c = ConceptModel([concept])
    c.explode(level=level)
    l = len(c.concepts())
    return c.concepts()[random.randrange(0, l)]

jump(jump(jump(jump(jump('IBM')))))

'X Window System'

In [3]:
[jump(jump(jump(jump(jump(i))))) for i in ['IBM'] * 5]

['Source code',
 'Random-access memory',
 'BSD licenses',
 'Internet Explorer',
 'MS-DOS']

In [4]:
[jump(jump(jump(jump(jump(i, level=3), level=3), level=3), level=3), level=3) for i in ['IBM'] * 5]

['IBM System/3',
 'IBM System/34',
 'Processor book',
 'Benjamin B. Redding',
 'Inforex 1300 Systems']

You should see, from the results above and from those of the queries that you yourself run, that the output of the cognitive graph is fairly tight: we can't even say definetively that it deteriorates significantly when we `explode()` at higher `level` parameters.

## Modeling

If you run the command above yourself you will also notice that though 15 queries doesn't sound like a lot it really piles on, especially once you start getting to higher level parameters: this query will take at least a minute to fully process. This is a key facet of modelling with `watsongraph`: the `Concept Insights` service, as brilliant as it is, is very time expensive. For this reason when constructing models using this service we actually want to *avoid* using it as much as possible: what that usually entails is backboning your graph locally, using other, faster data sources, and then only then expanding the connections amongst nodes to construct the graph.

For our test usecase we will try to construct a "corporate network" showing the strength of the relationships amongst as many technology corporations as we can muster.

The naive way of doing this would be to start with a simple model (say, `ConceptModel(['IBM'])`), `expand()` the graph, and then scrape the categorization of the associated Wikipedia pages (always remember that every concept is the name of a Wikipedia article!) in order to pare away all of the articles that were brought up where were not those belonging to companies. Let's time the naive way.

In [5]:
import time
import requests

In [6]:
ibm = ConceptModel(['IBM'])
ibm.explode(level=0)
len(ibm.concepts())

37

In [7]:
def scraper_timer(func):
    def wrapper(model, starting_point):
        start_time = time.time()
        ret = func(model, starting_point)
        print("Runtime: " + str(time.time() - start_time) + " seconds.")
        return ret
    return wrapper

def categories_snapshot(concept):
    dat = requests.get('https://en.wikipedia.org/wiki/' + concept).text
    return dat[dat.find("<div id='catlinks' class='catlinks'>"):]

def select(t, name):
    if t[1] == name:
        return t[2]
    else:
        return t[1]

@scraper_timer
def get_companies_with_scraping(model, starting_point):
    top = [(select(edge, 'IBM'), edge[0]) for edge in model.edges()]
    top_companies = [concept for concept in top if 'companies' in categories_snapshot(concept[0])]
    return top_companies

get_companies_with_scraping(ibm, 'IBM')

Runtime: 10.709612846374512 seconds.


[('Digital Equipment Corporation', 0.89564085),
 ('Advanced Micro Devices', 0.79349726),
 ('Sun Microsystems', 0.780642),
 ('Oracle Corporation', 0.7744718),
 ('Intel', 0.6541496),
 ('Hewlett-Packard', 0.6270959)]

This is clearly to slow to be of any practical use!

Though web scrapers working in the main are somewhat faster at generating results than the Concept Insight API but definitely not fast enough to not get bogged down when you give it the enormous volume of pages to process that this operation requires. Plus rate unlimited scraping is also highly discouraged by Wikipedia itself, and running large-volume queries for long periods of time is liable to get you in trouble with the webmasters.

So let's adopt our paradigm. Wikipedia [has an API](https://www.mediawiki.org/wiki/API:Main_page). It's admittedly a bit crufty but it's very powerful way of retrieving only the information that you need and retrieving it in fast batches. We use the [mwapi](https://pypi.python.org/pypi/mwapi/0.4.0) library to avoid writing the access code: this is a low-level library that is good for writing fast queries (written by two active Wikimedia Foundation developers for exactly that purpose). You can read the documentation [here](http://pythonhosted.org/mwapi/). You can try out queries and get a sense of how the API works in the [API sandbox](https://en.wikipedia.org/wiki/Special:ApiSandbox).

In [8]:
import mwapi

In [9]:
session = mwapi.Session('https://en.wikipedia.org', user_agent='watsongraph notebook')

def api_timer(func):
    def wrapper(model):
        start_time = time.time()
        ret = func(model)
        print("Runtime: " + str(time.time() - start_time) + " seconds.")
        return ret
    return wrapper

# @api_timer
def get_companies_with_api(model):
    companies = []
    cont = ""
    while True:
        raw = session.get(action='query',
                          prop='categories',
                          clshow='!hidden',
                          cllimit=500,
                          titles='|'.join(model.concepts()))
                          # clcontinue=cont if cont else '???')
        query = raw['query']['pages']
        for p_k in query.keys():
            if query[p_k]['title'] not in companies:
                for cat in query[p_k]['categories']:
                    if 'companies' in cat['title'] or 'Companies' in cat['title']:
                        # print('Spotted: ' + cat['title'])
                        companies.append(query[p_k]['title'])
                        break
        if 'batchcomplete' in raw.keys():
            break
        else:
            cont = raw['clcontinue']
    return companies

start_time = time.time()
ret = get_companies_with_api(ibm)
print("Runtime: " + str(time.time() - start_time) + " seconds.")
ret

Runtime: 0.1670088768005371 seconds.


['Sun Microsystems',
 'Hewlett-Packard',
 'IBM',
 'Advanced Micro Devices',
 'Intel',
 'Digital Equipment Corporation',
 'Oracle Corporation']

*Much* faster! But notice the difference between this output, which is just a `list` of things, and what we had previously: edges in a `ConceptModel` object. But following along with the logic that I mentioned earlier, it is much, much faster to create this list and then populate its edges - using a new method we will introduce for this purpose, `add_edges()` - then it is to work with them from the start.

`add_edges()` extends the Concent Insights `getGraphRelationScore` API method and maps edges correlating a single `source_concept` (the first required argument) and an arbitrary `list_of_target_concepts` (the second required argument). If you recall the examples on entropy at the beginning of this notebook you will remember that the `wikipedia/en-20120601` cognitive graph we are using has very low entropy: for two concepts to be associated with one another at all they must be very tightly associated. This can cause surprising results:

In [10]:
from watsongraph.event_insight_lib import get_relation_scores
get_relation_scores('IBM', ['Interpretive dance',
                            'Apple',
                            'Apple Inc.',
                            'Microsoft',
                            'Java (programming language)',
                            'Intel',
                            'Cloud computing',
                            'Watson (computer)'
                           ])['scores']

[{'concept': '/graphs/wikipedia/en-20120601/concepts/Interpretive_dance',
  'score': 0.5},
 {'concept': '/graphs/wikipedia/en-20120601/concepts/Apple', 'score': 0.5},
 {'concept': '/graphs/wikipedia/en-20120601/concepts/Apple_Inc.',
  'score': 0.5},
 {'concept': '/graphs/wikipedia/en-20120601/concepts/Microsoft', 'score': 0.5},
 {'concept': '/graphs/wikipedia/en-20120601/concepts/Java_(programming_language)',
  'score': 0.60449404},
 {'concept': '/graphs/wikipedia/en-20120601/concepts/Intel',
  'score': 0.6541496},
 {'concept': '/graphs/wikipedia/en-20120601/concepts/Cloud_computing',
  'score': 0.6814177},
 {'concept': '/graphs/wikipedia/en-20120601/concepts/Watson_(computer)',
  'score': 0.899236}]

We as humans know that IBM has a heck of a lot more to do with Apple Inc. and with Microsoft then it does with apples and interpretive dance, but alas, because of the tightness of its cognitive graph, Watson doesn't. Watson always returns a value of `0.5` for edges that it doesn't know enough about to be sure of a correlation or that it doesn't think are particularly correlated. To account for this systematic limitation `add_edges()` offers an additional optional parameter, `prune`, which is set to `False` by default; if we add a `prune=True` argument the method will throw out all of these questionably useful edges and only keep those that return a `score` higher than 0.5.

With all of this in mind let's work through a complete construction.

In [11]:
start_time = time.time()
companies = get_companies_with_api(ibm)
companies = ConceptModel(companies)
companies.concepts()
companies.add_edges('IBM', companies.concepts(), prune=True)
print("Runtime: " + str(time.time() - start_time) + " seconds.")
companies.edges()

Runtime: 2.3711349964141846 seconds.


[(0.89564085, 'Digital Equipment Corporation', 'IBM'),
 (0.79349726, 'Advanced Micro Devices', 'IBM'),
 (0.780642, 'Sun Microsystems', 'IBM'),
 (0.7744718, 'IBM', 'Oracle Corporation'),
 (0.6541496, 'Intel', 'IBM'),
 (0.6270959, 'Hewlett-Packard', 'IBM')]

Already much faster! And this difference in execution time rapidly becomes more pronounced as you get to more intensive queries.

The moral of the story is, to reiterate, that we should avoid using unstructured IBM Watson output as much as possible when defining our graph, and rely on it only in the end step, when we are linking together the nodes that we have hopefully populated from a different source. That source need not be the Wikipedia API: heavens knows there are a lot of ways of getting lists of companies out there on the Internet. The Wikipedia API is just a very durable multipurpose tool for getting there.

One final note: there are also `explode_edges([prune=True/False])` and `add_edge([source, target, [prune=True/False]])` methods available for your use. The former is like `explode()`, but for edges: it creates every possible edge that it can. The latter creates only a single edge, and is not really recommended, as you should always try to handle this job in batches.

## Parameterization

Daily page views are a bit of feature sugar that comes prebuilt in the `watsongraph` library. This data is based on a 30-day average, and is generated by a call against the (just recently introduced!) [Wikipedia Pageview API](https://wikimedia.org/api/rest_v1/?doc).

These are a significant additional overhead on execution time and so are not instantiated for you, but you can do it yourself at any time using the `set_view_counts()` method. `concepts_by_view_count()` is useful for visualizing the result.

In [12]:
companies.set_view_counts()
companies.concepts_by_view_count()

[(4427, 'IBM'),
 (3734, 'Hewlett-Packard'),
 (2972, 'Oracle Corporation'),
 (2876, 'Intel'),
 (1284, 'Advanced Micro Devices'),
 (1045, 'Sun Microsystems'),
 (617, 'Digital Equipment Corporation')]

The `view_count` parameter is actually just one of any number of arbitrary properties that a `ConceptModel` can store for its nodes; `relevance` is the other one that's used extensively in-code, in the `User` and `Item` classes.

This means that you can always extend the `ConceptModel` implicitly by defining and storing your own properties. There are two methods that allow you to do so. The first is the per-concept `set_property()` method:

In [13]:
companies.set_property('IBM', 'length', len('IBM'))

A smarter way is to use `map_property()`, which accepts a function that should take a concept string as an argument. This method maps the property for *every* node.

In [14]:
companies.map_property('length', lambda concept: len(concept))

Use `concepts_by_property()` to list out your results.

In [15]:
companies.concepts_by_property('length')

[(29, 'Digital Equipment Corporation'),
 (22, 'Advanced Micro Devices'),
 (18, 'Oracle Corporation'),
 (16, 'Sun Microsystems'),
 (15, 'Hewlett-Packard'),
 (5, 'Intel'),
 (3, 'IBM')]