This notebook is a walkthrough on using `ConceptModel` methodology to effeciently create models useful for the applications that you have in mind. This notebook is a followup to the basic [Concept Model Demo notebook](./watsongraph - Concept Model Demo.ipynb), which introduces `ConceptModel` logic and operations and which you should read **first** before diving into this one.

## Entropy

You may have heard of the principle of [six degrees of seperation](https://en.wikipedia.org/wiki/Six_degrees_of_separation), a fascinating theory from network analysis that states that any two people in the United States are at most six "friend of friends" distant from one another. This concept (the subject of a fantastical 1920s mail-order experiment) has since entered popular culture in forms such as [Bacon numbers](https://en.wikipedia.org/wiki/Bacon_number) and [Erős numbers](https://en.wikipedia.org/wiki/Erd%C5%91s_number). Wikipedia has its own [six degrees principle](https://en.wikipedia.org/wiki/Wikipedia:Six_degrees_of_Wikipedia), a popular conception that all *sufficiently mainstream* Wikipedia articles are within six hops of one another, at most.

How much entropy does Watson's cognitive graph have? To illustrate how far Watson's cognitive graph can wander let's take a few random 5-step walks. Try re-running this code block yourself. Where do you end up?

In [20]:
from watsongraph.conceptmodel import ConceptModel
import random

In [10]:
def jump(concept, level=0):
    c = ConceptModel([concept])
    c.explode(level=level)
    l = len(c.concepts())
    return c.concepts()[random.randrange(0, l)]

jump(jump(jump(jump(jump('IBM')))))

'Computer programming'

In [9]:
[jump(jump(jump(jump(jump(i))))) for i in ['IBM'] * 5]

['Microprocessor',
 'Megabyte',
 'MySQL',
 'Random-access memory',
 'Digital Equipment Corporation']

In [11]:
[jump(jump(jump(jump(jump(i, level=3), level=3), level=3), level=3), level=3) for i in ['IBM'] * 5]

['IBM 402',
 'Object access method',
 'IBM 1410',
 'Windows Small Business Server',
 'WebSphere Application Server for z/OS']

You should see, from the results above and from those of the queries that you yourself run, that the output of the cognitive graph is fairly tight: we can't even say definetively that it deteriorates significantly when we `explode()` at higher `level` parameters.

## Modeling techniques

If you run the command above yourself you will also notice that though 15 queries doesn't sound like a lot it really piles on, especially once you start getting to higher level parameters: this query will take at least a minute to fully process. This is a key facet of modelling with `watsongraph`: the `Concept Insights` service, as brilliant as it is, is very time expensive. For this reason when constructing models using this service we actually want to *avoid* using it as much as possible: what that usually entails is backboning your graph locally, using other, faster data sources, and then only then expanding the connections amongst nodes to construct the graph.

For our test usecase we will try to construct a "corporate network" showing the strength of the relationships amongst as many technology corporations as we can muster.

The naive way of doing this would be to start with a simple model (say, `ConceptModel(['IBM'])`), `expand()` the graph, and then scrape the categorization of the associated Wikipedia pages (always remember that every concept is the name of a Wikipedia article!) in order to pare away all of the articles that were brought up where were not those belonging to companies. But though web scrapers working in the main are somewhat faster at generating results than the Concept Insight API but definetly not fast enough to not get bogged down when you give it the enormous volume of pages to process that this operation requires. Plus rate unlimited scraping is also highly discouraged by Wikipedia itself, and running large-volume queries for long periods of time is liable to get you in trouble with the webmasters.

Let's time the naive way.

In [21]:
import time
import requests

In [7]:
ibm = ConceptModel(['IBM'])
ibm.explode(level=0)
len(ibm.concepts())

37

In [47]:
def scraper_timer(func):
    def wrapper(model, starting_point):
        start_time = time.time()
        ret = func(model, starting_point)
        print("Runtime: " + str(time.time() - start_time) + " seconds.")
        return ret
    return wrapper

def categories_snapshot(concept):
    dat = requests.get('https://en.wikipedia.org/wiki/' + concept).text
    return dat[dat.find("<div id='catlinks' class='catlinks'>"):]

def select(t, name):
    if t[1] == name:
        return t[2]
    else:
        return t[1]

@scraper_timer
def get_companies_with_scraping(model, starting_point):
    top = [(select(edge, 'IBM'), edge[0]) for edge in model.edges()]
    top_companies = [concept for concept in top if 'companies' in categories_snapshot(concept[0])]
    return top_companies

get_companies_with_scraping(ibm, 'IBM')

Runtime: 7.30610203742981 seconds.


[('Digital Equipment Corporation', 0.89564085),
 ('Advanced Micro Devices', 0.79349726),
 ('Sun Microsystems', 0.780642),
 ('Oracle Corporation', 0.7744718),
 ('Intel', 0.6541496),
 ('Hewlett-Packard', 0.6270959)]

Let's adopt our paradigm. Wikipedia [has an API](https://www.mediawiki.org/wiki/API:Main_page). It's admittedly a bit crufty but it's very powerful way of retrieving only the information that you need and retrieving it in fast batches. We use the [mwapi](https://pypi.python.org/pypi/mwapi/0.4.0) library to avoid writing the access code: this is a low-level library that is good for writing fast queries (written by two active Wikimedia Foundation developers for exactly that purpose). You can read the documentation [here](http://pythonhosted.org/mwapi/). You can try out queries and get a sense of how the API works in the [API sandbox](https://en.wikipedia.org/wiki/Special:ApiSandbox).

In [24]:
import mwapi

In [49]:
session = mwapi.Session('https://en.wikipedia.org', user_agent='watsongraph notebook')

a = ConceptModel(['Apple'])

def api_timer(func):
    def wrapper(model):
        start_time = time.time()
        ret = func(model)
        print("Runtime: " + str(time.time() - start_time) + " seconds.")
        return ret
    return wrapper

@api_timer
def get_companies_with_api(model):
    companies = []
    cont = ""
    while True:
        raw = session.get(action='query',
                          prop='categories',
                          clshow='!hidden',
                          cllimit=500,
                          titles='|'.join(model.concepts()),
                          clcontinue=cont)
        query = raw['query']['pages']
        for p_k in query.keys():
            if query[p_k]['title'] not in companies:
                for cat in query[p_k]['categories']:
                    if 'companies' in cat['title']:
                        companies.append(query[p_k]['title'])
                        break
        if 'batchcomplete' in session.keys():
            break
        else:
            cont = raw['clcontinue']
        return companies

get_companies_with_api(ConceptModel('Apple'))

AttributeError: 'Session' object has no attribute 'keys'

Let's put together a more advanced ConceptModel usecase.

In this notebook we will put together a quick script for constructing a diagram of a "corporate network" linking together a list of companies branched out from IBM, our starting point.

We need some way of categorizing concepts according to whether or not they are companies. The following trick works well enough for our purposes, though as we shall see later, it's by no means foolproof.

In [103]:
def categories_snapshot(concept):
    dat = requests.get('https://en.wikipedia.org/wiki/' + concept).text
    return dat[dat.find("<div id='catlinks' class='catlinks'>"):]

In [104]:
'companies' in categories_snapshot('IBM')

True

In [105]:
def companies(l):
    ret = []
    for item in l:
        if 'companies' in categories_snapshot(item):
            ret.append(item)
    return ret

In [106]:
companies(['IBM', 'Microsoft', 'Apple', 'Apple Inc.', 'Apple pie'])

['IBM', 'Microsoft', 'Apple Inc.']

Each step that we take we will need to:

1. Augment the new nodes in the graph.
2. Pare the resulting graph down to companies.

Along the way we also need to keep track of the nodes that we've already augmented, so as not to waste time reaugmenting nodes that we've already augmented before.

In [107]:
expanded_concepts = []

def iter_model(G):
    for concept in [concept for concept in G.concepts() if concept not in expanded_concepts]:
        G.augment(concept)
        expanded_concepts.append(concept)
    return G

In [108]:
ibm = ConceptModel(['IBM'])
ibm = iter_model(ibm)
ibm.concepts()

['.NET Framework',
 'ARM architecture',
 'Advanced Micro Devices',
 'Application programming interface',
 'Berkeley Software Distribution',
 'C (programming language)',
 'Central processing unit',
 'Cloud computing',
 'Compiler',
 'Digital Equipment Corporation',
 'Fortran',
 'FreeBSD',
 'Graphical user interface',
 'Hard disk drive',
 'Hewlett-Packard',
 'IBM',
 'Intel',
 'Java (programming language)',
 'Library (computing)',
 'Linux',
 'Microprocessor',
 'MySQL',
 'Object-oriented programming',
 'Operating system',
 'Oracle Corporation',
 'Programming language',
 'SQL',
 'Server (computing)',
 'Solaris (operating system)',
 'Sun Microsystems',
 'Supercomputer',
 'Unix',
 'Unix-like',
 'X Window System',
 'X86',
 'X86-64',
 'XML']

In [109]:
def reduce_model(G):
    cmp = companies(ibm.concepts())
    for concept in [concept for concept in G.concepts() if concept not in cmp]:
        G.remove(concept)
    return G

In [110]:
ibm = reduce_model(ibm)

In [111]:
ibm.concepts()

['Advanced Micro Devices',
 'Digital Equipment Corporation',
 'Hewlett-Packard',
 'IBM',
 'Intel',
 'Oracle Corporation',
 'Sun Microsystems']

Let's stitch our two operations together and try it out!

In [118]:
def step(G):
    return reduce_model(iter_model(G))

In [119]:
corporate_network = step(ibm)

In [120]:
corporate_network.concepts()

['Advanced Micro Devices',
 'Apple Inc.',
 'Cisco Systems',
 'Dell',
 'Digital Equipment Corporation',
 'Hewlett-Packard',
 'IBM',
 'Intel',
 'List of mobile network operators of Europe',
 'Motorola',
 'Nokia',
 'Oracle Corporation',
 'Samsung',
 'Sun Microsystems',
 'Texas Instruments',
 'Vodafone']

In [128]:
corporate_network = step(step(corporate_network))

In [129]:
corporate_network.concepts()

['AT&T',
 'Advanced Micro Devices',
 'Apple Inc.',
 'Avex Group',
 'Capcom',
 'Cisco Systems',
 'Comcast',
 'Dell',
 'Digital Equipment Corporation',
 'Fuji Television',
 'Goldman Sachs',
 'Hewlett-Packard',
 'Hudson Soft',
 'IBM',
 'Intel',
 'Konami',
 'Korean Broadcasting System',
 'List of mobile network operators of Europe',
 'London Stock Exchange',
 'Motorola',
 'NASDAQ',
 'Namco',
 'New York Stock Exchange',
 'Nokia',
 'Oracle Corporation',
 'Oricon',
 'Philips',
 'Public company',
 'Samsung',
 'Sega',
 'Seoul Broadcasting System',
 'Siemens',
 'Sony',
 'Sony Computer Entertainment',
 'Sony Music Entertainment Japan',
 'Sun Microsystems',
 'THQ',
 'Tesco',
 'Texas Instruments',
 'Ubisoft',
 'Vodafone']

In [130]:
expanded_concepts

['IBM',
 'Advanced Micro Devices',
 'Digital Equipment Corporation',
 'Hewlett-Packard',
 'Intel',
 'Apple Inc.',
 'Cisco Systems',
 'Dell',
 'Motorola',
 'Oracle Corporation',
 'Sun Microsystems',
 'Texas Instruments',
 'Nokia',
 'Samsung',
 'Vodafone',
 'AT&T',
 'Korean Broadcasting System',
 'List of mobile network operators of Europe',
 'London Stock Exchange',
 'Seoul Broadcasting System',
 'Sony']

Let's get greedy.

In [131]:
corporate_network = step(step(step(step(step(corporate_network)))))

In [133]:
len(corporate_network.concepts())

107

In [136]:
corporate_network.edges()[:20]

[(0.9743249, 'London and North Eastern Railway', 'North Eastern Railway (UK)'),
 (0.9730256, 'Caledonian Railway', 'First ScotRail'),
 (0.97047436, 'Great Eastern Railway', 'London and North Eastern Railway'),
 (0.96971184, 'Advanced Micro Devices', 'Intel'),
 (0.9661626, 'Great Northern Railway (Great Britain)', 'Midland Railway'),
 (0.9650179, 'Shogakukan', 'Viz Media'),
 (0.964935,
  'London and North Eastern Railway',
  'Great Northern Railway (Great Britain)'),
 (0.9637322, 'Korean Broadcasting System', 'Seoul Broadcasting System'),
 (0.9636774, 'Vodafone', 'List of mobile network operators of Europe'),
 (0.9593524, 'First Great Western', 'Network Rail'),
 (0.95859057,
  'Great Northern Railway (Great Britain)',
  'North Eastern Railway (UK)'),
 (0.9582207, 'First Great Western', 'Great Western Railway'),
 (0.95611596, 'Hudson Soft', 'Nintendo'),
 (0.95601696, 'London and North Western Railway', 'Midland Railway'),
 (0.94842815,
  'Great Northern Railway (Great Britain)',
  'Londo

We will stop here for now.

In [140]:
corporate_network.neighborhood('Viz Media')

[('Fuji Television', 0.65638936),
 ('Sony Music Entertainment Japan', 0.60574347),
 ('TV Tokyo', 0.78721863),
 ('Avex Group', 0.50190413),
 ('Bandai', 0.5434743),
 ('Kodansha', 0.80848604),
 ('TV Asahi', 0.61470026),
 ('Tokyopop', 0.8365715),
 ('Square Enix', 0.51347756),
 ('Shueisha', 0.9482551),
 ('Shogakukan', 0.9650179),
 ('Viz Media', 1)]

Daily page views are a bit of feature sugar that comes prebuilt in the `watsongraph` library. This data is based on a 30-day average, and is generated by a call against the (just recently introduced!) [Wikipedia Pageview API](https://wikimedia.org/api/rest_v1/?doc).

These are a significant additional overhead on execution time and so are not instantiated for you, but you can do it yourself at any time using the `set_view_counts()` method. `concepts_by_view_count()` is useful for visualizing the result.

In [None]:
# Daily page views are a bit of sugar built into the watson-graph library.
# This data is based on a 30-day average, and is generated by a call against the appropriate Wikipedia API.
# Note that because they are a significant additional overhead you have to instantiate them yourself!
kr = ConceptModel(kinda_random)
kr.set_view_counts()
kr.concepts_by_view_count()

One last technical pointer: `watsongraph` provides default bindings for saving to and loading from `JSON`. These are in the form of the alternating `to_json()` and `from_json()` object methods, and are useful for saving and loading your models, if so desired.

In [None]:
kr.to_json()