# Sources of Distant Supervision for NER/IE from Stack Overflow Posts

In this notebook we will be acquiring sources of distant supervision for our Information Extraction models using SPARQL queries on the WikiData dataset.

## WikiData Programming Languages

For the Snorkel example for Chapter 5, we create a programming language extractor from the titles and bodies of Stack Overflow questions. Here we generate the file that we used by querying WikiData using SPARQL to get a list of programming languages. We then use these language names to label positive examples of programming languages in posts for training our discriminative/network extractor model.

The following SPARQL query prints out the names of all [Property:31:instances of](https://www.wikidata.org/wiki/Property:P31) [Item:Q9143 programming languages](https://www.wikidata.org/wiki/Q9143) in English content from WikiData.

We `SELECT DISTINCT` the item and item labels, then filter the language of the item label to English, to avoid duplicate content from other languages.

In [1]:
!pip install -q jsonlines requests

import json
import jsonlines
import requests

In [2]:
url = 'https://query.wikidata.org/sparql'
query = """
# Get all programming language names from English sources
SELECT DISTINCT ?item ?item_label
WHERE {
 ?item wdt:P31 wd:Q9143 # P31:instances of Q9143:programming language
 ; rdfs:label ?item_label .
  
  FILTER (LANG(?item_label) = "en"). # English only
}
"""
r = requests.get(url, params = {'format': 'json', 'query': query})
data = r.json()

In [6]:
print(json.dumps(data["results"]["bindings"][0:4], indent=4, sort_keys=True))

[
    {
        "item": {
            "type": "uri",
            "value": "http://www.wikidata.org/entity/Q2005"
        },
        "item_label": {
            "type": "literal",
            "value": "JavaScript",
            "xml:lang": "en"
        }
    },
    {
        "item": {
            "type": "uri",
            "value": "http://www.wikidata.org/entity/Q1374139"
        },
        "item_label": {
            "type": "literal",
            "value": "Euphoria",
            "xml:lang": "en"
        }
    },
    {
        "item": {
            "type": "uri",
            "value": "http://www.wikidata.org/entity/Q1334586"
        },
        "item_label": {
            "type": "literal",
            "value": "Emacs Lisp",
            "xml:lang": "en"
        }
    },
    {
        "item": {
            "type": "uri",
            "value": "http://www.wikidata.org/entity/Q1356671"
        },
        "item_label": {
            "type": "literal",
            "value": "GT.M",
           

## Extract the Language Labels from nested JSON

Nested JSON is a pain to work with in `DataFrames`, so we un-nest it, retaining only what we need.

In [8]:
languages = [
    {
        'name': x['item_label']['value'],
        'kb_url': x['item']['value'],
        'kb_id': x['item']['value'].split('/')[-1], # Get the ID
    }
    for x in data['results']['bindings']
]

# Filter out an erroneous language
languages = list(
    filter(
        lambda x: x['kb_id'] != 'Q25111344', 
        languages
    )
)

print(f'There were {len(languages):,} languages returned.\n')

for l in languages[0:10]:
    print(l)

There were 1,417 languages returned.

{'name': 'JavaScript', 'kb_url': 'http://www.wikidata.org/entity/Q2005', 'kb_id': 'Q2005'}
{'name': 'Euphoria', 'kb_url': 'http://www.wikidata.org/entity/Q1374139', 'kb_id': 'Q1374139'}
{'name': 'Emacs Lisp', 'kb_url': 'http://www.wikidata.org/entity/Q1334586', 'kb_id': 'Q1334586'}
{'name': 'GT.M', 'kb_url': 'http://www.wikidata.org/entity/Q1356671', 'kb_id': 'Q1356671'}
{'name': 'REBOL', 'kb_url': 'http://www.wikidata.org/entity/Q1359171', 'kb_id': 'Q1359171'}
{'name': 'Embedded SQL', 'kb_url': 'http://www.wikidata.org/entity/Q1335009', 'kb_id': 'Q1335009'}
{'name': 'SystemVerilog', 'kb_url': 'http://www.wikidata.org/entity/Q1387402', 'kb_id': 'Q1387402'}
{'name': 'BETA', 'kb_url': 'http://www.wikidata.org/entity/Q830842', 'kb_id': 'Q830842'}
{'name': 'newLISP', 'kb_url': 'http://www.wikidata.org/entity/Q827233', 'kb_id': 'Q827233'}
{'name': 'Verilog', 'kb_url': 'http://www.wikidata.org/entity/Q827773', 'kb_id': 'Q827773'}


## Write Languages to Disk as CSV

In [10]:
with jsonlines.open('programming_languages.jsonl', mode='w') as writer:
    writer.write_all(languages)

## Now get a list of operating systems to create negative LFs from

In [11]:
url = 'https://query.wikidata.org/sparql'
query = """
# Get all operating system names from English sources
SELECT DISTINCT ?item ?item_label
WHERE {
 ?item wdt:P31 wd:Q9135 # instances of operating system
 ; rdfs:label ?item_label .
  
  FILTER (LANG(?item_label) = "en"). 
}
"""
r = requests.get(url, params = {'format': 'json', 'query': query})
data = r.json()

In [12]:
oses = [
    {
        'name': x['item_label']['value'],
        'kb_url': x['item']['value'],
        'kb_id': x['item']['value'].split('/')[-1], # Get the ID
    }
    for x in data['results']['bindings']
]

print(f'There were {len(oses):,} programs returned.\n')

for l in oses[0:10]:
    print(l)

There were 1,066 programs returned.

{'name': 'Windows 8', 'kb_url': 'http://www.wikidata.org/entity/Q5046', 'kb_id': 'Q5046'}
{'name': 'Möbius', 'kb_url': 'http://www.wikidata.org/entity/Q3869245', 'kb_id': 'Q3869245'}
{'name': 'ITIX', 'kb_url': 'http://www.wikidata.org/entity/Q3789886', 'kb_id': 'Q3789886'}
{'name': 'TinyKRNL', 'kb_url': 'http://www.wikidata.org/entity/Q3991642', 'kb_id': 'Q3991642'}
{'name': 'Myarc Disk Operating System', 'kb_url': 'http://www.wikidata.org/entity/Q3841260', 'kb_id': 'Q3841260'}
{'name': 'NX-OS', 'kb_url': 'http://www.wikidata.org/entity/Q3869717', 'kb_id': 'Q3869717'}
{'name': 'Unslung', 'kb_url': 'http://www.wikidata.org/entity/Q4006074', 'kb_id': 'Q4006074'}
{'name': 'KnopILS', 'kb_url': 'http://www.wikidata.org/entity/Q3815960', 'kb_id': 'Q3815960'}
{'name': 'Multiuser DOS', 'kb_url': 'http://www.wikidata.org/entity/Q3867065', 'kb_id': 'Q3867065'}
{'name': 'MDOS', 'kb_url': 'http://www.wikidata.org/entity/Q3841258', 'kb_id': 'Q3841258'}


In [13]:
with jsonlines.open('operating_systems.jsonl', mode='w') as writer:
    writer.write_all(oses)

## Conclusion

Now we are ready to use our programming languages in our Label Functions (LFs) in the Snorkel notebook!