### Intro
This notebook shows how to use the python package `spacy` to perform software-related entity extraction and ontology linking from GitHub repository `README` files.

In [2]:
import json
import sys

import pandas

from utils import get_readme
from entity_extraction import extract_entities

### Ontology Linking

We will use [WikiData]("https://www.wikidata.org/wiki/Wikidata:Main_Page") as our ontology knowledge base.  WikiData contains nearly 100 Million entities often linked to other knowledge bases.  For this project we are interested in software-related entities so for simplicity we will only use entities which have a StackOverflow tag.

### Pretrained Model

We will use a pretrained SpaCy model to extract the entities.  The `spacy-entity-linker` package is trained on the [Kensho Derived Wikimedia Dataset](https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data) a linked dataset between WikiData and Wikipedia hosted through Kaggle.

It uses an alias table and max-prior method, which simply links the text mention to the entity with the most frequently used matching alias.
>Currently the only method for choosing an entity given different possible matches (e.g. Paris - city vs Paris - firstname) is max-prior. This method achieves around 70% accuracy on predicting the correct entities behind link descriptions on wikipedia.

This could be improved upon by training a context aware matching model but the model is sufficient for proof-of-concept.




In [12]:
%pycat entity_extraction.py

[0;32mimport[0m [0mpandas[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mspacy[0m  [0;31m# version 3.0.6'[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;31m# initialize language model[0m[0;34m[0m
[0;34m[0m[0mnlp[0m [0;34m=[0m [0mspacy[0m[0;34m.[0m[0mload[0m[0;34m([0m[0;34m"en_core_web_md"[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;31m# add pipeline (declared through entry_points in setup.py)[0m[0;34m[0m
[0;34m[0m[0mnlp[0m[0;34m.[0m[0madd_pipe[0m[0;34m([0m[0;34m"entityLinker"[0m[0;34m,[0m [0mlast[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mdef[0m [0mextract_entities[0m[0;34m([0m[0mtext[0m[0;34m,[0m [0mfilter_file[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;32mif[0m [0mfilter_file[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0mtagged_entities[0m [0;34m=[0m [0mpandas[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m

### Test Case
To demonstrate the models capability we will extract entities from GitHub repository README files. These files often contain descriptions of the projects architecture and technology stack. CMS catalogues GitHub repositories in Snyk and thus the entities extracted from the README can be linked to those extracted from other sources e.g. Snyk dependency scanning or CFACTS.

In [34]:
readme = get_readme('cmsgov', 'bluebutton-web-server')
print(readme)

  else: warn('Neither GITHUB_TOKEN nor GITHUB_JWT_TOKEN found: running as unauthenticated')


Blue Button Web Server


This server serves as a data provider for sharing Medicare claims data with third parties.
The server connects to Medicare.gov for authentication, and uses OAuth2 to confirm permission
grants to external app developers. The data itself comes from a back end FHIR server
(https://github.com/CMSgov/bluebutton-data-server), which in turn pulls data from the CMS
Chronic Conditions Warehouse (https://www.ccwdata.org)
For more information on how to connect to the API implemented here, check out our
developer documentation at https://cmsgov.github.io/bluebutton-developer-help/. Our most
recent deployment is at https://sandbox.bluebutton.cms.gov, and you can also
check out our Google Group at https://groups.google.com/forum/#!forum/developer-group-for-cms-blue-button-api
for more details.
The information below outlines setting up the server for development or your own environment.
For general information on deploying Django see https://docs.djangoproject.com/en/1.11/how

In [35]:
entities = extract_entities(readme, filter_file='entity_tags.csv')
entities.set_index('wikidata_id').drop(columns=['superclasses'])

Unnamed: 0_level_0,text,label,description,tags
wikidata_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Q11288,Web Server,web server,server that serves website content to clients,webserver
Q212108,authentication,authentication,act of confirming the truth of an attribute of...,authentication
Q131093,CMS,content management system,software,content-management-system
Q165194,API,application programming interface,"set of subroutine definitions, protocols, and ...",api
Q842014,Django,Django,Python web framework,django
Q278485,#,hashtag,word or an unspaced phrase prefixed with the n...,hashtag
Q9135,OS,operating system,software that manages computer hardware resources,operating-system
Q381,Ubuntu,Ubuntu,Debian-based Linux operating system,ubuntu
Q400857,environment variables,environment variable,small piece of data used to store values for s...,environment-variables
Q28865,python,Python,"general-purpose, high-level programming language",python


In [36]:
entities_long = entities.set_index('wikidata_id').explode('superclasses').dropna(subset=['superclasses'])
entities_long['wikidata_parent_id'] = entities_long['superclasses'].apply(lambda x: x['wikidata_id'])
entities_long['wikidata_parent_label'] = entities_long['superclasses'].apply(lambda x: x['label'])

In [37]:
entities_long.drop(columns=['superclasses']).head(20)

Unnamed: 0_level_0,text,label,description,tags,wikidata_parent_id,wikidata_parent_label
wikidata_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Q11288,Web Server,web server,server that serves website content to clients,webserver,Q4485156,software feature
Q131093,CMS,content management system,software,content-management-system,Q17155032,software category
Q131093,CMS,content management system,software,content-management-system,Q7397,software
Q131093,CMS,content management system,software,content-management-system,Q40056,computer program
Q165194,API,application programming interface,"set of subroutine definitions, protocols, and ...",api,Q23808,interface
Q165194,API,application programming interface,"set of subroutine definitions, protocols, and ...",api,Q132364,communications protocol
Q165194,API,application programming interface,"set of subroutine definitions, protocols, and ...",api,Q241317,computing platform
Q842014,Django,Django,Python web framework,django,Q1330336,web framework
Q278485,#,hashtag,word or an unspaced phrase prefixed with the n...,hashtag,Q658349,tag
Q9135,OS,operating system,software that manages computer hardware resources,operating-system,Q241317,computing platform
