# Task 1: Using RLTK to perform Entity Resolution (ER)

<sub>Content of this notebook was prepared by Basel Shbita (shbita@usc.edu) as part of the class <u>CSCI 563/INF 558: Building Knowledge Graphs</u> during Spring 2020 at University of Southern California (USC).</sub>

The Record Linkage ToolKit ([RLTK](https://github.com/usc-isi-i2/rltk)) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity.

This notebook introduces some applied examples using RLTK. You can also find additional examples and use-cases in [RLTK's documentation](https://rltk.readthedocs.io/en/master/).

## Dataset analysis & RLTK components construction

First, you need define how a single entry would like for each type of record (for each dataset)

In [1]:
import rltk
import json

# You can use this tokenizer in case you need to manipulate some data
tokenizer = rltk.CrfTokenizer()

# RLTK IMDB Record
class IMDBRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['url']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['name']

    @rltk.cached_property
    def name_tokens(self):
        return set(tokenizer.tokenize(self.name_string))

# RLTK AFI Record
class AFIRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['url']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['title']
    
    @rltk.cached_property
    def date_string(self):
        return self.raw_object.get('release_date', '')

You can load your json-lines files into RLTK using this method:

In [2]:
imdb_file = 'imdb.jl'
afi_file  = 'afi.jl'

# load Datasets
ds_imdb = rltk.Dataset(reader=rltk.JsonLinesReader(imdb_file), record_class=IMDBRecord, adapter=rltk.MemoryKeyValueAdapter())
ds_afi  = rltk.Dataset(reader=rltk.JsonLinesReader(afi_file),  record_class=AFIRecord,  adapter=rltk.MemoryKeyValueAdapter())

And we can inspect a few entries:

In [3]:
# print some entries
print(ds_imdb.generate_dataframe().head(5))
print(ds_afi.generate_dataframe().head(5))

                                      id               name_string  \
0  https://www.imdb.com/title/tt0033467/              Citizen Kane   
1  https://www.imdb.com/title/tt0068646/             The Godfather   
2  https://www.imdb.com/title/tt0032138/          The Wizard of Oz   
3  https://www.imdb.com/title/tt0111161/  The Shawshank Redemption   
4  https://www.imdb.com/title/tt0110912/              Pulp Fiction   

                    name_tokens  
0               {Kane, Citizen}  
1              {Godfather, The}  
2         {Oz, The, Wizard, of}  
3  {The, Shawshank, Redemption}  
4               {Fiction, Pulp}  
                                                  id  \
0  https://catalog.afi.com/#0d4edc56125f42fde0e02...   
1  https://catalog.afi.com/#ca4bcba711e75c8e80216...   
2  https://catalog.afi.com/#514aceac24037dcbcbd5c...   
3  https://catalog.afi.com/#41a24923501602a537e0f...   
4  https://catalog.afi.com/#f0bc574107fa3da879263...   

                  name_string        d

## Field (Attribute) Similarity

Here are 2 example functions for field (attribute) similarity:

In [4]:
def name_string_similarity_1(r_imdb, r_afi):
    ''' Example dummy similiary function '''
    s1 = r_imdb.name_string[:3]
    s2 = r_afi.name_string[:3]
    
    return rltk.jaro_winkler_similarity(s1, s2)
    
def name_string_similarity_2(r_imdb, r_afi):
    ''' Example dummy similiary function '''
    s1 = r_imdb.name_string
    s2 = r_afi.name_string
    
    if s1 == s2:
        return 1
    
    return 0

## Entity Linking

Here's how you can combine multiple similarity functions into a single weightened scoring function:

In [5]:
# threshold value to determine if we are confident the record match
MY_TRESH = 0.8 # this number is just an example, you need to change it

# entity linkage scoring function
def rule_based_method(r_imdb, r_afi):
    score_1 = name_string_similarity_1(r_imdb, r_afi)
    score_2 = name_string_similarity_2(r_imdb, r_afi)
    
    total = 0.7 * score_1 + 0.3 * score_2
    
    # return two values: boolean if they match or not, float to determine confidence
    return total > MY_TRESH, total

## EL Evaluation

Evaluation is a built-in module for benchmarking. Lets load our development set

In [6]:
# load development set data
dev_set_file = 'imdb_afi_el.dev.json'
devset_file_handle = open(dev_set_file, "r")
devset_data = json.load(devset_file_handle)

And now build a ground truth based on the development set

In [7]:
gt = rltk.GroundTruth()
for item in devset_data:
    if None != item['afi_movie']:
        r_imdb = ds_imdb.get_record(item['imdb_movie'])
        r_afi  = ds_afi.get_record(item['afi_movie']) 
        gt.add_positive(r_imdb.raw_object['url'], r_afi.raw_object['url'])
gt.generate_all_negatives(ds_imdb, ds_afi, range_in_gt=True)

Lets run some candidates using the ground-truth

In [8]:
trial = rltk.Trial(gt)
candidate_pairs = rltk.get_record_pairs(ds_imdb, ds_afi, ground_truth=gt)
for r_imdb, r_afi in candidate_pairs:
    result, confidence = rule_based_method(r_imdb, r_afi)
    trial.add_result(r_imdb, r_afi, result, confidence)

Now lets evaluate our trial results

In [9]:
trial.evaluate()
print('Trial statistics based on Ground-Truth from development set data:')
print(f'tp: {trial.true_positives:.06f} [{len(trial.true_positives_list)}]')
print(f'fp: {trial.false_positives:.06f} [{len(trial.false_positives_list)}]')
print(f'tn: {trial.true_negatives:.06f} [{len(trial.true_negatives_list)}]')
print(f'fn: {trial.false_negatives:.06f} [{len(trial.false_negatives_list)}]')

Trial statistics based on Ground-Truth from development set data:
tp: 0.000000 [0]
fp: 0.000000 [0]
tn: 1.000000 [110]
fn: 1.000000 [11]


# Task 2: Using RDFLib for Knowledge Representation

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information as graphs. RDFLib aims to be a pythonic RDF API, a Graph is a python collection of RDF Subject, Predicate,  Object Triples.

This notebook introduces simple examples. You can also find additional information in the [official documenation](https://rdflib.readthedocs.io/en/stable/).

In [10]:
from rdflib import Graph, URIRef, Literal, XSD, Namespace, RDF

Let's define some namespaces:

In [11]:
FOAF = Namespace('http://xmlns.com/foaf/0.1/')
MYNS = Namespace('http://inf558.org/myfakenamespace#')

We can create a graph:

In [12]:
my_kg = Graph()
my_kg.bind('myns', MYNS)
my_kg.bind('foaf', FOAF)

Define a URI, then add a simple triple to the graph:

In [13]:
node_uri = URIRef(MYNS['inf558_production_company'])
my_kg.add((node_uri, RDF.type, MYNS['productionCompany']))

Add an additional triple (which describes the same subject, `node_uri`):

In [14]:
my_kg.add((node_uri, FOAF['name'], Literal('INF 558 Production Company')))

And now let's dump our graph triples into some `ttl` file:

In [15]:
my_kg.serialize('sample_graph.ttl', format="turtle")