# Merge BnF DATA, DBpedia and Wikidata

In this notebook, we apply a method to merge three datasets (BnF, DBpedia and Wikidata)

* First, we drop duplicates of each datasets. 

* Secondly, we merge the three datasets  by removing duplicate data. To realise that, we will use the Linkage toolkit who calculate the proximity (by giving a score) between to string from three dataframes.

* Previously, we have to collect data about economists with SPARQL queries.

In [2]:
# Import libraries usefull

from SPARQLWrapper import SPARQLWrapper, SPARQLWrapper2, JSON, TURTLE, XML, RDFXML
import pprint
import csv
# from bs4 import BeautifulSoup

from collections import Counter
from operator import itemgetter
import pandas as pd

import sqlite3 as sql
import time

from importlib import reload
from shutil import copyfile

In [3]:
import sparql_functions as spqf # It's made-home fonctions created by Francesco Beretta
# so they must to be in the same folder as this file.

In [4]:
### La connextion à une base de données SQLite créé la base de données si elle n'est pas encore existante, 
#  et ce dans le dossier indiqué

cn = sql.connect('data/sparql_queries.db')

In [5]:
## Un curseur est un objet qui permet d'exécuter des requêtes sur la base des données en les isolant proprement
c = cn.cursor()

##  https://www.sqlite.org/lang_datefunc.html
# on vérifie ainsi qu'on est bien connecté à la base et on a l'heure locale actuelle
c.execute("SELECT datetime('now', 'localtime')")
print(c.fetchone())

('2021-04-26 15:25:14',)


# Query economists and jurists from BnF Data

First step, we need datas about economists and jurists with theirs proprieties on 'BnF Data' so we realise a SPARQL query. We need proprieties to realise the merge between three datasets:
  * Name
  
  * Birth date
  
  * Date of death
  
  * Place of Birth
  
  * Place of Death

In addition, we add the biographie to filter the population we need. 

In [9]:
### It's define the database ligne to use
pk_query = 1

# Connexion to the database
original_db = 'data/sparql_queries.db'
conn = sql.connect(original_db)

### It runs the query on the SQLite database to get the row values 
c.execute('SELECT * FROM query WHERE pk_query = ?', [pk_query]) ### a list around argument is needed for a string longer then one
#c.execute('SELECT * FROM query WHERE pk_query = 10')

rc = c.fetchone()

# close connexion
conn.close()


In [10]:
# print(rc[2] +  "\n-----\n" + rc[4] +  "\n-----\n" +   rc[7]+  "\n\n\n------------------\n" +  rc[5] + "\n\n\n------------------\n")

In [11]:
### Execute the SPARQL query wrapped in the function in the library _sparql_functions.py_
# The first setting correspond to SPARQL Endpoint, the seconde to the query
qr = spqf.get_json_sparql_result(rc[4],rc[5])

<class 'dict'>


In [8]:
# Number of rows in the result
len(qr['results']['bindings'])

11132

In [9]:
# Inspect the first five rows
i = 0
for l in qr['results']['bindings']:
    if i < 5:
        print(l)
        i += 1

{'s': {'type': 'uri', 'value': 'http://data.bnf.fr/ark:/12148/cb12981404c#about'}, 'name': {'type': 'literal', 'value': 'Léon Garnier'}, 'uri': {'type': 'uri', 'value': 'http://viaf.org/viaf/99996033'}, 'birthDate': {'type': 'literal', 'value': '1836-11-10'}, 'deathDate': {'type': 'literal', 'value': '1901-05-06'}, 'bio': {'type': 'literal', 'value': "Juriste. - Administrateur et homme de lettres. - En poste à la Préfecture de la Seine. - Frère de l'explorateur Francis Garnier (1839-1873)"}}
{'s': {'type': 'uri', 'value': 'http://data.bnf.fr/ark:/12148/cb13484444m#about'}, 'name': {'type': 'literal', 'value': 'Gaston de Pawlowski'}, 'uri': {'type': 'uri', 'value': 'http://viaf.org/viaf/9999219'}, 'birthDate': {'type': 'literal', 'value': '1874-06-14'}, 'deathDate': {'type': 'literal', 'value': '1933-02-02'}, 'placeOfBirth': {'type': 'literal', 'value': 'Joigny (Yonne)'}, 'placeOfDeath': {'type': 'literal', 'value': 'Paris'}, 'bio': {'type': 'literal', 'value': 'Docteur en droit. - Crit

In [10]:
# Transform the result into a list with another fonction of the library
r_bnf = [l for l in spqf.sparql_result_to_list(qr)]
# r

In [11]:
# Inspect the first five of the list
#print(len(r_bnf))
#r_bnf[:5]

# Query economists and jurists from DBpedia

The query is the same as BnF Data, but we add the nationalities, who they don't on BnF Data.

In [77]:
### It's define the database ligne to use

### Query on the economists and jurists from DBpedia (expect the "lawyer recource")
pk_query = 2

# Connexion to the database
original_db = 'data/sparql_queries.db'
conn = sql.connect(original_db)

### It runs the query on the SQLite database to get the row values 
c.execute('SELECT * FROM query WHERE pk_query = ?', [pk_query]) ### a list around argument is needed for a string longer then one
#c.execute('SELECT * FROM query WHERE pk_query = 10')

rc = c.fetchone()

# close connexion
conn.close()


In [78]:
# Print the query
print(rc[2] +  "\n-----\n" + rc[4] +  "\n-----\n" +   rc[7]+  "\n\n\n------------------\n" +  rc[5] + "\n\n\n------------------\n")

This query takes informations about economists and jurists (except 'Lawyers') from DBpedia. The proprieties to take are the URI, the name, the birthdate, the deathdate, the place of birth and the place of death.
-----
https://dbpedia.org/sparql
-----
2021-04-22 10:13:00


------------------
PREFIX  dbo:  <http://dbpedia.org/ontology/>
PREFIX  dbp:  <http://dbpedia.org/property/>
PREFIX  owl:  <http://www.w3.org/2002/07/owl#>
PREFIX  dbr:  <http://dbpedia.org/resource/>
PREFIX  xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX  foaf: <http://xmlns.com/foaf/0.1/>

SELECT DISTINCT  ?s ?uri ?name ?birthDate ?deathDate ?placeOfBirth ?placeOfDeath
WHERE
  {   { ?s  a              dbo:Economist ;
            dbp:birthDate  ?birthDate
        FILTER ( xsd:date(?birthDate) > "1770-01-01"^^xsd:dateTime )
        OPTIONAL
          { ?s  owl:sameAs  ?uri
            FILTER regex(?uri, "viaf", "i")
          }
        OPTIONAL
          { ?s  dbp:name  ?name }
        OPTIONAL
          { ?s  dbp:b

In this query, we have made the choice to aggregate, by a UNION clause, several queries to maximise the results' number. Also we request the "economists" and the "jurists" in only one query. 

Obviously, we chose classes and instances directly related to our population, but also the "professor" instance, because some "economists" or "jurists" are in this instance (we have tried with and without them, and there more result when we use them). 

Also, we exclude all classes because they don't add more result, except the "Economist" class (we keep it) 

For exemple, we exclude the resource "personFunction" and the resource "Jurists" because they add no more data. Additionally, we keep only the "Professor" instance for the jurists (it returns result only for the jurists).

In [79]:
### Execute the SPARQL query wrapped in the function in the library _sparql_functions.py_
# The first setting correspond to SPARQL Endpoint, the seconde to the query
qr = spqf.get_json_sparql_result(rc[4],rc[5])

<class 'dict'>


In [80]:
# Number of rows in the result
len(qr['results']['bindings'])
# Unfortunately, DBpedia has set a row limit of 10,000, see https://wiki.dbpedia.org/public-sparql-endpoint.

7026

In [81]:
# Inspect the first five rows
i = 0
for l in qr['results']['bindings']:
    if i < 5:
        print(l)
        i += 1

{'s': {'type': 'uri', 'value': 'http://dbpedia.org/resource/António_de_Almeida_Santos'}, 'uri': {'type': 'uri', 'value': 'http://viaf.org/viaf/99921066'}, 'name': {'type': 'literal', 'xml:lang': 'en', 'value': 'António de Almeida Santos'}, 'birthDate': {'type': 'typed-literal', 'datatype': 'http://www.w3.org/2001/XMLSchema#date', 'value': '1926-02-15'}, 'deathDate': {'type': 'typed-literal', 'datatype': 'http://www.w3.org/2001/XMLSchema#date', 'value': '2016-01-18'}}
{'s': {'type': 'uri', 'value': 'http://dbpedia.org/resource/Carlos_Carvalhas'}, 'uri': {'type': 'uri', 'value': 'http://viaf.org/viaf/99826658'}, 'name': {'type': 'literal', 'xml:lang': 'en', 'value': 'Carlos Carvalhas'}, 'birthDate': {'type': 'typed-literal', 'datatype': 'http://www.w3.org/2001/XMLSchema#date', 'value': '1941-11-09'}, 'placeOfBirth': {'type': 'literal', 'value': 'São Pedro do Sul, Portugal'}}
{'s': {'type': 'uri', 'value': 'http://dbpedia.org/resource/Anita_Augspurg'}, 'uri': {'type': 'uri', 'value': 'htt

In [82]:
# Transform the result into a list with another fonction of the library
r_dbp = [l for l in spqf.sparql_result_to_list(qr)]
# r

In [83]:
# Inspect the first five of the list
#print(len(r_dbp))
#r_dbp[:5]

####  "lawyer recource" from DBpedia

In [133]:
### It's define the database ligne to use

### Query on the "lawyer recource" from DBpedia

pk_query = 5

# Connexion to the database
original_db = 'data/sparql_queries.db'
conn = sql.connect(original_db)

### It runs the query on the SQLite database to get the row values 
c.execute('SELECT * FROM query WHERE pk_query = ?', [pk_query]) ### a list around argument is needed for a string longer then one
#c.execute('SELECT * FROM query WHERE pk_query = 10')

rc = c.fetchone()

# close connexion
conn.close()


In [137]:
# Print the query
print(rc[2] +  "\n-----\n" + rc[4] +  "\n-----\n" + rc[6]+  "\n-----\n" +  rc[7]+  "\n\n\n------------------\n" +  rc[5] + "\n\n\n------------------\n")

This query takes informations about 'lawyers' from DBpedia. The proprieties to take are the URI, the name, the birthdate, the deathdate, the place of birth and the place of death.
-----
https://dbpedia.org/sparql
-----
We set the 'lawyer' resource apart because it exceeds the 10,000 rows allowed by DBpedia. Therefore we treat it differently from the others 
-----
2021-04-26 11:54:11


------------------
PREFIX  dbo:  <http://dbpedia.org/ontology/>
PREFIX  dbp:  <http://dbpedia.org/property/>
PREFIX  owl:  <http://www.w3.org/2002/07/owl#>
PREFIX  dbr:  <http://dbpedia.org/resource/>
PREFIX  xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX  foaf: <http://xmlns.com/foaf/0.1/>

SELECT DISTINCT  ?s (SAMPLE(?uri1) AS ?uri) ?name ?birthDate ?deathDate (SAMPLE(?pb2) AS ?placeOfBirth) (SAMPLE(?pd2) AS ?placeOfDeath)
WHERE
  { ?s  ?p             dbr:Lawyer ;
        dbo:birthDate  ?birthDate
    FILTER ( xsd:date(?birthDate) > "1770-01-01"^^xsd:dateTime )
    FILTER ( xsd:date(?birthDate) != "-1

In [138]:
### Execute the SPARQL query wrapped in the function in the library _sparql_functions.py_
# The first setting correspond to SPARQL Endpoint, the seconde to the query
qr = spqf.get_json_sparql_result(rc[4],rc[5])

<class 'dict'>


In [139]:
# Number of rows in the result
len(qr['results']['bindings'])

8533

In [140]:
# Inspect the first five rows
i = 0
for l in qr['results']['bindings']:
    if i < 5:
        print(l)
        i += 1

{'s': {'type': 'uri', 'value': 'http://dbpedia.org/resource/Lucjan_Wolanowski'}, 'uri': {'type': 'uri', 'value': 'http://viaf.org/viaf/54672620'}, 'name': {'type': 'literal', 'xml:lang': 'en', 'value': 'Lucjan Wolanowski'}, 'birthDate': {'type': 'typed-literal', 'datatype': 'http://www.w3.org/2001/XMLSchema#date', 'value': '1920-02-26'}, 'deathDate': {'type': 'typed-literal', 'datatype': 'http://www.w3.org/2001/XMLSchema#date', 'value': '2006-02-20'}, 'placeOfBirth': {'type': 'literal', 'value': 'Poland'}, 'placeOfDeath': {'type': 'literal', 'value': 'Poland'}}
{'s': {'type': 'uri', 'value': 'http://dbpedia.org/resource/Henry_Rogers_Seager'}, 'uri': {'type': 'uri', 'value': 'http://viaf.org/viaf/41905280'}, 'name': {'type': 'literal', 'xml:lang': 'en', 'value': 'Henry Rogers Seager'}, 'birthDate': {'type': 'typed-literal', 'datatype': 'http://www.w3.org/2001/XMLSchema#date', 'value': '1870-07-21'}, 'deathDate': {'type': 'typed-literal', 'datatype': 'http://www.w3.org/2001/XMLSchema#dat

In [141]:
# Transform the result into a list with another fonction of the library
r_dbp_l = [l for l in spqf.sparql_result_to_list(qr)]
# r

In [142]:
# Inspect the first five of the list
#print(len(r_dbp))
#r_dbp[:5]

# Query economists and jurists from Wikidata

The query is the same as the two others, but we add the BnF URI propriety. This propriety exists on Wikidata , we use it to realize in a simpler way the merge below.

In [19]:
### It's define the database ligne to use (here there correspond to the query from Wikidata)
pk_query = 3

# Connexion to the database
original_db = 'data/sparql_queries.db'
conn = sql.connect(original_db)

### It runs the query on the SQLite database to get the row values 
c.execute('SELECT * FROM query WHERE pk_query = ?', [pk_query]) ### a list around argument is needed for a string longer then one
#c.execute('SELECT * FROM query WHERE pk_query = 10')

rc = c.fetchone()

# close connexion
conn.close()


In [20]:
# Print the query
#print(rc[2] +  "\n-----\n" + rc[4] +  "\n-----\n" +   rc[7]+  "\n\n\n------------------\n" +  rc[5] + "\n\n\n------------------\n")

**SERVICE clause**

The use of the SERVICE clause is very important to display the property label. Also, in the SELECT, It must have "Label" used to work.

**BnF Propriety**

We get the BnF propriety to merge easier but it returns only this URI (ex: 134841632) but we need the complete URL (ex:http://data.bnf.fr/ark:/12148/cb134841632#about') to realise the merge so we use the concatenation.

In [21]:
### Execute the SPARQL query wrapped in the function in the library _sparql_functions.py_
# The first setting correspond to SPARQL Endpoint, the seconde to the query
qr = spqf.get_json_sparql_result(rc[4],rc[5])

<class 'dict'>


In [22]:
# Number of rows in the result
len(qr['results']['bindings'])

56524

In [23]:
# Inspect the first five rows
i = 0
for l in qr['results']['bindings']:
    if i < 5:
        print(l)
        i += 1

{'s': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q72568'}, 'name': {'xml:lang': 'en', 'type': 'literal', 'value': 'Alfred Lichtenstein'}, 'uri': {'type': 'uri', 'value': 'http://viaf.org/viaf/71423827'}, 'birthPlaceLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Berlin'}, 'birthDate': {'datatype': 'http://www.w3.org/2001/XMLSchema#dateTime', 'type': 'literal', 'value': '1889-08-23T00:00:00Z'}, 'deathPlaceLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Somme'}, 'deathDate': {'datatype': 'http://www.w3.org/2001/XMLSchema#dateTime', 'type': 'literal', 'value': '1914-09-25T00:00:00Z'}, 'uri_bnf': {'type': 'literal', 'value': 'http://data.bnf.fr/ark:/12148/cb12123568r#about'}}
{'s': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q72553'}, 'name': {'xml:lang': 'en', 'type': 'literal', 'value': 'Heinrich von Bülow'}, 'uri': {'type': 'uri', 'value': 'http://viaf.org/viaf/62342475'}, 'birthPlaceLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Schwe

In [24]:
# Transform the result into a list with another fonction of the library
r_wk = [l for l in spqf.sparql_result_to_list(qr)]
# r

In [25]:
# Inspect the first five of the list
print(len(r_wk))
r_wk[:5]

56524


[['http://www.wikidata.org/entity/Q72568',
  'Alfred Lichtenstein',
  '1889-08-23T00:00:00Z',
  '1914-09-25T00:00:00Z',
  'Berlin',
  'Somme',
  'http://viaf.org/viaf/71423827',
  'http://data.bnf.fr/ark:/12148/cb12123568r#about'],
 ['http://www.wikidata.org/entity/Q72553',
  'Heinrich von Bülow',
  '1792-09-16T00:00:00Z',
  '1846-02-06T00:00:00Z',
  'Schwerin',
  'Berlin',
  'http://viaf.org/viaf/62342475',
  ''],
 ['http://www.wikidata.org/entity/Q72628',
  'Alfred von Kiderlen-Waechter',
  '1852-07-10T00:00:00Z',
  '1912-12-30T00:00:00Z',
  'Stuttgart',
  'Stuttgart',
  'http://viaf.org/viaf/54958174',
  'http://data.bnf.fr/ark:/12148/cb135097503#about'],
 ['http://www.wikidata.org/entity/Q72535',
  'Rainer Rupp',
  '1945-09-21T00:00:00Z',
  '',
  'Saarlouis',
  '',
  'http://viaf.org/viaf/15698197',
  ''],
 ['http://www.wikidata.org/entity/Q77404',
  'Ingeborg Schwenzer',
  '1951-10-25T00:00:00Z',
  '',
  'Stuttgart',
  '',
  'http://viaf.org/viaf/91748910',
  'http://data.bnf.fr/a

-------------------------------

# Dataframes

The script below serves to change lists into a dataframe.
First, 

### BnF

In [75]:
df_bnf = pd.DataFrame(r_bnf, columns=['uri','name','viaf','birthDate','deathDate' , 'placeOfBirth','placeOfDeath','bio'])
print(len(df_bnf))
df_bnf.fillna('')

df_bnf[:10]

NameError: name 'r_bnf' is not defined

In [484]:
df_bnf.set_index("uri")

Unnamed: 0_level_0,name,viaf,birthDate,deathDate,placeOfBirth,placeOfDeath,bio
uri,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
http://data.bnf.fr/ark:/12148/cb12981404c#about,Léon Garnier,http://viaf.org/viaf/99996033,1836-11-10,1901-05-06,,,Juriste. - Administrateur et homme de lettres....
http://data.bnf.fr/ark:/12148/cb13484444m#about,Gaston de Pawlowski,http://viaf.org/viaf/9999219,1874-06-14,1933-02-02,Joigny (Yonne),Paris,Docteur en droit. - Critique littéraire et thé...
http://data.bnf.fr/ark:/12148/cb134841632#about,Jean-Michel Berton,http://viaf.org/viaf/9999131,1794-07-03,1845-10-20,Cahors (Lot),,"Écrivain politique, avocat à la Cour de cassat..."
http://data.bnf.fr/ark:/12148/cb13379520q#about,Emmanuel Mathieu,http://viaf.org/viaf/9995247,1852-07-19,19..,,,"Docteur en droit (Paris, 1873)"
http://data.bnf.fr/ark:/12148/cb13338312g#about,Josiah Henry Benton,http://viaf.org/viaf/9994322,1843,1917,,,Juriste. - Bibliophile
...,...,...,...,...,...,...,...
http://data.bnf.fr/ark:/12148/cb11475627b#about,Joan Mitchell,,1920-03-15,2014-02-13,,,Économiste. - Professeur d'économie de l'unive...
http://data.bnf.fr/ark:/12148/cb10562770v#about,Kazimierz Zimmermann,,1874,1925,Trzemeszno (Pologne),Cracovie (Pologne),Chanoine. - Economiste. - Recteur de l'Univers...
http://data.bnf.fr/ark:/12148/cb17701366b#about,ʿUmar ʿAzīz,,1949-02-18,2013-02-16,,,Chercheur et professeur d'économie. - Militant...
http://data.bnf.fr/ark:/12148/cb17877820g#about,John Davenport,,1904-09-11,1987-06-08,"Philadelphie (Pennsylvanie, États-Unis)","Red Bank (New Jersey, États-Unis)","Journaliste économiste. - Journaliste à : ""For..."


In [414]:
#count the number of similar URIs
gb_bnf=df_bnf.groupby(['uri_bnf']).size()
print(gb_bnf)

uri_bnf
http://data.bnf.fr/ark:/12148/cb10002423d#about    1
http://data.bnf.fr/ark:/12148/cb10011983m#about    1
http://data.bnf.fr/ark:/12148/cb10012188c#about    1
http://data.bnf.fr/ark:/12148/cb10023428s#about    1
http://data.bnf.fr/ark:/12148/cb10028057g#about    1
                                                  ..
http://data.bnf.fr/ark:/12148/cb17915446h#about    1
http://data.bnf.fr/ark:/12148/cb17915558s#about    1
http://data.bnf.fr/ark:/12148/cb17916121k#about    1
http://data.bnf.fr/ark:/12148/cb179164907#about    1
http://data.bnf.fr/ark:/12148/cb179165043#about    1
Length: 11039, dtype: int64


In [415]:
# count the number of similar URIs greater than one
sort_bnf=gb_bnf.sort_values(ascending=False) > 1
sum(sort_bnf)

93

### DBpedia

In [143]:
df_dbp = pd.DataFrame(r_dbp, columns=['uri_dbp', 'viaf_dbp', 'name_dbp','birthDate_dbp','deathDate_dbp', 'placeOfBirth_dbp','placeOfDeath_dbp'])
print(len(df_dbp))
df_dbp.head(20)

7026


Unnamed: 0,uri_dbp,viaf_dbp,name_dbp,birthDate_dbp,deathDate_dbp,placeOfBirth_dbp,placeOfDeath_dbp
0,http://dbpedia.org/resource/António_de_Almeida...,http://viaf.org/viaf/99921066,António de Almeida Santos,1926-02-15,2016-01-18,,
1,http://dbpedia.org/resource/Carlos_Carvalhas,http://viaf.org/viaf/99826658,Carlos Carvalhas,1941-11-09,,"São Pedro do Sul, Portugal",
2,http://dbpedia.org/resource/Anita_Augspurg,http://viaf.org/viaf/9976800,Anita Augspurg,1857-09-22,1943-12-20,,
3,http://dbpedia.org/resource/Mason_Gaffney,http://viaf.org/viaf/9960617,Mason Gaffney,1923-10-18,2020-07-16,,
4,http://dbpedia.org/resource/Paulo_Portas,http://viaf.org/viaf/99455673,Paulo Portas,1962-09-12,,Lisbon,
5,http://dbpedia.org/resource/Paulo_Portas,http://viaf.org/viaf/99455673,Paulo Portas,1962-09-12,,Portugal,
6,http://dbpedia.org/resource/Hermann_Heinrich_G...,http://viaf.org/viaf/9939728,Hermann Heinrich Gossen,1810-09-07,1858-02-13,Düren,Cologne
7,http://dbpedia.org/resource/Fernando_Teixeira_...,http://viaf.org/viaf/99275725,Fernando Teixeira dos Santos,1951-09-13,,"Maia, Portugal",
8,http://dbpedia.org/resource/Fernando_Teixeira_...,http://viaf.org/viaf/99275725,Fernando Teixeira dos Santos,1951-09-13,,Portugal,
9,http://dbpedia.org/resource/Gottfried_Haberler,http://viaf.org/viaf/99257315,Gottfried Haberler,1900-07-20,1995-05-06,Purkersdorf,"Washington, D.C."


In [145]:
# The script below is used to concatenate the DBpedia data, adding the resource 'lawyer' to the rest of the data 
df_dbp_l = pd.DataFrame(r_dbp_l, columns=['uri_dbp','viaf_dbp', 'name_dbp','birthDate_dbp','deathDate_dbp', 'placeOfBirth_dbp','placeOfDeath_dbp'])
print(len(df_dbp_l))
df_dbp_l[:10]

8533


Unnamed: 0,uri_dbp,viaf_dbp,name_dbp,birthDate_dbp,deathDate_dbp,placeOfBirth_dbp,placeOfDeath_dbp
0,http://dbpedia.org/resource/Lucjan_Wolanowski,http://viaf.org/viaf/54672620,Lucjan Wolanowski,1920-02-26,2006-02-20,Poland,Poland
1,http://dbpedia.org/resource/Henry_Rogers_Seager,http://viaf.org/viaf/41905280,Henry Rogers Seager,1870-07-21,1930-08-23,"Lansing, Michigan",Kiev
2,http://dbpedia.org/resource/Joseph_George_Rose...,http://viaf.org/viaf/49981696,Joseph George Rosengarten,1835-07-14,1921-01-14,Philadelphia,Philadelphia
3,http://dbpedia.org/resource/Dirk_Ballendorf,http://viaf.org/viaf/12443042,Dirk Ballendorf,1939-04-22,2013-02-04,Pennsylvania,Guam
4,http://dbpedia.org/resource/Konstantin_Rodzaevsky,http://viaf.org/viaf/8190795,Konstantin Rodzaevsky,1907-08-11,1946-08-30,Blagoveshchensk,Russian Soviet Federative Socialist Republic
5,http://dbpedia.org/resource/Presley_T._Glass,http://viaf.org/viaf/43885263,Presley Thornton Glass,1824-10-18,2018-10-02,"Halifax County, Virginia",Tennessee
6,http://dbpedia.org/resource/Price_Daniel_Jr.,http://viaf.org/viaf/63003481,Price Daniel Jr.,1941-06-08,1981-01-19,Austin,"Liberty County, Texas"
7,http://dbpedia.org/resource/Rafael_Uribe_Uribe,http://viaf.org/viaf/29567121,Rafael Uribe Uribe,1859-04-12,1914-10-15,"Valparaíso, Antioquia","Bogotá, D.C."
8,http://dbpedia.org/resource/Lyman_Duff,,Sir Lyman Duff,1865-01-07,1955-04-26,Ontario,Ontario
9,http://dbpedia.org/resource/Denison_Kitchel,,Denison Kitchel,1908-03-01,2002-10-10,"Bronxville, New York",Arizona


In [146]:
frames=[df_dbp, df_dbp_l]

In [149]:
df_dbpc=  pd.concat(frames)
print(len(df_dbp))
df_dbpc[:20]

15559


Unnamed: 0,uri_dbp,viaf_dbp,name_dbp,birthDate_dbp,deathDate_dbp,placeOfBirth_dbp,placeOfDeath_dbp
0,http://dbpedia.org/resource/António_de_Almeida...,http://viaf.org/viaf/99921066,António de Almeida Santos,1926-02-15,2016-01-18,,
1,http://dbpedia.org/resource/Carlos_Carvalhas,http://viaf.org/viaf/99826658,Carlos Carvalhas,1941-11-09,,"São Pedro do Sul, Portugal",
2,http://dbpedia.org/resource/Anita_Augspurg,http://viaf.org/viaf/9976800,Anita Augspurg,1857-09-22,1943-12-20,,
3,http://dbpedia.org/resource/Mason_Gaffney,http://viaf.org/viaf/9960617,Mason Gaffney,1923-10-18,2020-07-16,,
4,http://dbpedia.org/resource/Paulo_Portas,http://viaf.org/viaf/99455673,Paulo Portas,1962-09-12,,Lisbon,
5,http://dbpedia.org/resource/Paulo_Portas,http://viaf.org/viaf/99455673,Paulo Portas,1962-09-12,,Portugal,
6,http://dbpedia.org/resource/Hermann_Heinrich_G...,http://viaf.org/viaf/9939728,Hermann Heinrich Gossen,1810-09-07,1858-02-13,Düren,Cologne
7,http://dbpedia.org/resource/Fernando_Teixeira_...,http://viaf.org/viaf/99275725,Fernando Teixeira dos Santos,1951-09-13,,"Maia, Portugal",
8,http://dbpedia.org/resource/Fernando_Teixeira_...,http://viaf.org/viaf/99275725,Fernando Teixeira dos Santos,1951-09-13,,Portugal,
9,http://dbpedia.org/resource/Gottfried_Haberler,http://viaf.org/viaf/99257315,Gottfried Haberler,1900-07-20,1995-05-06,Purkersdorf,"Washington, D.C."


Unnamed: 0_level_0,viaf_dbp,name_dbp,birthDate_dbp,deathDate_dbp,placeOfBirth_dbp,placeOfDeath_dbp
uri_dbp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
http://dbpedia.org/resource/António_de_Almeida_Santos,http://viaf.org/viaf/99921066,António de Almeida Santos,1926-02-15,2016-01-18,,
http://dbpedia.org/resource/Carlos_Carvalhas,http://viaf.org/viaf/99826658,Carlos Carvalhas,1941-11-09,,"São Pedro do Sul, Portugal",
http://dbpedia.org/resource/Anita_Augspurg,http://viaf.org/viaf/9976800,Anita Augspurg,1857-09-22,1943-12-20,,
http://dbpedia.org/resource/Mason_Gaffney,http://viaf.org/viaf/9960617,Mason Gaffney,1923-10-18,2020-07-16,,
http://dbpedia.org/resource/Paulo_Portas,http://viaf.org/viaf/99455673,Paulo Portas,1962-09-12,,Lisbon,
...,...,...,...,...,...,...
http://dbpedia.org/resource/Thomas_Reilly,,Thomas F. Reilly,1942-02-14,,"Springfield, Massachusetts",
http://dbpedia.org/resource/Todd_McKenney_(politician),,Todd McKenney,1963-10-11,,"Akron, Ohio",
http://dbpedia.org/resource/Yaël_Braun-Pivet,,Yaël Braun-Pivet,1970-12-07,,"Nancy, France",
http://dbpedia.org/resource/Medard_Kalemani,,Dr. Medard M. C. Kalemani,1968-03-15,,,


In [107]:
#count the number of similar URIs
gb_dbp=df_dbp.groupby(['uri_dbp']).size()
print(gb_dbp)

uri_dbp
http://dbpedia.org/resource/A._Arthur_Giddon     1
http://dbpedia.org/resource/A._Brown_Moore       3
http://dbpedia.org/resource/A._Bruce_Bielaski    1
http://dbpedia.org/resource/A._C._Gibbs          1
http://dbpedia.org/resource/A._D._Roy            1
                                                ..
http://dbpedia.org/resource/Şirin_Tekeli         1
http://dbpedia.org/resource/Štefan_Osuský        1
http://dbpedia.org/resource/Štefan_Tiso          1
http://dbpedia.org/resource/Željko_Komšić        3
http://dbpedia.org/resource/Željko_Rohatinski    3
Length: 8575, dtype: int64


In [108]:
# count the number of similar URIs greater than one
sort_dbp=gb_dbp.sort_values(ascending=False) > 1
sum(sort_dbp)

3391

### Wikidata

In [1]:
df_wk= pd.DataFrame(r_wk, columns=['uri_wk', 'name_wk', 'birthDate_wk', "deathDate_wk", "placeOfBirth_wk", "placeOfDeath_wk",'viaf_wk', "uri_bnf_wk"])
print(len(df_wk))
df_wk.fillna('')
df_wk

NameError: name 'pd' is not defined

In [497]:
df_wk.set_index('uri_wk')

Unnamed: 0_level_0,name_wk,birthDate_wk,deathDate_wk,placeOfBirth_wk,placeOfDeath_wk,viaf_wk,uri_bnf_wk
uri_wk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
http://www.wikidata.org/entity/Q72568,Alfred Lichtenstein,1889-08-23T00:00:00Z,1914-09-25T00:00:00Z,Berlin,Somme,http://viaf.org/viaf/71423827,http://data.bnf.fr/ark:/12148/cb12123568r#about
http://www.wikidata.org/entity/Q72553,Heinrich von Bülow,1792-09-16T00:00:00Z,1846-02-06T00:00:00Z,Schwerin,Berlin,http://viaf.org/viaf/62342475,
http://www.wikidata.org/entity/Q72628,Alfred von Kiderlen-Waechter,1852-07-10T00:00:00Z,1912-12-30T00:00:00Z,Stuttgart,Stuttgart,http://viaf.org/viaf/54958174,http://data.bnf.fr/ark:/12148/cb135097503#about
http://www.wikidata.org/entity/Q72535,Rainer Rupp,1945-09-21T00:00:00Z,,Saarlouis,,http://viaf.org/viaf/15698197,
http://www.wikidata.org/entity/Q77404,Ingeborg Schwenzer,1951-10-25T00:00:00Z,,Stuttgart,,http://viaf.org/viaf/91748910,http://data.bnf.fr/ark:/12148/cb15067859n#about
...,...,...,...,...,...,...,...
http://www.wikidata.org/entity/Q96449608,,1942-03-21T00:00:00Z,,Hebron,,,
http://www.wikidata.org/entity/Q85322288,,1977-10-09T00:00:00Z,,Pryluky,,,
http://www.wikidata.org/entity/Q85884950,,1923-01-01T00:00:00Z,,,,,
http://www.wikidata.org/entity/Q92295431,,1876-12-01T00:00:00Z,1967-04-10T00:00:00Z,Saint Petersburg,Kharkiv,,


In [499]:
# We drop 'T00:00:00Z' to have the format 'YYYY-MM-DD' and improve the merge below.
df_wk["birthDate_wk"] = df_wk["birthDate_wk"].str.replace("T00:00:00Z", "")
df_wk["deathDate_wk"] = df_wk["deathDate_wk"].str.replace("T00:00:00Z", "")

#print(df_wk)

In [63]:
#count the number of similar URIs
gb_wk=df_wk.groupby(['uri_wk']).size()
print(gb_wk)

uri_wk
http://www.wikidata.org/entity/Q1000023     1
http://www.wikidata.org/entity/Q1000061     1
http://www.wikidata.org/entity/Q1000228     1
http://www.wikidata.org/entity/Q1000324     1
http://www.wikidata.org/entity/Q1000392     1
                                           ..
http://www.wikidata.org/entity/Q999600      1
http://www.wikidata.org/entity/Q99968375    1
http://www.wikidata.org/entity/Q99980497    1
http://www.wikidata.org/entity/Q999844      1
http://www.wikidata.org/entity/Q999985      1
Length: 53361, dtype: int64


In [64]:
# count the number of similar URIs greater than one
sort_wk=gb_wk.sort_values(ascending=False) > 1
sum(sort_wk)

2114

In [2]:
# Drop duplicates lines and keep first
#df_wk.drop_duplicates(subset ='uri_wk', keep = 'first', inplace=True)
#df_wk.fillna('')
#print(len(df_wk))
#df_wk.head()

In [47]:
# We use the dataframes in csv to don't run again requests
df_bnf.to_csv("df_bnf.csv")
df_dbp.to_csv("df_dbp.csv")

In [52]:
BnF_Data = pd.read_csv('df_bnf.csv')
Wikidata = pd.read_csv('df_wk.csv')
DBpedia = pd.read_csv('df_dbp.csv')

# Merge

## 1- Between Wikidata and BnF Data

First, we realize a merge between Wikidata and BnF Data using the BnF URIs in common.

In [491]:
merged_bnf_wk = pd.merge(df_bnf, df_wk , on='uri_bnf', how='inner', sort='uri_bnf')
print("")
print("This method merge", len(merged_bnf_wk), "mentions.")
print("")
merged_bnf_wk[:10]
merged_bnf_wk.loc[:,"uri_bnf","name_bnf","via_bnf","birthDate_bnf","deathDate_bnf","placeOfBirth_bnf","placeOfDeath_bnf","bio_bnf","uri_wk","name_wk","via_wk","birthDate_wk","deathDate_wk","placeOfBirth_wk","placeOfDeath_wk"]


KeyError: 'uri_bnf'

In [476]:
df_bnf[df_bnf.uri_bnf.isin(df_dbp.uri_bnf)]

AttributeError: 'DataFrame' object has no attribute 'uri_bnf'

In [488]:
frames =[df_bnf, df_wk]
result = pd.concat(frames, keys=["x", "y"])
result

Unnamed: 0,Unnamed: 1,uri,name,viaf,birthDate,deathDate,placeOfBirth,placeOfDeath,bio,uri_bnf
x,0,http://data.bnf.fr/ark:/12148/cb12981404c#about,Léon Garnier,http://viaf.org/viaf/99996033,1836-11-10,1901-05-06,,,Juriste. - Administrateur et homme de lettres....,
x,1,http://data.bnf.fr/ark:/12148/cb13484444m#about,Gaston de Pawlowski,http://viaf.org/viaf/9999219,1874-06-14,1933-02-02,Joigny (Yonne),Paris,Docteur en droit. - Critique littéraire et thé...,
x,2,http://data.bnf.fr/ark:/12148/cb134841632#about,Jean-Michel Berton,http://viaf.org/viaf/9999131,1794-07-03,1845-10-20,Cahors (Lot),,"Écrivain politique, avocat à la Cour de cassat...",
x,3,http://data.bnf.fr/ark:/12148/cb13379520q#about,Emmanuel Mathieu,http://viaf.org/viaf/9995247,1852-07-19,19..,,,"Docteur en droit (Paris, 1873)",
x,4,http://data.bnf.fr/ark:/12148/cb13338312g#about,Josiah Henry Benton,http://viaf.org/viaf/9994322,1843,1917,,,Juriste. - Bibliophile,
...,...,...,...,...,...,...,...,...,...,...
y,56519,http://www.wikidata.org/entity/Q96449608,,,1942-03-21,,Hebron,,,
y,56520,http://www.wikidata.org/entity/Q85322288,,,1977-10-09,,Pryluky,,,
y,56521,http://www.wikidata.org/entity/Q85884950,,,1923-01-01,,,,,
y,56522,http://www.wikidata.org/entity/Q92295431,,,1876-12-01,1967-04-10,Saint Petersburg,Kharkiv,,


In [None]:
# Create an id to Bnf Data dataframe
# We felt it was necessary to create new id to realise the comparaisons between dataframes (cf. below for understand).
df_bnf["id_bnf"] = df_bnf.index + 0
df_bnf= pd.DataFrame(df_bnf, columns=['uri_bnf', 'viaf_bnf', 'name_bnf', 'Sname','birthDate_bnf','deathDate_bnf' , 'placeOfBirth_bnf','placeOfDeath_bnf','bio_bnf'],index=df_bnf["id_bnf"])
df_bnf[-20:]

In [217]:
# Create an id to DBpedia dataframe
df_dbp["id_dbp"] = df_dbp.index + 0
df_dbp= pd.DataFrame(df_dbp, columns=['uri_dbp', 'viaf_dbp', 'name_dbp','birthDate_dbp','deathDate_dbp' , 'placeOfBirth_dbp','placeOfDeath_dbp'],index=df_dbp["id_dbp"])
print(len(df_dbp))
df_dbp[-20:]

10000


Unnamed: 0_level_0,uri_dbp,viaf_dbp,name_dbp,birthDate_dbp,deathDate_dbp,placeOfBirth_dbp,placeOfDeath_dbp
id_dbp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9980,http://dbpedia.org/resource/Terry_Haskins,,,1955-01-31,2000-10-24,,
9981,http://dbpedia.org/resource/Bernard_M._L._Ernst,,Bernard M. L. Ernst,1879-03-17,--11-28,,
9982,http://dbpedia.org/resource/Miha_Krek,,Miha Krek,1897-09-28,1969-11-18,,
9983,http://dbpedia.org/resource/Leon_Despres,,,1908-02-02,2009-05-06,,
9984,http://dbpedia.org/resource/Benito_Pabón_y_Suá...,,Benito Pabón y Suárez de Urbina,1895-03-25,,Seville,"Colón, Panama"
9985,http://dbpedia.org/resource/John_McCarthy_(Aus...,,John McCarthy,1942-11-29,,United States,
9986,http://dbpedia.org/resource/John_McCarthy_(Aus...,,John McCarthy,1942-11-29,,"Washington, D.C.",
9987,http://dbpedia.org/resource/Dominique_Warluzel,,Dominique Warluzel,1957-05-29,,"Pau, Pyrénées-Atlantiques",
9988,http://dbpedia.org/resource/Hauwa_Ibrahim,,Hauwa Ibrahim,1968-01-20,,"Gombe, Gombe",
9989,http://dbpedia.org/resource/Beatriz_Corredor,,Beatriz Corredor,1968-07-01,,Madrid,


 ## Recordlinkage

In [127]:
import recordlinkage

To have more details and settings, you can read the documentation of the library: https://recordlinkage.readthedocs.io/en/latest/about.html .

And this article, https://pbpython.com/record-linking.html, explains how the library works and compare this method with a another, Fuzzymatcher. Unfortunalty, he doesn't explain how he does after comparing to merge the two dataframes.

### Dataframe of DBpedia with itself

In [150]:
# This first step is essential because it indexes the dataframes (or a dataframe with itself) to then compare them 
# Here, we choose the method 'sortedneighbourhood' for this speed of execution and the number of indexed values
# We use the dataframe with itself to spot duplicates who don't have the same URI.
indexer = recordlinkage.Index()
indexer.sortedneighbourhood(left_on='name_dbp')
candidates = indexer.index(df_dbpc)
print(len(candidates))

ValueError: index of DataFrame is not unique

In [291]:
# The step is the most important, we compare the values of the variables with the other values of the same variable.
# We choose to compare the variable 'birthDate', 'deathDate' because it is a strict format.
# Also, we add 'name' to match date variables and check more easily.
# Finally, the "uri" variable to check if the other variables match a "uri".
compare = recordlinkage.Compare()
compare.exact('birthDate_dbp',
            'birthDate_dbp',
            label='birthDate_dbp')
compare.exact('deathDate_dbp',
            'deathDate_dbp',
            label='deathDate_dbp')
compare.string('name_dbp',
            'name_dbp',
            label='name_dbp')
compare.exact('uri_dbp',
            'uri_dbp',
            label='uri_dbp')
features = compare.compute(candidates, df_dbp)

In [292]:
# display scores for matchings
features.sum(axis=1).value_counts().sort_index(ascending=False)

4.000000    12076
3.986667        1
3.966667       16
3.950000        9
3.923077       36
            ...  
0.052632        1
0.045455        4
0.037037        3
0.030303       12
0.000000      336
Length: 421, dtype: int64

In [295]:
# display scores for matchings only if they are superior to '1.9'
potential_matches = features[features.sum(axis=1) >1.9].reset_index()

In [296]:
# Dataframes of scores of each variables with the cumulative score
# It displays only the score inferior to '1' (it means no matches between both URIs)
potential_matches['Score'] = potential_matches.loc[:, 'birthDate_dbp':'uri_dbp'].sum(axis=1)
rslt_df = potential_matches[potential_matches['uri_dbp'] < 1]
rslt_df
# It seems none person doesn't have many mentions 
# because if we compare the three first variables with the 'uri' (with a score superior at '2'), no result arises. 

Unnamed: 0,level_0,level_1,birthDate_dbp,deathDate_dbp,name_dbp,uri_dbp,Score
61,5984,2326,0,1,0.909091,0,1.909091
62,5984,2327,0,1,0.909091,0,1.909091
2922,1842,1754,0,1,1.0,0,2.0
2923,3467,1754,0,1,1.0,0,2.0
2924,3467,1842,0,1,1.0,0,2.0
2925,3831,1754,0,1,1.0,0,2.0
2926,3831,1842,0,1,1.0,0,2.0
2927,3831,3467,0,1,1.0,0,2.0


In [216]:
df_dbp.loc[7750,:]

uri_dbp             http://dbpedia.org/resource/Luís_Marques_Mendes
viaf_dbp                                                           
name_dbp                                        Luís Marques Mendes
birthDate_dbp                                            1957-11-05
deathDate_dbp                                                      
placeOfBirth_dbp                                           Portugal
placeOfDeath_dbp                                                   
nationality_dbp                                                    
Name: 7750, dtype: object

In [217]:
df_dbp.loc[7748,:]

uri_dbp             http://dbpedia.org/resource/Luís_Marques_Guedes
viaf_dbp                                                           
name_dbp                                        Luís Marques Guedes
birthDate_dbp                                            1957-08-25
deathDate_dbp                                                      
placeOfBirth_dbp                                           Portugal
placeOfDeath_dbp                                                   
nationality_dbp                                                    
Name: 7748, dtype: object

In [3]:
### !!!!! It doesn't works for the moment !!!!! ###


#df_dbp['dbp1_Name_Lookup'] = df_dbp[[
#   'name_dbp', 'birthDate_dbp','deathDate_dbp','placeOfBirth_dbp' ,'placeOfDeath_dbp' 
#]].apply(lambda x: '|'.join(x), axis=1)
#
#df_dbp['dbp2_Name_Lookup'] = df_dbp[[
#  'name_dbp', 'birthDate_dbp','deathDate_dbp','placeOfBirth_dbp' ,'placeOfDeath_dbp' 
#]].apply(lambda x: '|'.join(x), axis=1)

#df_dbp = df_dbp[['dbp1_Name_Lookup']].reset_index()
#df_dbp = df_dbp[['dbp2_Name_Lookup']].reset_index()

### Dataframe of Wikidata with itself

In [249]:
# Create an id to Wikidata dataframe
df_wk["id_wk"] = df_wk.index + 0
df_wk= pd.DataFrame(df_wk, columns=['uri_wk', 'viaf_wk', 'name_wk','birthDate_wk','deathDate_wk' , 'placeOfBirth_wk','placeOfDeath_wk'],index=df_wk["id_wk"])
print(len(df_wk))
df_dbp[-20:]

56524


Unnamed: 0,uri_dbp,viaf_dbp,name_dbp,birthDate_dbp,deathDate_dbp,placeOfBirth_dbp,placeOfDeath_dbp,nationality_dbp
9980,http://dbpedia.org/resource/Bautista_Saavedra,,Bautista Saavedra,1870-08-30,1939-05-01,Bolivia,Santiago,Bolivians
9981,http://dbpedia.org/resource/Bautista_Saavedra,,Bautista Saavedra,1870-08-30,1939-05-01,Sorata,Santiago,Bolivians
9982,http://dbpedia.org/resource/Bautista_Saavedra,,Bautista Saavedra,1870-08-30,1939-05-01,La Paz Department (Bolivia),Santiago,Bolivians
9983,http://dbpedia.org/resource/Belisario_Betancur,,Belisario Betancur Cuartas,1923-02-04,2018-12-07,,,
9984,http://dbpedia.org/resource/Benito_Juárez,,Benito Juárez,1806-03-21,1872-07-18,Oaxaca,Mexico,Mexican nationality law
9985,http://dbpedia.org/resource/Benito_Juárez,,Benito Juárez,1806-03-21,1872-07-18,San Pablo Guelatao,Mexico,Mexican nationality law
9986,http://dbpedia.org/resource/Benito_Juárez,,Benito Juárez,1806-03-21,1872-07-18,New Spain,Mexico,Mexican nationality law
9987,http://dbpedia.org/resource/Benito_Juárez,,Benito Juárez,1806-03-21,1872-07-18,Oaxaca,Mexico City,Mexican nationality law
9988,http://dbpedia.org/resource/Benito_Juárez,,Benito Juárez,1806-03-21,1872-07-18,San Pablo Guelatao,Mexico City,Mexican nationality law
9989,http://dbpedia.org/resource/Benito_Juárez,,Benito Juárez,1806-03-21,1872-07-18,New Spain,Mexico City,Mexican nationality law


In [250]:
indexer = recordlinkage.Index()
indexer.sortedneighbourhood(left_on='name_wk')
candidates = indexer.index(df_wk)
print(len(candidates))

3677385


In [251]:
compare = recordlinkage.Compare()
compare.exact('birthDate_wk',
            'birthDate_wk',
            label='birthDate_wk')
compare.exact('deathDate_wk',
            'deathDate_wk',
            label='deathDate_wk')
compare.string('name_wk',
            'name_wk',
            label='name_wk')
compare.exact('uri_wk',
            'uri_wk',
            label='uri_wk')
features = compare.compute(candidates, df_wk)

In [242]:
features.sum(axis=1).value_counts().sort_index(ascending=False)

4.000000       1286
3.000000       3103
2.956522          1
2.954545          1
2.947368          1
             ...   
0.058824          3
0.055556          1
0.052632          2
0.047619          1
0.000000    2513444
Length: 783, dtype: int64

In [281]:
potential_matches = features[features.sum(axis=1) >=2.9].reset_index()

In [287]:
potential_matches['Score'] = potential_matches.loc[:, 'birthDate_wk':'uri_wk'].sum(axis=1)
rslt_df = potential_matches[potential_matches['uri_wk'] < 1]
print(len(rslt_df),"mentions of the same persons don't have the same URI.")
print("")
rslt_df.sort_values(by='Score')

12 mentions of the same persons don't have the same URI.



Unnamed: 0,id_wk_1,id_wk_2,birthDate_wk,deathDate_wk,name_wk,uri_wk,Score
4393,45716,25511,1,1,0.9,0,2.9
2,44371,42312,1,1,0.931034,0,2.931034
1,44369,42300,1,1,0.933333,0,2.933333
3,53177,46361,1,1,0.941176,0,2.941176
4395,52254,48210,1,1,0.947368,0,2.947368
4394,51804,44349,1,1,0.954545,0,2.954545
0,19619,19618,1,1,0.956522,0,2.956522
1872,37257,18254,1,1,1.0,0,3.0
1984,37921,20048,1,1,1.0,0,3.0
2236,23067,23063,1,1,1.0,0,3.0


We find twelve mentions of people who don't have the same URI. 

To check, we displays two lines below corresponding to the same person but with a different URI.

In [279]:
df_wk.loc[45716,:]

uri_wk             http://www.wikidata.org/entity/Q94808707
viaf_wk                       http://viaf.org/viaf/17974557
name_wk                                Konstantinos Rhalles
birthDate_wk                                     1867-01-01
deathDate_wk                                     1942-01-01
placeOfBirth_wk                                            
placeOfDeath_wk                                            
Name: 45716, dtype: object

In [280]:
df_wk.loc[25511,:]

uri_wk             http://www.wikidata.org/entity/Q12879698
viaf_wk                       http://viaf.org/viaf/64413539
name_wk                                 Konstantinos Rallis
birthDate_wk                                     1867-01-01
deathDate_wk                                     1942-01-01
placeOfBirth_wk                                      Athens
placeOfDeath_wk                                      Athens
Name: 25511, dtype: object

### Dataframe of BnF Data with itself

In [317]:
indexer = recordlinkage.Index()
indexer.sortedneighbourhood(left_on='name_bnf')
candidates = indexer.index(df_bnf)
print(len(candidates))

13011


In [318]:
compare = recordlinkage.Compare()
compare.exact('birthDate_bnf',
            'birthDate_bnf',
            label='birthDate_bnf')
compare.exact('deathDate_bnf',
            'deathDate_bnf',
            label='deathDate_bnf')
compare.string('name_bnf',
            'name_bnf',
            label='name_bnf')
compare.exact('uri_bnf',
            'uri_bnf',
            label='uri_bnf')
features = compare.compute(candidates, df_bnf)

In [319]:
features.sum(axis=1).value_counts().sort_index(ascending=False)

4.000000      92
3.000000       4
2.950000       1
2.944444       1
2.928571       1
            ... 
0.066667       3
0.062500       1
0.058824       2
0.050000       1
0.000000    1595
Length: 395, dtype: int64

In [326]:
potential_matches = features[features.sum(axis=1) >=2.8].reset_index()

In [328]:
potential_matches['Score'] = potential_matches.loc[:, 'birthDate_bnf':'deathDate_bnf'].sum(axis=1)
rslt_df = potential_matches[potential_matches['uri_bnf'] < 1]
print(len(rslt_df),"mentions of the same persons don't have the same URI.")
rslt_df.sort_values(by='name_bnf')

8 mentions of the same persons don't have the same URI.


Unnamed: 0,level_0,level_1,birthDate_bnf,deathDate_bnf,name_bnf,uri_bnf,Score
2,7900,4417,1,1,0.8,0,2
99,3606,2288,1,1,0.882353,0,2
0,4876,1693,1,1,0.928571,0,2
1,4909,2744,1,1,0.944444,0,2
100,10626,1741,1,1,0.95,0,2
15,10954,1166,1,1,1.0,0,2
35,10232,3140,1,1,1.0,0,2
83,11079,8022,1,1,1.0,0,2


We find eight mentions of people who don't have the same URI. 

To check, we displays two lines below corresponding to the same person but with a different URI (this is an exemple).

In [331]:
df_bnf.loc[1741,:]

uri_bnf             http://data.bnf.fr/ark:/12148/cb149780599#about
name_bnf                                       Eugenio Montero Rios
viaf_bnf                              http://viaf.org/viaf/66720534
birthDate_bnf                                            1832-11-13
deathDate_bnf                                            1914-05-12
placeOfBirth_bnf                                                   
placeOfDeath_bnf                                                   
bio_bnf                                  Homme politique et juriste
Name: 1741, dtype: object

In [332]:
df_bnf.loc[10626,:]

uri_bnf             http://data.bnf.fr/ark:/12148/cb14623832q#about
name_bnf                                       Eugenio Montero Ríos
viaf_bnf                                                           
birthDate_bnf                                            1832-11-13
deathDate_bnf                                            1914-05-12
placeOfBirth_bnf                                                   
placeOfDeath_bnf                                                   
bio_bnf                                  Homme politique. - Juriste
Name: 10626, dtype: object

In [341]:
df_bnf_merge = potential_matches.merge(df_bnf, how='left')

TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

--------------------
## Matched between BnF Data and Wikidata

After performing it between a dataframe and itself, we can realise the same thing between two dataframes. 

With this method, we can find out if the same person records are found in two different databases. 

In [212]:
indexer = recordlinkage.Index()
indexer.sortedneighbourhood(left_on='name_bnf', right_on='name_wk')
candidates = indexer.index(df_bnf, df_wk)
print(len(candidates))

NameError: name 'df_wk' is not defined

indexer = recordlinkage.Index()
indexer.block(left_on=['name_bnf', 'uri_bnf'],
              right_on=['name_dbp', 'uri_dbp'])
pairs = indexer.index(df_bnf, df_dbp)


candidates = indexer.index(BnF_Data, DBpedia)
print(len(candidates))

In [62]:
# We choose the compare the 'name', the 'birthDate' and deathDate' for the same reasons mentioned above
compare = recordlinkage.Compare()
compare.string('name_bnf',
            'name_wk',
            method='jarowinkler',
            threshold=0.85,
            label='name_bnf_wk')
compare.exact('birthDate_bnf',
            'birthDate_wk',
            label='birthDate_bnf_wk')
compare.exact('deathDate_bnf',
            'deathDate_wk',
            label='deathDate_bnf_wk')
features = compare.compute(candidates, df_bnf, df_wk)

In [63]:
features.sum(axis=1).value_counts().sort_index(ascending=False)

3.0     813
2.0     424
1.0    7382
0.0    6889
dtype: int64

In [73]:
potential_matches = features[features.sum(axis=1) >1].reset_index()
potential_matches['Score'] = potential_matches.loc[:, 'name_bnf_wk':'deathDate_bnf_wk'].sum(axis=1)
potential_matches

Unnamed: 0,level_0,level_1,name_bnf_wk,birthDate_bnf_wk,deathDate_bnf_wk,Score
0,274,6770,1.0,1,1,3.0
1,529,13944,1.0,1,1,3.0
2,529,13947,1.0,1,0,2.0
3,529,13952,1.0,0,1,2.0
4,696,802,1.0,1,1,3.0
...,...,...,...,...,...,...
1232,10220,14340,1.0,1,0,2.0
1233,10373,44984,1.0,1,1,3.0
1234,11083,16132,1.0,1,1,3.0
1235,11116,53736,1.0,1,1,3.0


In [74]:
df_wk.loc[274,:]

index                                                274
uri_wk             http://www.wikidata.org/entity/Q88014
name_wk                                   Gustav Rümelin
dateBirth_wk                                  1848-05-01
dateDeath_wk                                  1907-01-11
placeOfBirth_wk                                Nürtingen
placeOfDeath_wk                     Freiburg im Breisgau
viaf                       http://viaf.org/viaf/95543689
uri_bnf                                              NaN
Name: 274, dtype: object

In [75]:
df_bnf.loc[6770,:]

uri_bnf               http://data.bnf.fr/ark:/12148/cb12163133t#about
Unnamed: 0                                                       6770
name_bnf                                        Edmond Eugène Moeller
viaf                                    http://viaf.org/viaf/22181280
dateBirth_bnf                                                    1909
dateDeath_bnf                                              1991-04-03
placeOfBirth_bnf                                                  NaN
placeOfDeath_bnf                                                  NaN
bio_bnf             Bénédictin de l'Abbaye du Mont-César (Louvain)...
Name: 6770, dtype: object

In [76]:
# We don't understand why, but to work, the script below must have all variables as strings.
# And this is the only way, we can do it.
df_wk['name_wk']=df_wk['name_wk'].astype(str)
df_bnf['name_bnf']=df_bnf['name_bnf'].astype(str)

Wikidata['viaf']=Wikidata['viaf'].astype(str)
df_bnf['viaf']=df_bnf['viaf'].astype(str)

df_wk['uri_wk']=df_wk['uri_wk'].astype(str)
df_bnf['uri_bnf']=df_bnf['uri_bnf'].astype(str)

df_wk['placeOfBirth_wk']=df_wk['placeOfBirth_wk'].astype(str)
df_bnf['placeOfBirth_bnf']=df_bnf['placeOfBirth_bnf'].astype(str)

df_wk['placeOfDeath_wk']=df_wk['placeOfDeath_wk'].astype(str)
df_bnf['placeOfDeath_bnf']=df_bnf['placeOfDeath_bnf'].astype(str)

df_bnf['deathDate_bnf']=df_bnf['deathDate_bnf'].astype(str)
df_wk['deathDate_wk']=df_wk['deathDate_wk'].astype(str)

df_bnf['birthDate_bnf']=df_bnf['birthDate_bnf'].astype(str)
df_wk['birthDate_wk']=df_wk['birthDate_wk'].astype(str)

df_bnf['bio_bnf']=df_bnf['bio_bnf'].astype(str)

In [77]:
# This makes it easy to compare matches 
df_wk['Wikidata_Name_Lookup'] = df_wk[[
   'name_wk', 'dateBirth_wk','dateDeath_wk','placeOfBirth_wk' ,'placeOfDeath_wk' 
]].apply(lambda x: '|'.join(x), axis=1)

BnF_Data['bnf_Name_Lookup'] = BnF_Data[[
   'name_bnf','dateBirth_bnf','dateDeath_bnf','placeOfBirth_bnf','placeOfDeath_bnf','bio_bnf' 
]].apply(lambda x: '|'.join(x), axis=1)

Wikidata_lookup = df_wk[['Wikidata_Name_Lookup']].reset_index()
BnF_Data_lookup = BnF_Data[['bnf_Name_Lookup']].reset_index()

In [78]:
Wikidata_merge = potential_matches.merge(Wikidata_lookup, how='left')

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

In [79]:
final_wk_bnf_merge = Wikidata_merge.merge(BnF_Data_lookup, how='left')

NameError: name 'Wikidata_merge' is not defined

In [22]:
cols = ['id_bnf', 'id_wk', 'Score',
        'bnf_Name_Lookup', 'Wikidata_Name_Lookup']
final_wk_bnf_merge=final_wk_bnf_merge[cols].sort_values(by=[ 'Score'], ascending=True)
print(len(final_wk_bnf_merge))
final_wk_bnf_merge[:50]

1016


Unnamed: 0,id_bnf,id_wk,Score,bnf_Name_Lookup,Wikidata_Name_Lookup
507,5492,27548,2.0,Yvon Linant de Bellefonds|1904-08-27|1994-12-2...,Yvon Linant de Bellefonds|1904-01-01|1994-12-2...
520,5818,26807,2.0,Federico Patellani|1911-12-01|1977-02-10|Monza...,Federico Patellani|1911-12-01|1977-01-01|Monza...
522,5887,30520,2.0,Charles Lyon-Caen|1843-12-25|1935-09-17|Paris ...,Charles Lyon-Caen|1843-12-25|1935-12-17|Paris|...
523,5889,2375,2.0,Charles-Frédéric Rau|1803-08-03|1877-04-10|Sav...,Charles-Frédéric Rau|1803-08-03|1877-04-1|Boux...
525,5897,10622,2.0,Robert Walter|1931-01-30|2010-12-25|nan|nan|Ju...,Robert Walter|1931-01-3|2010-12-25|Vienna|Vienna
526,5899,26491,2.0,Salvatore Riccobono|1864-01-31|1958-04-05|San ...,Salvatore Riccobono|1864-01-31|1958-04-12|San ...
527,5901,52871,2.0,Roman Herzog|1934-04-05|2017-01-10|Landshut (B...,Roman Herzog|1934-04-05|2017-01-1|Landshut|Bad...
518,5729,97,2.0,Hugo Sinzheimer|1875-04-12|1945-09-16|Worms (R...,Hugo Sinzheimer|1875-01-01|1945-09-16|Worms|Bl...
528,5905,3888,2.0,Ottó Bihari|1921-01-11|1983-01-04|Timis̡oara (...,Ottó Bihari|1921-01-13|1983-01-04|Timișoara|Pécs
533,5976,41022,2.0,Otto Weissel|1875-08-31|1955|Vienne|nan|Avocat,Otto Weissel|1875-08-31|1955-12-23|Vienna|Geneva


In [23]:
# We realize a merge between Wikidata and BnF Data using the BnF URIs in common
merged_final_bnf_wk = merged_bnf_wk.merge(final_wk_bnf_merge, on=['id_wk', 'id_bnf'], how='outer')
print("")
print("This method merge", len(merged_final_bnf_wk), "persons.")
print("")
print(len(merged_final_bnf_wk)-len(merged_bnf_wk), "persons have not a BnF URI in common on Wikidata.")
print("")
cols=["id_bnf","id_wk","uri_bnf","uri_wk", "name_bnf", "dateBirth_bnf","dateDeath_bnf","placeOfBirth_bnf","placeOfDeath_bnf","bio_bnf","name_wk","dateBirth_wk","dateDeath_wk","placeOfBirth_wk","placeOfDeath_wk"]
merged_final_bnf_wk[cols][:10]


This method merge 1597 persons.

39 persons have not a BnF URI in common on Wikidata.



Unnamed: 0,id_bnf,id_wk,uri_bnf,uri_wk,name_bnf,dateBirth_bnf,dateDeath_bnf,placeOfBirth_bnf,placeOfDeath_bnf,bio_bnf,name_wk,dateBirth_wk,dateDeath_wk,placeOfBirth_wk,placeOfDeath_wk
0,1713,6505,http://data.bnf.fr/ark:/12148/cb10071436z#about,http://www.wikidata.org/entity/Q1360518,Ernst Walz,1859-07-18,1941-12-18,"Heidelberg (Bade-Wurtemberg, Allemagne)","Heidelberg (Bade-Wurtemberg, Allemagne)","Maire de Heidelberg, Allemagne (1886-). - A ét...",Ernst Walz,1859-07-19,1941-12-18,Heidelberg,Heidelberg
1,6594,28290,http://data.bnf.fr/ark:/12148/cb101827728#about,http://www.wikidata.org/entity/Q15177628,Adrien Calmètes,1800-09-19,1871-02-27,Figueras (Espagne),Montpellier (Hérault),Magistrat. - Président de chambre à Montpellie...,Adrien Calmètes,1800-09-19,1871-02-27,Figueres,Montpellier
2,6600,44446,http://data.bnf.fr/ark:/12148/cb10190862z#about,http://www.wikidata.org/entity/Q56646471,Béla Kun,1861-04-24,1934-09-19,"Sátoraljaújhelyi, Hongrie","Budapest, Hongrie",Juriste. - Conseiller au Ministère de la Justi...,Kun Béla,1861-04-24,1934-09-19,Sátoraljaújhely,Budapest District VII
3,6929,6082,http://data.bnf.fr/ark:/12148/cb102042826#about,http://www.wikidata.org/entity/Q1214240,Imre Nagy,1822-07-01,1894-05-05,"Németkeresztúr, aujourd'hui Deutschkreutz, Aut...",Budapest,Historien. - Juriste. - Académicien,Imre Nagy,1822-06-01,1894-05-05,Deutschkreutz,Budapest
4,6931,46637,http://data.bnf.fr/ark:/12148/cb10207440j#about,http://www.wikidata.org/entity/Q94851244,Wilhelm Gustav Karl Starke,1824-02-26,1903-03-10,"Lubán (Prusse, aujourd'hui Pologne)",Berlin (Allemagne),Juriste. - Parlementaire,Wilhelm Gustav Karl Starke,1824-02-26,1903-03-09,Lubań,Berlin
5,7694,2149,http://data.bnf.fr/ark:/12148/cb102101229#about,http://www.wikidata.org/entity/Q343248,Karel Baxa,1863-06-23,1938-01-05,,,Docteur en droit. - Avocat. - Membre du parti ...,Karel Baxa,1862-06-24,1938-01-05,Sedlčany,Prague
6,3976,28075,http://data.bnf.fr/ark:/12148/cb10217219c#about,http://www.wikidata.org/entity/Q15830136,Jan Heller,1848-11-13,1932-03-20,Vranov u Rokycan (République tchèque),Prague (République tchèque),"Juriste, publiciste. - Rédacteur de la revue j...",Jan Heller,1848-11-13,1932-03-2,,Prague
7,1011,15408,http://data.bnf.fr/ark:/12148/cb102252805#about,http://www.wikidata.org/entity/Q3840074,Luigi Rava,1860-12-01,1938-05-12,"Ravenne, Italie",Rome,"Juriste et historien, professeur de philosophi...",Luigi Rava,1860-11-29,1938-05-12,Ravenna,Rome
8,7407,45126,http://data.bnf.fr/ark:/12148/cb10226211k#about,http://www.wikidata.org/entity/Q57202321,André Roux,1893-10-10,19..,Privas (Ardèche),,Magistrat. - A été avocat à Montpellier (1915-...,André Louis Roux,1893-10-1,,Privas,
9,10383,29226,http://data.bnf.fr/ark:/12148/cb10226945t#about,http://www.wikidata.org/entity/Q13411514,Arveds Švābe,1888-05-25,1959-08-20,,,"Historien, juriste et auteur",Arveds Švābe,1888-05-25,1959-08-2,Q16362410,Stockholm


## Match between BnF Data and DBpedia

indexer = recordlinkage.Index()
indexer.full()

In [218]:
indexer = recordlinkage.Index()
indexer.block(left_on='name_bnf', right_on='name_dbp')
candidates = indexer.index(df_bnf, df_dbp)
print(len(candidates))

28614


In [219]:
compare = recordlinkage.Compare()
compare.string('name_bnf',
            'name_dbp',
            method='jarowinkler',
            threshold=0.85,
            label='name_bnf_dbp')
compare.exact('birthDate_bnf',
            'birthDate_dbp',
            label='birthDate_bnf_dbp')
compare.exact('deathDate_bnf',
            'deathDate_dbp',
            label='deathDate_bnf_dbp')
features = compare.compute(candidates, df_bnf, df_dbp)

In [220]:
features

Unnamed: 0_level_0,Unnamed: 1_level_0,name_bnf_dbp,birthDate_bnf_dbp,deathDate_bnf_dbp
id_bnf,id_dbp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20,2,1.0,1,1
47,5,0.0,0,0
47,6,0.0,0,0
47,32,0.0,0,0
47,33,0.0,0,0
...,...,...,...,...
10651,944,1.0,1,0
10651,945,1.0,1,0
10651,946,1.0,1,0
10651,947,1.0,1,0


In [221]:
features.sum(axis=1).value_counts().sort_index(ascending=False)

3.0      269
2.0       73
1.0      104
0.0    28168
dtype: int64

In [222]:

potential_matches = features[features.sum(axis=1) > 1].reset_index()
potential_matches['Score'] = potential_matches.loc[:, 'name_bnf_dbp':'deathDate_bnf_dbp'].sum(axis=1)
potential_matches
# Display the matches up to 

Unnamed: 0,id_bnf,id_dbp,name_bnf_dbp,birthDate_bnf_dbp,deathDate_bnf_dbp,Score
0,20,2,1.0,1,1,3.0
1,85,10,1.0,1,1,3.0
2,85,11,1.0,1,1,3.0
3,85,12,1.0,1,1,3.0
4,85,13,1.0,1,1,3.0
...,...,...,...,...,...,...
337,10651,944,1.0,1,0,2.0
338,10651,945,1.0,1,0,2.0
339,10651,946,1.0,1,0,2.0
340,10651,947,1.0,1,0,2.0


In [223]:
df_bnf.loc[85,:]

uri_bnf               http://data.bnf.fr/ark:/12148/cb123010784#about
viaf_bnf                                 http://viaf.org/viaf/9914155
name_bnf                                    Vittorio Emanuele Orlando
Sname                                                             NaN
birthDate_bnf                                              1860-05-19
deathDate_bnf                                              1952-12-01
placeOfBirth_bnf                                     Palerme (Italie)
placeOfDeath_bnf                                                 Rome
bio_bnf             Juriste, spécialiste de droit public. - Présid...
Name: 85, dtype: object

In [224]:
df_dbp.loc[10,:]

uri_dbp             http://dbpedia.org/resource/Vittorio_Emanuele_...
viaf_dbp                                 http://viaf.org/viaf/9914155
name_dbp                                    Vittorio Emanuele Orlando
birthDate_dbp                                              1860-05-19
deathDate_dbp                                              1952-12-01
placeOfBirth_dbp                          Kingdom of the Two Sicilies
placeOfDeath_dbp                                                Italy
Name: 10, dtype: object

In [226]:
bnf_merge = potential_matches.merge(df_bnf, how='left', on='id_bnf')

In [227]:
final_dbp_wk_merge = bnf_merge .merge(df_dbp, how='left', on='id_dbp')

In [229]:
BnF_merge = pd.merge(potential_matches,df_bnf, how='left', on="id_bnf")

In [230]:
final_bnf_dbp_merge = BnF_merge.merge(df_dbp, how='left', on='id_dbp')

In [233]:
cols = ["uri_dbp","name_dbp", "birthDate_dbp","deathDate_dbp","placeOfBirth_dbp","placeOfDeath_dbp"]
final_bnf_dbp_merge=final_bnf_dbp_merge[cols]
print(len(final_bnf_dbp_merge))
final_bnf_dbp_merge

342


Unnamed: 0,uri_dbp,name_dbp,birthDate_dbp,deathDate_dbp,placeOfBirth_dbp,placeOfDeath_dbp
0,http://dbpedia.org/resource/Anita_Augspurg,Anita Augspurg,1857-09-22,1943-12-20,,
1,http://dbpedia.org/resource/Vittorio_Emanuele_...,Vittorio Emanuele Orlando,1860-05-19,1952-12-01,Kingdom of the Two Sicilies,Italy
2,http://dbpedia.org/resource/Vittorio_Emanuele_...,Vittorio Emanuele Orlando,1860-05-19,1952-12-01,Palermo,Italy
3,http://dbpedia.org/resource/Vittorio_Emanuele_...,Vittorio Emanuele Orlando,1860-05-19,1952-12-01,Kingdom of the Two Sicilies,Rome
4,http://dbpedia.org/resource/Vittorio_Emanuele_...,Vittorio Emanuele Orlando,1860-05-19,1952-12-01,Palermo,Rome
...,...,...,...,...,...,...
337,http://dbpedia.org/resource/Laxmi_Mall_Singhvi,Laxmi Mall Singhvi,1931-11-09,2007-10-06,Jodhpur,India
338,http://dbpedia.org/resource/Laxmi_Mall_Singhvi,Laxmi Mall Singhvi,1931-11-09,2007-10-06,Jodhpur State,India
339,http://dbpedia.org/resource/Laxmi_Mall_Singhvi,Laxmi Mall Singhvi,1931-11-09,2007-10-06,Presidencies and provinces of British India,New Delhi
340,http://dbpedia.org/resource/Laxmi_Mall_Singhvi,Laxmi Mall Singhvi,1931-11-09,2007-10-06,Jodhpur,New Delhi


In [234]:
pd.concat([df_dbp,final_bnf_dbp_merge]).drop_duplicates(keep=False)

Unnamed: 0,uri_dbp,viaf_dbp,name_dbp,birthDate_dbp,deathDate_dbp,placeOfBirth_dbp,placeOfDeath_dbp
0,http://dbpedia.org/resource/António_de_Almeida...,http://viaf.org/viaf/99921066,António de Almeida Santos,1926-02-15,2016-01-18,,
1,http://dbpedia.org/resource/Carlos_Carvalhas,http://viaf.org/viaf/99826658,Carlos Carvalhas,1941-11-09,,"São Pedro do Sul, Portugal",
2,http://dbpedia.org/resource/Anita_Augspurg,http://viaf.org/viaf/9976800,Anita Augspurg,1857-09-22,1943-12-20,,
3,http://dbpedia.org/resource/Paulo_Portas,http://viaf.org/viaf/99455673,Paulo Portas,1962-09-12,,Lisbon,
4,http://dbpedia.org/resource/Paulo_Portas,http://viaf.org/viaf/99455673,Paulo Portas,1962-09-12,,Portugal,
...,...,...,...,...,...,...,...
337,http://dbpedia.org/resource/Laxmi_Mall_Singhvi,,Laxmi Mall Singhvi,1931-11-09,2007-10-06,Jodhpur,India
338,http://dbpedia.org/resource/Laxmi_Mall_Singhvi,,Laxmi Mall Singhvi,1931-11-09,2007-10-06,Jodhpur State,India
339,http://dbpedia.org/resource/Laxmi_Mall_Singhvi,,Laxmi Mall Singhvi,1931-11-09,2007-10-06,Presidencies and provinces of British India,New Delhi
340,http://dbpedia.org/resource/Laxmi_Mall_Singhvi,,Laxmi Mall Singhvi,1931-11-09,2007-10-06,Jodhpur,New Delhi


In [38]:
# We realize a merge between Wikidata, BnF Data and DBpedia using the BnF URIs in common
merged_final_bnf_wk_dbp = merged_final_bnf_wk.merge(final_bnf_dbp_merge, on=['id_bnf'], how="inner")
print("")
print("This method merge", len(merged_final_bnf_wk_dbp), "persons.")
print("")
cols=["id_bnf","id_wk",'id_dbp',"uri_bnf_x","uri_wk", "name_bnf_x", "dateBirth_bnf_x","dateDeath_bnf_x","placeOfBirth_bnf_x","placeOfDeath_bnf_x","bio_bnf","name_wk","dateBirth_wk","dateDeath_wk","placeOfBirth_wk","placeOfDeath_wk","name_dbp", "birthDate_dbp","deathDate_dbp","placeOfBirth_dbp","placeOfDeath_dbp"]
merged_final_bnf_wk_dbp[cols][50:100]


This method merge 39 persons.



Unnamed: 0,id_bnf,uri_bnf,viaf_x,name_bnf,Sname,dateBirth_bnf,dateDeath_bnf,placeOfBirth_bnf,placeOfDeath_bnf,bio_bnf,...,Score,bnf_Name_Lookup,Wikidata_Name_Lookup,id_dbp,uri_dbp,name_dbp,birthDate_dbp,deathDate_dbp,placeOfBirth_dbp,placeOfDeath_dbp


#### Match between DBpedia and Wikidata

In [2]:
import recordlinkage

In [3]:
indexer = recordlinkage.Index()
indexer.sortedneighbourhood(left_on='deathDate_wk', right_on='deathDate_dbp')
candidates = indexer.index(df_wk, df_dbp)
print(len(candidates))

NameError: name 'df_wk' is not defined

indexer = recordlinkage.Index()
indexer.block(left_on=['name_bnf', 'uri_bnf'],
              right_on=['name_dbp', 'uri_dbp'])
pairs = indexer.index(df_bnf, df_dbp)


candidates = indexer.index(BnF_Data, DBpedia)
print(len(candidates))

In [1]:
compare = recordlinkage.Compare()
compare.string('name_wk',
            'name_dbp',
            threshold=0.85,
            label='name_wk_dbp')
compare.exact('dateDeath_wk',
            'deathDate_dbp',
            label='deathDate_wk_dbp')
compare.exact('dateBirth_wk',
            'birthDate_dbp',
            label='birthDate_wk_dbp')
features = compare.compute(candidates, Wikidata, DBpedia)
# !!!! see if use https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage.Compare.add works to add method like compare 'exact' on 'name'

NameError: name 'recordlinkage' is not defined

In [47]:
features.sum(axis=1).value_counts().sort_index(ascending=False)

3.0     409
2.0     234
1.0    1110
0.0    8797
dtype: int64

In [48]:
features

Unnamed: 0_level_0,Unnamed: 1_level_0,name_wk_dbp,deathDate_wk_dbp,birthDate_wk_dbp
id_wk,id_dbp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
37,7331,0.0,0,0
51,1800,1.0,0,1
10006,1800,0.0,0,0
58,6668,0.0,0,0
24839,6668,0.0,0,0
...,...,...,...,...
56294,9738,0.0,0,0
56314,7415,0.0,0,0
56347,687,0.0,0,0
56413,7486,0.0,0,0


In [492]:
potential_matches = features[features.sum(axis=1) > 1].reset_index()
potential_matches['Score'] = potential_matches.loc[:, 'name_wk_dbp':'birthDate_wk_dbp'].sum(axis=1)
potential_matches

KeyError: 'name_wk_dbp'

In [50]:
Wikidata.loc[2255,:]

uri_wk             http://www.wikidata.org/entity/Q350384
name_wk                                       Milan Jelić
dateBirth_wk                                   1956-03-26
dateDeath_wk                                    2007-09-3
placeOfBirth_wk                         Koprivna, Modriča
placeOfDeath_wk                                   Modriča
viaf                                                  NaN
uri_bnf                                               NaN
Name: 2255, dtype: object

In [51]:
DBpedia.loc[156,:]

uri_dbp                    http://dbpedia.org/resource/Boris_Fyodorov
viaf                                    http://viaf.org/viaf/91429025
name_dbp                                               Boris Fyodorov
birthDate_dbp                                              1958-02-13
deathDate_dbp                                              2008-11-20
placeOfBirth_dbp                                         Soviet Union
placeOfDeath_dbp                                               London
dbp_Name_Lookup     Boris Fyodorov|1958-02-13|2008-11-20|Soviet Un...
Name: 156, dtype: object

In [54]:
Wikidata['name_wk']=Wikidata['name_wk'].astype(str)
DBpedia['name_dbp']=DBpedia['name_dbp'].astype(str)

Wikidata['viaf']=Wikidata['viaf'].astype(str)
DBpedia['viaf']=DBpedia['viaf'].astype(str)

Wikidata['uri_wk']=Wikidata['uri_wk'].astype(str)
DBpedia['uri_dbp']=DBpedia['uri_dbp'].astype(str)

Wikidata['placeOfBirth_wk']=Wikidata['placeOfBirth_wk'].astype(str)
DBpedia['placeOfBirth_dbp']=DBpedia['placeOfBirth_dbp'].astype(str)

Wikidata['placeOfDeath_wk']=Wikidata['placeOfDeath_wk'].astype(str)
DBpedia['placeOfDeath_dbp']=DBpedia['placeOfDeath_dbp'].astype(str)

Wikidata['dateDeath_wk']=Wikidata['dateDeath_wk'].astype(str)
DBpedia['deathDate_dbp']=DBpedia['deathDate_dbp'].astype(str)

Wikidata['dateBirth_wk']=Wikidata['dateBirth_wk'].astype(str)
DBpedia['birthDate_dbp']=DBpedia['birthDate_dbp'].astype(str)

In [55]:
Wikidata['Wikidata_Name_Lookup'] = Wikidata[[
   'name_wk', 'dateBirth_wk','dateDeath_wk','placeOfBirth_wk' ,'placeOfDeath_wk' 
]].apply(lambda x: '|'.join(x), axis=1)

DBpedia['dbp_Name_Lookup'] = DBpedia[[
   'name_dbp', 'birthDate_dbp', 'deathDate_dbp','placeOfBirth_dbp', 'placeOfDeath_dbp'
]].apply(lambda x: '|'.join(x), axis=1)

Wikidata_lookup = Wikidata[['Wikidata_Name_Lookup']].reset_index()
DBpedia_lookup = DBpedia[['dbp_Name_Lookup']].reset_index()


In [56]:
Wikidata_merge = potential_matches.merge(Wikidata_lookup, how='left')

In [57]:
final_dbp_wk_merge = Wikidata_merge.merge(DBpedia_lookup, how='left')

In [58]:
cols = ['id_wk', 'id_dbp', 'Score',
        'Wikidata_Name_Lookup', 'dbp_Name_Lookup']
final_dbp_wk_merge=final_dbp_wk_merge[cols].sort_values(by=[ 'Score'], ascending=True)
print(len(final))
final[:20]

117


Unnamed: 0,id_bnf,id_dbp,Score,BnF_Name_Lookup,dbp_Name_Lookup
0,2110,9566,2.0,Mihai A. Antonescu|1907-11-18|1946-06-01|Nucet...,Mihai Antonescu|1904-11-18|1946-06-01|Kingdom ...
84,9181,3201,2.0,Vilfredo Pareto|1848-07-15|1923-08-20|Paris|Cé...,Vilfredo Pareto|1848-07-15|1923-08-19|nan|nan
34,3014,8313,2.0,Gisèle Halimi|1927-07-27|2020-07-28|La Goulett...,Gisèle Halimi|1927-07-28|2020-07-28|French pro...
33,2607,1493,2.0,William Martin Geldart|1870-06-07|1922-02-12|n...,William Martin Geldart|1870-06-07|--02-12|nan|nan
44,3539,8182,2.0,Camille Blaisot|1881-01-19|1945|Valognes (Manc...,Camille Blaisot|1881-01-19|1945-01-24|Valognes...
89,9336,4306,2.0,Jacob Marschak|1898|1977-07-27|nan|nan|Economi...,Jacob Marschak|1898-07-23|1977-07-27|Russian E...
45,3560,1976,2.0,Gyula Wlassics|1852-03-17|1937-04-30|Zalaegers...,Gyula Wlassics|1852-03-17|1937-03-30|Kingdom o...
25,1870,1054,2.0,David Josiah Brewer|1837-06-20|1910-03-27|nan|...,David Josiah Brewer|1837-06-20|1910-03-28|İzmi...
93,9470,1808,2.0,Gustav Cassel|1866-10-20|1945-01-15|Stockholm|...,Gustav Cassel|1866-10-20|1945-01-14|Stockholm|...
48,3669,5796,2.0,Vincenzo Caianiello|1932-10-02|2002-04-26|Aver...,Vincenzo Caianiello|1932-10-02|2002-04-06|Case...


## VIAF method

### Between Wikidata and DBpedia

In [3]:
Wikidata[:10]

Unnamed: 0_level_0,uri_wk,name_wk,dateBirth_wk,dateDeath_wk,placeOfBirth_wk,placeOfDeath_wk,viaf,uri_bnf
id_wk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,http://www.wikidata.org/entity/Q77404,Ingeborg Schwenzer,1951-10-25,,Stuttgart,,http://viaf.org/viaf/91748910,http://data.bnf.fr/ark:/12148/cb15067859n#about
1,http://www.wikidata.org/entity/Q77341,Hans Globke,1898-09-1,1973-02-13,Düsseldorf,Bad Godesberg,http://viaf.org/viaf/54939901,http://data.bnf.fr/ark:/12148/cb15597313f#about
2,http://www.wikidata.org/entity/Q77390,Christoph Ahlhaus,1969-08-28,,Heidelberg,,http://viaf.org/viaf/171463495,
3,http://www.wikidata.org/entity/Q89299,Emil Schlagintweit,1835-07-07,1904-10-29,Munich,Zweibrücken,http://viaf.org/viaf/74008741,http://data.bnf.fr/ark:/12148/cb13519131c#about
4,http://www.wikidata.org/entity/Q72628,Alfred von Kiderlen-Waechter,1852-07-1,1912-12-3,Stuttgart,Stuttgart,http://viaf.org/viaf/54958174,http://data.bnf.fr/ark:/12148/cb135097503#about
5,http://www.wikidata.org/entity/Q65539,Peter Altmaier,1958-06-18,,Ensdorf,,http://viaf.org/viaf/4137633,
6,http://www.wikidata.org/entity/Q89439,Adolf Damaschke,1865-11-24,1935-07-3,Berlin,Berlin,http://viaf.org/viaf/54949744,http://data.bnf.fr/ark:/12148/cb130056891#about
7,http://www.wikidata.org/entity/Q72654,Magnus von Braun,1878-02-07,1972-08-29,Bagrationovsk,Oberaudorf,http://viaf.org/viaf/262459734,
10,http://www.wikidata.org/entity/Q65561,Hans Apel,1932-02-25,2011-09-06,Hamburg,Hamburg,http://viaf.org/viaf/232142151,http://data.bnf.fr/ark:/12148/cb121740063#about
11,http://www.wikidata.org/entity/Q89493,Marianne Beth,1890-03-06,1984-08-19,Vienna,New York City,http://viaf.org/viaf/220617724,


In [8]:
DBpedia[:10]

Unnamed: 0,id_dbp,uri_dbp,viaf,name_dbp,birthDate_dbp,deathDate_dbp,placeOfBirth_dbp,placeOfDeath_dbp
0,0,http://dbpedia.org/resource/António_de_Almeida...,http://viaf.org/viaf/99921066,António de Almeida Santos,1926-02-15,2016-01-18,,
1,1,http://dbpedia.org/resource/Carlos_Carvalhas,http://viaf.org/viaf/99826658,Carlos Carvalhas,1941-11-09,,"São Pedro do Sul, Portugal",
2,2,http://dbpedia.org/resource/Anita_Augspurg,http://viaf.org/viaf/9976800,Anita Augspurg,1857-09-22,1943-12-20,,
3,3,http://dbpedia.org/resource/Paulo_Portas,http://viaf.org/viaf/99455673,Paulo Portas,1962-09-12,,Lisbon,
4,5,http://dbpedia.org/resource/Pedro_Aspe,http://viaf.org/viaf/9928165,,1950-07-07,,Mexico City,
5,7,http://dbpedia.org/resource/Fernando_Teixeira_...,http://viaf.org/viaf/99275725,Fernando Teixeira dos Santos,1951-09-13,,"Maia, Portugal",
6,9,http://dbpedia.org/resource/Xavier_Vives,http://viaf.org/viaf/9920331,Xavier Vives,1955-01-23,,,
7,10,http://dbpedia.org/resource/Vittorio_Emanuele_...,http://viaf.org/viaf/9914155,Vittorio Emanuele Orlando,1860-05-19,1952-12-01,Kingdom of the Two Sicilies,Italy
8,14,http://dbpedia.org/resource/James_M._Poterba,http://viaf.org/viaf/9910825,James M. Poterba,1958-07-13,,,
9,15,http://dbpedia.org/resource/Claus_Roxin,http://viaf.org/viaf/9910548,Claus Roxin,1931-05-15,,,


In [None]:
df_merged_Wiki_DB = Wikidata.merge( DBpedia , on='viaf', how='inner', sort='viaf')
df_merged_Wiki_DB[:10]

In [1]:
print("the number of merged data from DBpedia and Wikidata is ",len(merged_df_wk_dbp), "rows.")
print("")
print("The proportion of the number of merged data from DBpedia with Wikidata is ",((len(merged_df_wk_dbp))/(len(DBpedia))*100),"%")
print("")
print("proportion of the number of merged data from Wikidata with DBpedia is ",((len(merged_df_wk_dbp))/(len(Wikidata))*100),"%")

NameError: name 'merged_df_wk_dbp' is not defined

### Between Wikidata and BnF Data

In [None]:
merged_df_2 = pd.merge( Wikidata, BnF_Data , on='viaf', how='inner', sort='viaf')
print(len(merged_df_2))
merged_df_2[:10]

In [39]:
print("The number of merged data from BnF Data and Wikidata is ",len(merged_df_2), "rows.")

print("")

print("The proportion of the number of merged data from BnF Data with Wikidata is ",((len(merged_df_2))/(len(df_bnf))*100),"%")

print("")

print("The proportion of the number of merged data from Wikidata with Wikidata is ",((len(merged_df_2))/(len(df_wk))*100),"%")

The number of merged data from BnF Data and Wikidata is  112 rows.

The proportion of the number of merged data from BnF Data with Wikidata is  1.224445173280857 %

The proportion of the number of merged data from Wikidata with Wikidata is  0.5179669796050502 %


### Between DBpedia and BnF Data

In [40]:
merged_df_3 = pd.merge( df_bnf, df_dbp , on='viaf', how='inner', sort='viaf')
print(len(merged_df_3))
merged_df_3[:10]

88


Unnamed: 0,uri_bnf,viaf,name_bnf,sName,year_bnf,bio_bnf,uri_dbp,name_dbp,year_dbp
0,http://data.bnf.fr/ark:/12148/cb122145877#about,http://viaf.org/viaf/100966624,John Humphrey,,1905,Juriste. - A été professeur de droit internati...,http://dbpedia.org/resource/John_Peters_Humphrey,John Peters Humphrey,1905
1,http://data.bnf.fr/ark:/12148/cb12327654n#about,http://viaf.org/viaf/107536763,Louis Renault,,1843,Juriste. - Professeur de droit international à...,http://dbpedia.org/resource/Louis_Renault_(jur...,Louis Renault,1843
2,http://data.bnf.fr/ark:/12148/cb122775427#about,http://viaf.org/viaf/108173876,Ronald Myles Dworkin,,1931,Juriste. - Professeur de jurisprudence à la Ya...,http://dbpedia.org/resource/Ronald_Dworkin,,1931
3,http://data.bnf.fr/ark:/12148/cb11927239j#about,http://viaf.org/viaf/108188941,Gordon Tullock,,1922,"Docteur en droit (University of Chicago, Ill.,...",http://dbpedia.org/resource/Gordon_Tullock,Gordon Tullock,1922
4,http://data.bnf.fr/ark:/12148/cb120906270#about,http://viaf.org/viaf/108565309,Paul Abraham Freund,,1908,"Professeur de droit, ""Harvard Law School""",http://dbpedia.org/resource/Paul_A._Freund,Paul Abraham Freund,1908
5,http://data.bnf.fr/ark:/12148/cb119084288#about,http://viaf.org/viaf/108587991,Alexis Jacquemin,,1938,Juriste et économiste. - Professeur à l'Univer...,http://dbpedia.org/resource/Alexis_Jacquemin,Alexis Jacquemin,1938
6,http://data.bnf.fr/ark:/12148/cb128832222#about,http://viaf.org/viaf/108624624,Muḥammad Ẓafr Allāh H̱ān,,1893,"Juriste, diplomate et homme politique",http://dbpedia.org/resource/Muhammad_Zafarulla...,CH Muhammad Zafarullah Khan,1893
7,http://data.bnf.fr/ark:/12148/cb12299375j#about,http://viaf.org/viaf/108794549,Karl Engisch,,1899,Juriste. - Spécialiste de philosophie du droit...,http://dbpedia.org/resource/Karl_Engisch,Karl Engisch,1899
8,http://data.bnf.fr/ark:/12148/cb118935370#about,http://viaf.org/viaf/111389197,Georges Bousquet,,1846,Avocat au Barreau de Paris (en 1866). - Engagé...,http://dbpedia.org/resource/Georges_Hilaire_Bo...,Georges Hilaire Bousquet,1845
9,http://data.bnf.fr/ark:/12148/cb12328362p#about,http://viaf.org/viaf/11396531,John Paul Stevens,,1920,Juriste américain,http://dbpedia.org/resource/John_Paul_Stevens,John Paul Stevens,1920


In [41]:
print("The number of merged data from BnF Data and Wikidata is ",len(merged_df_3), "rows.")

print("")

print("The proportion of the number of merged data from BnF Data with DBpedia is ",((len(merged_df_3))/(len(df_dbp))*100),"%")

print("")

print("The proportion of the number of merged data from DBpedia with BnF Data is ",((len(merged_df_3))/(len(df_bnf))*100),"%")

The number of merged data from BnF Data and Wikidata is  88 rows.

The proportion of the number of merged data from BnF Data with DBpedia is  5.333333333333334 %

The proportion of the number of merged data from DBpedia with BnF Data is  0.9620640647206734 %


### Between Wikidata, BnF Data and DBpedia

In [42]:
merged_df = pd.merge( merged_df_1, df_bnf , on='viaf', how='inner', sort='viaf')
merged_df[:10]

Unnamed: 0,uri_wk,viaf,name_wk,year_wk,uri_dbp,name_dbp,year_dbp,uri_bnf,name_bnf,sName,year_bnf,bio_bnf
0,http://www.wikidata.org/entity/Q518859,http://viaf.org/viaf/108188941,Gordon Tullock,1922,http://dbpedia.org/resource/Gordon_Tullock,Gordon Tullock,1922,http://data.bnf.fr/ark:/12148/cb11927239j#about,Gordon Tullock,,1922,"Docteur en droit (University of Chicago, Ill.,..."
1,http://www.wikidata.org/entity/Q652154,http://viaf.org/viaf/108587991,Alexis Jacquemin,1938,http://dbpedia.org/resource/Alexis_Jacquemin,Alexis Jacquemin,1938,http://data.bnf.fr/ark:/12148/cb119084288#about,Alexis Jacquemin,,1938,Juriste et économiste. - Professeur à l'Univer...
2,http://www.wikidata.org/entity/Q3085838,http://viaf.org/viaf/32062931,François Simiand,1873,http://dbpedia.org/resource/François_Simiand,François Simiand,1873,http://data.bnf.fr/ark:/12148/cb12301152q#about,François Simiand,,1873,Philosophe. - Agrégé de philosophie. - Docteur...
3,http://www.wikidata.org/entity/Q61956,http://viaf.org/viaf/44308789,Lorenz von Stein,1815,http://dbpedia.org/resource/Lorenz_von_Stein,Lorenz von Stein,1815,http://data.bnf.fr/ark:/12148/cb12001622n#about,Lorenz von Stein,,1815,"Juriste et économiste. - Professeur à Kiel, Al..."
4,http://www.wikidata.org/entity/Q231690,http://viaf.org/viaf/44331988,B. R. Ambedkar,1891,http://dbpedia.org/resource/B._R._Ambedkar,Bhimrao Ramji Ambedkar,1891,http://data.bnf.fr/ark:/12148/cb12126992f#about,Bhimrao Ramji Ambedkar,,1891,Homme politique d'origine harijan mahar. - Étu...
5,http://www.wikidata.org/entity/Q215961,http://viaf.org/viaf/50021033,Franz Hermann Schulze-Delitzsch,1808,http://dbpedia.org/resource/Franz_Hermann_Schu...,Hermann Schulze-Delitzsch,1808,http://data.bnf.fr/ark:/12148/cb12088660j#about,Hermann Schulze-Delitzsch,,1808,"Juriste, homme politique et économiste alleman..."
6,http://www.wikidata.org/entity/Q4893263,http://viaf.org/viaf/69263532,Joan Sardà i Dexeus,1910,http://dbpedia.org/resource/Joan_Sardà_i_Dexeus,Joan Sardà i Dexeus,1910,http://data.bnf.fr/ark:/12148/cb158098327#about,Juan Sardá Dexeus,,1910,Docteur en droit. - Économiste
7,http://www.wikidata.org/entity/Q7836141,http://viaf.org/viaf/73921034,Travers Twiss,1809,http://dbpedia.org/resource/Travers_Twiss,Travers Twiss,1809,http://data.bnf.fr/ark:/12148/cb12314495r#about,Travers Twiss,,1809,Juriste. - Spécialiste de droit international


In [43]:
print("The number of merged data from DBpedia, Wikidata and BnF Data is",len(merged_df),"rows.")
print("")
print("The proportion of the number of merged data from DBpedia, Wikidata and BnF Data is ",(len(merged_df))/(len(df_bnf))*100,"%" )

The number of merged data from DBpedia, Wikidata and BnF Data is 8 rows.

The proportion of the number of merged data from DBpedia, Wikidata and BnF Data is  0.08746036952006123 %


In [42]:
df1 = pd.DataFrame({'user_id': ['id001', 'id002', 'id003', 'id004', 'id005', 'id006', 'id007'],
                    'first_name': ['Rivi', 'Wynnie', 'Kristos', 'Madalyn', 'Tobe', 'Regan', 'Kristin'],
                    'last_name': ['Valti', 'McMurty', 'Ivanets', 'Max', 'Riddich', 'Huyghe', 'Illis'],
                    'email': ['rvalti0@example.com', 'wmcmurty1@example.com', 'kivanets2@example.com',
                              'mmax3@example.com', 'triddich4@example.com', 'rhuyghe@example.com', 'killis4@example.com']
                    })

In [43]:
df2 = pd.DataFrame({'user_id': ['id001', 'id002', 'id003', 'id004', 'id005'],
                    'image_url': ['http://example.com/img/id001.png', 'http://example.com/img/id002.jpg',
                                  'http://example.com/img/id003.bmp', 'http://example.com/img/id004.jpg',
                                  'http://example.com/img/id005.png']
                    })

In [11]:
df3_merged = pd.merge(df1, df2)
df3_merged 

Unnamed: 0,user_id,first_name,last_name,email,image_url
0,id001,Rivi,Valti,rvalti0@example.com,http://example.com/img/id001.png
1,id002,Wynnie,McMurty,wmcmurty1@example.com,http://example.com/img/id002.jpg
2,id003,Kristos,Ivanets,kivanets2@example.com,http://example.com/img/id003.bmp
3,id004,Madalyn,Max,mmax3@example.com,http://example.com/img/id004.jpg
4,id005,Tobe,Riddich,triddich4@example.com,http://example.com/img/id005.png
