# AI4LAM Metadata Working Group
## 11 May 2021

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import pymarc
from pymarcspec import MarcSearchParser
import spacy

marc_parser = MarcSearchParser()
with open("data/yale-dvd-records.mrc", "rb") as fo:
    marc_reader = pymarc.MARCReader(fo.read())
marc_records = [r for r in marc_reader]

In [2]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 4.2 MB/s eta 0:00:01    |█████████████████████▎          | 9.1 MB 4.2 MB/s eta 0:00:02
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
len(marc_records)

749

In [5]:
spec511 = marc_parser.parse('511$a')
spec7xx = marc_parser.parse('7..$a$d')

In [6]:
print(spec511.search(marc_records[45]))
print(spec7xx.search(marc_records[45]))

[['Fats Bookholane, Sello Motloung, Connie Chiume, Nomsa Nene, Clementine Mosimane.']]
[['Harraway, David.'], ['Rauch, Nicola.'], ['Wa Luruli, Ntshavheni.'], ['Green, Richard.'], ['Cheze, Michael.'], ['Mamlambo']]


In [7]:
print(marc_records[45])

=LDR  02369cgm a2200373 a 4500
=001  7717981
=005  20110303202209.0
=007  vd\cgaiz-
=008  061025p20011998sa\122\\\\\\\\\\\\vleng\d
=035  \\$a(OCoLC)ocn682710482
=035  \\$a7717981
=040  \\$aOI@$cOI@$dOCLCQ$dCtY
=028  42$aADVD01 0004$bAfrican DVD
=043  \\$af-sa---
=050  \4$aPN1997$b.C464 1998
=079  \\$aocm55988746
=245  00$aChikin biznis$h[videorecording] :$b-- the whole story /$cdirector, Ntshaveni wa Luruli.
=246  3\$aChicken business
=260  \\$aParklands, South Africa :$bDigital Content Co. ;$aJohannesburg :$bDistributed by African DVD in assoc. with Film Resource Unit,$cc2001, [c1998]
=300  \\$a1 videodisc (122 min.) :$bsd., col. ;$c4 3/4 in.
=500  \\$aTitle from container.
=500  \\$aIncludes a bonus short film: Mamlambo.
=500  \\$aFrom container: producer, Nicola Rauch ; author, David Harraway.
=511  1\$aFats Bookholane, Sello Motloung, Connie Chiume, Nomsa Nene, Clementine Mosimane.
=508  \\$aProducers, Richard Green, Micheal Cheze ; writer, Mtutuzeli Matshoba ; editor, Micki Strouc

In [10]:
spec508 = marc_parser.parse("508$a")
spec508.search(marc_records[45])

[['Producers, Richard Green, Micheal Cheze ; writer, Mtutuzeli Matshoba ; editor, Micki Stroucken; director of photography, Rod Stewart ; original music, Shaluzamax Mntambo ; executive producer, Leon Rautenbach ; production designer, Sarah Roberts.']]

In [12]:
doc = nlp(spec511.search(marc_records[45])[0][0])

In [13]:
doc?

In [14]:
for ent in doc.ents:
    print(ent, ent.label_)

Fats Bookholane PERSON
Sello Motloung PERSON
Connie Chiume PERSON
Nomsa Nene PERSON
Clementine Mosimane PERSON


In [17]:
doc2 = nlp(spec508.search(marc_records[45])[0][0])
for ent in doc2.ents:
    print(ent, ent.start, ent.end, ent.label_)

Richard Green 2 4 PERSON
Micheal Cheze 5 7 PERSON
Mtutuzeli Matshoba 10 12 PERSON
Micki Stroucken 15 17 PERSON
Rod Stewart 22 24 PERSON
Shaluzamax Mntambo 28 30 ORG
Leon Rautenbach 34 36 PERSON
Sarah Roberts 40 42 PERSON


In [19]:
spec008lang = marc_parser.parse("008/35-37")

In [20]:
spec008lang.search(marc_records[45])

['en']

In [21]:
from spacy import displacy

In [22]:
displacy.render(doc2, style='ent')

In [23]:
displacy.render(doc2)

In [24]:
spec520 = marc_parser.parse("520$a")
doc3 = nlp(spec520.search(marc_records[45])[0][0])

In [25]:
doc3

Chikin Biznis follows the fortunes and blunders of the streetwise Bra Sipho, who after "twenty five years of sophisticated slavery" for a Stock Exchange listed company, retires to set up shop in the lucrative trade of chicken business, hoping that one day he will return to the Stock Exchange as the king of his own business empire: a telling metaphor for the aspirations of many other South Africans in an increasingly competitive economic environment. The bonus film Mamlambo is a touching tale of a street child and his friendship with a prostitute in Hillbrow, Johannesburg.

In [27]:
displacy.render(doc3)

In [28]:
displacy.render(doc3, style='ent')

## Approches for Entity Matching
Controlled form verses text (unstructured) data.

1.  Fuzzywuzzy - fuzzy matching libraries, token-ratio matching etc. BOW for entities, order wouldn't matter
1.  Beyond spaCy [Papers With Code Entity Linking](https://paperswithcode.com/task/entity-linking)
1.  [Autoregressive Entity Retrieval](https://arxiv.org/pdf/2010.00904v3.pdf)
1.  [NLP Progress](http://nlpprogress.com/)