In this tutorial we will be using OCLC Classify service to reconcile bibliographic titles and authors. In this situation we only have the title of the work and the author’s name. Our goal is to enrich the dataset with unique persistent identifiers for the names and titles provided. The approach we are taking in getting these identifiers is to start with the Work, which is the highest level of hierarchy. You can imagine a Work as having many instances, or editions that belong to the same work. So in order to get those instance level information we need to reconcile the Work to the OCLC Classify Work. 


Getting started checklist:
1. A CSV or TSV file with metadata, at minimum it needs to contain the author’s full name and the full title of the work.
2. A OCLC WSkey for the Classify service, this can be generated if you have a subscription/membership with OCLC. You need to speak with someone at your institution who has access to your organization’s account at http://platform.worldcat.org/wskey/
3. Python, with Pandas, Requests and BeautifulSoup module installed and internet connection


Our first steps will be defining some variables we will be using, set these variables below based on your setup:


`path_to_tsv` - the path to the TSV/CSV file you want to run it on

`id_author_name` - the name of the column header in the file that contains author's name

`id_title_name` - the name of the column header in the file that contains title of the work

`id_author_viaf` - the name of the column header that the author's viaf number will be put into

`id_author_authorized_heading` - the name of the column header that the  author's authorized heading string

`id_author_lccn`- the name of the column header that the author's lccn number

`user_agent` - this is the value put into the headers on each request, it is good practice to identifiy your client/project

`pause_between_req` - number of seconds to wait between each API call, if you want the script to run slower

`WSKey` - the OCLC WSkey, will be a 80 character alpha numeric code

We also will load the modules we will be using.

In [3]:
path_to_tsv = "/path/to/the_file.tsv"
id_author_name = 'author'
id_title_name = 'title'
id_author_viaf  = 'author_viaf'
id_author_authorized_heading  = 'author_authorized_heading'
id_author_lccn = 'author_lccn'
user_agent = 'YOUR PROJECT NAME HERE'
pause_between_req = 0
WSkey = "WSKeyXXXXXXXXXXXXXX"

import pandas as pd
import requests
import time
from bs4 import BeautifulSoup


Before we start working with the file we need to define two helper funcions. The first will be what each row of the file is run throuh to modify the values. The second is a helper function to parse the results returned from the Classify service


In [None]:
def add_classify(d):

    # You can add some logic here to skip rows that already have some data if you previously ran it
    # if 'oclc_classify' in d:
    #     if type(d['oclc_classify']) == str:        
    #         print('Skip',d[id_author_name])
    #         return d


    # make the call out to the classift service
    headers = {'X-OCLC-API-Key': WSkey}
    params = {'author': d[id_author_name], 'title': d[id_title_name], 'summary' : 'false', 'maxRecs':100}
    r = requests.get('https://metadata.api.oclc.org/classify/', params=params,headers=headers)

    work_parsed = None
    work_unparsed = None

    # different response codes mean different things:
    # 0:	Success. Single-work summary response provided.
    # 2:	Success. Single-work detail response provided.
    # 4:	Success. Multi-work response provided.
    # 100:	No input. The method requires an input argument.
    # 101:	Invalid input. The standard number argument is invalid.
    # 102:	Not found. No data found for the input argument.
    # 200:	Unexpected error.

    if r.text.find('<response code="2"/>') > -1:
        work_parsed = extract_classify(r.text)
        work_unparsed = r.text

    elif r.text.find('<response code="4"/>') > -1:
        
        # we need to look through this reponse since it returned multiple possiblities, we will just be selecting the one with the most holdings
        soup = BeautifulSoup(str(r.text))
        work_soup = soup.find("works")
        largest_count = 0
        largest_work = None
        for work in work_soup.find_all("work"):
            if int(work['holdings']) > largest_count:
                largest_count = int(work['holdings'])
                largest_work = work

        # once we have that one we want to use, make the request again with its ID now
        params = {'owi': largest_work['owi'], 'summary' : 'false', 'maxRecs':100}
        r = requests.get('https://metadata.api.oclc.org/classify/', params=params,headers=headers)

        work_parsed = extract_classify(r.text)
        work_unparsed = r.text

    elif r.text.find('<response code="100"/>') > -1 or r.text.find('<response code=\\"100\\"/>') >-1:
        print(params,'100: No input. The method requires an input argument.')
    elif r.text.find('<response code="101"/>') > -1 or r.text.find('<response code=\\"101\\"/>') >-1:
        print(params,'101: Invalid input. The standard number argument is invalid.')
    elif r.text.find('<response code="102"/>') > -1 or r.text.find('<response code=\\"102\\"/>') >-1:
        print(params,'102: ?.')
    elif r.text.find('<response code="200"/>') > -1 or r.text.find('<response code=\\"200\\"/>') >-1:
        print(params,'200: Unexpected error.')
    else:
        print("unknown Problem:",r.text)


    if work_parsed != None:
        # we are going to store the raw XML from the response in the file as well as the extracted information
        d['oclc_classify'] = work_unparsed

        # check if the columns we want to add are not yet there in the row, if not add them as null
        if id_author_viaf not in d:
            d[id_author_viaf] = None

        if id_author_authorized_heading not in d:
            d[id_author_authorized_heading] = None

        if id_author_lccn not in d:
            d[id_author_lccn] = None

        # add in the author info if is missing
        if pd.isnull(d[id_author_viaf]) == True and work_parsed['work_author'] != None:
            if work_parsed['work_author']['viaf'] != None:
                d[id_author_authorized_heading] = work_parsed['work_author']['name']
                d[id_author_viaf] = work_parsed['work_author']['viaf']
        
        if pd.isnull(d[id_author_lccn]) == True and work_parsed['work_author'] != None:
            if work_parsed['work_author']['lccn'] != None:                    
                d[id_author_lccn] = work_parsed['work_author']['lccn']
                d[id_author_authorized_heading] = work_parsed['work_author']['name']

        d['oclc_eholdings'] = data['work_eholdings']
        d['oclc_holdings'] = data['work_holdings']
        d['oclc_owi'] = data['work_owi']


    # if we need to script to run slower we can configure it setting the  pause_between_req variable above 
    time.sleep(pause_between_req)

    return d

This is the second helper function, the Classify service returns a XML blob with data that can be parsed. We use the BeautifulSoup module to help read a parse the data.
The function returns a dictonary with differnt keys:

`work_statement_responsibility` - the string statement of responsibility

`work_editions` - the total number of editions this work has

`work_eholdings` - the total eletronic holdings this work has

`work_format` - what format the work is, likely "Book"

`work_holdings` - how many instances of this work exists in all the OCLC membership libraries

`work_itemtype` - likely "itemtype-book"

`work_owi` - a Work identifier from OCLC, mostly only used inside this Classify service

`work_title` - title

`authors` - A list of dictonaries that contain `name` `lccn` `viaf` for each contributor

`work_author` - the "main" contribtor, like the auhtor opposed to illustrator or other contributor

`normalized_ddc` - the most common dewey decimal value for this work

`normalized_lcc` - the most common library of congress classifciation number for this work

`editions` - a list of dictonaries for all the editions (instances) that represent this work in OCLC insitutions, each dict contains: `author` `eholdings` `format` `holdings` `itemtype` `language` `oclc` `title`


In [None]:
def extract_classify(xml):

		soup = BeautifulSoup(str(xml))

		work_soup = soup.find("work")

		if work_soup == None:
			# print("can not parse xml:")
			# print(xml)
			return None

		results = {}

		results['work_statement_responsibility'] = None if work_soup.has_attr('author') == False else work_soup['author']
		results['work_editions'] = None if work_soup.has_attr('editions') == False else int(work_soup['editions'])
		results['work_eholdings'] = None if work_soup.has_attr('eholdings') == False else int(work_soup['eholdings'])
		results['work_format'] = None if work_soup.has_attr('format') == False else work_soup['format']
		results['work_holdings'] = None if work_soup.has_attr('holdings') == False else int(work_soup['holdings'])
		results['work_itemtype'] = None if work_soup.has_attr('itemtype') == False else work_soup['itemtype']
		results['work_owi'] = None if work_soup.has_attr('owi') == False else work_soup['owi']
		results['work_title'] = None if work_soup.has_attr('title') == False else work_soup['title']
		results['main_oclc'] = work_soup.text

				
		authors_soup = soup.find_all("author")
		results['authors'] = []
		for a in authors_soup:
			results['authors'].append({
					"name" : a.text.split('[')[0].strip(),
					"lccn" : None if a.has_attr('lc') == False else a['lc'],
					"viaf" : None if a.has_attr('viaf') == False else a['viaf']
				})

		for a in results['authors']:
			if a['lccn'] == 'null':
				a['lccn'] = None	
			if a['viaf'] == 'null':
				a['viaf'] = None	

		# try to find the first main contributor
		results['work_author'] = None
		if results['work_statement_responsibility'] != None:
			if len(results['work_statement_responsibility'].split("|"))>0:
				first_author = results['work_statement_responsibility'].split("|")[0].strip()
				for a in results['authors']:
					print(a['name'].split('[')[0].strip(), first_author )
					if a['name'].strip() == first_author:
						results['work_author'] = a



		results["normalized_ddc"] = None
		results["normalized_lcc"] = None

		ddc_soup = soup.find("ddc")
		if ddc_soup != None:
			ddc_soup = soup.find("ddc").find("mostpopular")
			if ddc_soup != None:
				if ddc_soup.has_attr('nsfa'):
					results["normalized_ddc"] = ddc_soup['nsfa']

		lcc_soup = soup.find("lcc")
		if lcc_soup != None:
			lcc_soup = soup.find("lcc").find("mostpopular")
			if lcc_soup != None:
				if lcc_soup.has_attr('nsfa'):
					results["normalized_lcc"] = lcc_soup['nsfa']


		results["headings"] = []
		heading_soup = soup.find_all("heading")
		for h in heading_soup:
			results["headings"].append({
					"id" : h['ident'],
					"src": h['src'],
					"value" : h.text
				})
			

		edition_soup = soup.find_all("edition")
		# print(isbn,len(edition_soup))
		results["editions"] = []
		for e in edition_soup:
			edition = {}
			edition['author'] = None if e.has_attr('author') == False else e['author']
			edition['eholdings'] = None if e.has_attr('eholdings') == False else int(e['eholdings'])
			edition['format'] = None if e.has_attr('format') == False else e['format']
			edition['holdings'] = None if e.has_attr('holdings') == False else int(e['holdings'])
			edition['itemtype'] = None if e.has_attr('itemtype') == False else e['itemtype']
			edition['language'] = None if e.has_attr('language') == False else e['language']
			edition['oclc'] = None if e.has_attr('oclc') == False else e['oclc']
			edition['title'] = None if e.has_attr('title') == False else e['title']
			results["editions"].append(edition)
		
		if len(results["editions"]) > 0:
			results["largest_holding_oclc"] = results["editions"][0]['oclc']

		return results

Our next step will be to load the Pandas module and load the data we are using, you can adjust the `sep` argument to change what delimiter is being used (for example if you are using a CSV file, change it to ","). Once loaded we pass each record to the `add_classify()` function to kick off adding the data to the record



In [5]:
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)
df = df.apply(lambda d: add_classify(d),axis=1 )  

# we are writing out the file to the same location here, you may want to modifythe filename to create a new file, and change the sep argument if using a CSV
df.to_csv(path_to_tsv, sep='\t')




HELLO


The below code does the same thing as the block above but it breaks the CSV/TSV into multiple chunks and writes it out after each chunk, this allows for recovery from any errors such as as internet timeout or other problems that would cause you to loose all progress unless the script runs flawlessly, you would likely want to use this approch for larger datasets. Also uncomment the starting check in add_classify:
```
def add_classify(d):

    # You can add some logic here to skip rows that already have some data if you previously ran it
    # if 'oclc_classify' in d:
    #     if type(d['oclc_classify']) == str:        
    #         print('Skip',d[id_author_name])
    #         return d

```

To allow it to check if it needs to skip the record

In [None]:
# load the tsv
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)

# we are going to split the dataframe into chunks so we can save our progress as we go but don't want to save the entire file on on every record operation
n = 100  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

# loop through each chunk
for idx, df_chunk in enumerate(list_df):

    # if you want it to skip X number of chunks uncomment this, the number is the row to skip to
    # if idx < 10:
    #     continue

    print("Working on chunk ", idx, 'of', len(list_df))
    list_df[idx] = list_df[idx].apply(lambda d: add_classify(d),axis=1 )  

    reformed_df = pd.concat(list_df)
    reformed_df.to_csv(path_to_tsv, sep='\t')
