# Improved JSTOR-S2 Aligner

The code in this notebook aligns JSTOR articles with Semantic Scholar, using both the title and the doi as evidence. 

Our strategy:

1. Do a search query on title keywords in Semantic Scholar.
2. Iterate through results to find the best match with author and publication date, as well as title and journal title. 
3. If there's also a match to the DOI recorded in JSTOR, we record that as well, and trust it more.
4. We select the paper with highest trust. Generally if we found the DOI, this is that paper. Otherwise, if the search had a trust score of higher than 0.7, it's the searched paper. Note: papers with marginal trust scores of less than 0.85 might still not be very reliable matches. We should check some.
4. In recording citations, we check the second page of results if there's more than 1000.

In checking names, we only consider last names because format of first names and initials is volatile.

#### How to use it

Run all the cells down to "Match A Segment," so the code is in memory. Load the appropriate metadata file for your discipline.

Then use the function match_a_segment() to step through the metadata file in reasonable-sized chunks. I'm guessing a reasonable chunk is around 4000 or 5000 rows at a time.

So you might run

match_a_segment(inmeta, 0, 5000, 'econ')

and then

match_a_segment(inmeta, 5000, 10000, 'econ')

When you're done with the whole metadata file, the function concatenate_results() at the end will concatenate all the result files you produced.

### Necessary imports

In [57]:
import pandas as pd
import urllib, urllib.request
import requests
import time, re
import json, glob
from difflib import SequenceMatcher
from ast import literal_eval
from collections import Counter

### get api key

It's good practice not to save private keys in the github repo.

Instead we'll share the S2 api key as a one-line text file that we place in the same directory as this script, and do not add to git. I need to email it to you.

In [2]:
with open('apikey.txt', encoding = 'utf-8') as f:
    apikey = f.read()

### Create metadata file for a discipline

If JSTOR provides good full metadata, which they seem to have done for Sarah's Econ corpus, this is very simple:

In [50]:
inmeta = pd.read_csv('../../metadata/econ-jstor-metadata.csv')
inmeta.head()

Unnamed: 0,id,title,isPartOf,publicationYear,doi,docType,docSubType,provider,collection,datePublished,...,creator,publisher,language,pageStart,pageEnd,placeOfPublication,keyphrase,wordCount,pageCount,file
0,http://www.jstor.org/stable/44955205,SUBMISSION OF MANUSCRIPTS TO THE ECONOMETRIC S...,Econometrica,2018,10.2307/44955205,article,misc,jstor,,2018-01-01,...,Jeffrey C. Ely; Donald W. K. Andrews,Wiley,eng,389,389,,pdf format; electronic submissions; manuscript...,352.0,1,part-1.jsonl.gz
1,http://www.jstor.org/stable/1808822,New Books,The American Economic Review,1933,10.2307/1808822,article,research-article,jstor,,1933-03-01,...,,American Economic Association,eng,139,140,,kartelle berlin; heymanns verlag; carl heymann...,117.0,2,part-1.jsonl.gz
2,http://www.jstor.org/stable/1924813,U.S. Evidence on Linear Feedback from Money Gr...,The Review of Economics and Statistics,1985,10.2307/1924813,article,research-article,jstor,,1985-11-01,...,Mary G. McGarvey,The MIT Press,eng,675,680,,relative price; feedback; growth shocks; money...,3585.0,6,part-1.jsonl.gz
3,http://www.jstor.org/stable/1808956,"Tariffs, Intermediate Goods, and Domestic Prot...",The American Economic Review,1969,10.2307/1808956,article,research-article,jstor,,1969-06-01,...,Roy J. Ruffin,American Economic Association,eng,261,269,,effective tariff; tariff rate; effective tarif...,4840.0,9,part-1.jsonl.gz
4,http://www.jstor.org/stable/2728408,L: Industrial Organization,Journal of Economic Literature,1993,10.2307/2728408,article,misc,jstor,,1993-12-01,...,,American Economic Association,eng,2360,2371,,english summary; market; econ societes; amer e...,7775.0,12,part-1.jsonl.gz


#### fall-back option

If there is no metadata file we can iterate through the json provided by JSTOR and extract metadata. Probably not needed.

In [3]:
def extract_metadata(infile, outfile):
    '''
    Iterates through a jsonl file where each article is represented as a line.
    Extracts metadata and saves it to outfile as a tab-separated-value file.
    '''
    jauthors = []
    jtitles = []
    dois = []
    jyears = []
    jdoctypes = []
    jwordcounts = []
    journals = []
    languages = []

    with open(infile, mode = 'r', encoding = 'utf-8') as f:
        lines = f.readlines()
        ctr = 0
        for l in lines:
            j = json.loads(l)
            ctr += 1
            if ctr % 500 == 1:
                print(ctr)
            if 'creator' in j:
                jauthors.append(j['creator'])
            else:
                jauthors.append("['anonymous']")
            doctype = j['docType'] + " | " + j['docSubType']
            jdoctypes.append(doctype)
            jwordcounts.append(j['wordCount'])
            jyears.append(j['publicationYear'])
            jtitles.append(j['title'])
            journals.append(j['isPartOf'])
            ids = j['identifier']
            thedoi = 'no doi'
            for anid in ids:
                if anid['name'] == 'local_doi':
                    thedoi = anid['value']
            dois.append(thedoi)
            languages.append(j['language'])

    themeta = pd.DataFrame({'journal': journals, 'year': jyears, 'authors': jauthors,
                           'title': jtitles, 'language': languages,
                           'wordcount': jwordcounts, 'doctype': jdoctypes, 'doi': dois})
    themeta.to_csv(outfile, sep = '\t', index = False)

#### check document types

This was simple in Econ. We just filter out "misc." If there are a lot more in Ecology, we may want to adjust the code to filter out other non-article doctypes.

In [51]:
doctypes = set(inmeta.docSubType)
doctypes

{'book-review', 'misc', 'research-article'}

### Functions that evaluate match strength

We define functions to compare pairs of titles, lists of authors, and publication year.

In [4]:
def compare_titles(title1, title2):
    
    '''We first lowercase both titles using the lower() method for 
    a case-insensitive comparison.
    We check if the lowercased titles are exactly the same. If so, 
    we return 1.0 and the length of the shorter title.
    If the titles aren't exactly the same, we truncate both titles 
    to a maximum of 50 characters using string slicing. The logic is
    that we suspect the trailing characters are optional boilerplate.
    We then determine the length of the shorter title (or 50 if either
    title was truncated).
    Finally, we use SequenceMatcher from the difflib module to 
    calculate the similarity ratio between the truncated titles, 
    and return the similarity ratio along with the length of the 
    shorter title.
    '''
    
    # Lowercase the titles for case-insensitive comparison
    title1 = title1.lower()
    title2 = title2.lower()
    
    # Check for exact match when lowercased
    if title1 == title2:
        return 1.0, min(len(title1), len(title2))
    
    # Truncate titles to 50 characters if necessary
    max_length = 50
    title1_truncated = title1[:max_length]
    title2_truncated = title2[:max_length]
    
    # Get the length of the shorter title (or 50 if truncated)
    shorter_title_length = min(len(title1_truncated), len(title2_truncated))
    
    # Calculate similarity ratio using SequenceMatcher
    similarity_ratio = SequenceMatcher(None, title1_truncated, title2_truncated).ratio()
    
    return similarity_ratio, shorter_title_length

# Usage:
title1 = "A Long Title That Might Have Additional Subtitle"
title2 = "A long title that might have additional subtitle (Translated from the French by Pierre Menard)"
similarity_ratio, shorter_title_length = compare_titles(title1, title2)
print(f'Similarity Ratio: {similarity_ratio}, Shorter Title Length: {shorter_title_length}')


Similarity Ratio: 0.9795918367346939, Shorter Title Length: 48


In [5]:
def compare_author_lists(author_list1, author_list2):
    
    '''
    Our strategy uses only last names, because middle and first names
    may be omitted or reduced to initials.
    
    We extract the last names from author_list1 and author_list2 by
    splitting each name string on spaces and taking the last element.
    We use a set comprehension create sets of unique last names, then
    iterate through the last names in last_names_list1 and check if each
    last name exists in last_names_list2, incrementing matching_last_names_count
    for each match.
    
    Finally, we calculate the fraction of matching last names by 
    dividing matching_last_names_count by the total number of unique last 
    names in last_names_list1, and return this fraction.
    '''
    
    # If the first list is empty, return None
    if not author_list1:
        return None
    
    # Extract last names from the first list
    last_names_list1 = set()
    for name in author_list1:
        if len(name) > 0:
            last_names_list1.add(name.split()[-1].lower())
    
    last_names_list2 = set()
    for author in author_list2:
        if 'name' in author and len(author['name']) > 0:
            last_names_list2.add(author['name'].split()[-1].lower())
    
    # Find the count of matching last names
    matching_last_names_count = sum(1 for last_name in last_names_list1 if last_name in last_names_list2)
    
    # Calculate and return the fraction of matching last names
    matching_fraction = matching_last_names_count / len(last_names_list1)
    return matching_fraction

# Usage:
author_list1 = ['Michael D. Bauer', 'Glenn D. Rudebusch']
author_list2 = [{'authorId': '145421946', 'name': 'M. BAUER'}, {'authorId': '65729671', 'name': 'Glenn D. Rudebusch'}]

matching_fraction = compare_author_lists(author_list1, author_list2)
print(matching_fraction)  # Output: 1.0

1.0


In [6]:
def compare_dates(date1, date2):
    '''
    We return the absolute difference between two dates, or 7
    if they are difficult to compare.
    
    The logic is that 7 years apart is not a close match but not
    impossible; it's a neutral result.
    '''
    
    if not date1 or not date2:
        return 7 
    elif len(date2) < 4:
        return 7
    else:
        try:
            date2 = int(date2[0:4])
        except:
            return 7
        
        return abs(date1 - date2)

print(compare_dates(1989, '197o-8-17'))

7


#### Final evaluation of match

this function pulls together all the functions defined above to make an overall evaluation.

In [7]:
def match_strength(title1, title2, author_list1, author_list2,
                   date1, date2):
    ''' Makes three comparisons and returns an overall metric of
    match strength.
    
    Our metric is mostly dependent on the title (0-1.0), but author matches
    can add up to 0.1, and date distance can subtract up to 0.1, or add 0.02.
    
    The final result should be 1.0 or greater to be trusted.
    '''
    
    titlematch, titlelen = compare_titles(title1, title2)
    
    if titlelen < 10:  
        titlematch = titlematch * (titlelen / 10)
    # short titles aren't as informative
    
    authormatch = compare_author_lists(author_list1, author_list2)
    # datedistance = compare_dates(date1, date2)
    
    if date1 is None or date2 is None:
        datedistance = 7
    else:
        datedistance = abs(date1 - date2)
    
    if not authormatch:
        authormatch = 0
    
    authorbonus = (0.1 * authormatch)
    if datedistance == 0:
        datepenalty = -0.02 # an exact match is a negative penalty aka bonus
    elif datedistance < 4:
        datepenalty = 0
    elif datedistance < 7:
        datepenalty = 0.04
    elif datedistance < 12:
        datepenalty = 0.07
    else:
        datepenalty = 0.1
        
    totalmatch = titlematch + authorbonus - datepenalty
    return totalmatch

### Functions that do the searching

First a utility that turns a title into a string that can be used in the API url.

Note: this is probably not necessary. I wrote it before I started passing the query in "params" to the request library. Now, I suspect the requests library does what is needed. But I haven't tested.

In [8]:
def title_to_url(title):
    '''
    Slightly deprecated.
    '''
    # handle possessives
    title = title.replace("'s ", " ")
    # Replace hyphens, slashes, and apostrophes with spaces
    title = re.sub(r'[-/\'’]', ' ', title)
    # Remove non-alphabetic characters (excluding spaces)
    title = re.sub(r'[^a-zA-Z\s]', '', title)
    # Convert to lower case
    title = title.lower()
    # Replace double spaces
    url_string = re.sub('  ', ' ', title)
    return url_string

# Usage:
title = "The End of 5-O'Clock-Shadow? Shaving, Depilation/Removal, and Gilette's Other Strategies."
url_string = title_to_url(title)
print(url_string)  # Output: the+end+of+o+clock+shadow+shaving+depilation+removal+and+gilette+other+strategies


the end of o clock shadow shaving depilation removal and gilette other strategies


### Iterate through papers to find the best match

Given metadata for an article, derived from JSTOR, and a list of papers from S2, this finds the best match in the list of papers.

In [11]:
def evaluate_papers(title, authors, journal, year, papers, journalpenalty):
    '''
    Title, author, year, and journal are derived from JSTOR metadata.
    
    Papers is a list of papers we have received from S2.
    
    We iterate through papers and return the best match.
    '''
    bestmatch = -1
    bestmatchstrength = 0

    for idx, paper in enumerate(papers):
        if 'title' in paper:
            foundtitle = paper['title']
        else:
            foundtitle = ''
        if 'authors' in paper:
            foundauthors = paper['authors']
        else:
            foundauthors = []
        if 'year' in paper and not paper['year'] is None:
            founddate = paper['year']
        else:
            founddate = None

        if 'journal' in paper and not paper['journal'] is None:
            if 'name' in paper['journal']:
                foundjournal = paper['journal']['name']
            else:
                foundjournal = ''
        else:
            foundjournal = ''

        totalmatch = match_strength(title, foundtitle, authors,
                                    foundauthors, year, founddate)

        if journal.lower() not in foundjournal.lower():
            totalmatch = totalmatch - journalpenalty

        if totalmatch > 0.7 and totalmatch > bestmatchstrength:
            bestmatchstrength = totalmatch
            bestmatch = idx
    
    return papers[bestmatch], bestmatchstrength

### Combine DOI search and title search

This function tries for an exact match on the DOI, and then also runs a title search. It returns the results of both.

In [65]:
def search_papers(DOI, title, authors, journal, year = None, journalpenalty = 0.05):
    '''
    This function finds the closest match for an article with a given
    title, list of authors, journal of publication, and publication year.
    
    There's a default penalty for appearing in the wrong journal, but
    it allows the user to override that penalty (some journals republish
    articles from others and it's not surprising to have the wrong journal
    title in those cases).
    
    We return the best paper, as a json object, and the match strength.
    
    If there's an exact DOI match we also return that.
    
    If either attempt fails, we return None for that part.
    '''
    
    url="https://api.semanticscholar.org/graph/v1/paper/DOI:"+DOI
    url = url + "?fields=title,year,authors,citationCount,externalIds,citations,journal"
    
    response = requests.get(url, headers={'X-API-KEY': apikey})
    
    foundDOI = False
    if response.status_code == 200:
        jsonobj = response.json()
        papers = [jsonobj]
        foundDOI = True
        doimatch, doimatchstrength = evaluate_papers(title, authors, journal, year, papers, journalpenalty)

    query = title_to_url(title)
    
    response = requests.get('https://api.semanticscholar.org/graph/v1/paper/search',
                           headers={'X-API-KEY': apikey},
                           params={'query': query, 'limit': 15, 
                                   'fields': 'title,authors,year,citationCount,externalIds,journal,citations'})
    
    successfulsearch = False
    
    if response.status_code == 200:
        jsonobj = response.json()
        if 'data' in jsonobj:
            papers = jsonobj['data']
            searchmatch, searchmatchstrength = evaluate_papers(title, authors, journal, year, papers, journalpenalty)
            successfulsearch = True
                  
    if not foundDOI and not successfulsearch:
        return None, -1, None, -1, 'neither'
    elif foundDOI and not successfulsearch:
        return doimatch, doimatchstrength, None, -1, 'DOI'
    elif not foundDOI and successfulsearch:
        return None, -1, searchmatch, searchmatchstrength, 'search'
    else:
        if searchmatchstrength > doimatchstrength:
            thebetter = 'search'
        else:
            thebetter = 'DOI'
            
        return doimatch, doimatchstrength, searchmatch, searchmatchstrength, thebetter
                

### Utility function to enrich JSTOR metadata "row"

As we iterate through the JSTOR metadata file we add new fields to each row.

In [47]:
def extract_metadata_from_paper(row, paper):
    
    new_row = row.copy()
    
    new_row['paperId'] = paper['paperId']
    
    if 'citationCount' in paper:
        new_row['citationCount'] = paper['citationCount']
    else:
        new_row['citationCount'] = 0
                
    if 'title' in paper:
        new_row['foundTitle'] = paper['title']
    else:
        new_row['foundTitle'] = 'NA'
                
    if 'year' in paper:
        new_row['foundYear'] = paper['year']
    else:
        new_row['foundYear'] = 0
                
    if 'authors' in paper:
        new_row['foundAuthors'] = ' | '.join([x['name'] for x in paper['authors']])
    else:
        new_row['foundAuthors'] = 'NA'
    
    if 'citations' in paper and paper['citationCount'] < 900:
        citation_list = []
        for x in paper['citations']:
            if 'paperId' in x and x['paperId'] is not None:
                citation_list.append(x['paperId'])
            
        new_row['citations'] = ' | '.join(citation_list)
    elif paper['citationCount'] >= 900:
        citation_list, got_contexts = get_all_citations(paper['paperId'], paper['citationCount'])
        new_row['citations'] = ' | '.join(citation_list)
    else:
        new_row['citations'] = 'NA'
        citation_list = []
    
    foundCitations = len(citation_list)
    if 'citationCount' in paper:
        citationCount = paper['citationCount']
    else:
        citationCount = 0
    
    if (citationCount - foundCitations) > (citationCount / 3):
        print('SUSPICIOUS CITATON GAP: ', citationCount, foundCitations)
        
    return new_row

### Get citations if more than 1000

If there are more than 1000 citations we may need to iterate through multiple "pages" of citations and aggregate them.

In [66]:
def get_all_citations(paperId, citationCount):
    
    url = 'https://api.semanticscholar.org/graph/v1/paper/' + paperId + '/citations'
    citations = []
    got_contexts = False
    for offset in range(0, citationCount, 1000):  
        response = requests.get(url, 
                                headers={'X-API-KEY': apikey},
                               params={'offset': offset, 'limit': 1000, 
                                       'fields': 'contexts,year'})
        if response.status_code == 200:
            jsonobj = response.json()
            data = jsonobj['data']
            for cite in data:
                if 'contexts' in cite:
                    if len(cite['contexts']) > 0:
                        got_contexts = True
                if 'citingPaper' in cite:
                    if 'paperId' in cite['citingPaper']:
                        if cite['citingPaper']['paperId'] is not None:
                            citations.append(cite['citingPaper']['paperId'])
    
    print('Paper with ', len(citations), 'citations.')
    
    return citations, got_contexts 
                

## Match A Segment

This is the main show: the function you need to call to match your metadata with S2.

In [72]:
def match_a_segment(metadata, startrow, endrow, outfile_prefix):
    
    '''
    Iterates through metadata from startrow to endrow, and saves
    the results to two outfiles: a new metadata file containing
    rows where matches were found, and a paper file containing the
    jsons for found papers.
    
    The outfiles are both named using a consistent 
    pattern so that all the outfiles can be concatenated automatically
    when you're done.
    
    So e.g. if you say
    match_a_segment(inmeta, 0, 5000, 'econ')
    
    the results will be saved to 
    econ-meta-0-5000.tsv and econ-papers-0-5000.jsonl,
    
    overwriting any existing files with those names.
    '''
    
    # create outfile name
    outmetafile = outfile_prefix + '-meta-' + str(startrow) + '-' + str(endrow) + '.tsv'
    outpaperfile = outfile_prefix + '-papers-' + str(startrow) + '-' + str(endrow) + '.jsonl'
    
    assert endrow > startrow
    
    metalength = metadata.shape[0]
    if endrow > metalength:
        endrow = metalength
    
    itermeta = metadata.iloc[startrow: endrow, : ]
    
    titledist = Counter(metadata.title)
    common_titles = titledist.most_common(120)
    forbidden_titles = set([x[0] for x in common_titles])
    
    print('TITLES TOO COMMON TO SEARCH ON:')
    for title, freq in common_titles:
        print(title, freq)
    print()
    
    all_new_rows = []
    all_papers = []
    
    for idx, row in itermeta.iterrows():
        
        if idx % 100 == 1:
            print('INDEX: ', idx)
            print()
        
        if row.docSubType == 'misc':
            continue
        
        # we don't trust *search* matches on Forbidden Titles
        # they are too common to be meaningful
        
        if row.title in forbidden_titles:
            continue
        
        creators = []
        if not pd.isnull(row['creator']) and not pd.isna(row['creator']):
            creators = row['creator'].split(';')
            
        doimatch, doimatchstrength, searchmatch, searchmatchstrength, thebetter = search_papers(row['doi'], row['title'], 
                                             creators, row['isPartOf'], row['publicationYear'])
        
        # if we found an exact match to the DOI we should be more confident
        # about that match, and the search match as well (if it has the same
        # paper id)
        
        if doimatchstrength > 0.7:
            doimatchstrength += 0.05 # DOI matches are inherently stronger
            if searchmatchstrength > 0.7:
                doipaper = doimatch['paperId']
                searchpaper = searchmatch['paperId']
                if doipaper == searchpaper:
                    doimatchstrength += 0.05
                    searchmatchstrength += 0.05
                else:
                    searchmatchstrength -= 0.15
                    # we favor DOI in case of disagreement
        
        if doimatchstrength < 0.7 and searchmatchstrength < 0.7:
            continue    # initial threshold of 0.7 for matches
        elif doimatchstrength > searchmatchstrength:
            new_row = extract_metadata_from_paper(row, doimatch)
            new_row['paperSource'] = 'DOI'
            all_papers.append(doimatch)
        else:
            new_row = extract_metadata_from_paper(row, searchmatch)
            new_row['paperSource'] = 'search'
            all_papers.append(searchmatch)
        
        new_row['doiMatch'] = doimatchstrength
        new_row['searchMatch'] = searchmatchstrength
        
        all_new_rows.append(new_row)
            
    new_df = pd.DataFrame(all_new_rows)
    new_df.reset_index(drop = False, inplace = True) # keep old indices
    
    new_df.to_csv(outmetafile, sep = '\t', index = False) 
    
    with open(outpaperfile, mode = 'w', encoding = 'utf-8') as f:
        for paper in all_papers:
            outstring = json.dumps(paper)
            f.write(outstring + '\n')
    
    print('Write completed sucessfully')
    print(len(all_papers), ' papers written.')

### illustrative example of a run

In [71]:
match_a_segment(inmeta, 2000, 3000, 'econ')

TITLES TOO COMMON TO SEARCH ON:
Review Article 21643
Front Matter 3087
Back Matter 2727
New Books 2012
Volume Information 796
Discussion 548
Books Received 392
400: International Economics 264
900: Welfare Programs; Consumer Economics; Urban and Regional Economics 262
100: Economic Growth; Development; Planning; Fluctuations 261
000: General Economics; Theory; History; Systems 261
500: Administration; Business Finance; Marketing; Accounting 260
800: Manpower; Labor; Population 243
600: Industrial Organization; Technological Change; Industry Studies 233
300: Domestic Monetary and Fiscal Theory and Institutions 218
Periodicals 205
700: Agriculture; Natural Resources 201
New Journals 197
200: Quantitative Economic Methods and Data 191
Related Disciplines 145
Recent Publications 141
L: Industrial Organization 134
N: Economic History 134
C: Mathematical and Quantitative Methods 134
K: Law and Economics 134
G: Financial Economics 134
F: International Economics 134
A: General Economics and Te

### Concatenate the output

This function asks for a prefix code like "econ" and then finds relevant output files in the local directory. It checks to make sure they're consecutive, and if so, concatenates them.

Consecutive, here, means that the startrow of each file is the endrow of the last file.

In [63]:
def are_consecutive(listoffiles):
    
    '''
    We accept a list of files in the format
    econ-meta-0-1000.tsv or
    econ-papers-2000-3000.jsonl
    
    and return a boolean (whether consecutive)
    as well as an ordered list of filenames
    '''
    
    tuplelist = []
    for afile in listoffiles:
        parts = afile.split('-')
        startrow = int(parts[2])
        tuplelist.append((startrow, afile))
        
    tuplelist.sort()
    
    lastend = 0
    consecutive = True
    for startrow, afile in tuplelist:
        if startrow != lastend:
            consecutive = False
        parts = afile.split('-')
        lastend = int(parts[3].replace('.tsv', '').replace('.jsonl', ''))
    
    return consecutive, [x[1] for x in tuplelist]
            

def concat_output(prefix):
    metafiles = glob.glob(prefix + '-meta-*.tsv')
    paperfiles = glob.glob(prefix + '-papers-*.jsonl')
    
    consecutive, orderedmeta = are_consecutive(metafiles)
    
    if not consecutive:
        print('Metafiles are not consecutive.')
        return 'failed'
    else:
        print('Concatenating metadata:')
    
    dataframes = []
    for afile in orderedmeta:
        df = pd.read_csv(afile, sep = '\t')
        dataframes.append(df)
        print(afile)
    
    fullmeta = pd.concat(dataframes)
    print()
    
    consecutive, orderedpapers = are_consecutive(paperfiles)
    
    if not consecutive:
        print('Papers are not consecutive.')
        return 'failed'
    else:
        print('Concatenating papers:') 
        
    paperlines = []
    
    for afile in orderedpapers:
        print(afile)
        with open(afile, encoding = 'utf-8') as f:
            lines = f.readlines()
            paperlines.extend(lines)
    print()
    
    outmetaname = 'all-' + prefix + '-S2meta.tsv'
    fullmeta.to_csv(outmetaname, sep = '\t', index = False)
    
    outpapername = 'all-' + prefix + '-S2papers.tsv'
    with open(outpapername, mode = 'w', encoding = 'utf-8') as f:
        for line in paperlines:
            f.write(line + '\n')
    
    print('Done.')
    return 'success'
    

In [69]:
concat_output('econ')

Concatenating metadata:
econ-meta-0-1000.tsv
econ-meta-1000-2000.tsv
econ-meta-2000-3000.tsv

Concatenating papers:
econ-papers-0-1000.jsonl
econ-papers-1000-2000.jsonl
econ-papers-2000-3000.jsonl

Done.


'success'

### Get SPECTER2 embeddings


In [81]:
from pprint import pprint
def get_embedding(paperId):
    
    url = 'https://api.semanticscholar.org/graph/v1/paper/' + paperId 
    
    
    response = requests.get(url, 
                            headers={'X-API-KEY': apikey},
                           params={'fields': 'embedding.specter_v2'})
    if response.status_code == 200:
        jsonobj = response.json()
        pprint(jsonobj)
    
    return data 
                

In [82]:
get_embedding('30840cd017722676dcb03d1a322aa7efe6fbe945')

{'embedding': {'model': 'specter_v2',
               'vector': [-0.07971347868442535,
                          0.7958121299743652,
                          -0.9786374568939209,
                          -0.23536469042301178,
                          -0.25757038593292236,
                          -0.11469785124063492,
                          0.09423073381185532,
                          -0.30397799611091614,
                          0.3099800646305084,
                          0.5733435750007629,
                          0.23842251300811768,
                          -0.43257734179496765,
                          -0.49130770564079285,
                          0.0817175954580307,
                          0.07332389801740646,
                          -0.670652449131012,
                          -0.756175696849823,
                          0.32430511713027954,
                          -1.2303245067596436,
                          -0.5261821150779724,
                     

NameError: name 'data' is not defined