# Bibtex Overlap Finder

For a cumulative thesis, several publications are compiled to one document, which often entails merging the indvidual bibliographies, which may produce overlaps due to similar references being used in multiple publications. This script detects overlaps based on (a) their identifier and (b) their title.

In [18]:
import re
from difflib import SequenceMatcher

### Load the References

First, load the reference file `references.bib`, which simply contains the content of all individual reference files.

In [19]:
with open('../data/references.bib', 'r') as file:
    rawbib = file.read()

In [20]:
total_references = re.findall(r'\@', rawbib)
print(f'Found a total of {len(total_references)} References')

Found a total of 4 References


### Overlap identification

Define the overlap identification function, which takes an overlap function (by default, just counting exact duplicates) and determines all items within a list that have an overlap according to that function.

In [21]:
def find_overlap(items: list[str], overlap_fun=(lambda l, i: l.count(i))) -> None:
    """Idenfities all items that occur more than once in a given list according to an overlap function. The size of the list and all items are printed upon identification.
    
    attributes:
        items -- list of items
        overlap_fun -- function counting how often one item overlaps with other items in a list
        
    """
    overlap = set([item for item in items if overlap_fun(items, item) > 1])
    
    print(f'Found {len(overlap)} item{"" if len(overlap)==1 else "s"} overlapping items in the list:')
    for item in overlap:
        print(f' - {item}')

### (a) Overlapping identifiers

First, identify all overlapping identifiers. An identifier is the string by which a bibtex entry is referenced. These duplicated identifiers could either be caused by a duplicate reference or by two different references which use the same identifier.

In [22]:
references = re.findall(r'\@.*{(.*),', rawbib)
find_overlap(references)

Found 1 item overlapping items in the list:
 - femmer2018requirements


### (b) Overlapping titles

Second, identify all overlapping titles. Here, use a different overlap function: instead of counting how often one title appears exactly the same in the set, count how often it appears in a very similar fashion. This accounts for titles that are slightly different (e.g., different punctuation).

In [23]:
def similars(items: list[str], item: str, threshold: float=0.9) -> int:
    """Count how often an item appears in a list of items with a given similarity threshold

    attributes:
        items -- list of items
        item -- item, against which the similarity is calculated
        threshold -- float between 0 and 1 (where 1 is a perfect match); the calculation of the SequenceMatcher ratio between the given item and another candidate must exceed the threshold to be considered a similar

    returns: number of items similar to the given item (at least 1 since the item has a perfect match with itself) """
    
    candidates = [candidate for candidate in items if SequenceMatcher(None, item, candidate).ratio()>threshold]
    return len(candidates)

titles = re.findall(r',\s*\n\s*title\s*\=\s*{(.*)}', rawbib)
titles = [title.lower() for title in titles]
find_overlap(titles, overlap_fun=similars)

Found 2 items overlapping items in the list:
 - its the activities, stupid! a new perspective on re quality
 - it's the activities, stupid! a new perspective on re quality
