# DOI Finder Based on Article Titles

This notebook retrieves DOIs for a list of article titles using the Crossref API. It reads titles from a CSV, queries Crossref, and saves the results.

##  Import Libraries

Import all necessary Python libraries for API requests, data handling, and string similarity.

In [None]:
# Import URL handling utilities for encoding and making requests
from urllib.parse import quote_plus, urlencode
from urllib.request import urlopen, Request
# Import json for parsing API responses
import json
# Import Levenshtein ratio for string similarity comparison
from Levenshtein import ratio
# Import HTTPError for handling API request errors
from urllib.error import HTTPError
# Import time for adding delays between requests
import time
# Import pandas for data manipulation
import pandas as pd

## Read Article Titles

Load the list of article titles from a CSV file. Update the file path and column name as needed.

In [None]:
# Read titles from a CSV file into a list
# Replace 'path_to_your_file.csv' and 'titles' with your actual file path and column name
# Example: titles = pd.read_csv('my_file.csv')['Title'].tolist()
titles = pd.read_csv('path_to_your_file.csv')[titles].tolist()

## Define Query and DOI Finder Functions

Create two functions:

- One to query the Crossref API for a given title and return the most similar result.
- Another to find the DOI for a given title using the Crossref query function.

In [None]:
# Define a default result for cases where no match is found
EMPTY_RESULT = {
    "crossref_title": "",
    "similarity": 0,
    "doi": ""
}

def crossref_query_title(title):
    """
    Query the Crossref API for a given title and return the most similar result.
    Returns a dictionary with the Crossref title, similarity score, and DOI.
    """
    api_url = "https://api.crossref.org/works?"
    params = {"rows": "5", "query.bibliographic": title}  # Search for up to 5 results
    url = api_url + urlencode(params, quote_via=quote_plus)
    request = Request(url)
    try:
        ret = urlopen(request)
        content = ret.read()
        data = json.loads(content)
        items = data["message"]["items"]
        most_similar = EMPTY_RESULT
        for item in items:
            if "title" not in item:
                continue  # Skip items without a title
            title = item["title"].pop()  # Get the title from the result
            result = {
                "crossref_title": title,
                # Calculate similarity between the found title and the query
                "similarity": ratio(title.lower(), params["query.bibliographic"].lower()),
                "doi": item["DOI"]
            }
            # Keep the result with the highest similarity
            if most_similar["similarity"] < result["similarity"]:
                most_similar = result
        return {"success": True, "result": most_similar}
    except:
        # Return an empty result if any error occurs
        return {"success": False, "result": EMPTY_RESULT}

In [None]:
def doi_finder(title):
    # Get the result from crossref query
    result=crossref_query_title(title)['result']
    
    # Add random delay between 3-6 seconds to avoid rate limiting
    time.sleep(random.randint(3,6))
    
    # Check if we found an exact match (similarity = 1.0)
    if result['similarity']==1.0:
        doi=result['doi']
        print(f"Doi of this '{title}' is : {doi}")
    else:
        # If no exact match found, return NaN
        doi=np.nan
        print(f"Sorry , we can't  find any doi match with this '{title}' ...")
    return doi

## Collect and Find DOIs for All Titles

Initialize an empty list to store the DOIs, then use multithreading to efficiently find DOIs for all titles and measure the runtime.

In [None]:
# Initialize an empty list to store DOIs
dois = []

In [None]:
# Main execution block: find DOIs for all titles using multithreading
if __name__ == '__main__':    
    start = time.perf_counter()        # Start timer
    i = 0
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Map the doi_finder function to all titles in parallel
        results = executor.map(doi_finder, titles)
        for result in results:
            print(i, result)
            dois.append(result)
            i += 1
    finish = time.perf_counter()  # End timer
    run_time = round(finish - start, 2)
    print(run_time)

## Save Results

Combine the titles and DOIs into a DataFrame and save the results to a CSV file.

In [None]:
#put two list titles and dois in one dataframe
df = pd.DataFrame({"Title": titles, "Doi": dois})
df

In [None]:
# Save the DataFrame to a CSV file
df.to_csv('your_file_path.csv')