---
title: Lesson 5. Author Affiliations
format:
  html:
    toc: true
    toc-expand: 2
    toc-title: CONTENTS
---

The Entrez __[E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25500/)__ offer a suite of tools that enable researchers to automate the search and retrieval of scientific information from __[PubMed](https://pubmed.ncbi.nlm.nih.gov/)__ and other databases maintained by the National Center for Biotechnology Information (NCBI). In [Lesson 4](./icite.ipynb) we identified an active NIH funded research project at The Ohio State University and generated a list of PMIDs (PubMed Identifiers) associated with each project.  In Lesson 5, we will use this list of PMIDs with the Entrez __[E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25500/)__ to gather the affiliations of each author listed on the corresponding articles. We will also begin to explore regular expressions, a tool used across programming languages for matching and manipulating string data. 

## Data skills | concepts
- Working with APIs
- Manipulating text
- Regular expressions

## Learning objectives
1. Locate API documentation and identify key components required to formulate an API request
2. Parse an API response and store extracted data.
3. Utilize regular expressions to search, match, and manipulate text.

This tutorial is designed to support multi-session __[workshops](https://library.osu.edu/events?combine=&tid=All&field_location_code_value=10&sort_bef_combine=field_end_date_value_ASC)__ hosted by The Ohio State University Libraries Research Commons. It assumes you already have a basic understanding of Python, including how to iterate through lists and dictionaries to extract data using a for loop. To learn basic Python concepts visit the [Python - Mastering the Basics](python_basics.ipynb) tutorial.

<div class="card border-primary mb-3 p-1" style="max-width: 100%;">
  <div class="card-header" style="font-size: 1.8rem;"><img src="images/idea_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Tip</div>
  <div class="card-body"><img src="images/microsoft_copilot_icon.svg" alt="" aria-hidden="true">The Enztrez <a href="https://www.ncbi.nlm.nih.gov/books/NBK25500/">E-utilities</a> manual can be overwhelming to read and comprehend at first. The example code is written in Perl script, not Python and the documentation assumes you are familiar working with APIs and programming tools.  Before starting this tutorial, <strong>ask Copilot</strong> to <span class="text-primary">explain entrez e-utilities</span>. Copilot returns a useful summary of the key components and functionalities of these tools, explains how they work, provides an example workflow, and identifies potential use cases. As you become more comfortable working with APIs, you can revisit the Enztrez <a href="https://www.ncbi.nlm.nih.gov/books/NBK25500/">E-utilities</a> manual to learn how to do more complex tasks.
  </div>
</div>

# LESSON 5
Science is constantly evolving, with new disciplines emerging from interdisciplinary research, technological innovation, and global collaboration. Analyzing **research networks** can help researchers identify potential collaborators, track emerging fields, and discover new research directions. 

One effective way to explore these networks is by examining **author affiliations** listed in journal publications.

## What is EFetch?
**EFetch** is a utility provided by NCBI's Entrez system that retrieves detailed records for a list of unique identifiers (like PMIDs) from databases such as PubMed.

## Where are author affiliations?
In PubMed records, author affiliations are embedded in the XML under:

In [None]:
<Author>
¬† <AffiliationInfo>
¬†¬†¬† <Affiliation>...</Affiliation>
¬† </AffiliationInfo>
</Author>

<div class="accordion" id="accordionExercise1">

  <div class="accordion-item"><h2 class="accordion-header" id="ex1-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex1-collapseOne" aria-expanded="true" aria-controls="ex1-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 1: Inspect a PubMed record</button></h2><div id="ex1-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex1-headingOne" data-bs-parent="#accordionExercise1"> <div class="accordion-body fs-4"><p>Use the <strong>first PMID</strong> from your list from Lesson 4.</p>
  <ul><li>Search for it on <a href="https://pubmed.ncbi.nlm.nih.gov/">PubMed</a>.</li><li>Observe:<ul><li>The <strong>structure of the URL</strong></li><li>The <strong>location of author affiliations</strong> in the record</li> <li><strong>Inspect</strong> the author elements.</li></ul></li></ul></div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex1-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex1-collapseTwo" aria-expanded="false" aria-controls="ex1-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex1-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex1-headingTwo" data-bs-parent="#accordionExercise1"> <div class="accordion-body"><h2>URL = https://pubmed.ncbi.nlm.nih.gov/<strong>39773557</strong>/</h2><p><strong>Note:</strong> The PMID is at the end of the URL</p> <img src="images/pubmed_1.png" alt="Screenshot of PubMed record showing affiliation list expanded." class="img-fluid" style="max-width: 100%; border-radius: 8px; padding: 1rem;">
  </div>
  </div>
  </div>

</div>

## Step 1. Construct an EFetch request
To retrieve XML data for a PubMed article, use the following components:
- **Base URL**: `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi` 
- **Parameters**:
    - **Database name**: `?db=pubmed`
    - **Unique identifier**: `&id=39773557`
    - **API key**: `&api_key=INSERT YOUR API KEY HERE`
      
Required parameters for an **EFetch** request depend on the specific Entrez [database]((https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/?report=objectonly)__) you are querying. For **PubMed**, the default EFetch response format is XML. 

To manage request volume, the **NCBI** enforces rate limits:
- Without an API key: **3 requests per second**
- With an API key: **up to 10 requests per second**

While you can view a single XML record without an API key, completing the exercises in this tutorial requires one. You can obtain an API key by visiting the **Settings page of your [NCBI account](https://www.ncbi.nlm.nih.gov/account/)**.

<div class="accordion" id="accordionExercise2">

  <div class="accordion-item"><h2 class="accordion-header" id="ex2-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex2-collapseOne" aria-expanded="true" aria-controls="ex2-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 2: Construct a request</button></h2><div id="ex2-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex2-headingOne" data-bs-parent="#accordionExercise2"> <div class="accordion-body fs-4">Use the <strong>EFetch</strong> utility to retrieve the <strong>XML record</strong> for the <strong>first PMID</strong> from the list you generated in Lesson 4. This XML will contain detailed metadata about the article, including author affiliations (if available). 
  </div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex2-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex2-collapseTwo" aria-expanded="false" aria-controls="ex2-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex2-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex2-headingTwo" data-bs-parent="#accordionExercise2"> <div class="accordion-body"><h2>Example:</h2><p> <a href="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=39773557">https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=39773557</a></p><img src="images/pubmed_2.png" alt="Screenshot of PubMed XML record number 39773557" class="img-fluid" style="max-width: 100%; border-radius: 8px; padding: 1rem;">

  </div>
  </div>
  </div>

</div>

## Step 2. Identify Python libraries for project
The following Python libraries are needed for this project:
- `requests` ‚Äì to make HTTP requests
- `pandas` ‚Äì to manage and store data
- `BeautifulSoup`‚Äì to parse XML and extract affiliations

<div class="accordion" id="accordionExercise3">

  <div class="accordion-item"><h2 class="accordion-header" id="ex3-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex3-collapseOne" aria-expanded="true" aria-controls="ex3-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 3: Write and test code</button></h2><div id="ex3-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex3-headingOne" data-bs-parent="#accordionExercise3"> <div class="accordion-body fs-4"><p>Using the list of PMIDs you generated in Lesson 4, write and test a Python script that:<ol><li>Iterates through your list of PMIDs</li><li>Sends an EFetch request for each
</li><li>Parses the XML to extract author affiliations</li><li>Handles missing or incomplete data gracefully</li></ol></p>
</div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#collapseTwo" aria-expanded="false" aria-controls="collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="collapseTwo" class="accordion-collapse collapse" aria-labelledby="headingTwo" data-bs-parent="#accordionExercise3"> <div class="accordion-body">

```python
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
from datetime import date

#1. Create a last_updated variable with today's date.
today = date.today()
last_updated=today

#2. Create list of PMIDs
pmids=['39773557', '39656677', '37398045', '39229161', '39713331', '39315813', '38338688', '36721057', '37322069']

#3.Create a dataframe to store the search results. 
author_affiliations=pd.DataFrame(columns=["pmid","name","affiliation","last_updated"])

#4. Use requests, BeautifulSoup, and the EFetch utility to retrieve author affiliations.
# Store results in a DataFrame.
count=0
for each_record in pmids:
    # try:
    count += 1
    print('starting record '+str(count)+': '+str(each_record))
    search_url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="+str(each_record)+"&api_key=INSERT YOUR API KEY HERE"
    xml_data=requests.get(search_url).text
    soup = bs(xml_data, 'xml')
    records=soup.find('PubmedArticle')
    pmid=records.PMID.text
    authors=records.find_all("Author")
    for each_author in authors:
        if each_author.LastName != None:
            lastname=each_author.LastName.text
        else:
            lastname=''
        if each_author.ForeName != None:
            forename=each_author.ForeName.text
        else:
            forename=''
        if lastname != '' and forename != '':
            name=lastname+', '+forename
        else:
            name=''
        

        if each_author.Affiliation != None:
            affiliation=each_author.Affiliation.text
        else:
            affiliation=''
        print(f"{name}, {affiliation}")
                
        row={
            "pmid": pmid,
            "name": name,
            "affiliation": affiliation,
            "last_updated": last_updated

            }
        author_info=pd.DataFrame(row, index=[0])
        author_affiliations = pd.concat([author_info,author_affiliations], axis=0)
        time.sleep(0.15)
        
#5. Export results to csv        
author_affiliations.to_csv('data/pubmed_author_affiliations.csv')
```
  </div>
  </div>
  </div>

</div>


## Regular Expressions (regex)
Analyzing author and affiliation data can be messy due to:
- Inconsistent naming conventions
- Variations in institutional affiliation formats
- Ambiguities in author identify.

<div class="card border-primary mb-3 p-1" style="max-width: 100%;">
  <div class="card-header" style="font-size: 1.8rem;"><img src="images/idea_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Tip: ORCID</div>
  <div class="card-body"><img src="images/orcid_24x24.png" alt="orcid logo"><p><a href="https://orcid.org/register">Create your ORCID iD</a>to help researchers distinguish your work from others and better track the impact of your work. </p><p>PubMed now embeds the ORCID in <span class="text-primary">Author</span> tags. Several journals and funding agencies also now require ORCID iDs for submissions. See <a href="https://guides.osu.edu/c.php?g=608754&p=4233669">Tracking and Enhacing the Impact of your Research</a> for more information.</p>
  </div>
</div>

**Regular expressions (regex)** match patterns in text.  Often described as wildcards on steroids, regular expressions help:
- **Validate** patterns (e.g., ZIP codes: (^\d{5}$))
- **Extract** variations (e.g., ‚Äúha?ematology‚Äù matches both ‚Äúhematology‚Äù and ‚Äúhaematology‚Äù)
- **Replace** text (e.g., re.sub(r'\bOH\b', 'Ohio', text))

Regular expressions are included in several programming langauges and software programs including Python, JavaScript, and Tableau. 

### Common Regex Patterns

|     Pattern      | Matches  |  
|:--------------|:--------------- |
|`[A-Z]`    |Any uppercase letter   | 
|`[a-z]`    |Any lowercase letter   | 
|`[0-9]{5}`    |Exactly 5 digits  | 
|`^Ohio`    |Starts with "Ohio"  | 
|`State$`    |Ends with "State"  |
|`ha?ematology`    |"hematology" or "haematology"  |
|`Ohio State\|OSU`    |	"Ohio State" or "OSU"  |

__[Metacharacters](https://www.w3schools.com/python/gloss_python_regex_metacharacters.asp)__ are special symbols in regular expressions that represent patterns rather than literal characters. To match them as literal characters, you must **escape them with a backslash ( \ )**. 

### Common metacharacters and their functions
|     Symbol     | Meaning  |  Example  |
|:-------|:--------------- |:--------------- |
|`[ ]`    |A set or range of characters   | `[a-f]` matches any lowercase letter from a to f  |
| `\`    |Starts a special sequence   | `\w` matches any word character (letter, digit, or underscore) |
| `.`   |	Any character except newline   | `d.g` matches "dog", "dig", "dug", etc.  |
| `^`   |Start of a string   | 	`^Ohio` matches any string that starts with "Ohio" |
| `$`   |End of a string  | `State$` matches any string that ends with "State"  |
| `.*`   |Zero or more of the preceding character   |`h*matology` matches "hematology", "haematology", etc. |
| `+`   |One or more of the preceding character   | `spe+d` matches "sped", "speed", etc.  |
| `?`   |Zero or one of the preceding character   | `travel?ling` matches "traveling", "travelling". etc. |
| `{ }`   |Exactly a specified number of repetitions   | `[0-9]{5}` matches any 5-digit number  |
| `( )`  |	Grouping or capturing  | `The (Ohio) State University` extracts "Ohio"  |
| `\|`   |	Logical OR   | `Ohio State\|OSU` matches "Ohio State" or "OSU"  |


### LEARNING RESOURCES
::: {.grid .g-4}
::: {.g-col-12 .g-col-md-6 .g-col-lg-4}
<div class="card bg-light mb-3" style="max-width: 20rem;">
  <div class="card-header">REGEX 101</div>
  <div class="card-body"><img src="images/regex101.png" alt="Screenshot of regex101 homepage" class="d-block mx-auto"><h4 class="card-title"><a href="https://regex101.com/">regular expressions 101: build, test, and debug regex</h4>
  </div>
  <div> <a href="https://regex101.com/">regular expressions 101: build, test, and debug regex</a> is an interactive tool that helps you build, test, and debug regular expressions across multiple programming languages. It lets you test your regex against sample text, provides real-time explanations as you type, and includes a searchable reference for regex syntax.</div>
</div>

:::

::: {.g-col-12 .g-col-md-6 .g-col-lg-4}
<div class="card bg-light mb-3" style="max-width: 20rem;">
  <div class="card-header">EFFECTIVE DATA VISUALIZATION</div>
  <div class="card-body"><img src="images/cover_learning_regular_expressions.png" alt="learning regular expressions book cover" class="d-block mx-auto" style="max-width: 100%; height: auto;"><h4 class="card-title"><a href="https://search.library.osu.edu/permalink/01OHIOLINK_OSU/rr4vai/alma991085516536408507">Learning Regular Expressions</a></h4>
  </div>
  <div><a href="https://search.library.osu.edu/permalink/01OHIOLINK_OSU/rr4vai/alma991085516536408507">Learning Regular Expressions</a> by <strong>Ben Forta</strong>** is available through the Libraries' <a href="https://search.library.osu.edu/permalink/01OHIOLINK_OSU/2g3e0k/alma991060545609708507">O'Reilly Online Learning</a> collection of technical books and videos. Each chapter is structured as a lesson, guiding you through how to match individual characters or sets of characters, use metacharacters, and more‚Äîmaking it a practical resource for mastering regex step by step.
  </div>

</div>

:::
:::

## re module
To work with regular expressions in Python, start by importing the built-in `re` module:

In [None]:
import re

üîó  See __[re](https://docs.python.org/3/library/re.html)__ module documentation.

### Commonly used re methods

### re.match( )

In [None]:
re.match(pattern, string, flags=0)

- Checks for a match **only at the beginning** of the string.
- Returns a match object if found, otherwise None.

**Example**:

In [1]:
import pandas as pd
import re

addresses=pd.read_csv('data/pubmed_author_affiliations.csv')
addresses=addresses.dropna(subset='affiliation') #drops rows with null affiliation values

for idx, row in addresses.iloc[0:10].iterrows():
    affiliation=str(row.affiliation)
    print(affiliation)
    osu_match=re.match(r"Ohio State University",affiliation) 
    if osu_match:
        print(f" MATCH {osu_match.group()}: {affiliation}")

The Ohio State Biochemistry Program, The Ohio State University, Columbus, OH, 43210, USA. fu.978@osu.edu.
The Ohio State Biochemistry Program, The Ohio State University, Columbus, OH, 43210, USA.
Center for Cancer Metabolism, The Ohio State University Comprehensive Cancer Center, Columbus, OH, 43210, USA.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, 43210, USA.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, 43210, USA.
The Ohio State Biochemistry Program, The Ohio State University, Columbus, OH, 43210, USA.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, USA. fu.978@osu.edu.
Department of Biological Chemistry and Pharmacology, The Ohio State University, Columbus, OH, USA.
Department of Physics, Northeastern University, Boston, MA 02115, USA.
Department of Chemistry and Biochemistry, Center for RNA Biology, Ohio State University, Columbus, OH 4321

üîó  See [re.match()](https://docs.python.org/3/library/re.html#re.match) documentation.

### re.search( )

In [None]:
re.search(pattern, string, flags=0)

- Scans through the string and returns the **first match** of the pattern.
- Returns a match object or None.

**Example**:

In [3]:
import pandas as pd
import re

addresses=pd.read_csv('data/pubmed_author_affiliations.csv')
addresses=addresses.dropna(subset='affiliation')

for idx, row in addresses.iloc[0:10].iterrows():
    affiliation=str(row.affiliation)
    # print(affiliation)
    osu_search=re.search(r"The Ohio State University",affiliation) 
    if osu_search:
        print(osu_search.group())
    else:
        print(f"No match: affiliation = {affiliation}")

The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
The Ohio State University
No match: affiliation = Department of Physics, Northeastern University, Boston, MA 02115, USA.
No match: affiliation = Department of Chemistry and Biochemistry, Center for RNA Biology, Ohio State University, Columbus, OH 43210, USA.


üîó  See [re.search()](https://docs.python.org/3/library/re.html#re.search) documentation.

### re.findall( )

In [None]:
re.findall(pattern, string, flags=0)

- Returns **all non-overlapping matches** of the pattern in the string as a list.

**Example**:

In [5]:
# HOW MANY TORTOISES AND TURTLES ARE IN THIS LIST OF ANIMALS?

import pandas as pd
import re

df=pd.read_csv('data/animals_tortoises.csv')
animals=df.common_name.tolist()
animals=','.join(animals)
print('LIST OF ANIMALS')
print(animals)

tortoises_turtles=re.findall(r"tortoise|turtle", animals)
print('ANSWER')
print(f"There are {len(tortoises_turtles)} tortoises and turtles in this list of animals.")


LIST OF ANIMALS
abyssinian-ground-hornbill,addax,african-clawed-frog,african-pancake-tortoise,african-plated-lizard,aldabr-tortoise,allens-swamp-monkey,alligator-lizard,alligator-snapping-turtle,alpaca
ANSWER
There are 3 tortoises and turtles in this list of animals.


üîó  See [re.findall()](https://docs.python.org/3/library/re.html#re.findall) documentation.

### re.sub( )

In [None]:
re.sub(pattern, repl, string, count=0, flags=0)

- Replaces all occurrences of the pattern in the string with the replacement text (`repl`).
- You can limit the number of replacements using the count parameter.

**Examples**:

In [6]:
# FIND TORTOISES AT THE NATIONAL ZOO AND REPLACE THE COMMON_NAME WITH "SLOW TORTOISE"
import pandas as pd
import re

df=pd.read_csv('data/animals_tortoises.csv')
animals=df.common_name.tolist()
animals=','.join(animals)
pattern="tortoise|turtle"
tortoises_slow=re.sub(pattern,"SLOW TORTOISE,",animals)
tortoises_slow

'abyssinian-ground-hornbill,addax,african-clawed-frog,african-pancake-SLOW TORTOISE,,african-plated-lizard,aldabr-SLOW TORTOISE,,allens-swamp-monkey,alligator-lizard,alligator-snapping-SLOW TORTOISE,,alpaca'

üîó  See [re.sub()](https://docs.python.org/3/library/re.html#re.sub) documentation.

<div class="accordion" id="accordionExercise4">

  <div class="accordion-item"><h2 class="accordion-header" id="ex4-headingOne"><button class="accordion-button fs-3" type="button" data-bs-toggle="collapse" data-bs-target="#ex4-collapseOne" aria-expanded="true" aria-controls="ex4-collapseOne"><img src="images/guidepost_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Exercise 4: Use re to normalize affiliations</button></h2><div id="ex4-collapseOne" class="accordion-collapse collapse show fs-4" aria-labelledby="ex4-headingOne" data-bs-parent="#accordionExercise4"> <div class="accordion-body fs-4">Use regular expressions to create a list of institution names from the affiliations list you generated in Exercise 3.</div></div>
  </div>

  <div class="accordion-item"><h2 class="accordion-header" id="ex4-headingTwo"><button class="accordion-button fs-3 collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#ex4-collapseTwo" aria-expanded="false" aria-controls="ex4-collapseTwo"><img src="images/magnifying_glass_standard_icon.png" alt="" aria-hidden="true" style="height: 3rem; vertical-align: middle; margin-right: 0.5rem;">Solution:</button></h2><div id="ex4-collapseTwo" class="accordion-collapse collapse" aria-labelledby="ex4-headingTwo" data-bs-parent="#accordionExercise4"> <div class="accordion-body">

```python
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
from datetime import date
import re

#1. Create a last_updated variable with today's date.
today = date.today()
last_updated=today

#2. Create list of PMIDs
pmids=['39773557', '39656677', '37398045', '39229161', '39713331', '39315813', '38338688', '36721057', '37322069']

#3.Create a dataframe to store the search results. 
author_affiliations=pd.DataFrame(columns=["pmid","name","affiliation","institution","last_updated"])

#4. Use requests, BeautifulSoup, and the EFetch utility to retrieve author affiliations.
# Store results in a DataFrame.
count=0
for each_record in pmids:
    # try:
    count += 1
    print('starting record '+str(count)+': '+str(each_record))
    search_url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="+str(each_record)+"&api_key=INSERT API KEY HERE"
    xml_data=requests.get(search_url).text
    soup = bs(xml_data, 'xml')
    records=soup.find('PubmedArticle')
    pmid=records.PMID.text
    authors=records.find_all("Author")
    for each_author in authors:
        if each_author.LastName != None:
            lastname=each_author.LastName.text
        else:
            lastname=''
        if each_author.ForeName != None:
            forename=each_author.ForeName.text
        else:
            forename=''
        if lastname != '' and forename != '':
            name=lastname+', '+forename
        else:
            name=''
        

        if each_author.Affiliation != None:
            affiliation=each_author.Affiliation.text

            ohio_state=re.search(r"Ohio State", affiliation)
            harvard_medical_school=re.search(r"Harvard Medical School", affiliation)
            institut_genetique=re.search(r"Institut de G√©n√©tique et de Biologie Mol√©culaire et Cellulaire", affiliation)
            johns_hopkins=re.search(r"Johns Hopkins University", affiliation)
            mcgill=re.search(r"McGill University", affiliation)
            nci=re.search(r"National Cancer Institute", affiliation)
            nidcd=re.search(r"National Institute on Deafness and Other Communication Disorders", affiliation)
            northeastern=re.search(r"Northeastern University", affiliation)
            u_bristol=re.search(r"University of Bristol", affiliation)
            u_maryland=re.search(r"University of Maryland", affiliation)
            u_virginia=re.search(r"University of Virginia", affiliation)
            vicosa=re.search(r"Universidade Federal de Vi√ßosa", affiliation)
            if ohio_state:
                institution="The Ohio State University"
            elif harvard_medical_school:
                institution=harvard_medical_school.group()
            elif institut_genetique:
                institution=institut_genetique.group()
            elif johns_hopkins:
                institution=johns_hopkins.group()
            elif mcgill:
                institution=mcgill.group()
            elif nci:
                institution=nci.group()
            elif nidcd:
                institution=nidcd.group()
            elif northeastern:
                institution=northeastern.group()
            elif u_bristol:
                institution=u_bristol.group()
            elif u_maryland:
                institution=u_maryland.group()
            elif u_virginia:
                institution=u_virginia.group()
            elif vicosa:
                institution=vicosa.group()

        else:
            affiliation=''
        print(f"{name}, {affiliation}")

        row={
            "pmid": pmid,
            "name": name,
            "affiliation": affiliation,
            "institution": institution,
            "last_updated": last_updated

            }
        author_info=pd.DataFrame(row, index=[0])
        author_affiliations = pd.concat([author_info,author_affiliations], axis=0)
        time.sleep(0.15)
        
#5. Export results to csv        
author_affiliations.to_csv('pubmed_author_affiliations.csv')
```
  </div>
  </div>
  </div>

</div>
