<a href="https://colab.research.google.com/github/atjoelpark/eutilities/blob/main/analysis/bmj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extraction of BMJ Articles (ORIGINAL MADE BY JOE)

This tutorial will provide a step-by-step pathway to downloading the data from PubMed with the aim of extracting the authors' first names for gender analysis and countries for origin.

**Prerequisites:**

1. Basic Foundation of Python3 knowledge including functions.
2. Regular Expressions
3. Use of DataFrames (Pandas) to analyze the data

**Resources:**

- [Introduction to Python at Datacamp](https://learn.datacamp.com/courses/intro-to-python-for-data-science). This website provides free (and paid) resources for learning the fundamentals of Python. A great introductory course in Datacamp is `Introduction to Python`. This will go over the essentials required to program in Python.

- [Regular Expressions at Datacamp](https://learn.datacamp.com/courses/regular-expressions-in-python). This class will provide the foundations for regular expressions. As a data scientist, you will encounter many situations where you will need to extract key information from huge corpora of text, clean messy data containing strings, or detect and match patterns to find useful words. All of these situations are part of text mining.

The alternative to Datacamp is Dataquest. `Dataquest.io` provides more of a text-based format of learning, whereas datacamp is geared more towards videos. [Dataquest](https://www.dataquest.io). The above are paid resources. If you are looking for more of a truly free solution, Coursera provides great resources as well. [Python for Everyone on Coursera](https://www.coursera.org/specializations/python).

Before moving forward, it is important to have a foundation in Python prior to moving forward. The experience will be more meaningful.

In [None]:
# Importing libraries
# These libraries provide functions that will 
import numpy as np 
import pandas as pd # Another resource to learn Pandas: https://www.kaggle.com/learn/pandas
import re 

# PubMed API

Extracting data from the appropriate resources is important prior to performing text mining and analysis. Because PubMed is the resource of interest, this section will demonstrate how data is extracted from PubMed.

**References:**
1. [Entrez Programming Utilties Help PDF](https://www.ncbi.nlm.nih.gov/books/NBK25501/pdf/Bookshelf_NBK25501.pdf)
2. [NCBI API Resources](https://www.ncbi.nlm.nih.gov/home/develop/api/)
3. [NCBI Developer Resources](https://www.ncbi.nlm.nih.gov/pmc/tools/developers/)
4. [Bio Entrez Package](https://biopython.org/docs/1.75/api/Bio.Entrez.html)
5. [Entrez Direct: E-utilties on the Unix Command Line](https://www.ncbi.nlm.nih.gov/books/NBK179288/)

In [None]:
# Installing E-utilities Entrez Direct
def e_utilities_install():
  """
  Installs e_utilities
  Reference: https://www.ncbi.nlm.nih.gov/books/NBK179288/
  """
  !sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
  !echo 'export PATH=\$PATH:\$HOME/edirect' >> $HOME/.bash_profile

In [None]:
# Defining Functions
def pull_pmid_metadata(pmid: list) -> pd.DataFrame:
  """
  This is dependent on E-utilities. The input parameter is a list of PMIDs.
  The output is is a Pandas DataFrame with the following columns:

  1. PMID
  2. PubMed Article Title
  3. DateCompleted (Year_Month_Day)
  4. DateRevised (Year_Month_Day)
  5. Journal Title
  6. Publication Date (Year_Month_Day)
  7. Abstract
  8. Author FirstName_LastName_Affiliation (Note that that three values are
  separated by "_". If an author has affilitations to multiple institions, the
  institutions are separated by the character "/".)

  @param pmid: Takes a list of PMIDs produced by function pull_pmid
  @return: Returns a Pandas DataFrame
  @raise keyError: raises an exception
  """
  columns = ["PMID", "PubMed_Article_Title", "Date_Completed_Year", 
             "Date_Completed_Month", "Date_Completed_Day", "Date_Revised_Year", 
             "Date_Revised_Month", "Date_Revised_Day", "Journal_Title",
             "Publication_Date_Year", "Publication_Date_Month", "Publication_Date_Day",
             "Abstract", "AuthorFirstName_AuthorLastName_Affiliation"]
  df = pd.DataFrame(columns=columns)

  for id, i in enumerate(pmid):
    try:
      _temp = f'''$HOME/edirect/efetch -db pubmed -id {i} -format xml \
| $HOME/edirect/xtract -pattern PubmedArticle -tab "|" -def "NULL" -sep "," -element MedlineCitation/PMID ArticleTitle \
DateCompleted/Year DateCompleted/Month DateCompleted/Day DateRevised/Year DateRevised/Month DateRevised/Day Journal/Title \
PubDate/Year PubDate/Month PubDate/Day AbstractText \
-block Author -tab " /" -def "NULL" -sep "_" -element ForeName,LastName,Affiliation'''

      _result = !{_temp}
      _temp = _result[0].split("|")

      for _count, _value in enumerate(_temp):
        df.loc[id,columns[_count]] = _value
        
    except Exception as e:
      print(f"Error Raised when Querying Unix EDirect for PMID: {i}")
      print(e)

  # Prior to returning df
  # If any cells have empty values, convert to NULL
  df = df.replace(r'', "NULL", regex=True)

  return df

In [None]:
# Install e-utilities UNIX
# When prompoted here, please type `y` to install
e_utilities_install()


Entrez Direct has been successfully downloaded and installed.

In order to complete the configuration process, please execute the following:

  echo "export PATH=\${PATH}:/root/edirect" >> ${HOME}/.bashrc

or manually edit the PATH variable assignment in your .bashrc file.

Would you like to do that automatically now? [y/N]
y
OK, done.

To activate EDirect for this terminal session, please execute the following:

export PATH=${PATH}:${HOME}/edirect



In [None]:
# Example of a query of PMID 27219127 using the terminal
# For fun fact, this is Leo's paper on MIMIC-III database
# Please note that you enter in the PMID in the -id parameter
!$HOME/edirect/efetch -db pubmed -id 27219127 -format xml \
| $HOME/edirect/xtract -pattern PubmedArticle -tab "|" -def "NULL" -sep "," -element MedlineCitation/PMID ArticleTitle \
DateCompleted/Year DateCompleted/Month DateCompleted/Day DateRevised/Year DateRevised/Month DateRevised/Day Journal/Title \
PubDate/Year PubDate/Month PubDate/Day AbstractText \
-block Author -tab " /" -def "NULL" -sep "_" -element ForeName,LastName,Affiliation

27219127|MIMIC-III, a freely accessible critical care database.|2016|12|16|2018|11|13|Scientific data|2016|May|24|MIMIC-III ('Medical Information Mart for Intensive Care') is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.|Alistair E W_Johnson_Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. /Tom J_Pollard_Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachus

In [None]:
# Creating a list of Leo's other's papers and creating a dataframe for analysis
list_of_pmids = [27219127, 30349085, 29806057, 29303796, 28328711 ]
df = pull_pmid_metadata(list_of_pmids)

Error Raised when Querying Unix EDirect for PMID: 28328711
list index out of range


In [None]:
# Showing the df 
df

Unnamed: 0,PMID,PubMed_Article_Title,Date_Completed_Year,Date_Completed_Month,Date_Completed_Day,Date_Revised_Year,Date_Revised_Month,Date_Revised_Day,Journal_Title,Publication_Date_Year,Publication_Date_Month,Publication_Date_Day,Abstract,AuthorFirstName_AuthorLastName_Affiliation
0,27219127,"MIMIC-III, a freely accessible critical care d...",2016,12,16,2018,11,13,Scientific data,2016,May,24.0,MIMIC-III ('Medical Information Mart for Inten...,Alistair E W_Johnson_Laboratory for Computatio...
1,30349085,The Artificial Intelligence Clinician learns o...,2019,5,9,2021,1,9,Nature medicine,2018,11,,Sepsis is the third leading cause of death wor...,Matthieu_Komorowski_Department of Surgery and ...
2,29806057,Transthoracic echocardiography and mortality i...,2019,2,11,2019,2,15,Intensive care medicine,2018,06,,While the use of transthoracic echocardiograph...,Mengling_Feng_Saw Swee Hock School of Public H...
3,29303796,A Comparative Analysis of Sepsis Identificatio...,2019,9,3,2019,9,3,Critical care medicine,2018,04,,To evaluate the relative validity of criteria ...,"Alistair E W_Johnson_MIT Critical Data, Cambri..."
4,28328711,Management of Atrial Fibrillation with Rapid V...,2018,5,17,2020,12,9,"Shock (Augusta, Ga.)",2017,10,,Atrial fibrillation with rapid ventricular res...,"Ari_Moskowitz_*Division of Pulmonary, Critical..."


In [None]:
# As long as we supply a list of pmids as an argument to the pull_pmid_metadata, the function will return a dataframe with our relevant information.
# Let's get all the BMJ articles from 2020. 
# We can go to: https://pubmed.ncbi.nlm.nih.gov/advanced/
# Search by query: ("BMJ (Clinical research ed.)"[Journal]) AND (("2020/01/01"[Date - Publication] : "2020/12/31"[Date - Publication]))
# As of September 13, 2021, there are 2095 articles on Pubmed.
# Click on the "Save" button.
# Select "All Results" in Format "PMID".
# You can either import that list of PMIDs in here or you can copy and paste the information here.
# For demonstration sake, will do the first 10 PMIDs
list_of_pmids = [33168565, 33148535, 33288500, 33361141, 33184044, 33328153, 33310706, 33229333, 33106289, 33268462]

In [None]:
# Creating dataframe
df = pull_pmid_metadata(list_of_pmids)

In [None]:
# Show head of dataframe
df.head()

Unnamed: 0,PMID,PubMed_Article_Title,Date_Completed_Year,Date_Completed_Month,Date_Completed_Day,Date_Revised_Year,Date_Revised_Month,Date_Revised_Day,Journal_Title,Publication_Date_Year,Publication_Date_Month,Publication_Date_Day,Abstract,AuthorFirstName_AuthorLastName_Affiliation
0,33168565,Treatment of epithelial ovarian cancer.,2020,12,4,2020,12,14,BMJ (Clinical research ed.),2020,11,9,Ovarian cancer is the third most common gyneco...,Lindsay_Kuroki_Division of Gynecologic Oncolog...
1,33148535,Mortality due to cancer treatment delay: syste...,2020,12,4,2020,12,14,BMJ (Clinical research ed.),2020,11,4,To quantify the association of cancer treatmen...,Timothy P_Hanna_Division of Cancer Care and Ep...
2,33288500,Cancer of unknown primary.,2020,12,22,2020,12,22,BMJ (Clinical research ed.),2020,12,7,Cancers of unknown primary (CUPs) are histolog...,Michael S_Lee_Department of Gastrointestinal M...
3,33361141,NICE guideline on long covid.,2021,1,7,2021,1,7,BMJ (Clinical research ed.),2020,12,23,,Manoj_Sivan_Academic Department of Rehabilitat...
4,33184044,Diabetic ketoacidosis with SGLT2 inhibitors.,2020,11,23,2020,11,23,BMJ (Clinical research ed.),2020,11,12,,Giovanni_Musso_Emergency and Intensive Care Me...


# Extracting the Countries

Now with the data extracted and nearly cleaned. It is time to extract the finer details of the field elements perform further refinement with countries. Note that this information exists within the `AuthorFirstName_AuthorLastName_Affiliation` column. The data contains information about a list of authors and their affiliations. The delimiter between the first name, last name and affiliation is `_`, whereas the delimiter separating the different authors with their affiliations is with `/`. Let's see what this looks like in practice.

In [None]:
# Examining the first row and last column of the first dataframe
df.loc[0,"AuthorFirstName_AuthorLastName_Affiliation"]

'Lindsay_Kuroki_Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Washington University School of Medicine, St Louis, MO, USA. /Saketh R_Guntupalli_Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, University of Colorado School of Medicine, Denver, CO, USA saketh.guntupalli@ucdenver.edu.'

It's evident that there is an affiliation association with each author. By using regular expressions, we can attempt to extract out the countries (either by abbreviations or by their full names).

In [None]:
class Country:

    # This class produces several static methods that assist in the search phase including utilizing APIs for Genderize as well as NLM's website. No fields are declared nor constructors instantiated. As a result, any import of the Search class will not require an instantiation. There will however be some static field types that wil help faciliate some of the static methods here.

    def __init__(self):
        pass

    @staticmethod
    def countryCount(arrayOfAffiliations):
        # This returns a dictionary with a dictionary with key = country and value = frequency of the country. There is one parameters with the first parameter ingesting a string of affiliations. 

        # Libraries utilized
        from collections import Counter
        import pandas as pd
        import re

        # Declare an array with list of all valid countries in the world.
        # Reference: https://textlists.info/geography/countries-of-the-world/
        listOfCountries = ["afghanistan", "albania", "algeria", "andorra", "angola", "antigua and barbuda", "argentina", "armenia", "australia", "austria", "azerbaijan", "the bahamas", "bahrain", "bangladesh", "barbados", "belarus", "belgium", "belize", "benin", "bhutan", "bolivia", "bosnia and herzegovina", "botswana", "brazil", "brunei", "bulgaria", "burkina faso", "burundi", "cambodia", "cameroon", "canada", "cape verde", "central african republic", "chad", "chile", "china", "colombia", "comoros", "congo, republic of the", "congo, democratic republic of the", "costa rica", "cote d'ivoire", "croatia", "cuba", "cyprus", "czech republic", "denmark", "djibouti", "dominica", "dominican republic", "east timor (timor-leste)", "ecuador", "egypt", "el salvador", "equatorial guinea", "eritrea", "estonia", "ethiopia", "fiji", "finland", "france","gabon","the gambia","georgia","germany","ghana","greece","grenada","guatemala","guinea","guinea-bissau","guyana","haiti","honduras","hungary","iceland","india","indonesia","iran","iraq","ireland","israel","italy","jamaica","japan","jordan","kazakhstan","kenya","kiribati","korea,north","korea, south","kosovo","kuwait","kyrgyzstan","laos","latvia","lebanon","lesotho","liberia","libya","liechtenstein","lithuania","luxembourg","macedonia","madagascar","malawi","malaysia","maldives","mali","malta","marshall islands","mauritania","mauritius","mexico","micronesia, federated states of","moldova","monaco","mongolia","montenegro","morocco","mozambique","myanmar (burma)","namibia","nauru","nepal","netherlands","new zealand","nicaragua","niger","nigeria","norway","oman","pakistan","palau","panama","papua new guinea","paraguay","peru","philippines","poland","portugal","qatar","romania","russia","rwanda","saint kitts and nevis","saint lucia","saint vincent and the grenadines","samoa","san marino","sao tome and principe","saudi arabia","senegal","serbia","seychelles","sierra leone","singapore","slovakia","slovenia","solomon islands","somalia","south africa","south sudan","spain","sri lanka","sudan","suriname","swaziland","sweden","switzerland","syria","taiwan","tajikistan","tanzania","thailand","togo","tonga","trinidad and tobago","tunisia","turkey","turkmenistan","tuvalu","uganda","ukraine","united arab emirates","united kingdom","united states of america","uruguay","uzbekistan","vanuatu","vatican city (holy see)","venezuela","vietnam","yemen","zambia","zimbabwe"]
        punc = '''!|()-[]{};:'"\, <>./?@#$%^&*_~'''

        # Please note that some countries will require mapping
        # i.e. "South Korea" --> "Korea, South"
        # Please note that I account for these edge cases in the main body of the method below
        # This may require further iterations to improve on these edge cases
        # punctuation has also been declared so that they can be removed

        # TODO: Map the country initials to the country's full name. https://datahub.io/core/country-list
        abbrev_dict = {
                "Afghanistan":"AF",
                "Åland Islands":"AX",
                "Albania":"AL",
                "Algeria":"DZ",
                "American Samoa":"AS",
                "Andorra":"AD",
                "Angola":"AO",
                "Anguilla":"AI",
                "Antarctica":"AQ",
                "Antigua and Barbuda":"AG",
                "Argentina":"AR",
                "Armenia":"AM",
                "Aruba":"AW",
                "Australia":"AU",
                "Austria":"AT",
                "Azerbaijan":"AZ",
                "Bahamas":"BS",
                "Bahrain":"BH",
                "Bangladesh":"BD",
                "Barbados":"BB",
                "Belarus":"BY",
                "Belgium":"BE",
                "Belize":"BZ",
                "Benin":"BJ",
                "Bermuda":"BM",
                "Bhutan":"BT",
                "Bolivia, Plurinational State of":"BO",
                "Bonaire, Sint Eustatius and Saba":"BQ",
                "Bosnia and Herzegovina":"BA",
                "Botswana":"BW",
                "Bouvet Island":"BV",
                "Brazil":"BR",
                "British Indian Ocean Territory":"IO",
                "Brunei Darussalam":"BN",
                "Bulgaria":"BG",
                "Burkina Faso":"BF",
                "Burundi":"BI",
                "Cambodia":"KH",
                "Cameroon":"CM",
                "Canada":"CA",
                "Cape Verde":"CV",
                "Cayman Islands":"KY",
                "Central African Republic":"CF",
                "Chad":"TD",
                "Chile":"CL",
                "China":"CN",
                "Christmas Island":"CX",
                "Cocos (Keeling) Islands":"CC",
                "Colombia":"CO",
                "Comoros":"KM",
                "Congo":"CG",
                "Congo, the Democratic Republic of the":"CD",
                "Cook Islands":"CK",
                "Costa Rica":"CR",
                "Côte d'Ivoire":"CI",
                "Croatia":"HR",
                "Cuba":"CU",
                "Curaçao":"CW",
                "Cyprus":"CY",
                "Czech Republic":"CZ",
                "Denmark":"DK",
                "Djibouti":"DJ",
                "Dominica":"DM",
                "Dominican Republic":"DO",
                "Ecuador":"EC",
                "Egypt":"EG",
                "El Salvador":"SV",
                "Equatorial Guinea":"GQ",
                "Eritrea":"ER",
                "Estonia":"EE",
                "Ethiopia":"ET",
                "Falkland Islands (Malvinas)":"FK",
                "Faroe Islands":"FO",
                "Fiji":"FJ",
                "Finland":"FI",
                "France":"FR",
                "French Guiana":"GF",
                "French Polynesia":"PF",
                "French Southern Territories":"TF",
                "Gabon":"GA",
                "Gambia":"GM",
                "Georgia":"GE",
                "Germany":"DE",
                "Ghana":"GH",
                "Gibraltar":"GI",
                "Greece":"GR",
                "Greenland":"GL",
                "Grenada":"GD",
                "Guadeloupe":"GP",
                "Guam":"GU",
                "Guatemala":"GT",
                "Guernsey":"GG",
                "Guinea":"GN",
                "Guinea-Bissau":"GW",
                "Guyana":"GY",
                "Haiti":"HT",
                "Heard Island and McDonald Islands":"HM",
                "Holy See (Vatican City State)":"VA",
                "Honduras":"HN",
                "Hong Kong":"HK",
                "Hungary":"HU",
                "Iceland":"IS",
                "India":"IN",
                "Indonesia":"ID",
                "Iran, Islamic Republic of":"IR",
                "Iraq":"IQ",
                "Ireland":"IE",
                "Isle of Man":"IM",
                "Israel":"IL",
                "Italy":"IT",
                "Jamaica":"JM",
                "Japan":"JP",
                "Jersey":"JE",
                "Jordan":"JO",
                "Kazakhstan":"KZ",
                "Kenya":"KE",
                "Kiribati":"KI",
                "Korea, Democratic People's Republic of":"KP",
                "Korea, Republic of":"KR",
                "Kuwait":"KW",
                "Kyrgyzstan":"KG",
                "Lao People's Democratic Republic":"LA",
                "Latvia":"LV",
                "Lebanon":"LB",
                "Lesotho":"LS",
                "Liberia":"LR",
                "Libya":"LY",
                "Liechtenstein":"LI",
                "Lithuania":"LT",
                "Luxembourg":"LU",
                "Macao":"MO",
                "Macedonia, the Former Yugoslav Republic of":"MK",
                "Madagascar":"MG",
                "Malawi":"MW",
                "Malaysia":"MY",
                "Maldives":"MV",
                "Mali":"ML",
                "Malta":"MT",
                "Marshall Islands":"MH",
                "Martinique":"MQ",
                "Mauritania":"MR",
                "Mauritius":"MU",
                "Mayotte":"YT",
                "Mexico":"MX",
                "Micronesia, Federated States of":"FM",
                "Moldova, Republic of":"MD",
                "Monaco":"MC",
                "Mongolia":"MN",
                "Montenegro":"ME",
                "Montserrat":"MS",
                "Morocco":"MA",
                "Mozambique":"MZ",
                "Myanmar":"MM",
                "Namibia":"NA",
                "Nauru":"NR",
                "Nepal":"NP",
                "Netherlands":"NL",
                "New Caledonia":"NC",
                "New Zealand":"NZ",
                "Nicaragua":"NI",
                "Niger":"NE",
                "Nigeria":"NG",
                "Niue":"NU",
                "Norfolk Island":"NF",
                "Northern Mariana Islands":"MP",
                "Norway":"NO",
                "Oman":"OM",
                "Pakistan":"PK",
                "Palau":"PW",
                "Palestine, State of":"PS",
                "Panama":"PA",
                "Papua New Guinea":"PG",
                "Paraguay":"PY",
                "Peru":"PE",
                "Philippines":"PH",
                "Pitcairn":"PN",
                "Poland":"PL",
                "Portugal":"PT",
                "Puerto Rico":"PR",
                "Qatar":"QA",
                "Réunion":"RE",
                "Romania":"RO",
                "Russian Federation":"RU",
                "Rwanda":"RW",
                "Saint Barthélemy":"BL",
                "Saint Helena, Ascension and Tristan da Cunha":"SH",
                "Saint Kitts and Nevis":"KN",
                "Saint Lucia":"LC",
                "Saint Martin (French part)":"MF",
                "Saint Pierre and Miquelon":"PM",
                "Saint Vincent and the Grenadines":"VC",
                "Samoa":"WS",
                "San Marino":"SM",
                "Sao Tome and Principe":"ST",
                "Saudi Arabia":"SA",
                "Senegal":"SN",
                "Serbia":"RS",
                "Seychelles":"SC",
                "Sierra Leone":"SL",
                "Singapore":"SG",
                "Sint Maarten (Dutch part)":"SX",
                "Slovakia":"SK",
                "Slovenia":"SI",
                "Solomon Islands":"SB",
                "Somalia":"SO",
                "South Africa":"ZA",
                "South Georgia and the South Sandwich Islands":"GS",
                "South Sudan":"SS",
                "Spain":"ES",
                "Sri Lanka":"LK",
                "Sudan":"SD",
                "Suriname":"SR",
                "Svalbard and Jan Mayen":"SJ",
                "Swaziland":"SZ",
                "Sweden":"SE",
                "Switzerland":"CH",
                "Syrian Arab Republic":"SY",
                "Taiwan":"TW",
                "Tajikistan":"TJ",
                "Tanzania":"TZ",
                "Thailand":"TH",
                "Timor-Leste":"TL",
                "Togo":"TG",
                "Tokelau":"TK",
                "Tonga":"TO",
                "Trinidad and Tobago":"TT",
                "Tunisia":"TN",
                "Turkey":"TR",
                "Turkmenistan":"TM",
                "Turks and Caicos Islands":"TC",
                "Tuvalu":"TV",
                "Uganda":"UG",
                "Ukraine":"UA",
                "United Arab Emirates":"AE",
                "United Kingdom":"GB",
                "United States of America":"US",
                "United States Minor Outlying Islands":"UM",
                "Uruguay":"UY",
                "Uzbekistan":"UZ",
                "Vanuatu":"VU",
                "Venezuela":"VE",
                "Vietnam":"VN",
                "Virgin Islands, British":"VG",
                "Virgin Islands, U.S.":"VI",
                "Wallis and Futuna":"WF",
                "Western Sahara":"EH",
                "Yemen":"YE",
                "Zambia":"ZM",
                "Zimbabwe":"ZW"}

        abbrev = pd.DataFrame(list(abbrev_dict.items()), columns=["Name", "Code"])

        #arrayOfAffiliations = re.sub(r'[^\w\s]',' ',arrayOfAffiliations)
        affiliationsSplit = arrayOfAffiliations.split(" ")
        tempList = []                                            

        for word in affiliationsSplit:                           
            
            # Removing punctuations in string 
            # Using loop + punctuation string 
            for ele in word:  
                if ele in punc:  
                    word = word.replace(ele, "")  
                                                                   
            if word.lower() in listOfCountries: 
                tempList.append(word.lower())
            if word.lower() == "south korea": 
                tempList.append("korea, south") # edge case for South Korea
            if word.lower() == "north korea": 
                tempList.append("korea, north") # edge case for North Korea
            if word.lower() == "united states": 
                tempList.append("united states of america") # edge case for US
            if word.lower() == "usa": 
                tempList.append("united states of america") # edge case for US
            if word.lower() == "uk": 
                tempList.append("united kingdom") # edge case for US

        # List of countries established. Counter function returns a dictionary with country frequency
        return Counter(tempList)


class Tokenization():
    # This class provides static functions in tokenization and frequency counts of words.
    # No constructors are required.

    @staticmethod
    def return_tokens(index: int, aff_string: str) -> list:
        try:
            '''
            Returns a sorted list of words in descending order
            Dependency: nltk.tokenize, nltk.corpus
            '''
            from nltk.tokenize import RegexpTokenizer
            from nltk.corpus import stopwords

            tokenizer = RegexpTokenizer(r'\w+')
            tokens = tokenizer.tokenize(aff_string)
            tokens = [word.lower() for word in tokens]
            return [word for word in tokens if word not in stopwords.words('english')]
        except:
            # Occasionally the input creates a null.
            # Return an empty list 
            # print(f'Null encountered at index {i}')
            return []

    @staticmethod
    def return_frequency_terms(terms: list) -> dict:
        '''
        Returns a dictionary of the terms with its corresponding frequencies
        '''
        wordfreq = [terms.count(w) for w in terms]
        return dict(list(zip(terms,wordfreq)))

    @staticmethod
    def return_sorted_frequency(freqdict: dict) -> list:
        '''
        Returns a sorted list with the most frequent terms
        '''
        aux = [(freqdict[key], key) for key in freqdict]
        aux.sort()
        aux.reverse()
        return aux

In [None]:
country_count = Country.countryCount('Alistair E W_Johnson_Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. /Tom J_Pollard_Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. /Lu_Shen_Information Systems, Beth Israel Deaconess Medical Center, Boston, Massachusetts 02215, USA. /Li-Wei H_Lehman_Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. /Mengling_Feng_Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA._Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore 138632, Singapore. /Mohammad_Ghassemi_Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. /Benjamin_Moody_Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. /Peter_Szolovits_Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. /Leo Anthony_Celi_Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA._Information Systems, Beth Israel Deaconess Medical Center, Boston, Massachusetts 02215, USA. /Roger G_Mark_Laboratory for Computational Physiology, MIT Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA._Information Systems, Beth Israel Deaconess Medical Center, Boston, Massachusetts 02215, USA.')
print(country_count)

Counter({'united states of america': 9, 'israel': 3, 'singapore': 2})


In [None]:
# TODO: Bigrams and Trigrams need to be considered for country extractions
# The methods within the Country Class fails to identify countries with two words or more (i.e. it won't identify South Korea or United States of America
# ...it will however identify usa, etc.)

In [None]:
# **** ARIEL CODE STARTS HERE ****
# Get list of articles published in 2019 in The Lancet. Global health
def pull_pmid_metadata_from_list(pubList: list) -> pd.DataFrame:
  columns = ["PMID", 
             "PubMed_Article_Title", 
             "Journal_Title",
             "Publication_Date_Year",
             "Abstract",
             "ForeName", 
             "Authors"]
  df = pd.DataFrame(columns=columns)

  for id, pub in enumerate(pubList):
    try:
      _temp = pub.split("|")
      for _count, _value in enumerate(_temp):
        df.loc[id,columns[_count]] = _value
        
    except Exception as e:
      print(f"Error Raised when Querying Unix EDirect for PM: {pub}")
      print(e)

  # Prior to returning df
  # If any cells have empty values, convert to NULL
  df = df.replace(r'', "NULL", regex=True)

  return df

pubList = !$HOME/edirect/esearch -db pubmed -query "(The Lancet. Global health[Journal]) AND ((2019/01/01[Date - Publication] : 2019/12/31[Date - Publication])" \
| $HOME/edirect/efetch -format xml \
| $HOME/edirect/xtract -pattern PubmedArticle -tab "|" -def "NULL" -sep "," -element MedlineCitation/PMID ArticleTitle Journal/Title PubDate/Year AbstractText Author/ForeName \
-block Author -tab " /" -def "NULL" -sep "_" -element ForeName,LastName,Affiliation

df = pull_pmid_metadata_from_list(list(pubList));

In [None]:
# ADD Country column for all authors

def label_country(word):
  if len(word) > 0:
    return Country.countryCount(word);
  return ''
df['Country'] = [label_country(str(word)) for word in df['Authors']]



Unnamed: 0,PMID,PubMed_Article_Title,Journal_Title,Publication_Date_Year,Abstract,ForeName,Authors,Country,Gender_FirstAuththor,Gender_LastAuththor,Country_FirstAuththor,Country_LastAuththor
0,31879213,Gender-transformative programmes: implications...,The Lancet. Global health,2020,,"Anna,Venkatraman",Anna_Kågesten_Department of Global Public Heal...,"{'sweden': 1, 'switzerland': 1}",female,unknown,Counter({'sweden': 1,'switzerland': 1})
1,31879212,Characteristics of successful programmes targe...,The Lancet. Global health,2020,In the context of the Sustainable Development ...,"Jessica K,Gary L,Caitlin,Mary,Erika,Aishwarya,...",Jessica K_Levy_Brown School at Washington Univ...,{'united states of america': 10},unknown,unknown,Counter({'united states of america': 10}),Counter({'united states of america': 10})
2,31866149,Correction to Lancet Glob Health 2020; 8: e23-24.,The Lancet. Global health,2020,,,,{},unknown,unknown,Counter(),Counter()
3,31864918,Evaluating the impact of Georgia's hepatitis C...,The Lancet. Global health,2020,,"Yvan,Niklas,Philippa","Yvan_Hutin_Global Hepatitis Programme, WHO, 12...",{'switzerland': 3},male,female,Counter({'switzerland': 3}),Counter({'switzerland': 3})
4,31864917,Interim effect evaluation of the hepatitis C e...,The Lancet. Global health,2020,"Georgia has a high prevalence of hepatitis C, ...","Josephine G,Tinatin,David,Hannah,Aaron G,Shaun...","Josephine G_Walker_Population Health Sciences,...","{'united kingdom': 6, 'georgia': 30, 'united s...",unknown,male,Counter({'georgia': 30,'united kingdom': 6})
...,...,...,...,...,...,...,...,...,...,...,...,...
403,30361106,Introducing the World Bank's 2018 Health Equit...,The Lancet. Global health,2019,,"Adam,Patrick,Sven,Marc-Francois","Adam_Wagstaff_Development Research Group, Worl...",{'united states of america': 4},male,unknown,Counter({'united states of america': 4}),Counter({'united states of america': 4})
404,30342924,Correction to Lancet Glob Health 2018; 6: e1139.,The Lancet. Global health,2019,,,,{},unknown,unknown,Counter(),Counter()
405,30297255,Correction to Lancet Glob Health 2018; 6: e107...,The Lancet. Global health,2019,,,,{},unknown,unknown,Counter(),Counter()
406,30245118,Correction to Lancet Glob Health 2018; 6: e113...,The Lancet. Global health,2019,,,,{},unknown,unknown,Counter(),Counter()


In [None]:
# ADD Country column for first and last authors
def label_countryFirstLast(word, authorType=''):
  if len(word) > 0:
    countryArr = word.split(",");
    if (authorType=='last'):
      countryName = countryArr[-1]
    else:
      countryName = countryArr[0]
    return countryName;
  return ''

df['Country_FirstAuththor'] = [label_countryFirstLast(str(word)) for word in df['Country']]
df['Country_LastAuththor'] = [label_countryFirstLast(str(word),'last') for word in df['Country']]


df

Unnamed: 0,PMID,PubMed_Article_Title,Journal_Title,Publication_Date_Year,Abstract,ForeName,Authors,Country,Gender_FirstAuththor,Gender_LastAuththor,Country_FirstAuththor,Country_LastAuththor
0,31879213,Gender-transformative programmes: implications...,The Lancet. Global health,2020,,"Anna,Venkatraman",Anna_Kågesten_Department of Global Public Heal...,"{'sweden': 1, 'switzerland': 1}",female,unknown,Counter({'sweden': 1,'switzerland': 1})
1,31879212,Characteristics of successful programmes targe...,The Lancet. Global health,2020,In the context of the Sustainable Development ...,"Jessica K,Gary L,Caitlin,Mary,Erika,Aishwarya,...",Jessica K_Levy_Brown School at Washington Univ...,{'united states of america': 10},unknown,unknown,Counter({'united states of america': 10}),Counter({'united states of america': 10})
2,31866149,Correction to Lancet Glob Health 2020; 8: e23-24.,The Lancet. Global health,2020,,,,{},unknown,unknown,Counter(),Counter()
3,31864918,Evaluating the impact of Georgia's hepatitis C...,The Lancet. Global health,2020,,"Yvan,Niklas,Philippa","Yvan_Hutin_Global Hepatitis Programme, WHO, 12...",{'switzerland': 3},male,female,Counter({'switzerland': 3}),Counter({'switzerland': 3})
4,31864917,Interim effect evaluation of the hepatitis C e...,The Lancet. Global health,2020,"Georgia has a high prevalence of hepatitis C, ...","Josephine G,Tinatin,David,Hannah,Aaron G,Shaun...","Josephine G_Walker_Population Health Sciences,...","{'united kingdom': 6, 'georgia': 30, 'united s...",unknown,male,Counter({'georgia': 30,'united kingdom': 6})
...,...,...,...,...,...,...,...,...,...,...,...,...
403,30361106,Introducing the World Bank's 2018 Health Equit...,The Lancet. Global health,2019,,"Adam,Patrick,Sven,Marc-Francois","Adam_Wagstaff_Development Research Group, Worl...",{'united states of america': 4},male,unknown,Counter({'united states of america': 4}),Counter({'united states of america': 4})
404,30342924,Correction to Lancet Glob Health 2018; 6: e1139.,The Lancet. Global health,2019,,,,{},unknown,unknown,Counter(),Counter()
405,30297255,Correction to Lancet Glob Health 2018; 6: e107...,The Lancet. Global health,2019,,,,{},unknown,unknown,Counter(),Counter()
406,30245118,Correction to Lancet Glob Health 2018; 6: e113...,The Lancet. Global health,2019,,,,{},unknown,unknown,Counter(),Counter()


In [None]:
# Add Gerder First Author/ Last Author column
!pip install gender_guesser
import gender_guesser.detector as gender
d = gender.Detector()

def label_gender(word, authorType=''):
  if len(word) > 0:
    authorArr = word.split(",");
    if (authorType=='last'):
      authorName = authorArr[-1]
    else:
      authorName = authorArr[0]
    return d.get_gender(authorName);
  return ''

df['Gender_FirstAuththor'] = [label_gender(str(word)) for word in df['ForeName']]
df['Gender_LastAuththor'] = [label_gender(str(word),'last') for word in df['ForeName']]
df



Unnamed: 0,PMID,PubMed_Article_Title,Journal_Title,Publication_Date_Year,Abstract,ForeName,Authors,Country,Gender_FirstAuththor,Gender_LastAuththor,Country_FirstAuththor,Country_LastAuththor
0,31879213,Gender-transformative programmes: implications...,The Lancet. Global health,2020,,"Anna,Venkatraman",Anna_Kågesten_Department of Global Public Heal...,"{'sweden': 1, 'switzerland': 1}",female,unknown,Counter({'sweden': 1,'switzerland': 1})
1,31879212,Characteristics of successful programmes targe...,The Lancet. Global health,2020,In the context of the Sustainable Development ...,"Jessica K,Gary L,Caitlin,Mary,Erika,Aishwarya,...",Jessica K_Levy_Brown School at Washington Univ...,{'united states of america': 10},unknown,unknown,Counter({'united states of america': 10}),Counter({'united states of america': 10})
2,31866149,Correction to Lancet Glob Health 2020; 8: e23-24.,The Lancet. Global health,2020,,,,{},unknown,unknown,Counter(),Counter()
3,31864918,Evaluating the impact of Georgia's hepatitis C...,The Lancet. Global health,2020,,"Yvan,Niklas,Philippa","Yvan_Hutin_Global Hepatitis Programme, WHO, 12...",{'switzerland': 3},male,female,Counter({'switzerland': 3}),Counter({'switzerland': 3})
4,31864917,Interim effect evaluation of the hepatitis C e...,The Lancet. Global health,2020,"Georgia has a high prevalence of hepatitis C, ...","Josephine G,Tinatin,David,Hannah,Aaron G,Shaun...","Josephine G_Walker_Population Health Sciences,...","{'united kingdom': 6, 'georgia': 30, 'united s...",unknown,male,Counter({'georgia': 30,'united kingdom': 6})
...,...,...,...,...,...,...,...,...,...,...,...,...
403,30361106,Introducing the World Bank's 2018 Health Equit...,The Lancet. Global health,2019,,"Adam,Patrick,Sven,Marc-Francois","Adam_Wagstaff_Development Research Group, Worl...",{'united states of america': 4},male,unknown,Counter({'united states of america': 4}),Counter({'united states of america': 4})
404,30342924,Correction to Lancet Glob Health 2018; 6: e1139.,The Lancet. Global health,2019,,,,{},unknown,unknown,Counter(),Counter()
405,30297255,Correction to Lancet Glob Health 2018; 6: e107...,The Lancet. Global health,2019,,,,{},unknown,unknown,Counter(),Counter()
406,30245118,Correction to Lancet Glob Health 2018; 6: e113...,The Lancet. Global health,2019,,,,{},unknown,unknown,Counter(),Counter()


In [None]:
#EXPORT THE DATAFRAME TO CSV/EXCEL (TO MY GOOGLE DRIVE) 
#FIRST I MUST ACCEPT GOOGLE TO USE MY DRIVE ACCOUNT TO SAVE THE FILE (COPY GOOGLE AUTHORIZATION CODE) 
from google.colab import drive
drive.mount('drive')

df.to_csv('drive/My Drive/pubmedlancet.csv', sep=';')


Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).
