# Overview/goals
This project showcases how *puboracle* can be used to mine PubMed and extract info on geospatial location of research activity, networks of collaborations and related matrics.

This example tracks publications with the key word "connectome" or "connectomics" in the last year and dispays key geospatial data and metrics related to these publications.

In [1]:
import math
from pathlib import Path
import re

from puboracle.writestoredata import getdata,readwritefun
from puboracle.txtprocess import txt2geo
from puboracle.visualization import visfun
from puboracle.metrics import txtmetrics

Submit query and fetch data in an XML format

In [None]:
save_folder = '/Users/alexandrosgoulas/Data/work-stuff/projects/example_puboracle_connectomics/xmldata/'
query = 'connectomics OR connectome'
days = 365
email = 'arimpos@gmail.com'

getdata.fetch_write_data(
                        query = query,
                        datetype = 'pdat',
                        email = email,
                        days = days,
                        save_folder = save_folder
                        )

Read all the XML files that were downloaded

In [2]:
folder_to_xmls = Path('/Users/alexandrosgoulas/Data/work-stuff/projects/example_puboracle_connectomics/xmldata/')
all_xml_files = readwritefun.get_files_in_folder(folder_to_xmls, 
                                                 order = True
                                                )

Read the XML files and extract the desired info specified by the list keys_to_parse

In [3]:
keys_to_parse = [
                 'affiliations'
                ]

pub_data, xml_filenames = readwritefun.read_xml_to_dict(folder_to_xmls, 
                                                        all_xml_files = all_xml_files,
                                                        keys_to_parse = keys_to_parse
                                                        )


Iterating file...: xml_0.xml

Iterating file...: xml_1.xml


Extract affiliations

In [4]:
affiliations = pub_data[keys_to_parse.index('affiliations')]

Remove unwanted elements from affiliations:
i.  email address 
ii. author names or initials in parentheses (can also remove acronyms of location but this is OK)   
iii. the word "and" from the beginning of an affiliation 

In [5]:
affiliations_cleaned = []
for i,affil in enumerate(affiliations):
    affil_split = affil.split(';')
    items_to_re_join = []
    for item in affil_split:
        cleaned = re.sub("[\(\[].*?[\)\]]", "", item)#remove text in parentheses
        cleaned = re.sub("\S*@\S*\s?", "", cleaned).rstrip()#remove email address
        cleaned = cleaned.replace('electronic address:','')# remove 'electronic address:'
        cleaned = cleaned.replace('Electronic address:','')# remove 'Electronic address:'
        cleaned = re.sub("^\sand", "", cleaned)# remove 'and' from the beggining (preceeded by whitespace)
        items_to_re_join.append(cleaned)    
    affiliations_cleaned.append(';'.join(items_to_re_join))    

Remove empty entries

In [6]:
affiliations_cleaned = [affil for affil in affiliations_cleaned if affil]

Get all the unique cleaned affiliations

In [7]:
(all_affiliations_cleaned, 
 unique_affiliations_cleaned, 
 occurences) = txtmetrics.get_unique_strs(affiliations_cleaned,
                                          exclude= ['',' ']
                                          )

Geolocation: extract latitude and longitude from the textual description of the locations in each affiliation

In [None]:
lat,lon,txtforloc = txt2geo.get_lat_lon_from_text_wordwise(
                                                           unique_affiliations_cleaned,
                                                           reverse = False,
                                                           verbose = True,
                                                           user_agent = 'affiliations_geoloc'
                                                           )


Unpacking textual location descriptions...: 1 / 4318

Searching for latitude and longitude for location description: Wellcome Centre for Neuroimaging, University College London, London, UK.

Unpacking textual location descriptions...: 2 / 4318

Searching for latitude and longitude for location description: Division of Neuropathology and Neurochemistry, Department of Neurology, Medical University of Vienna, Vienna, Austria.

Unpacking textual location descriptions...: 3 / 4318

Searching for latitude and longitude for location description: Department of Radiology, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine, Kaohsiung 83301, Taiwan.

Unpacking textual location descriptions...: 4 / 4318

Searching for latitude and longitude for location description: ATR Neural Information Analysis Laboratories, Kyoto, Japan.

Unpacking textual location descriptions...: 5 / 4318

Searching for latitude and longitude for location description: Department of Psychology,


Unpacking textual location descriptions...: 37 / 4318

Searching for latitude and longitude for location description: Unger-Vetlesen Institute, Lovisenberg Diaconal Hospital, Oslo, Norway.

Unpacking textual location descriptions...: 38 / 4318

Searching for latitude and longitude for location description: Department of Radiology and Biomedical Research Imaging Center , University of North Carolina at Chapel Hill, NC, U.S.A.

Unpacking textual location descriptions...: 39 / 4318

Searching for latitude and longitude for location description: Department of Psychology, University of Georgia, Athens, Georgia, USA.

Unpacking textual location descriptions...: 40 / 4318

Searching for latitude and longitude for location description: MICA Laboratory, Montreal Neurological Institute and Hospital, Montreal, Canada. 

Unpacking textual location descriptions...: 41 / 4318

Searching for latitude and longitude for location description:  Graduate Institute of Clinical Medical Science, China Medic


Unpacking textual location descriptions...: 74 / 4318

Searching for latitude and longitude for location description:  BrainLinks-BrainTools Cluster of Excellence, University of Freiburg, Georges-Köhler-Allee 80, 79110, Freiburg, Germany.

Unpacking textual location descriptions...: 75 / 4318

Searching for latitude and longitude for location description: Division of Brain, Imaging, and Behaviour, Krembil Brain Institute, Krembil Research Institute, Toronto Western Hospital, University Health Network, Toronto, ON, Canada.

Unpacking textual location descriptions...: 76 / 4318

Searching for latitude and longitude for location description:  Department of Neurology, Institute of Memory and Alzheimer's Disease , Pitié-Salpêtrière Hospital, AP-HP, Boulevard de l'hôpital, Paris, France.

Unpacking textual location descriptions...: 77 / 4318

Searching for latitude and longitude for location description: Department of Psychology, Yale University, New Haven, CT.

Unpacking textual location d


Unpacking textual location descriptions...: 108 / 4318

Searching for latitude and longitude for location description: Translational Neuromodeling Unit , Institute for Biomedical Engineering, University of Zurich & ETH Zurich, Wilfriedstrasse 6, 8032 Zurich, Switzerland.

Unpacking textual location descriptions...: 109 / 4318

Searching for latitude and longitude for location description: Section on Quantitative Imaging and Tissue Sciences, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, United States. 

Unpacking textual location descriptions...: 110 / 4318

Searching for latitude and longitude for location description: Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, School of Psychology, Center for Studies of Psychological Application, Guangdong Key Laboratory of Mental Health and Cognitive Science, South China Normal University, Guangzhou, Guangdong Province, 510631, China. 

Unpacking


Unpacking textual location descriptions...: 144 / 4318

Searching for latitude and longitude for location description: Department of Diagnostics, Affiliated Hospital, Medical School, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China.

Unpacking textual location descriptions...: 145 / 4318

Searching for latitude and longitude for location description: Department of Computer Science, University of North Carolina, Chapel Hill, NC, USA.

Unpacking textual location descriptions...: 146 / 4318

Searching for latitude and longitude for location description: Center for Minimally Invasive Neurosurgery, Prince of Wales Private Hospital, Sydney, Australia.

Unpacking textual location descriptions...: 147 / 4318

Searching for latitude and longitude for location description:  McGovern Institute for Brain Research, Peking University, Beijing, China. 

Unpacking textual location descriptions...: 148 / 4318

Searching for latitude and longitude for location description: Department of Rad


Unpacking textual location descriptions...: 180 / 4318

Searching for latitude and longitude for location description: Neuroscience Center Zurich, Zurich, Switzerland.

Unpacking textual location descriptions...: 181 / 4318

Searching for latitude and longitude for location description: Department of Neurosurgery, University of Tuebingen, Tuebingen, Germany.

Unpacking textual location descriptions...: 182 / 4318

Searching for latitude and longitude for location description: Department of Neurology, Drum Tower Hospital, Medical School and The State Key Laboratory of Pharmaceutical Biotechnology, Institute of Brain Science, Nanjing University, 321 Zhongshan Road, Nanjing, Jiangsu, 210008, P. R. China.

Unpacking textual location descriptions...: 183 / 4318

Searching for latitude and longitude for location description: Berenson-Allen Center for Noninvasive Brain Stimulation, Department of Neurology, Harvard Medical School and Beth Israel Deaconess Medical Center, 330 Brookline Avenue,


Unpacking textual location descriptions...: 217 / 4318

Searching for latitude and longitude for location description: Center for the Developing Brain, Child Mind Institute, New York, New York

Unpacking textual location descriptions...: 218 / 4318

Searching for latitude and longitude for location description: Department of Medicine, Massachusetts General Hospital, Boston, MA, USA.

Unpacking textual location descriptions...: 219 / 4318

Searching for latitude and longitude for location description: Department of Neurology, NYU Grossman School of Medicine, New York, NY, USA.

Unpacking textual location descriptions...: 220 / 4318

Searching for latitude and longitude for location description: Department of Statistics, Florida State University, Tallahassee, FL, United States.

Unpacking textual location descriptions...: 221 / 4318

Searching for latitude and longitude for location description: Center for Affective Neuroscience, Development, Learning and Education

Unpacking textual lo


Unpacking textual location descriptions...: 254 / 4318

Searching for latitude and longitude for location description: Department of Ophthalmology, The Jikei University School of Medicine, Tokyo, Japan.

Unpacking textual location descriptions...: 255 / 4318

Searching for latitude and longitude for location description: Medical University of Vienna, Vienna, Austria.

Unpacking textual location descriptions...: 256 / 4318

Searching for latitude and longitude for location description:  Laboratory of Microneurosurgical Anatomy, BP-A Beneficência Portuguesa de São Paulo, São Paulo, SP, Brazil. 

Unpacking textual location descriptions...: 257 / 4318

Searching for latitude and longitude for location description: Azm Center for Research in Biotechnology and Its Application, EDST, Lebanese University, Tripoli, Lebanon.

Unpacking textual location descriptions...: 258 / 4318

Searching for latitude and longitude for location description:  State Key Laboratory of Magnetic Resonance and Atom


Unpacking textual location descriptions...: 290 / 4318

Searching for latitude and longitude for location description:  School of Computer Science, McGill University, Montreal, Canada.

Unpacking textual location descriptions...: 291 / 4318

Searching for latitude and longitude for location description:  Amsterdam UMC, Vrije Universiteit Amsterdam, Department of Psychiatry, Amsterdam Neuroscience, De Boelelaan 1108, 1081 HZ, Amsterdam, the Netherlands

Unpacking textual location descriptions...: 292 / 4318

Searching for latitude and longitude for location description: Department of Psychiatric-Neuroimaging-Genetics and Morbidity Laboratory , Tianjin Mental Health Center, Nankai University Affiliated Anding Hospital, Tianjin, 300222, China.

Unpacking textual location descriptions...: 293 / 4318

Searching for latitude and longitude for location description: Division of Developmental Medicine, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA

Unpacking textual locat


Unpacking textual location descriptions...: 328 / 4318

Searching for latitude and longitude for location description:  Guangdong-Hong Kong-Macao Greater Bay Area Center for Brain Science and Brain-Inspired Intelligence, Guangzhou, China

Unpacking textual location descriptions...: 329 / 4318

Searching for latitude and longitude for location description: Department of Psychology, Sun Yat-sen University, Guangzhou, China.

Unpacking textual location descriptions...: 330 / 4318

Searching for latitude and longitude for location description: Department of Functional Neurosurgery, National Institute of Clinical Neurosciences, Budapest, H-1145, Hungary.

Unpacking textual location descriptions...: 331 / 4318

Searching for latitude and longitude for location description: Department of Pediatrics, Perinatal Institute  and Department of Electronic Engineering and Computing Science, University of Cincinnati, Cincinnati, Ohio

Unpacking textual location descriptions...: 332 / 4318

Searching 


Unpacking textual location descriptions...: 366 / 4318

Searching for latitude and longitude for location description: Imaging Research Laboratories, Robarts Research Institute, London, Ontario, Canada.

Unpacking textual location descriptions...: 367 / 4318

Searching for latitude and longitude for location description: Neuroscience Graduate Program, University of Rochester Medical Center, Rochester, New York 14620

Unpacking textual location descriptions...: 368 / 4318

Searching for latitude and longitude for location description:  Nathan Kline Institute for Psychiatric Research, Orangeburg, NY 10962, USA.

Unpacking textual location descriptions...: 369 / 4318

Searching for latitude and longitude for location description: Department for Clinical and Movement Neurosciences, UCL Queen Square Institute of Neurology, 33 Queen Square, London, WC1N 3BG, UK. 

Unpacking textual location descriptions...: 370 / 4318

Searching for latitude and longitude for location description: Laureate 


Unpacking textual location descriptions...: 402 / 4318

Searching for latitude and longitude for location description: Department of Neurosurgery, University of Oklahoma Health Science Center, Oklahoma City, OK, United States of America.

Unpacking textual location descriptions...: 403 / 4318

Searching for latitude and longitude for location description: Larner College of Medicine, University of Vermont, Burlington, VT, USA.

Unpacking textual location descriptions...: 404 / 4318

Searching for latitude and longitude for location description:  Department of Ophthalmology, Boston Children's Hospital, Harvard Medical School, Boston, MA 02115, USA.

Unpacking textual location descriptions...: 405 / 4318

Searching for latitude and longitude for location description: Institute for Information and Communication Technologies, Electronics and Applied Mathematics , Université catholique de Louvain, Louvain-la-Neuve, Belgium. 

Unpacking textual location descriptions...: 406 / 4318

Searching


Unpacking textual location descriptions...: 439 / 4318

Searching for latitude and longitude for location description:  School of Psychology, Shandong Normal University, Jinan, 250358, China.

Unpacking textual location descriptions...: 440 / 4318

Searching for latitude and longitude for location description: College of Information Science and Technology, Beijing Normal University, Beijing, 100875, China.

Unpacking textual location descriptions...: 441 / 4318

Searching for latitude and longitude for location description: IBM T.J. Watson Research Center, Yorktown Heights, New York, USA.

Unpacking textual location descriptions...: 442 / 4318

Searching for latitude and longitude for location description: Center for Molecular and Behavioral Neuroscience, Rutgers University, Newark, NJ 07102, USA.

Unpacking textual location descriptions...: 443 / 4318

Searching for latitude and longitude for location description:  Department of Bioengineering, University of Pennsylvania, Philadelphi


Unpacking textual location descriptions...: 478 / 4318

Searching for latitude and longitude for location description: Department of Radiology, University of Wisconsin-Madison, Madison, WI, USA.

Unpacking textual location descriptions...: 479 / 4318

Searching for latitude and longitude for location description: Aix Marseille Univ, Inserm, INS, Institut de Neurosciences des Systèmes, Marseille, France

Unpacking textual location descriptions...: 480 / 4318

Searching for latitude and longitude for location description: Centre for Mathematical Medicine and Biology, School of Mathematical Sciences, University of Nottingham, Nottingham, UK.

Unpacking textual location descriptions...: 481 / 4318

Searching for latitude and longitude for location description:  Department of Neuroscience, University of Rochester Medical Center, USA

Unpacking textual location descriptions...: 482 / 4318

Searching for latitude and longitude for location description: National Institute of Mental Health, Kl


Unpacking textual location descriptions...: 514 / 4318

Searching for latitude and longitude for location description: Department of Neurosurgery, Assistance Publique-Hopitaux de Paris , U955 INSERM IMRB eq.15, University of Paris 12 UPEC, Faculté de Médecine, Créteil 94010, France.

Unpacking textual location descriptions...: 515 / 4318

Searching for latitude and longitude for location description: Faculty of Engineering, Universidad de Concepción, Concepción, Chile.

Unpacking textual location descriptions...: 516 / 4318

Searching for latitude and longitude for location description:  Departments of Psychology and Neuroscience , University of Colorado at Boulder

Unpacking textual location descriptions...: 517 / 4318

Searching for latitude and longitude for location description: Department of Psychiatry, Yale University School of Medicine, New Haven, CT 06510.

Unpacking textual location descriptions...: 518 / 4318

Searching for latitude and longitude for location description:  M

Remove lat lon nan entries and visualize the rest on a global map

In [None]:
lat = [item for item in lat if not math.isnan(item)]
lon = [item for item in lon if not math.isnan(item)]
visfun.vis_lon_lat(longitude=lon, latitude=lat)

The above map offers an overview of the distribution of the location across the globe where research was conducted 
and resulted in a publication with the keywords "connectome" or "connectomics" in the last year. 

Visualize top 10 of affiliations with the max publications, after the merge of nr of publications between affiliations
that exceed a string similarity threshold

In [None]:
(affiliations_nrpubs_topmerged, 
 _) = txtmetrics.add_by_similarity(occurences, 
                                   topN = 10, 
                                   look_ahead = 500,
                                   threshold = 0.8
                                  )
                                                     
visfun.visualize_counter_selection(affiliations_nrpubs_topmerged)

We can see that there is a large inhomogeneity of publications per affiliation with the 
Janelia Research Campus, Howard Hughes Medical, USA at the top with more than 80 publications

In [None]:
all_edges = construct_edges_list(unique_items, 
                                 list_coitems=co_items,
                                 exclude=[]
                                 )

In [10]:
len(unique_affiliations_cleaned)

4318

In [None]:
affiliations_cleaned