The purpose of this nb is to test Open Education API to get Institution info & rake-nltk for Keyword extraction

In [40]:
import requests
import io
import pandas as pd
import regex as re
import numpy as np

from bs4 import BeautifulSoup
from pdfminer.high_level import extract_text
import PyPDF2 as pdf2

import pycountry


In [11]:
%store -r date_df_ok

###

## Affilition Retrieval:

During Affiliation Retrieval testing, it is seen that the "easiest" way to extract this information is to use a Regex capture pattern from the pdf & article.

Any other method needs either a tedious API process or possibly incorrect information.

Therefore, 2 capture patterns will be used on the article:

1- "University name" OR 2- "email@address.pt" 

If it's possible to extract the info from country code, ne ala, if not Uni. name can be processed.


### Affiliation Regex Patterns:

In [166]:
# PART 1 - E-mail Capturing:

rec_email = "([a-zA-Z0-9._-]+@([a-zA-Z0-9._-]+\.([a-zA-Z0-9_-]+)))" 

re.findall(re.compile(rec_email, re.IGNORECASE),"bokbo aasghdjhgas merhhaba okokes@ku.edu.tr asdghjasgdash")

#mail_found = 
#cntry_code_dict[mail_found.upper()]

[('okokes@ku.edu.tr', 'ku.edu.tr', 'tr')]

In [169]:
re.search(re.compile(rec_email, re.IGNORECASE),"bokbo aasghdjhgas merhhaba okokes@ku.edu.tr asdghjasgdash").group(2)

'ku.edu.tr'

Email regex is complete!

Now to get the country info from the mail ending:

In [66]:
cntry_name_list = [cntry.name for cntry in list(pycountry.countries)]

cntry_code_dict = {cntry.alpha_2 : cntry.name for cntry in list(pycountry.countries)}

In [222]:
def from_code_to_country(pdftext):
    rec_email = "([a-zA-Z0-9._-]+@([a-zA-Z0-9._-]+\.([a-zA-Z0-9_-]+)))" 
    cntry_code_dict = {cntry.alpha_2 : cntry.name for cntry in list(pycountry.countries)}

    ccode = re.search(re.compile(rec_email, re.IGNORECASE),pdftext).group(3)
    if ccode == "com":
        cntry_found = np.nan

    else:
        if len(ccode)>2:
            ccode = ccode[:2]
        cntry_found = cntry_code_dict[ccode.upper()]

    return cntry_found


Testing:

In [146]:
random_dir_url = date_df_ok.direct_url.sample(1).values[0]

#Manual Testing - PART 2: extract_text()

HEADERS = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'}

pdf_req = requests.get(random_dir_url,headers=HEADERS)

print(pdf_req)
pdf_io = io.BytesIO(pdf_req.content)
pdf_read = extract_text(pdf_io,page_numbers=[])


<Response [200]>


In [147]:
random_dir_url

'https://sci.bban.top/pdf/10.3390/antibiotics5040034.pdf#view=FitH'

In [223]:
# E-mail match - Option 1:

from_code_to_country(pdf_read)

#Fast BUT NOT!!! WORKING!!! for USA domains as they have no country code at the end

KeyError: 'ED'

In [219]:
# E-mail match - Option 2:


email_opt2_match = re.findall(r"(?=("+'|'.join(list(uni_code_country_dict.keys()))+r"))", pdf_read)[0]


uni_code_country_dict[email_opt2_match]

# This option REALLY slow (8-9 sec) but no need to extract any email str, loops through the entire PDF str.

'United States'

In [236]:
# E-mail match - Option 3:
email_reg = re.compile("([a-zA-Z0-9._-]+@([a-zA-Z0-9._-]+\.([a-zA-Z0-9_-]+)))" , re.IGNORECASE)

uni_code_country_dict[re.search(email_reg,pdf_read).group(2)]

'United States'

In [144]:
# Country Match - Option 1:
re.findall(r"(?=("+'|'.join(cntry_name_list)+r"))", pdf_read)[0][0]

# This option is slower but it gives all the countries mentioned in mention order

'Israel'

In [145]:
# Country match - Option 2: 
str_match = [s for s in cntry_name_list if s in pdf_read]
print(str_match)

# This option is faster but it gives all the countries mentioned in alphabetical order

['Israel']


In [None]:
# University Match - option 1:

# same as country match opt 1, join all uni names & search in whole text

In [None]:
# University Match - option 2:

# same as email match opt 3, create a regex pattern to capture uni name & the use dict or join keys to only check the captured parts

MUST FIX Observations from Tests:

* .gmail.com emails -> COLOMBIA!
* "USA" not in country list
* email & country_match give out different results in some cases, would be very useful to have a third "University" field

Other Noteworthy Observations:
* Not all authors have email info, but have address written!
* Not all first authors are working in a "University", but rather in a "Institute"

---


World universities JSON offers info on Uni name, domain name and  country for all universities worldwide. This package therefore might be a useful for both e-mail capture & uni name capture. It will be tested below: 

In [150]:
import json

In [155]:
with open('world_universities_and_domains.json') as fp:
    uni_dict = json.load(fp)

In [None]:
uni_code_country_dict = {}
uni_name_country_dict = {}

In [160]:
uni_df = pd.DataFrame(uni_dict)

In [211]:
uni_code_country_dict = {}
uni_name_country_dict = {}

for index, row in uni_df.iterrows():
    uni_code_country_dict.update({dom : row.country for dom in row.domains})
    uni_name_country_dict.update({row["name"] : row.country})


In [213]:
uni_name_country_dict["Koç University"]

'Turkey'

In [212]:
uni_code_country_dict["ku.edu.tr"]

'Turkey'

In [185]:
uni_code_country_dict["unl.pt"]

'Portugal'

In [215]:
list(uni_code_country_dict.keys())

['marywood.edu',
 'upes.ac.in',
 'cstj.qc.ca',
 'lindenwood.edu',
 'davietjal.org',
 'lpu.in',
 'sullivan.edu',
 'fscj.edu',
 'xavier.edu',
 'tusculum.edu',
 'cst.edu',
 'somaiya.edu',
 'columbiasc.edu',
 'clpccd.edu',
 'keller.edu',
 'monroecollege.edu',
 'smccd.edu',
 'losrios.edu',
 'digipen.edu',
 'upmc.edu',
 'upmc.com',
 'cgu.edu',
 'utrgv.edu',
 'mountsaintvincent.edu',
 'uasys.edu',
 'ecpi.edu',
 'umw.edu',
 'bw.edu',
 'csuci.edu',
 'brandman.edu',
 'uscga.edu',
 'athens.edu',
 'scripps.edu',
 'easternflorida.edu',
 'wne.edu',
 'king.edu',
 'ggc.edu',
 'trident.edu',
 'alliant.edu',
 'mvsu.edu',
 'roosevelt.edu',
 'itt-tech.info',
 'itt-tech.edu',
 'iecc.edu',
 'park.edu',
 'mssm.edu',
 'uvu.edu',
 'wlc.edu',
 'rccd.edu',
 'wakehealth.edu',
 'umb.edu',
 'floridapoly.edu',
 'wagner.edu',
 'wilmu.edu',
 'itu.edu',
 'yhc.edu',
 'findlay.edu',
 'pcom.edu',
 'yosemite.edu',
 'coastalalabama.edu',
 'pnw.edu',
 'columbiabasin.edu',
 'seattlecolleges.edu',
 'lipscomb.edu',
 'tiffin.edu

In [218]:
uni_code_country_dict["umn.edu"]

'United States'

The code below shows that there are uni_codes that are seen multiple times, causing the length of the dict to be different than total codes in the JSON. One such example is rutgers.edu, this will be analysed below in more detail:

In [192]:
#from collections import Counter

#unlist =uni_df.domains.tolist()
#flatun = [item for sublist in unlist for item in sublist]
#Counter(flatun,a)

#uni_df[uni_df.domains.map(lambda x: "rutgers.edu" in x)]

### Scholarly - Affiliation Retr. Tests:

In [16]:
from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator()
success = pg.FreeProxies()
scholarly.use_proxy(pg)

search_qry = scholarly.search_author('Steven A Cholewiak')

author = next(search_qry)


In [17]:
author

{'container_type': 'Author',
 'filled': [],
 'source': <AuthorSource.SEARCH_AUTHOR_SNIPPETS: 'SEARCH_AUTHOR_SNIPPETS'>,
 'scholar_id': '4bahYMkAAAAJ',
 'url_picture': 'https://scholar.google.com/citations?view_op=medium_photo&user=4bahYMkAAAAJ',
 'name': 'Steven A. Cholewiak, PhD',
 'affiliation': 'Vision Scientist at Google LLC',
 'email_domain': '@google.com',
 'interests': ['Depth Cues',
  '3D Shape',
  'Shape from Texture & Shading',
  'Naive Physics',
  'Haptics'],
 'citedby': 399}

In [18]:
search_qry = scholarly.search_author("yan dongpeng")
author = next(search_qry)

In [22]:
author

{'container_type': 'Author',
 'filled': [],
 'source': <AuthorSource.SEARCH_AUTHOR_SNIPPETS: 'SEARCH_AUTHOR_SNIPPETS'>,
 'scholar_id': 'CDfzddYAAAAJ',
 'url_picture': 'https://scholar.google.com/citations?view_op=medium_photo&user=CDfzddYAAAAJ',
 'name': 'Dongpeng Yan',
 'affiliation': 'Beijing Normal University, Beijing University of Chemical Technology',
 'email_domain': '@bnu.edu.cn',
 'interests': ['Materials Chemistry',
  'Molecule Science',
  'Chemical Engineering'],
 'citedby': 11105}

In [29]:
date_df_ok.loc["10.1002/aic.12400","subject"]

['General Chemical Engineering', 'Environmental Engineering', 'Biotechnology']

In [30]:
any(intr in " ".join(date_df_ok.loc["10.1002/aic.12400","subject"]) for intr in author["interests"])

True

In [28]:
any(x in author["interests"] for x in date_df_ok.loc["10.1002/aic.12400","subject"])

False

In [27]:
[True if int in author["interests"] in date_df_ok.loc["10.1002/aic.12400","subject"]]

SyntaxError: invalid syntax (<ipython-input-27-b449ccb9a97d>, line 1)

In [13]:
sm_test = scholarly.search_author("yan dongpeng")

In [23]:
date_df_ok.head()

Unnamed: 0_level_0,reference-count,publisher,published-print,is-referenced-by-count,title,author,published-online,container-title,issued,ISSN,subject,published,Results,direct_url,Rec_date,Acc_date,first_author
DOI,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
10.1002/aic.12400,26,Wiley,"{'date-parts': [[2011, 7]]}",14,[Fabrication of an anionic polythiophene/layer...,"[{'given': 'Dongpeng', 'family': 'Yan', 'seque...","{'date-parts': [[2010, 9, 17]]}",[AIChE Journal],"{'date-parts': [[2010, 9, 17]]}",[0001-1541],"[General Chemical Engineering, Environmental E...","{'date-parts': [[2010, 9, 17]]}",[https://sci.bban.top/pdf/10.1002/aic.12400.pd...,https://sci.bban.top/pdf/10.1002/aic.12400.pdf...,,,"{'given': 'Dongpeng', 'family': 'Yan', 'sequen..."
10.1002/aic.12671,14,Wiley,"{'date-parts': [[2012, 5]]}",0,[Gas phase reaction kinetics in boron fibre pr...,"[{'given': 'Fatih', 'family': 'Fırat', 'sequen...","{'date-parts': [[2011, 6, 1]]}",[AIChE Journal],"{'date-parts': [[2011, 6, 1]]}",[0001-1541],"[General Chemical Engineering, Environmental E...","{'date-parts': [[2011, 6, 1]]}",[https://sci.bban.top/pdf/10.1002/aic.12671.pd...,https://sci.bban.top/pdf/10.1002/aic.12671.pdf...,,,"{'given': 'Fatih', 'family': 'Fırat', 'sequenc..."
10.1002/aic.13810,37,Wiley,"{'date-parts': [[2013, 1]]}",17,[Rapid assembly of polyelectrolyte multilayer ...,"[{'given': 'Haiqi', 'family': 'Tang', 'sequenc...","{'date-parts': [[2012, 4, 23]]}",[AIChE Journal],"{'date-parts': [[2012, 4, 23]]}",[0001-1541],"[General Chemical Engineering, Environmental E...","{'date-parts': [[2012, 4, 23]]}",[https://sci.bban.top/pdf/10.1002/aic.13810.pd...,https://sci.bban.top/pdf/10.1002/aic.13810.pdf...,,,"{'given': 'Haiqi', 'family': 'Tang', 'sequence..."
10.1002/aic.14056,0,Wiley,"{'date-parts': [[2013, 4]]}",19,[A computer-aided methodology to design safe f...,"[{'given': 'Phuong-Mai', 'family': 'Nguyen', '...","{'date-parts': [[2013, 3, 6]]}",[AIChE Journal],"{'date-parts': [[2013, 3, 6]]}",[0001-1541],"[General Chemical Engineering, Environmental E...","{'date-parts': [[2013, 3, 6]]}",[https://sci.bban.top/pdf/10.1002/aic.14056.pd...,https://sci.bban.top/pdf/10.1002/aic.14056.pdf...,"July 3, 2012",,"{'given': 'Phuong-Mai', 'family': 'Nguyen', 's..."
10.1002/aic.14601,37,Wiley,"{'date-parts': [[2014, 11]]}",12,[Cake properties of nanocolloid evaluated by v...,"[{'given': 'Eiji', 'family': 'Iritani', 'seque...","{'date-parts': [[2014, 9, 16]]}",[AIChE Journal],"{'date-parts': [[2014, 9, 16]]}",[0001-1541],"[General Chemical Engineering, Environmental E...","{'date-parts': [[2014, 9, 16]]}",[https://sci.bban.top/pdf/10.1002/aic.14601.pd...,https://sci.bban.top/pdf/10.1002/aic.14601.pdf...,"July 24, 2014",,"{'given': 'Eiji', 'family': 'Iritani', 'sequen..."


In [14]:
next(sm_test)

{'container_type': 'Author',
 'filled': [],
 'source': <AuthorSource.SEARCH_AUTHOR_SNIPPETS: 'SEARCH_AUTHOR_SNIPPETS'>,
 'scholar_id': 'CDfzddYAAAAJ',
 'url_picture': 'https://scholar.google.com/citations?view_op=medium_photo&user=CDfzddYAAAAJ',
 'name': 'Dongpeng Yan',
 'affiliation': 'Beijing Normal University, Beijing University of Chemical Technology',
 'email_domain': '@bnu.edu.cn',
 'interests': ['Materials Chemistry',
  'Molecule Science',
  'Chemical Engineering'],
 'citedby': 11105}

## Abstract Retrieval Tests:

In [1]:
from science_parse_api.api import parse_pdf
import pprint

In [2]:
from pathlib import Path
import tempfile

In [36]:
date_df_ok.loc["10.1002/aic.14601","direct_url"]

'https://sci.bban.top/pdf/10.1002/aic.14601.pdf#view=FitH'

In [12]:
pdf_req = requests.get("https://sci.bban.top/pdf/10.1002/aic.14601.pdf#view=FitH")
pdf_io = io.BytesIO(pdf_req.content)

In [13]:
   # If it cannot find that folder will get the pdf and 
    # image from Github. This will occur if you are using 
    # Google Colab
    #pdf_url = ('https://github.com/UCREL/science_parse_py_api/'
    #           'raw/master/test_data/example_for_test.pdf')

temp_test_pdf_paper = tempfile.NamedTemporaryFile('rb+')
test_pdf_paper = Path(temp_test_pdf_paper.name)
temp_test_pdf_paper.write(pdf_req.content)


10585

In [17]:
temp_test_pdf_paper.close()

In [18]:
temp_test_pdf_paper

<tempfile._TemporaryFileWrapper at 0x1e3d0ba1eb0>

In [15]:
test_pdf_paper

WindowsPath('C:/Users/oguzk/AppData/Local/Temp/tmpfm66jlm_')

In [14]:
extract_text(temp_test_pdf_paper.con)

TypeError: Unsupported input type: <class 'tempfile._TemporaryFileWrapper'>

In [3]:
test2_path = Path(r"C:\Users\oguzk\NOVA\THEsis\Codes\main1.pdf")

In [16]:
host = 'http://127.0.0.1'
port = '8080'
output_dict = parse_pdf(host, test_pdf_paper, port=port)

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(output_dict)

PermissionError: [Errno 13] Permission denied: 'C:\\Users\\oguzk\\AppData\\Local\\Temp\\tmpfm66jlm_'

In [None]:
with test_pdf_paper.open('r') as test_fp:
    test_fp.write(pdf_req.content)

In [32]:
output_dict

{'id': 'empty'}

### PART 1: Keywords 

There are two possible solutions to the keywords problem. One is a manual regexx pattern capture & the othehr being rake-nltk or a similar library. Depending on the performance of each one will be selected & added to sh.get_dates() pipeline

In [30]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [5]:
from rake_nltk import Rake

In [27]:
#PDF_Miner

with open("BiblTest\main1.pdf", 'rb') as f:
    pdf_six = extract_text(f,page_numbers=[]).replace("\n", "")

In [21]:
#PyPDF2

with open("BiblTest\main1.pdf", 'rb') as f2:
    pdf_pdf = pdf2.PdfFileReader(f2)
    page = pdf_pdf.getPage(0)
    page_text = page.extractText()
    r.extract_keywords_from_text(page_text)
    #print(page.extractText())




In [32]:
pdf_six_sent_tokens = sent_tokenize(pdf_six)

In [38]:
r = Rake()
r.extract_keywords_from_sentences(pdf_six_sent_tokens)
r.get_ranked_phrases_with_scores()

[(193.1721852014278,
  '25 journals 7 journals 6 journals 3 social science journals 3 natural science journals one journal15 journals10 journals one journal 7 journals 14 journals 28 commercial'),
 (130.0310584070018,
  '19971992 1985 – 1999 1997 2002 2004 2004 2005 2009 economics econometrics statistics econ ., management physics'),
 (81.31818181818181,
  'ac ceptedche mistry engin eering biomedicine physics earth science mathema'),
 (71.49940982682381, '1577 /$ – see front matter © http :// dx'),
 (69.82727272727273,
  'analytical chemistry geoscience mainly engineering agriculture biomedicine civil engineering cross'),
 (61.21734410198315,
  'two conventional 26 iranian journals 1980 1986 – 1990 1994'),
 (51.1, 'revised form 3 september 2013accepted 4 september 2013keywords'),
 (47.56033724340177,
  'acceptdiscipline journal size journal article × size discipline total 3'),
 (44.26774465080917,
  'publishdiscipline journal size journal article × size discipline total 5'),
 (42.66953

In [19]:
r = Rake()
r_kw = r.extract_keywords_from_text(pdf)
r.get_ranked_phrases()[:10]

['25 journals 7 journals 6 journals 3 social science journals 3 natural science journals one journal15 journals10 journals one journal 7 journals 14 journals 28 commercial',
 '19971992 1985 – 1999 1997 2002 2004 2004 2005 2009 economics econometrics statistics econ ., management physics',
 'ac ceptedche mistry engin eering biomedicine physics earth science mathema',
 '1577 /$ – see front matter © http :// dx',
 'analytical chemistry geoscience mainly engineering agriculture biomedicine civil engineering cross',
 'two conventional 26 iranian journals 1980 1986 – 1990 1994',
 'revised form 3 september 2013accepted 4 september 2013keywords',
 'acceptdiscipline journal size journal article × size discipline total 3',
 'publishdiscipline journal size journal article × size discipline total 5',
 'publishingin open access journals (“ gold oa ”)']

In [36]:
r.extract_keywords_from_text("The publishing delay in scholarly peer-reviewed journals")

In [37]:
r.get_ranked_phrases_with_scores()

[(4.0, 'scholarly peer'),
 (4.0, 'reviewed journals'),
 (4.0, 'publishing delay')]

In [17]:
pdf_pdf.getPage(0)

ValueError: seek of closed file

In [9]:
r_kw

In [7]:
len(pdf_token)

8360

In [12]:
pdf_token

['Journal',
 'of',
 'Informetrics',
 '7',
 '(',
 '2013',
 ')',
 '914–',
 '923Contents',
 'lists',
 'available',
 'at',
 'ScienceDirectJournal',
 'of',
 'Informetricsj',
 'o',
 'u',
 'r',
 'n',
 'a',
 'l',
 'h',
 'o',
 'm',
 'e',
 'p',
 'a',
 'g',
 'e',
 ':',
 'w',
 'w',
 'w',
 '.',
 'e',
 'l',
 's',
 'e',
 'v',
 'i',
 'e',
 'r',
 '.',
 'c',
 'o',
 'm',
 '/',
 'l',
 'o',
 'c',
 'a',
 't',
 'e',
 '/',
 'j',
 'o',
 'iThe',
 'publishing',
 'delay',
 'in',
 'scholarly',
 'peer-reviewed',
 'journalsBo-Christer',
 'Björk',
 'a',
 ',',
 'David',
 'Solomon',
 'b',
 ',',
 '∗a',
 'Information',
 'Systems',
 'Science',
 ',',
 'Hanken',
 'School',
 'of',
 'Economics',
 ',',
 'P.',
 'B',
 '.',
 '479',
 ',',
 '00101',
 'Helsinki',
 ',',
 'Finlandb',
 'Department',
 'of',
 'Medicine',
 'and',
 'OMERAD',
 ',',
 'A-202',
 'E',
 'Fee',
 'Hall',
 ',',
 'Michigan',
 'State',
 'University',
 ',',
 'East',
 'Lansing',
 ',',
 'MI',
 '48824',
 ',',
 'USAart',
 'iclei',
 'nfoabstract',
 'Article',
 'history',
 

In [11]:
"References" in pdf_token

False

### PART 1: Open Education API 

To get the affiliation information of the (first) authors of articles, OE API or sth similar must be used. OE API looks easy to use given that we have OAuth2 figured out.

In [1]:
import requests

In [8]:
requests.get("https://open-education-api.github.io")

<Response [404]>