In [10]:
from pybliometrics.scopus import AbstractRetrieval
import time
import pandas as pd
import numpy as np
import json

## 1. Creation of EJOR abstract dataset ##

The list of DOIs is created, which is the DOI column of the data provided by EJOR.

In [21]:
EJOR = pd.read_table(".\DATASETS\EJOR_Breakdown.csv", delimiter = ",") #EJOR breakdown csv is without abstracts

In [None]:
DOIlist = list(EJOR['DOI'])

A dictionary is created with as key the DOI item and as value the abstract, retrieved by the pybliometrics library. If there is no DOI for an entry, we put NO ABSTRACT FOUND.

In [None]:
DOIdict = dict()
for doi_index in range(1,len(DOIlist)):
    if str(DOIlist[doi_index]) == 'nan':
        DOIdict[DOIlist[doi_index]] = 'NO ABSTRACT FOUND'
    else:
        capture = AbstractRetrieval(identifier = DOIlist[doi_index], id_type = "doi")
        DOIdict[DOIlist[doi_index]] = capture.description

Write it to a txt file, for safety. The append ('a') is handy such that when the program gets stuck you can add the parts that have already been retrieved.

In [None]:
f = open("all_abstracts.txt.txt", 'a', encoding = "utf-8")
for key in DOIdict:
    f.write(str(key) + "\t" + str(DOIdict[key]) + "\n")
f.close()

In the next part, we read the table of all DOIs with their abstract, and merge it with the table provided by EJOR. Some papers do not have a DOI. These are manually checked by going to their site on scopus, if the abstract there is relevant, we add it. Conference proceedings are not added and thus removed from the database.

In [None]:
all_abstracts = pd.read_table("all_abstracts.txt", delimiter = "\t")

In [None]:
df = EJOR.merge(scrape, on='DOI', how='left', indicator= True)

indices of entries without a DOI

In [None]:
lijstje = [16580, 18801] #list of values where the abscent abstract is relevant, so we do not remove them.
for index in df[df["DOI"].isna()].index:
    if index not in lijstje:
        df.drop(index,axis = 0, inplace = True)

In [None]:
df.at[16580,"Abstract"] = "In the first part, the preparation of aqueous anionic urethane-urea dispersions is described using different aliphatic and cycloaliphatic diisocyanates, a polyether polyol (PTMG 2000), dimethylolpropionic acid (DMPA), and cyclohexane-diamine as chain extender. In the second part, different polyester and polyether polyols and different neutralizing agents were employed using1, 12-dodecane diisocyanate (C12DDI) as the sole diisocyanate and cyclohexanediamine as chain extender in the preparation of the anionic urethane-urea dispersions. The relationships between the chemical structure of the diisocyanates, polyols, and neutralizing agents on the dispersion, mechanical and thermal properties are being discussed."

In [None]:
df.at[18801,"Abstract"] = "The quantification of the financial benefits of computerized information systems is discussed. It is relatively easy to analyze the clerical applications of computers. In management information systems, however, revenues arise only if computers yield better data and if these data are used to improve decision-making. A new framework is presented plus a few theories and techniques. The framework comprises the sequence transaction-data creation-decision-reaction. Relevant theories are Bayesian Information Economics, Control Theory, and System Dynamics. Relevant techniques are simulation and management gaming."

open up the keyword list with their DOI. Then we make it a dataframe and merge it with the other dataframe, based on the DOI.

In [2]:
with open(".\DATASETS\keyworddict.txt", encoding= "utf-8") as f:
    DOIdict = json.load(f)

In [7]:
dois = []
keywordas = []
for key in DOIdict:
    dois.append(key)
    if DOIdict[key] is not None:
        keywordas.append(tuple(DOIdict[key]))
    else:
        keywordas.append(tuple())

In [8]:
df2 = pd.DataFrame(list(zip(dois,keywordas)),columns=['DOI','Keywords'])

In [None]:
df3 = df.merge(df2, left_on = 'DOI', right_on = 'DOI')

write away to a csv file for a temporary save.

In [20]:
df3.to_csv(".\DATASETS\EJOR_DATABASE_ABSTRACT_KEYWORDS.csv", index = False, encoding = 'utf-8', delimiter = ";")

The next section will clean up the text of title, keywords provided by authors, and abstract. Then all that text is concatenated for later basic text-mining use.

In [None]:
df = pd.read_csv(".\DATASETS\EJOR_DATABASE_ABSTRACT_KEYWORDS.csv") #read in data with year

In [None]:
df['Text'] = df['Title'] + ' ' + df['Abstract'] + ' ' + df['Keywords']
#merge all text data in one column and then clean it for analysis

In [None]:
#getting ready for text mining
#import nltk
#nltk.download('stopwords') #do this if running code first time
stop_words = set(stopwords.words('english'))

# function to remove stopwords
def remove_stopwords(text):
    no_stopword_text = [w for w in text.split() if not w in stop_words]
    return ' '.join(no_stopword_text)

#Clean Text
def clean_text(text):
    text = text.lower()
    text = re.sub("[^a-zA-Z]"," ",text) 
    text = ' '.join(text.split()) 
    return text

#stemming
stemmer = SnowballStemmer("english")
def stemming(sentence):
    stemSentence = ""
    for word in sentence.split():
        stem = stemmer.stem(word)
        stemSentence += stem
        stemSentence += " "
    stemSentence = stemSentence.strip()
    return stemSentence

In [None]:
df['Text'] = df['Text'].apply(lambda x: remove_stopwords(x))
df['Text'] = df['Text'].apply(lambda x:clean_text(x))
df['Text'] = df['Text'].apply(stemming)

In [None]:
df.to_csv(".\DATASETS\EJOR_DATABASE_ABSTRACT_KEYWORDS.csv")

In [None]:
#actual analysis are done in the next notebook.