# Preparing files for VOSviewer overlays

In this notebook we will load some files from Web of Science, parse them, and use them to prepare advanced overlays map in VOSviewer. Many of the operations you have already seen earlier during the summer school.

As usual we will start by importing the relevant packages. We will need the `pandas` pacakge, and we will call it `pd` again, and additionally we need the `csv` package for some options, and finally, we also need the `glob` package to easily find the relevant files.

In [1]:
import pandas as pd
import csv
import glob

We will start by reading in all files. We already did this in an earlier notebook, here below we repeat this.

In [2]:
files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))
publications_df = pd.concat(pd.read_csv(f, sep='\t', quoting=csv.QUOTE_NONE, 
                                        usecols=range(68), index_col='UT') for f in files)
publications_df = publications_df.sort_index()

We will now prepare files manually for VOSviewer. We will have to prepare two files: 
 1. a so-called corpus file that contains all text for each document.
 2. a so-called scores file that contains "scores" for each document.

## Corpus file

We will now first prepare the corpus file. We will concatenate the title and abstract together for this purpose. VOSviewer will simply consider each line in the corpus file a document, and will simply consider all text when creating a term map. In other words, you can apply this to any type of file.

In [3]:
publications_df['text'] = publications_df['TI'] + '. ' + publications_df['AB']

We have added the additional full stop (`.`) to make sure that VOSviewer is able to parse the sentences correctly.

Since VOSviewer expects a document at each line, we need to make sure that the titles and abstract are all on a single line. In more technical terms: they cannot contain any newlines, which are represented by a combination of special characters, and this depends on the platform you are using. We will simply remove all possible newline characters as follows:

In [4]:
publications_df['text'] = publications_df['text'].str.replace('\n', '').replace('\r', '');

Now we write the text for each document to a corpus file.

In [5]:
publications_df['text'].to_csv('corpus.txt', index=False, header=False)

## Scores file

Now we have to determine what type of scores we want to project as overlays in VOSviewer. We will show how to do this using journals, you can repeat the exercise on countries.

Scores in VOSviewer work as follows. For each score it will calculate the average of the scores in documents that match a specific term. It will then color the terms in the term map according to the average of these scores. This can then highlight certain parts of the map showing where this score is particularly high or low. The objective now is to show this for journals, highlighting what part of the map is particularly relevant to a certain journal.

We will do this for each journal separately. At the moment, the journal is contained in the field `SO`.

In [6]:
publications_df['SO']

UT
WOS:000084372900003                                         ACTA TROPICA
WOS:000085634000001             TROPICAL MEDICINE & INTERNATIONAL HEALTH
WOS:000085825800025    TRANSACTIONS OF THE ROYAL SOCIETY OF TROPICAL ...
WOS:000085825800030    TRANSACTIONS OF THE ROYAL SOCIETY OF TROPICAL ...
WOS:000086145500003    AMERICAN JOURNAL OF TROPICAL MEDICINE AND HYGIENE
                                             ...                        
WOS:000437429300018                         MEDECINE ET SANTE TROPICALES
WOS:000437432400002                         MEDECINE ET SANTE TROPICALES
WOS:000437432400014                         MEDECINE ET SANTE TROPICALES
WOS:000437437700018                         MEDECINE ET SANTE TROPICALES
WOS:000437449800023                         MEDECINE ET SANTE TROPICALES
Name: SO, Length: 2338, dtype: object

You may remember that you can get group the dataframe by the journal to get an overview per journal.

In [7]:
publications_df.groupby('SO').size().sort_values(ascending=False)

SO
TROPICAL MEDICINE & INTERNATIONAL HEALTH                                                                  1025
AMERICAN JOURNAL OF TROPICAL MEDICINE AND HYGIENE                                                          360
PLOS NEGLECTED TROPICAL DISEASES                                                                           275
MALARIA JOURNAL                                                                                            197
TRANSACTIONS OF THE ROYAL SOCIETY OF TROPICAL MEDICINE AND HYGIENE                                         133
PARASITES & VECTORS                                                                                        105
ACTA TROPICA                                                                                                82
MEDECINE ET SANTE TROPICALES                                                                                18
ANNALS OF TROPICAL MEDICINE AND PARASITOLOGY                                                                1

Now we would like to translate the `SO` column in such a way that VOSviewer can show a separate overlay for each journal. For those of you are familiar with statistics, we will do this using so-called "dummy" variables. That is, for each journal, we will create a new column, and indicate whether the publication is from that journal (Yes, `1`) or not (No, `0`). If VOSviewer then takes the average, this comes down to showing the percentage of publications with a certain term that are publishing in that journal. Fortunately, this is implemented in `pandas`, so we can easily do that.

In [8]:
journal_scores_df = publications_df['SO'].str.get_dummies()

If we now look at scores_df, you will see many column names that represent the journal, and only `0` or `1` in each entry.

In [9]:
journal_scores_df.head()

Unnamed: 0_level_0,ACTA TROPICA,AMERICAN JOURNAL OF TROPICAL MEDICINE AND HYGIENE,ANNALS OF TROPICAL MEDICINE AND PARASITOLOGY,ANNALS OF TROPICAL MEDICINE AND PUBLIC HEALTH,ANNALS OF TROPICAL PAEDIATRICS,ASIAN PACIFIC JOURNAL OF TROPICAL BIOMEDICINE,ASIAN PACIFIC JOURNAL OF TROPICAL MEDICINE,BIOMEDICA,BULLETIN DE LA SOCIETE DE PATHOLOGIE EXOTIQUE,IMPACT OF ECOLOGICAL CHANGES ON TROPICAL ANIMAL HEALTH AND DISEASE CONTROL,...,PARASITES & VECTORS,PATHOGENS AND GLOBAL HEALTH,PLOS NEGLECTED TROPICAL DISEASES,PLoS Neglected Tropical Diseases,REVISTA DA SOCIEDADE BRASILEIRA DE MEDICINA TROPICAL,SOUTHEAST ASIAN JOURNAL OF TROPICAL MEDICINE AND PUBLIC HEALTH,TRANSACTIONS OF THE ROYAL SOCIETY OF TROPICAL MEDICINE AND HYGIENE,TROPICAL DOCTOR,TROPICAL MEDICINE & INTERNATIONAL HEALTH,TROPICAL MEDICINE AND HEALTH
UT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
WOS:000084372900003,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WOS:000085634000001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
WOS:000085825800025,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
WOS:000085825800030,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
WOS:000086145500003,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


VOSviewer wants a specific column name for scores. In particular, it should be called `Score<...>`. We therefore change the column names

In [10]:
journal_scores_df.columns = ['Score<{}>'.format(c) for c in journal_scores_df.columns]

Finally, we then write the dataframe to a scores files, which should be tab-delimited.

In [11]:
journal_scores_df.to_csv('scores.txt', sep='\t', index=None)

## VOSviewer

You can now create a term map in VOSviewer using the two files you produced `corpus.txt` and `scores.txt`. To create a term map based on these files, choose "Create a map based on text data" in VOSviewer, and then select "Read data from VOSviewer files."

# Exercise Document type

<div class="alert alert-info">
    Now repeat the same exercise but using the document type <code>DT</code>.
</div>

In [16]:
dt_scores_df = publications_df['DT'].str.get_dummies()
dt_scores_df.columns = ['Score<{}>'.format(c) for c in dt_scores_df.columns]
dt_scores_df.to_csv('scores.txt', sep='\t', index=None)

<div class="alert alert-info">
    Create the term map in VOSviewer with the document type score file. Does the category of "Meeting Abstract" show a particular pattern? Why (not)? Can you explain you observation?
</div>

<div class="alert alert-info">
    You probably now have two different dataframes. You then cannot see the document type overlay at the same time as the journal overlay. Could you try to combine the two dataframes? (Hint: check out the <code>concat</code> function we encountered earlier.)
</div>

In [14]:
scores_df = pd.concat([journal_scores_df, dt_scores_df], axis=1)
dt_scores_df.to_csv('scores.txt', sep='\t', index=None)