# Web <em>of</em> Science Crawler

This program is an enhancement of the version created by Santana (2018) to collect data from available publications on the search for a term on the Web of Science website.

In this release it is possible to perform authentication automatically through Selenium, including reuse of authentication, as well as advanced searches using WoS proprietary TAGS.

Authors: De Souza, Edson Melo; Storópoli, José Eduardo; Alves, Wonder Alexandre Luz<br>
Version: 3.0 (2019)<br>

Original source: Santana, Octavio (2018) - https://github.com/Octavio-Santana/Web-Science

---

## Retrieved Data

*Some data may be incomplete

| Field                   | Description                                      |
|-------------------------|--------------------------------------------------|
| wos_id                  | Web of Science article ID                        |
| doi                     | Digital Object Identifier                        |
| title                   | Article title                                    |
| year                    | Year of publication                              |
| author                  | Authors byline                                   |
| n_references            | References in article                            |
| n_cited                 | Citation count                                   |
| journal                 | Journal name                                     |
| impact_factor           | Impact factor (IF)                               |
| impact_factor_year      | Year of IF                                       |
| j_impact_factor_5_years | Shows the long-term citation trend for a journal |
| issn                    | ISSN journal                                     |
| eissn                   | Eletronic ISSN                                   |
| author_keywords         | Keywords informed by the author                  |
| keywords_plus           | Provides additional keywords                     |
| research_area           | Research area                                    |
| abstract                | Full abstract                                    |


## Instructions

<ul>
    <li>Its necessary set the <strong>language to English</strong> into Web <em>of </em>Science.</li>
    <li>You can search any term using WoS (Web of Science) tags (see site). See the example below in the "search_term" variable.</li>
    <li>Do not use commas to compose the search expression.</li>
    <li>At the end of the processing a resulting file will be generated with the analysis of the data and named as "<strong>Search Term</strong>".</li>
    <li>For further queries, repeat the procedures.</li>
    <li>This notebook shows a example with term "<strong>artificial intelligence in medicine</strong>"</li>
</ul>

## Input data to search
Insert the search term below in variable <strong>search_term</strong>

WoS provides a broad set of tags for use in search. For more information, see the instructions at: [WoS Advanced Search](https://apps.webofknowledge.com/WOS_AdvancedSearch_input.do?SID=7BtkvvzULX57U1SFGUR&product=WOS&search_mode=AdvancedSearch)

Usage example: 'TI=author* position'

In [None]:
# Search term usage example
search_term = 'TI=artificial intelligence in medicine'

#path to ChromeDriver
pathChromedriver='.\chromedriver.exe'

from WoSExtractor import *
wos = WoSExtractor(pathChromedriver) 
df_article = wos.search(search_term)

## Save data recovered to file

In [None]:
df_article.to_csv(wos.file_name(search_term) + '_data.csv', index=False)

## Convert values to Excel format

In [None]:
import pandas as pd
df = pd.read_csv(wos.file_name(search_term) + '_data.csv')
df.to_excel(wos.file_name(search_term) + '_result.xlsx', header=True, index=False, encoding='utf-8')

## Show results

In [None]:
print('Recovered articles:', df.shape[0], '\n')
df.head(5)

# Import libraries to show result

In [None]:
# Import libraries
import warnings
warnings.simplefilter("ignore")

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

## Data Analisys
Display the number of publications between year 2000 and and actual date available in Web of Science

In [None]:
fig, ax1 = plt.subplots(nrows=1, figsize=(10,6))
ax1 = sns.countplot(x=df.loc[df.year>=2000, 'year'], ax=ax1) ## colocar uma variável para o ano
ax1.set_title('Número de publicações entre 2000 e 2019', fontsize=12)
ax1.set_xlabel('Ano da publicação', fontsize=10)
ax1.set_ylabel('Número de publicações', fontsize=10)
ax1.grid(True)
plt.savefig("publicacoes_por_ano.png", dpi=300)
plt.show()

#### Total Authors
Display total number of authors in articles have already published on the term of search

In [None]:
author = [name.split('; ') for name in df.author.values]
nomes = list({name 
              for names in author 
              for name in names})
print("{} authors have already published on the term of search".format(len(nomes)))

#### Total Journals Published
Display journals and/or conferences related to the search theme

In [None]:
journals = list({paper for paper in df.journal})
print("{} journals and/or conferences related to the search theme".format(len(journals)))

### Count number of publications
<ul>
    <li>Authrs: popular_authors</li>
    <li>Journals: popular_journals</li>
</ul>

#### Top 10 Authors

In [None]:
from collections import Counter
popular_authors = Counter(name
                         for names in author
                         for name in names).most_common()

popular_journals = Counter(df.journal.values).most_common()

def popular(coluna, popular_coluna):
    popular = {}
    popular[coluna] = []
    popular['count'] = []
    for col, count in popular_coluna:
        popular[coluna].append(col)
        popular['count'].append(count)
    
    return pd.DataFrame(data=popular)

df_popular_authors = popular('author', popular_authors)
df_popular_authors.head(10).set_index('author').sort_values('count', ascending=True).plot(kind='barh', 
                                                                                  figsize=(11,7), 
                                                                                  grid=False, 
                                                                                  color='darkgreen', 
                                                                                  legend=False)
plt.title('The top 10 authors on the subject', fontsize=20)
plt.xlabel('Number of publications', fontsize=20)
plt.ylabel('Authors name', fontsize=20)
plt.savefig("top_10_authors.png", dpi=300)
plt.show()

#### Top 10 Journals

In [None]:
df_popular_journals = popular('journal', popular_journals)
df_popular_journals.head(10).set_index('journal').sort_values('count', ascending=True).plot(kind='barh', 
                                                                                  figsize=(10,6), 
                                                                                  grid=True, 
                                                                                  color='darkgreen', 
                                                                                  legend=False)
plt.title('The top 10 journals most widely published term', fontsize=12)
plt.xlabel('Number of publications', fontsize=10)
plt.ylabel('Journal name', fontsize=10)
plt.savefig("amount_of_periodicals.png", dpi=300)
plt.show()

### Show the best 5 Journals Impact Factor 

In [None]:
df_cited = df.sort_values('impact_factor', ascending=False)
df_cited.head(5)