# Data Scrapping form the European Medicines Agency (EMA)
## *Miquel Anglada Girotto*


In this notebook, I use the package "RSelenium" to obtain the information published for each drug in the European Medicines Agency (EMA) webpage.

Before each chunk, I try to summarize what the code below will be doing. More detailed description can be found inline for those parts of the code that I found may be confusing.

## 0. Load required dependencies

In [16]:
# libraries
library(RSelenium)
library(stringr)
library(xlsx)

# self-made package
source("functions_web_scrapping.R")

## 1. Set parameters

In [None]:
# start automated navigator
initRSelenium()

# Navigate to a list of all drugs
url = "https://www.ema.europa.eu/medicines/field_ema_web_categories%253Aname_field/Human/ema_group_types/ema_medicine"
remDr$navigate(url)

# Define pages to look through
page <- "https://www.ema.europa.eu/en/medicines/field_ema_web_categories%253Aname_field/Human/ema_group_types/ema_medicine?page=" #link to navigate through each page of the list.
n <- as.character(round(1495/25)-1) #no. of pages to navigate, as.character() to allow pasting with the 'page' string.

## 2. Download information from each drug

In [None]:
# Obtain each drug's link
drug_links = getURLs(page,n)

# Use drug links to retrieve information from EMA's webpage for each drug
# (Slow)
drugsDB = getDrugInfo(drug_links)

## 3. Save information in spreadsheet

In [None]:
# reshape dataframe and fill when features available
drug_df = processTable(drugsDB)

# save
file_name = "drugs_df_EMA.csv"
#saveTable(drug_df,file_name)

## 4. Exploratory Data Analysis

In [21]:
drug_df = read.table('drugs_df_EMA.csv',sep=',',header=T,stringsAsFactors = F)