ORscraper is an R package designed to extract relevant medical information from clinical reports generated by the Oncomine Reporter software. This package is intended for healthcare professionals and researchers working with genetic data who need to automate the extraction and processing of information from report files. ORscraper provides tools to identify biopsies, extract genetic variants and pathogenicity classifications, filter relevant data, and query databases such as NCBI ClinVar.
Install the released version of remotes from CRAN:
install.packages("ORscraper")You can install ORscraper from GitHub using the following R code:
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install ORscraper from GitHub
devtools::install_github("SamuelGonzalez0204/ORscraper")Below is a basic example of how to use ORscraper to extract information from PDF files:
library(ORscraper)
# Read content from a PDF file
example_pdf <- system.file("extdata", "100.1-example.pdf", package = "ORscraper")
lines <- read_pdf_content(example_pdf)
# Read content from mutation tables
genesFile <- system.file("extdata", "Genes.xlsx", package = "ORscraper")
genes <- read_excel(genesFile)
mutations <- unique(genes$GEN)
# Extract mutations values from the extracted text
genes_mut <- c()
pathogenicities <- c()
tableValues <- extract_values_from_tables(lines, mutations)
genes_mut <- c(genes_mut, tableValues[1])
pathogenicities <- c(pathogenicities, tableValues[2])
# Filter only pathogenic mutations
pathogenic_mutations <- filter_pathogenic_only(pathogenicities, genes_mut)
print(pathogenic_mutations)The ORscraper package includes several key functions:
-
classify_biopsy(): Analyzes biopsy identifiers and categorizes them based on predefined rules. -
extract_chip_id(): Extracts chip values from filenames matching specific patterns. -
extract_fusions(): Identifies and extracts fusion variants from text lines. -
extract_intermediate_values(): Searches for a specific text pattern and extracts consecutive values. -
extract_values_from_tables(): Extracts information such as mutations, pathogenicity, and frequencies from tables in reports. -
extract_values_start_end(): Extracts values based on start and end markers. -
filter_pathogenic_only(): Filters mutations, retaining only those marked as “Pathogenic.” -
read_pdf_content(): Extracts the content of a PDF and splits it into individual lines. -
read_pdf_files(): Scans a directory and retrieves all PDF files. -
search_ncbi_clinvar(): Queries the NCBI ClinVar database for germline classifications.