# Pre-processing texts with spaCy
[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.  
The features available are:
* sentence segmentation
* lemmatisation
* tokenisation
* POS Tagging
* NER Tagging

In [1]:
# library to use spaCy from R
library(spacyr)

# initialize spacy with a defined model
spacyr::spacy_initialize(model = "en")

Finding a python executable with spaCy installed...
spaCy (language model: en) is installed in more than one python
spacyr will use /opt/conda/bin/python (because ask = FALSE)
successfully initialized (spaCy Version: 2.2.3, language model: en)
(python options: type = "python_executable", value = "/opt/conda/bin/python")


### Import example texts

In [2]:
# @quotations vector of 6 movie quotations
load(file = "Data/quotations.RData")



### Run spaCy with function spacy_parse()

In [3]:
quotations[1]
spacy_parse(x = quotations[1], lemma = T, pos = T,entity = T)

doc_id,sentence_id,token_id,token,lemma,pos,entity
<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
text1,1,1,I,-PRON-,PRON,
text1,1,2,'ve,have,AUX,
text1,1,3,got,get,VERB,
text1,1,4,a,a,DET,
text1,1,5,feeling,feeling,NOUN,
text1,1,6,we,-PRON-,PRON,
text1,1,7,'re,be,AUX,
text1,1,8,not,not,PART,
text1,1,9,in,in,ADP,
text1,1,10,Kansas,Kansas,PROPN,GPE_B


In [4]:
quotations[2]
spacy_parse(x = quotations[2], lemma = T, pos = T,entity = T)

doc_id,sentence_id,token_id,token,lemma,pos,entity
<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
text1,1,1,Life,life,NOUN,
text1,1,2,is,be,AUX,
text1,1,3,like,like,SCONJ,
text1,1,4,a,a,DET,
text1,1,5,box,box,NOUN,
text1,1,6,of,of,ADP,
text1,1,7,chocolates,chocolate,NOUN,
text1,1,8,.,.,PUNCT,
text1,2,1,You,-PRON-,PRON,
text1,2,2,never,never,ADV,


In [5]:
quotations[3]
spacy_parse(x = quotations[3], lemma = T, pos = T,entity = T)

doc_id,sentence_id,token_id,token,lemma,pos,entity
<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
text1,1,1,I,-PRON-,PRON,
text1,1,2,am,be,AUX,
text1,1,3,gon,go,VERB,
text1,1,4,na,to,PART,
text1,1,5,kill,kill,VERB,
text1,1,6,Bill,Bill,PROPN,PERSON_B
text1,1,7,.,.,PUNCT,


### Process multiple texts at the same time

In [6]:
spacy_parse(x = quotations[c(3,5,6)], lemma = T, pos = T, entity = T)

doc_id,sentence_id,token_id,token,lemma,pos,entity
<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
text1,1,1,I,-PRON-,PRON,
text1,1,2,am,be,AUX,
text1,1,3,gon,go,VERB,
text1,1,4,na,to,PART,
text1,1,5,kill,kill,VERB,
text1,1,6,Bill,Bill,PROPN,PERSON_B
text1,1,7,.,.,PUNCT,
text2,1,1,You,-PRON-,PRON,
text2,1,2,'re,be,AUX,
text2,1,3,a,a,DET,


### Get all named Entities

In [7]:
spacy_extract_entity(x = quotations)

doc_id,text,ent_type,start_id,length
<chr>,<chr>,<chr>,<dbl>,<int>
text1,Kansas,GPE,10,1
text3,Bill,PERSON,6,1
text5,Harry,PERSON,6,1
