# NLP with spaCy

## Adapted from https://www.kaggle.com/matleonard/intro-to-nlp

In [1]:
#importing spaCy and loading English language Model
import spacy
nlp = spacy.load('en')

In [2]:
#processing a line of text
doc = nlp("Tea is healthy and calming, don't you think")

In [3]:
#Tokenization - splitting text into tokens
for token in doc:
    print(token)

Tea
is
healthy
and
calming
,
do
n't
you
think


### Notice how don't is split into "do" and "n't"

In [4]:
#Next step is Lemmatizing: Converting the tokens into root
#and
#Identifying Stopwords

print('Token \t\tLemma \t\tStopword')
print("-"*40)
for token in doc:
    print(f"{token}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
calming		calm		False
,		,		False
do		do		True
n't		not		True
you		-PRON-		True
think		think		False


### Sometimes lemmatizing and removing stopwords can affect model performance, this should be kept in mind while hyperparameter optimization process.

## Next lets look at Pattern Matching
### Pattern matching is a common NLP task where tokens or phrases are searched for within a section or whole document. One can use regular expressions to achieve that, but spaCy offers a convenient way to do the same.

In [5]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr = 'LOWER')

In [6]:
#create a list of terms to be matched.The phrase matcher needs the patterns as document objects.
#The easiest way to get these is with a list comprehension using the nlp model.

terms = ['Discovered', 'Tested', 'Hospital', 'Patients']
patterns = [nlp(text) for text in terms]

matcher.add("TerminologyList", patterns)

In [7]:
#Check datatype of patterns item
type(patterns[0])

spacy.tokens.doc.Doc

In [8]:
# Borrowed from https://en.wikipedia.org/wiki/Coronavirus

text_doc= nlp('''The first reports of an infection caused, by what would later be determined to be a coronavirus, occurred in the late 1920s, when an acute respiratory infection of domesticated chickens emerged in North America.[15][16] Arthur Schalk and M.C. Hawn in 1931 made the first detailed report which described a new respiratory infection of chickens in North Dakota. The infection of new-born chicks was characterized by gasping and listlessness with high mortality rates of 40–90%.[17] Leland David Bushnell and Carl Alfred Brandly isolated the virus in 1933.[18] The virus was then known as infectious bronchitis virus (IBV). Charles D. Hudson and Fred Robert Beaudette cultivated the virus for the first time in 1937.[19] The specimen came to be known as the Beaudette strain. In the late 1940s, two more animal coronaviruses, JHM that caused brain disease (murine encephalitis) and mouse hepatitis virus (MHV) that caused hepatitis in mice were discovered.[20] It was not realized at the time that these three different viruses were related.[21]

Human coronaviruses were discovered in the 1960s[22][23] using two different methods in the United Kingdom and the United States.[24] E.C. Kendall, Malcolm Bynoe, and David Tyrrell working at the Common Cold Unit of the British Medical Research Council collected a unique common cold virus designated B814 in 1961.[25][26][27] The virus could not be cultivated using standard techniques which had successfully cultivated rhinoviruses, adenoviruses and other known common cold viruses. In 1965, Tyrrell and Bynoe successfully cultivated the novel virus by serially passing it through organ culture of human embryonic trachea.[28] The new cultivating method was introduced to the lab by Bertil Hoorn.[29] The isolated virus when intranasally inoculated into volunteers caused a cold and was inactivated by ether which indicated it had a lipid envelope.[25][30] Dorothy Hamre[31] and John Procknow at the University of Chicago isolated a novel cold from medical students in 1962. They isolated and grew the virus in kidney tissue culture, assigning it as 229E. The novel virus caused a cold in volunteers and was inactivated by ether similarly as B814.[32]


Transmission electron micrograph of organ cultured coronavirus OC43
Scottish virologist June Almeida at St. Thomas Hospital in London, collaborating with Tyrrell, compared the structures of IBV, B814 and 229E in 1967.[33][34] Using electron microscopy the three viruses were shown to be morphologically related by their general shape and distinctive club-like spikes.[35] A research group at the National Institute of Health the same year was able to isolate another member of this new group of viruses using organ culture and named one of the samples OC43 (OC for organ culture).[36] Like B814, 229E, and IBV, the novel cold virus OC43 had distinctive club-like spikes when observed with the electron microscope.
''')

In [9]:
matches = matcher(text_doc)
print(matches)

[(3766102292120407359, 189, 190), (3766102292120407359, 392, 393)]


The matches here are a tuple of the match id and the positions of the start and end of the phrase.



In [18]:
for match in matches:
    _,start,end = match
    print(text_doc[start:end])

discovered
Hospital


### It appears only two items were matched 

In [14]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start-2:end+2])

TerminologyList coronaviruses were discovered in the
