### Chapter 9 - NLTK & Spacy

### NLTK:

In [1]:
import pandas as pd
import numpy as np
import nltk

  from collections import Mapping, defaultdict


In [30]:
paragraph = """Biotechnology is a broad area of biology, involving the use of living systems and organisms to develop or make products. Depending on the tools and applications, it often overlaps with related scientific fields. In the late 20th and early 21st centuries, biotechnology has expanded to include new and diverse sciences, such as genomics, recombinant gene techniques, applied immunology, and development of pharmaceutical therapies and diagnostic tests. The term biotechnology was first used by Karl Ereky in 1919, meaning the production of products from raw materials with the aid of living organisms."""

In [7]:
paragraph

'Biotechnology is a broad area of biology, involving the use of living systems and organisms to develop or make products. Depending on the tools and applications, it often overlaps with related scientific fields. In the late 20th and early 21st centuries, biotechnology has expanded to include new and diverse sciences, such as genomics, recombinant gene techniques, applied immunology, and development of pharmaceutical therapies and diagnostic tests. The term biotechnology was first used by Karl Ereky in 1919, meaning the production of products from raw materials with the aid of living organisms.'

In [9]:
from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(paragraph)
sentences

['Biotechnology is a broad area of biology, involving the use of living systems and organisms to develop or make products.',
 'Depending on the tools and applications, it often overlaps with related scientific fields.',
 'In the late 20th and early 21st centuries, biotechnology has expanded to include new and diverse sciences, such as genomics, recombinant gene techniques, applied immunology, and development of pharmaceutical therapies and diagnostic tests.',
 'The term biotechnology was first used by Karl Ereky in 1919, meaning the production of products from raw materials with the aid of living organisms.']

In [11]:
words = word_tokenize(sentences[0])
words

['Biotechnology',
 'is',
 'a',
 'broad',
 'area',
 'of',
 'biology',
 ',',
 'involving',
 'the',
 'use',
 'of',
 'living',
 'systems',
 'and',
 'organisms',
 'to',
 'develop',
 'or',
 'make',
 'products',
 '.']

In [20]:
from nltk.tokenize import PunktSentenceTokenizer
tokens = word_tokenize(sentences[0])
tags = nltk.pos_tag(tokens)
tags

[('Biotechnology', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('broad', 'JJ'),
 ('area', 'NN'),
 ('of', 'IN'),
 ('biology', 'NN'),
 (',', ','),
 ('involving', 'VBG'),
 ('the', 'DT'),
 ('use', 'NN'),
 ('of', 'IN'),
 ('living', 'VBG'),
 ('systems', 'NNS'),
 ('and', 'CC'),
 ('organisms', 'NNS'),
 ('to', 'TO'),
 ('develop', 'VB'),
 ('or', 'CC'),
 ('make', 'VB'),
 ('products', 'NNS'),
 ('.', '.')]

### Spacy:

In [27]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [31]:
spacy_paragraph = nlp(paragraph)
spacy_paragraph

Biotechnology is a broad area of biology, involving the use of living systems and organisms to develop or make products. Depending on the tools and applications, it often overlaps with related scientific fields. In the late 20th and early 21st centuries, biotechnology has expanded to include new and diverse sciences, such as genomics, recombinant gene techniques, applied immunology, and development of pharmaceutical therapies and diagnostic tests. The term biotechnology was first used by Karl Ereky in 1919, meaning the production of products from raw materials with the aid of living organisms.

In [33]:
print([(X.text, X.label_) for X in spacy_paragraph.ents])

[('the late 20th and early 21st centuries', 'DATE'), ('Karl Ereky', 'PERSON'), ('1919', 'DATE')]


In [40]:
sentences = [x for x in spacy_paragraph.sents]
print(sentences)

[Biotechnology is a broad area of biology, involving the use of living systems and organisms to develop or make products., Depending on the tools and applications, it often overlaps with related scientific fields., In the late 20th and early 21st centuries, biotechnology has expanded to include new and diverse sciences, such as genomics, recombinant gene techniques, applied immunology, and development of pharmaceutical therapies and diagnostic tests., The term biotechnology was first used by Karl Ereky in 1919, meaning the production of products from raw materials with the aid of living organisms.]


In [41]:
displacy.render(nlp(str(sentences)), jupyter=True, style='ent')

In [43]:
displacy.render(nlp(str(sentences[0])), style='dep', jupyter = True, options = {'distance': 120})