<a href="https://colab.research.google.com/github/ShoSato-047/DSCI330_module_3/blob/main/DSCI330_act3_2_tokenization_with_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install composable

Collecting composable
  Downloading composable-0.5.4-py3-none-any.whl.metadata (696 bytes)
Collecting python-forge<19.0,>=18.6 (from composable)
  Downloading python_forge-18.6.0-py35-none-any.whl.metadata (6.6 kB)
Collecting toolz<0.12.0,>=0.11.1 (from composable)
  Downloading toolz-0.11.2-py3-none-any.whl.metadata (5.1 kB)
Downloading composable-0.5.4-py3-none-any.whl (8.5 kB)
Downloading python_forge-18.6.0-py35-none-any.whl (31 kB)
Downloading toolz-0.11.2-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-forge, toolz, composable
  Attempting uninstall: toolz
    Found existing installation: toolz 0.12.1
    Uninstalling toolz-0.12.1:
      Successfully uninstalled toolz-0.12.1
Successfully installed composable-0.5.4 python-forge-18.6.0 toolz-0.11.2


In [None]:
from composable import pipeable
from composable.strict import map, filter

## Understanding text via *levels of abstraction*

1. Chapters consist of sections,
2. Sections consist of paragraphs,
3. Paragraphs consist of sentences,
4. Sentences consist of words,
5. Words consist of characters.

## Python tools for processing natural language

1. `nltk` is the Natural Language Toolkit.
2. [Project Gutenberg](https://www.gutenberg.org/) is a collection of free ebooks/texts.

In [None]:
import nltk

In [None]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

#### **Cleaning up Persuasion with comprehensions**

In [None]:
from nltk.corpus import gutenberg

In [None]:
(persuasion_raw := gutenberg.raw('austen-persuasion.txt')[35:])[:2000]

'Chapter 1\n\n\nSir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,\nfor his own amusement, never took up any book but the Baronetage;\nthere he found occupation for an idle hour, and consolation in a\ndistressed one; there his faculties were roused into admiration and\nrespect, by contemplating the limited remnant of the earliest patents;\nthere any unwelcome sensations, arising from domestic affairs\nchanged naturally into pity and contempt as he turned over\nthe almost endless creations of the last century; and there,\nif every other leaf were powerless, he could read his own history\nwith an interest which never failed.  This was the page at which\nthe favourite volume always opened:\n\n           "ELLIOT OF KELLYNCH HALL.\n\n"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\ndaughter of James Stevenson, Esq. of South Park, in the county of\nGloucester, by which lady (who died 1800) he has issue Elizabeth,\nborn June 1, 1785; Anne, born August

In [None]:
(persuasion_lower := persuasion_raw.lower())



In [None]:
from string import punctuation

punc_map = str.maketrans('', '', punctuation)

(persuasion_no_punc := persuasion_lower.translate(punc_map))



In [None]:
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


#### **Cleaning up Persuasion with a pipe**

In [None]:
# Define the functions using lambda for functional/pipe coding
# easy to debug

punc_map = str.maketrans('', '', punctuation)

drop_title = pipeable(lambda el, text: text[el:])
make_lower = pipeable(lambda text: text.lower())
remove_punc = pipeable(lambda text: text.translate(punc_map))

In [None]:
(persuasion_clean :=                     # assignment expression
 gutenberg.raw('austen-persuasion.txt')  # data
 >> drop_title(35)
 >> make_lower
 >> remove_punc
)[:2000]

'chapter 1\n\n\nsir walter elliot of kellynch hall in somersetshire was a man who\nfor his own amusement never took up any book but the baronetage\nthere he found occupation for an idle hour and consolation in a\ndistressed one there his faculties were roused into admiration and\nrespect by contemplating the limited remnant of the earliest patents\nthere any unwelcome sensations arising from domestic affairs\nchanged naturally into pity and contempt as he turned over\nthe almost endless creations of the last century and there\nif every other leaf were powerless he could read his own history\nwith an interest which never failed  this was the page at which\nthe favourite volume always opened\n\n           elliot of kellynch hall\n\nwalter elliot born march 1 1760 married july 15 1784 elizabeth\ndaughter of james stevenson esq of south park in the county of\ngloucester by which lady who died 1800 he has issue elizabeth\nborn june 1 1785 anne born august 9 1787 a stillborn son\nnovember 5 

## **Using `PipeableObject`s for easier piping.**

Did you notice the repeated pattern in the helper functions?

```{Python}
make_lower = pipeable(lambda text: text.lower())
remove_punc = pipeable(lambda text: text.translate(punc_map))
```

In both of these functions, we are just calling methods on the incoming object.  Here is a class that will allow use to use this pattern in our pipes without all the boilerplate code.

In [None]:
class PipeableObject(object):
    def __init__(self, function = lambda x: x, after_method_call = False):
        self._function = function
        self._after_method_call = after_method_call

    def __getattr__(self, name):
        return PipeableObject(lambda x: getattr(self._function(x), name), after_method_call = False)

    def __call__(self, *args, **kwargs):
        if self._after_method_call:
            return self._function(*args, **kwargs)
        else:
            return PipeableObject(lambda x: self._function(x)(*args, **kwargs),
                                  after_method_call = True)

    def __rrshift__(self, other):
        return self._function(other)

obj = PipeableObject()

### **Using a the pipeable `obj` in a pipe.**

We can pipe into `obj` to call the methods on the incoming object.  For example, if we want to make a string lowercase, simply pipe into `obj.lower()`.

In [None]:
# object call needs to end with ()
"Abc" >> obj.lower()

'abc'

In [None]:
"A,b,c" >> obj.lower() >> obj.split(',')

['a', 'b', 'c']

### **WARNING!  Make sure your `obj` invocation ends in a method call**

It is important that you finish any expression with a call to a method, as expressions that don't end with a call will be "incomplete" and return a function that would need to be subsequently called.

In [None]:
"Abc" >> obj.lower

<function str.lower()>

In [None]:
("Abc" >> obj.lower)()

'abc'

In [None]:
"A,b,c" >> obj.split

<function str.split(sep=None, maxsplit=-1)>

In [None]:
("A,b,c" >> obj.split)(',')

['A', 'b', 'c']

### **Composing the pipeable `obj` with other tools from `composable`**

Note that `obj.lower` behaves like a function, so it can be used as input to functions like `map` from composable.

In [None]:
('A,b,c'
 >> obj.split(',')  # Returns a list of strings
 >> map(obj.lower())
)

['a', 'b', 'c']

#### Cleaning up Persuasion with pipeable objects

In [None]:
punc_map = str.maketrans('', '', punctuation)
drop_title = pipeable(lambda el, text: text[el:])

In [None]:
(persuasion_clean :=
 gutenberg.raw('austen-persuasion.txt')
 >> drop_title(35)
 >> obj.lower()
 >> obj.translate(punc_map)
)[:2000]

'chapter 1\n\n\nsir walter elliot of kellynch hall in somersetshire was a man who\nfor his own amusement never took up any book but the baronetage\nthere he found occupation for an idle hour and consolation in a\ndistressed one there his faculties were roused into admiration and\nrespect by contemplating the limited remnant of the earliest patents\nthere any unwelcome sensations arising from domestic affairs\nchanged naturally into pity and contempt as he turned over\nthe almost endless creations of the last century and there\nif every other leaf were powerless he could read his own history\nwith an interest which never failed  this was the page at which\nthe favourite volume always opened\n\n           elliot of kellynch hall\n\nwalter elliot born march 1 1760 married july 15 1784 elizabeth\ndaughter of james stevenson esq of south park in the county of\ngloucester by which lady who died 1800 he has issue elizabeth\nborn june 1 1785 anne born august 9 1787 a stillborn son\nnovember 5 

## **Using the `SpaCy` library for tokenization**

`SpaCy` is another powerful library for NLP.

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

### `SpaCy` tokens with comprehensions

#### Word tokens

In [None]:
doc = nlp(persuasion_clean)

In [None]:
(persuasion_words := [token.text for token in doc])[:10]

['chapter',
 '1',
 '\n\n\n',
 'sir',
 'walter',
 'elliot',
 'of',
 'kellynch',
 'hall',
 'in']

#### Sentence tokens

In [None]:
# You need to keep punctuation to tokenize by sentences.

(persuasion_sents := [sent.text for sent in doc.sents])[:2]

['chapter 1\n\n\nsir walter elliot of kellynch hall in somersetshire was a man who\nfor his own amusement never took up any book but the baronetage\nthere he found occupation for an idle hour and consolation in a\ndistressed one there his faculties were roused into admiration and\nrespect by contemplating the limited remnant of the earliest patents\nthere any unwelcome sensations arising from domestic affairs\nchanged naturally into pity and contempt as he turned over\nthe almost endless creations of the last century and there\nif every other leaf were powerless he could read his own history\nwith an interest which never failed  this was the page at which\nthe favourite volume always opened\n\n           elliot of kellynch hall\n\nwalter elliot born march 1 1760 married july 15 1784 elizabeth\ndaughter of james stevenson esq of south park in the county of\ngloucester by which lady who died 1800 he has issue elizabeth\nborn june 1 1785 anne born august 9 1787 a stillborn son\nnovember 5

In [None]:
(newline_map := str.maketrans('\n', " "))

{10: 32}

In [None]:
doc_w_punc = nlp(persuasion_raw)

In [None]:
(persuasion_sents := [sent.text for sent in doc_w_punc.sents])[:3]

['Chapter 1\n\n\nSir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,\nfor his own amusement, never took up any book but the Baronetage;\nthere he found occupation for an idle hour, and consolation in a\ndistressed one; there his faculties were roused into admiration and\nrespect, by contemplating the limited remnant of the earliest patents;\nthere any unwelcome sensations, arising from domestic affairs\nchanged naturally into pity and contempt as he turned over\nthe almost endless creations of the last century; and there,\nif every other leaf were powerless, he could read his own history\nwith an interest which never failed.  ',
 'This was the page at which\nthe favourite volume always opened:\n\n           "ELLIOT OF KELLYNCH HALL.\n\n',
 '"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\ndaughter of James Stevenson, Esq. of South Park, in the county of\nGloucester, by which lady (who died 1800) he has issue Elizabeth,\nborn June 1, 1785; Anne, 

### **`SpaCy` tokens with a pipe and pipeable object.**

In [None]:
doc = nlp(persuasion_clean)

In [None]:
(word_tokens :=
 persuasion_clean
 >> pipeable(nlp)
 >> map(lambda token: token.text)
)[:10]

['chapter',
 '1',
 '\n\n\n',
 'sir',
 'walter',
 'elliot',
 'of',
 'kellynch',
 'hall',
 'in']

In [None]:
(sent_tokens :=
 persuasion_raw
 >> pipeable(nlp) # Converts raw text into a Doc object
 >> pipeable(lambda doc: doc.sents) # Extracts sentences from the Doc
 >> map(lambda token: token.text) # Converts sentences to text
)[:2]


['Chapter 1\n\n\nSir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,\nfor his own amusement, never took up any book but the Baronetage;\nthere he found occupation for an idle hour, and consolation in a\ndistressed one; there his faculties were roused into admiration and\nrespect, by contemplating the limited remnant of the earliest patents;\nthere any unwelcome sensations, arising from domestic affairs\nchanged naturally into pity and contempt as he turned over\nthe almost endless creations of the last century; and there,\nif every other leaf were powerless, he could read his own history\nwith an interest which never failed.  ',
 'This was the page at which\nthe favourite volume always opened:\n\n           "ELLIOT OF KELLYNCH HALL.\n\n']

## **Pipeable attributes.**

We couldn't use our pipeable `obj` to extract elements like `doc.sents` or `token.text` because they weren't method calls.  Instead, we can use the `PipeableAttribute` class.

In [None]:
class PipeableAttribute(object):
    def __init__(self, function = lambda x: x):
        self.function = function

    def __getattr__(self, name):
        return pipeable(lambda x: getattr(x, name))

    def __rrshift__(self, other):
        return self.function(other)

    def __call__(self, *args, **kwargs):
        return self.function(*args, **kwargs)

attr = PipeableAttribute()

In [None]:
class Example(object):
    def __init__(self, a):
        self.a = a

example = Example(5)

In [None]:
example >> attr.a

5

In [None]:
seq = [Example(i) for i in range(5)]

(seq
 >> map(attr.a)
)

[0, 1, 2, 3, 4]

### **Cleaning up the sentence token pipe with `attr`**

In [None]:
(sent_tokens :=
 persuasion_raw
 >> pipeable(nlp)
 >> pipeable(lambda doc: doc.sents)
 >> map(lambda token: token.text)
)[:2]

['Chapter 1\n\n\nSir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,\nfor his own amusement, never took up any book but the Baronetage;\nthere he found occupation for an idle hour, and consolation in a\ndistressed one; there his faculties were roused into admiration and\nrespect, by contemplating the limited remnant of the earliest patents;\nthere any unwelcome sensations, arising from domestic affairs\nchanged naturally into pity and contempt as he turned over\nthe almost endless creations of the last century; and there,\nif every other leaf were powerless, he could read his own history\nwith an interest which never failed.  ',
 'This was the page at which\nthe favourite volume always opened:\n\n           "ELLIOT OF KELLYNCH HALL.\n\n']

In [None]:
(sent_tokens :=
 persuasion_raw
 >> pipeable(nlp)
 >> attr.sents
 >> map(attr.text)
)[:2]

['Chapter 1\n\n\nSir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,\nfor his own amusement, never took up any book but the Baronetage;\nthere he found occupation for an idle hour, and consolation in a\ndistressed one; there his faculties were roused into admiration and\nrespect, by contemplating the limited remnant of the earliest patents;\nthere any unwelcome sensations, arising from domestic affairs\nchanged naturally into pity and contempt as he turned over\nthe almost endless creations of the last century; and there,\nif every other leaf were powerless, he could read his own history\nwith an interest which never failed.  ',
 'This was the page at which\nthe favourite volume always opened:\n\n           "ELLIOT OF KELLYNCH HALL.\n\n']

## <font color="red"> Exercise 3.2 </font>

Perform the following on Sense and Sensibility by Jane Austen.

1. Download and load the text,
2. Remove the title,
3. Get the sentence tokens using `SpaCy`.

Do this two ways: (1) with comprehensions, and (2) with a pipe

# **Composable Solution**

In [None]:
# Your code here

In [None]:
import nltk
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [None]:
# 1. Download and load the text
(sense_raw := gutenberg.raw('austen-sense.txt'))[:2000]

"[Sense and Sensibility by Jane Austen 1811]\n\nCHAPTER 1\n\n\nThe family of Dashwood had long been settled in Sussex.\nTheir estate was large, and their residence was at Norland Park,\nin the centre of their property, where, for many generations,\nthey had lived in so respectable a manner as to engage\nthe general good opinion of their surrounding acquaintance.\nThe late owner of this estate was a single man, who lived\nto a very advanced age, and who for many years of his life,\nhad a constant companion and housekeeper in his sister.\nBut her death, which happened ten years before his own,\nproduced a great alteration in his home; for to supply\nher loss, he invited and received into his house the family\nof his nephew Mr. Henry Dashwood, the legal inheritor\nof the Norland estate, and the person to whom he intended\nto bequeath it.  In the society of his nephew and niece,\nand their children, the old Gentleman's days were\ncomfortably spent.  His attachment to them all increased.\nT

In [None]:
# 2. Removing the title
(sense_raw := gutenberg.raw('austen-sense.txt')[45:])[:2000]

"CHAPTER 1\n\n\nThe family of Dashwood had long been settled in Sussex.\nTheir estate was large, and their residence was at Norland Park,\nin the centre of their property, where, for many generations,\nthey had lived in so respectable a manner as to engage\nthe general good opinion of their surrounding acquaintance.\nThe late owner of this estate was a single man, who lived\nto a very advanced age, and who for many years of his life,\nhad a constant companion and housekeeper in his sister.\nBut her death, which happened ten years before his own,\nproduced a great alteration in his home; for to supply\nher loss, he invited and received into his house the family\nof his nephew Mr. Henry Dashwood, the legal inheritor\nof the Norland estate, and the person to whom he intended\nto bequeath it.  In the society of his nephew and niece,\nand their children, the old Gentleman's days were\ncomfortably spent.  His attachment to them all increased.\nThe constant attention of Mr. and Mrs. Henry Das

In [None]:
# 3. Get the sentence tokens using SpaCy

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')



In [None]:
doc_w_punc_sense = nlp(sense_raw)

In [None]:
(sense_sents := [sent.text for sent in doc_w_punc_sense.sents])[:5]

['CHAPTER 1\n\n\nThe family of Dashwood had long been settled in Sussex.\n',
 'Their estate was large, and their residence was at Norland Park,\nin the centre of their property, where, for many generations,\nthey had lived in so respectable a manner as to engage\nthe general good opinion of their surrounding acquaintance.\n',
 'The late owner of this estate was a single man, who lived\nto a very advanced age, and who for many years of his life,\nhad a constant companion and housekeeper in his sister.\n',
 'But her death, which happened ten years before his own,\nproduced a great alteration in his home; for to supply\nher loss, he invited and received into his house the family\nof his nephew Mr. Henry Dashwood, the legal inheritor\nof the Norland estate, and the person to whom he intended\nto bequeath it.  ',
 "In the society of his nephew and niece,\nand their children, the old Gentleman's days were\ncomfortably spent.  "]

# **Pipe Solution**

In [None]:
(sent_tokens :=
 sense_raw
 >> pipeable(nlp) # Converts raw text into a Doc object
 >> pipeable(lambda doc: doc.sents) # Extracts sentences from the Doc
 >> map(lambda token: token.text) # Converts sentences to text
)[:2]

['CHAPTER 1\n\n\nThe family of Dashwood had long been settled in Sussex.\n',
 'Their estate was large, and their residence was at Norland Park,\nin the centre of their property, where, for many generations,\nthey had lived in so respectable a manner as to engage\nthe general good opinion of their surrounding acquaintance.\n']

In [None]:
(sent_tokens :=
 sense_raw
 >> pipeable(nlp)
 >> attr.sents
 >> map(attr.text)
)[:2]

['CHAPTER 1\n\n\nThe family of Dashwood had long been settled in Sussex.\n',
 'Their estate was large, and their residence was at Norland Park,\nin the centre of their property, where, for many generations,\nthey had lived in so respectable a manner as to engage\nthe general good opinion of their surrounding acquaintance.\n']