# Exercise 2: Building a "little stemmer"

<b>For this exercise, we will take a sample of Antoine de Saint-Exupéry's novella <br>*The Little Prince* and use it to demonstrate tokenization and stemming.

Here is your sample text, which appears at the beginning of the book:<b/>

In [1]:
text = """
Once when I was six years old I saw a magnificent picture in a book, called 
True Stories from Nature, about the primeval forest. It was a picture of 
a boa constrictor in the act of swallowing an animal. Here is a copy of 
the drawing.
Boa
In the book it said: "Boa constrictors swallow their prey whole, without 
chewing it. After that they are not able to move, and they sleep through 
the six months that they need for digestion."
I pondered deeply, then, over the adventures of the jungle. And after some work
with a colored pencil I succeeded in making my first drawing. 
My Drawing Number One. It looked something like this:
Hat
I showed my masterpiece to the grown-ups, and asked them whether the drawing 
frightened them.
But they answered: "Frighten? Why should any one be frightened by a hat?"
My drawing was not a picture of a hat. It was a picture of a boa constrictor 
digesting an elephant. But since the grown-ups were not able to understand it, 
I made another drawing: I drew the inside of a boa constrictor, so that 
the grown-ups could see it clearly. They always need to have things explained. 
My Drawing Number Two looked like this:
Elephant inside the boa
The grown-ups' response, this time, was to advise me to lay aside my drawings 
of boa constrictors, whether from the inside or the outside, and devote myself 
instead to geography, history, arithmetic, and grammar. That is why, at the age 
of six, I gave up what might have been a magnificent career as a painter. 
I had been disheartened by the failure of my Drawing Number One and 
my Drawing Number Two. Grown-ups never understand anything by themselves, 
and it is tiresome for children to be always and forever explaining things 
to them.
"""

In [2]:
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
import spacy
nlp = spacy.load('en')
import string
import re

In [3]:
stop_words = set([w.lower() for w in stopwords.words('english')])

**First let's use NLTK's build-in functions to tokenize and stem this text. <br>First convert the given text into an array of lowercase tokens using the NLTK functions <br>word_tokenize and PorterStemmer.**


**sourse links:**<br>https://www.nltk.org/book/ch03.html<br>https://machinelearningmastery.com/clean-text-machine-learning-python/<br>

In [4]:
tokens = word_tokenize(text)

# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

#stemming
stemmer = PorterStemmer()
singles = [stemmer.stem(word) for word in stripped]

#to lower case only words
words_lower = [t.lower() for t in singles  if t.isalpha()]

# filter out stop words
words = [w for w in words_lower if  w not in stop_words]

## Questions:
###  1. How many unique tokens are there in the text?


In [5]:
len(set(tokens))

170

###  2. How many unique stemmed tokens are in the text? 

In [6]:
len(set(singles))

146

### Lowercase stemmed tokens?


In [7]:
len(words_lower)

305

And also here how many lowercase words left after  stemmer filtering, removing stopwords and puctuation  

In [8]:
len(words)

159

###  3. What are some examples of words that have surprising stemmed forms?

In [9]:
'''onc, wa, pictur, primev, anim, abl, someth, ani, eleph, thi, advis, anyth'''

'onc, wa, pictur, primev, anim, abl, someth, ani, eleph, thi, advis, anyth'

### Can you explain why?

 These words, when they were complete, were just  forms of the same dictionary word (or lemma). <br>For some language processing tasks we want to ignore word endings, and just deal with word stems.

**Now let's try writing our own stemmer. Write a function which takes in a token and returns its stem, <br>by removing common English suffixes (e.g. remove the suffix -ed as in *listened* -> *listen*).<br> Handle at least four such suffixes in English.Then use this custom stemmer to convert the given text <br>to an array of lowercase stemmed tokens.**

In [10]:
 def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

stemmed_2 = [stem(word.lower()) for word in stripped \
             if word.isalpha() and word not in stop_words]

## Questions:
###  4. What are some examples where  your stemmer on the text differs from <br>the PorterStemmer?


In [11]:
'''th, som, mak, elephant, able'''

'th, som, mak, elephant, able'

###  5. Can you explain why the differences occur?
 

The differences occur because we didn't used all possible suffixes in the function<br>Foe example word `elephant` appears here in complete form with ending `ant` comparing <br>with stemmer from the box, the word `making, thing` left their ending `ing`, but<br> word `able` still have ending `e` and the mistery for me: why do we have stemm `som` - it shouldn't be here  < o_O>

**Finally, we will use the library Spacy to lemmatize the text and compare <br>the output to the stemming perfomrmed above. First we load the default Spacy model for English:**

**This contains Spacy's saved data about how to process English text. Now we will use this to lemmatize:**

## Question:
###  6. Lemmatize the text and output an array of lemmatized tokens. 

**sourse link:**<br>https://stackoverflow.com/questions/55675788/how-can-i-lemmatize-list-of-lists-in-python-using-spacy<br>https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/<br>https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/<br>https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/<br>https://spacy.io/usage/spacy-101

In [12]:
lemmatized = nlp(text)
lemmatized


Once when I was six years old I saw a magnificent picture in a book, called 
True Stories from Nature, about the primeval forest. It was a picture of 
a boa constrictor in the act of swallowing an animal. Here is a copy of 
the drawing.
Boa
In the book it said: "Boa constrictors swallow their prey whole, without 
chewing it. After that they are not able to move, and they sleep through 
the six months that they need for digestion."
I pondered deeply, then, over the adventures of the jungle. And after some work
with a colored pencil I succeeded in making my first drawing. 
My Drawing Number One. It looked something like this:
Hat
I showed my masterpiece to the grown-ups, and asked them whether the drawing 
frightened them.
But they answered: "Frighten? Why should any one be frightened by a hat?"
My drawing was not a picture of a hat. It was a picture of a boa constrictor 
digesting an elephant. But since the grown-ups were not able to understand it, 
I made another drawing: I drew the i

### How many unique lemmas are in the text? Hint: Use *nlp(text)* as a Python iterator.<br> Each item in the iterator has an attribute *.lemma_*.

In [13]:
lemmatized = [word.lemma_ for word in lemmatized]
len(set(lemmatized))

140

###  7. What is an example of a word which has different lemmatized and stemmed forms?

In [14]:
 set(lemmatized)

{'\n',
 '"',
 "'",
 ',',
 '-',
 '-PRON-',
 '.',
 ':',
 '?',
 'Boa',
 'Drawing',
 'Elephant',
 'Hat',
 'Nature',
 'Number',
 'One',
 'Stories',
 'True',
 'a',
 'able',
 'about',
 'act',
 'adventure',
 'advise',
 'after',
 'age',
 'always',
 'an',
 'and',
 'animal',
 'another',
 'answer',
 'any',
 'anything',
 'arithmetic',
 'as',
 'aside',
 'ask',
 'at',
 'be',
 'boa',
 'book',
 'but',
 'by',
 'call',
 'career',
 'chew',
 'child',
 'clearly',
 'colored',
 'constrictor',
 'copy',
 'could',
 'deeply',
 'devote',
 'digest',
 'digestion',
 'dishearten',
 'draw',
 'drawing',
 'elephant',
 'explain',
 'failure',
 'first',
 'for',
 'forest',
 'forever',
 'frighten',
 'from',
 'geography',
 'give',
 'grammar',
 'grown',
 'hat',
 'have',
 'here',
 'history',
 'in',
 'inside',
 'instead',
 'jungle',
 'lay',
 'like',
 'look',
 'magnificent',
 'make',
 'masterpiece',
 'may',
 'month',
 'move',
 'need',
 'never',
 'not',
 'of',
 'old',
 'once',
 'one',
 'or',
 'outside',
 'over',
 'painter',
 'penci

In [15]:
'''able, once, animal, anything, advise, this, picture, primeval, something'''

'able, once, animal, anything, advise, this, picture, primeval, something'

 ### Why?

Because of  the way they work and therefore the result they each of them returns.


**Stemming** algorithms work by cutting off the end or the beginning of the word,<br> taking into account a list of common prefixes and suffixes that can be found in an inflected word.<br> This indiscriminate cutting can be successful in some occasions, but not always, and that is why <br>we affirm that this approach presents some limitations.

**Lemmatization**, on the other hand, takes into consideration the morphological analysis of the words.<br> To do so, it is necessary to have detailed dictionaries which the algorithm can look through <br>to link the form back to its lemma. Again, you can see how it works with the same example words.


