# Cleaning Text

- Tokenization
- Lematization
- Stemming
- Stop words


## Imports

In [1]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Techniques

### Tokenization

#### Manually

I'm not being very critical here, just reviewing the general concept.

In [2]:
# Paragraph -> Sentences
paragraph = "My name is Edy, I'm Data Scientist, and I like to play guitar. My girlfriend likes to watch TV Series."
sentences = [t for t in paragraph.split('.') if t != '']

print('Paragraphs to sentences:')
print(paragraph)
print(sentences)

Paragraphs to sentences:
My name is Edy, I'm Data Scientist, and I like to play guitar. My girlfriend likes to watch TV Series.
["My name is Edy, I'm Data Scientist, and I like to play guitar", ' My girlfriend likes to watch TV Series']


In [3]:
# Paragraph to Words
words = []

splitted_words = paragraph.split(' ')
for word in splitted_words:
    words.append(word)

words = [w for w in words if w != '']
print('Words:', ' | '.join(words))

Words: My | name | is | Edy, | I'm | Data | Scientist, | and | I | like | to | play | guitar. | My | girlfriend | likes | to | watch | TV | Series.


#### Using NLTK

Note the differences between each method

In [4]:
from nltk.tokenize import word_tokenize, wordpunct_tokenize, TreebankWordTokenizer

In [5]:
# The quotation mark is attached to its next word, and the final stop is isolated.
words = word_tokenize(paragraph)
print('Words:', ' | '.join(words))

Words: My | name | is | Edy | , | I | 'm | Data | Scientist | , | and | I | like | to | play | guitar | . | My | girlfriend | likes | to | watch | TV | Series | .


In [6]:
# The quotation mark is separated from the other words
words = wordpunct_tokenize(paragraph)
print('Words:', ' | '.join(words))


Words: My | name | is | Edy | , | I | ' | m | Data | Scientist | , | and | I | like | to | play | guitar | . | My | girlfriend | likes | to | watch | TV | Series | .


In [7]:
# The full stop is attached to its previous word
words = TreebankWordTokenizer().tokenize(paragraph)
print('Words:', ' | '.join(words))

Words: My | name | is | Edy | , | I | 'm | Data | Scientist | , | and | I | like | to | play | guitar. | My | girlfriend | likes | to | watch | TV | Series | .


### Stemming

In [8]:
from nltk.stem import PorterStemmer, SnowballStemmer

In [9]:
words = [
    'training', 'trained', 'trains', 'gain', 'gained', 'gaining',
    'doing', 'monstruous', 'better', 'bettering', 'bettered', 'betterer'
]	

In [10]:
print('PorterStemming:\n')
for word, stem in zip(words, [PorterStemmer().stem(w) for w in words]):
    print(f'{word:<15} -> {stem}')

PorterStemming:

training        -> train
trained         -> train
trains          -> train
gain            -> gain
gained          -> gain
gaining         -> gain
doing           -> do
monstruous      -> monstruou
better          -> better
bettering       -> better
bettered        -> better
betterer        -> better


In [1]:
def element(x):
    return x[0]

def sort_list(t):
    return sorted(t, key=element)

print(sort_list([(2, 5), (1, 2), (4, 4), (2, 3), (2, 1)]))

[(1, 2), (2, 5), (2, 3), (2, 1), (4, 4)]


In [11]:
# The only difference for the PorterStemmer was the word 'monstruous''
print('PorterStemming:\n')
for word, stem in zip(words, [SnowballStemmer('english').stem(w) for w in words]):
    print(f'{word:<15} -> {stem}')

PorterStemming:

training        -> train
trained         -> train
trains          -> train
gain            -> gain
gained          -> gain
gaining         -> gain
doing           -> do
monstruous      -> monstruous
better          -> better
bettering       -> better
bettered        -> better
betterer        -> better


### Lemmatization

Lemmatization is similar to stemming, but the resulting word always has the same meaning as the original one. The result of lemmatization is called "lemma", which is a root word.

 Stemming may reduce the word "running" to "run", but it could also reduce the word "better" to "bet", which doesn't keep the original meaning. On the other hand, lemmatization will reduce "running" to "run" and "better" to "good", which keeps the original meaning.