# **Tokenization using spaCy**


In [1]:
import spacy

In [2]:
 nlp = spacy.load('en_core_web_sm')
 example1 = nlp("This is an example of tokenization")
 for token in example1:
     print(token.text)

This
is
an
example
of
tokenization


In [4]:
example2 = nlp("The quick brown fox jumped over the lazy dog")
for token in example2:
    print(token.text)

The
quick
brown
fox
jumped
over
the
lazy
dog


In [6]:
example3 = nlp("We're the champions")
for token in example3:
    print(token.text)

We
're
the
champions


# **Stemming words using NLTK**



1.   Output of tokenization is output of words
2.   Stemming is used to improve the quality of words
3.   PortStemmer is the most commonly used stemmer
     * Its output is not always interpretable





In [7]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [11]:
example = "Cats running was"
example = [stemmer.stem(token) for token in example.split(" ")]
print(' '.join(example))

cat run wa


# **Lemmatization using spaCy**



1.   Finds the root not just the stem
2.   Correctly identifies the intended part of the speech and meaning of the word



In [13]:
import  spacy
nlp = spacy.load('en_core_web_sm')

In [16]:
example4 = nlp("Animals")
for token in example4:
    print(token.lemma_)

animal


In [17]:
example41 = nlp("is am are")
for token in example41:
    print(token.lemma_)

be
am
be


# **Vectorization using SciKit learn**

**It is the process of turning a document into a numerical vector**
1.  Most basic approach is the algorithm called 'Bag of words'
    * First define a fixed length vocabulary <br />
    Example: ['I', 'am', 'you', 'are', 'john', 'jack']

    * Map each word to an index in this vocabulary <br />
    ['I' =>1, 'am'=>2, 'you'=>3, 'are'=>4, 'john'=>5, 'jack'=>6]

    * Based on this index, construct a vector in which the word's index is a '*1*' if the word is seen in the document, else '*0*' <br />
    Input: "I am John" => [1, 1, 0, 0, 1, 0] <br />
    Input: "You are jack" => [0, 0, 1, 1, 0, 1] <br />
    Input: "I am jack" => [1, 1, 0, 0, 0, 1]

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True, token_pattern=r'\b[^\d\W]+\b')

In [32]:
corpus = ["The dog is on the table", "the cats now are on the table"]
vectorizer.fit(corpus)
print(vectorizer.transform(["The dog is on the table"]).toarray())

[[0 0 1 1 1 1 1]]


In [33]:
vocab = vectorizer.vocabulary_
for key in sorted(vocab.keys()):
    print("{}: {}". format(key, vocab[key]))

are: 0
cats: 1
dog: 2
is: 3
on: 4
table: 5
the: 6


In [34]:
corpus2 = ["I am jack", "You are jjohn", "I am john"]
vectorizer.fit(corpus2)
print(vectorizer.transform(corpus2).toarray())

[[1 0 1 1 0 0 0]
 [0 1 0 0 1 0 1]
 [1 0 1 0 0 1 0]]


In [35]:
vocab = vectorizer.vocabulary_
for key in sorted(vocab.keys()):
    print("{}: {}".format(key, vocab[key]))

am: 0
are: 1
i: 2
jack: 3
jjohn: 4
john: 5
you: 6


# **Word Embedding**