## Tokenizing

In [7]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [8]:
example_text = "Hello there, how are you doing today? The weather is great and Python is awesome. Sentdex is crazy."

In [9]:
print(sent_tokenize(example_text))

['Hello there, how are you doing today?', 'The weather is great and Python is awesome.', 'Sentdex is crazy.']


In [10]:
print(word_tokenize(example_text))

['Hello', 'there', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'Python', 'is', 'awesome', '.', 'Sentdex', 'is', 'crazy', '.']


## Stemming

Stemming is a process where a given word is stripped to it's root word. This is an useful technique.. Like in the case of writting, written, or write.. the root is write. Let's say we are given two sentences:

1. I am riding in the car. 
2. I am taking in a ride in the car. 

Both of the above sentences have the same meaning. However, have ride and riding. 

It's good to know about stemming but isn't really important going forward because there are better alternatives like wordnet etc.

In [14]:
from nltk.stem import PorterStemmer #there are other stemmers that
# can bebe used as well
from nltk.corpus import stopwords

In [15]:
ps = PorterStemmer()
example_words = ["python","pyhtoner","pythoning","pythoned","pythononly","apythons"]
for w in example_words:
    print(ps.stem(w))

python
pyhton
python
python
pythononli
apython


In [17]:
new_sentence = "The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) are working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion). So far they’ve built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content)."
words = word_tokenize(new_sentence)
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in words if w not in stop_words]
for word in filtered_sentence:
    print(ps.stem(word))

the
convers
AI
team
,
research
initi
found
jigsaw
googl
(
part
alphabet
)
work
tool
help
improv
onlin
convers
.
one
area
focu
studi
neg
onlin
behavior
,
like
toxic
comment
(
i.e
.
comment
rude
,
disrespect
otherwis
like
make
someon
leav
discuss
)
.
So
far
’
built
rang
publicli
avail
model
serv
perspect
api
,
includ
toxic
.
but
current
model
still
make
error
,
’
allow
user
select
type
toxic
’
interest
find
(
e.g
.
platform
may
fine
profan
,
type
toxic
content
)
.


## Part of Speech Tagging

## Chunking/Chinking

Generally, the named entity or the noun is the subject. There can be many nouns in a sentence. There are words surrounding these nouns that are called modifiers which tell something about the subject or the noun in question. 

One of the main goals of chunking is to group into what are known as "noun phrases." These are phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe something like an adverb. The idea is to group nouns with the words that are in relation to them.

More here: https://pythonprogramming.net/chunking-nltk-tutorial/

## Lemmatizing

Lemmatizing is similar to stemming. Just that the result is some form of the complete word itself, not just the root. 

It's possible that you might end up with a very different word but it'll have the same meaning as the original word. 
  

In [10]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
lemmatizer = WordNetLemmatizer()

In [25]:
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cactus"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rock"))
print(lemmatizer.lemmatize("reductions"))

cat
cactus
goose
rock
reduction


In [30]:
#using it for sentences

text = "Photoshare Demo Signup (Monday, Week 2!) Hey everybody,I have linked one signup poll for the Photoshare demos for Monday (12 03). They are 15 minutes each to hopefully reductions the number of demos that run overtime. Be prepared to gsubmit your assignment before you demo (make sure to name your second gsubmission something different than the first). There will be additional demos later in the week, so don't worry if you haven't demoed and can't signup today. However, the vast majority of your project grade will come from the demos so make sure that you do sign up for one when available."
tokenized_list = word_tokenize(text)

for each in tokenized_list:
    if each == "n't":
        each = "not"
    print(lemmatizer.lemmatize(each))

Photoshare
Demo
Signup
(
Monday
,
Week
2
!
)
Hey
everybody
,
I
have
linked
one
signup
poll
for
the
Photoshare
demo
for
Monday
(
12
03
)
.
They
are
15
minute
each
to
hopefully
reduction
the
number
of
demo
that
run
overtime
.
Be
prepared
to
gsubmit
your
assignment
before
you
demo
(
make
sure
to
name
your
second
gsubmission
something
different
than
the
first
)
.
There
will
be
additional
demo
later
in
the
week
,
so
do
not
worry
if
you
have
not
demoed
and
ca
not
signup
today
.
However
,
the
vast
majority
of
your
project
grade
will
come
from
the
demo
so
make
sure
that
you
do
sign
up
for
one
when
available
.
