## Getiing Familiar with nltk library

NLTK is a leading platform for building Python programs to work with human language data. 

In [1]:
import nltk

In [2]:
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""

In [3]:
sentence

"At eight o'clock on Thursday morning\nArthur didn't feel very good."

In [6]:
tokens = nltk.word_tokenize(sentence)

In [47]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rakesh_PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Rakesh_PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Rakesh_PC\AppData\Roaming\nltk_data...


True

In [7]:
tokens

['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 'Arthur',
 'did',
 "n't",
 'feel',
 'very',
 'good',
 '.']

Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP.


## Tokenizing

**By tokenizing, you can conveniently split up text by word or by sentence.** This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. It’s your first step in turning unstructured data into structured data, which is easier to analyze.

When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence. Here’s what both types of tokenization bring to the table:

#### Tokenizing by word: 
Words are like the atoms of natural language. They’re the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often. For example, if you were analyzing a group of job ads, then you might find that the word “Python” comes up often. That could suggest high demand for Python knowledge, but you’d need to look deeper to know more.

#### Tokenizing by sentence: 

When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python? Are there more terms from the domain of herpetology than the domain of software development, suggesting that you may be dealing with an entirely different kind of python than you were expecting?

Here’s how to import the relevant parts of NLTK so you can tokenize by word and by sentence:

In [8]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [12]:
example_string = """Muad'Dib learned rapidly because his first training was in how to learn.
... And the first lesson of all was the basic trust that he could learn.
... It's shocking to find how many people do not believe they can learn,
... and how many more believe learning to be difficult."""

In [13]:
example_string

"Muad'Dib learned rapidly because his first training was in how to learn.\nAnd the first lesson of all was the basic trust that he could learn.\nIt's shocking to find how many people do not believe they can learn,\nand how many more believe learning to be difficult."

In [14]:
print(example_string)

Muad'Dib learned rapidly because his first training was in how to learn.
And the first lesson of all was the basic trust that he could learn.
It's shocking to find how many people do not believe they can learn,
and how many more believe learning to be difficult.


In [16]:
# You can use sent_tokenize() to split up example_string into sentences:
sentences = sent_tokenize(example_string)

In [17]:
sentences

["Muad'Dib learned rapidly because his first training was in how to learn.",
 'And the first lesson of all was the basic trust that he could learn.',
 "It's shocking to find how many people do not believe they can learn,\nand how many more believe learning to be difficult."]

In [18]:
type(sentences)

list

In [19]:
# Now try tokenizing example_string by word:
words = word_tokenize(example_string)

In [20]:
words

["Muad'Dib",
 'learned',
 'rapidly',
 'because',
 'his',
 'first',
 'training',
 'was',
 'in',
 'how',
 'to',
 'learn',
 '.',
 'And',
 'the',
 'first',
 'lesson',
 'of',
 'all',
 'was',
 'the',
 'basic',
 'trust',
 'that',
 'he',
 'could',
 'learn',
 '.',
 'It',
 "'s",
 'shocking',
 'to',
 'find',
 'how',
 'many',
 'people',
 'do',
 'not',
 'believe',
 'they',
 'can',
 'learn',
 ',',
 'and',
 'how',
 'many',
 'more',
 'believe',
 'learning',
 'to',
 'be',
 'difficult',
 '.']

You got a list of strings that NLTK considers to be words, such as:

- "Muad'Dib"
- 'training'
- 'how'

But the following strings were also considered to be words:

- "'s"
- ','
- '.'

See how "It's" was split at the apostrophe to give you 'It' and "'s", but "Muad'Dib" was left whole? This happened because NLTK knows that 'It' and "'s" (a contraction of “is”) are two distinct words, so it counted them separately. But "Muad'Dib" isn’t an accepted contraction like "It's", so it wasn’t read as two separate words and was left intact.

#### Filtering Stop Words
**Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it**. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [22]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rakesh_PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [24]:
worf_quote = "Sir, I protest. I am not a merry man!"

In [25]:
words_in_quote = word_tokenize(worf_quote)

In [26]:
words_in_quote

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!']

In [None]:
# create a set of stop words to filter words_in_quote. For this example, you’ll need to focus on stop words in "english":


In [27]:
stop_words = set(stopwords.words("english"))

In [30]:
filtered_list = []

for i in words_in_quote:
    if i not in stop_words:
        filtered_list.append(i)

In [29]:
filtered_list

['Sir', ',', 'I', 'protest', '.', 'I', 'merry', 'man', '!']

In [31]:
filtered_words = []
for word in words_in_quote:
    if word.casefold() not in stop_words:
        filtered_words.append(word)

You used `.casefold()` on word so you could ignore whether the letters in word were `uppercase or lowercase`. This is worth doing because stopwords.words('english') includes only lowercase versions of stop words.

In [32]:
# >>> filtered_list = [
# ...     word for word in words_in_quote if word.casefold() not in stop_words
# ... ]

In [33]:
filtered_words

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

### Stemming

Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.” Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. NLTK has more than one stemmer, but you’ll be using the `Porter stemmer`.

In [34]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [35]:
# you can create a stemmer with PorterStemmer():
stemmer = PorterStemmer()

In [36]:
string_for_stemming = """
... The crew of the USS Discovery discovered many discoveries.
... Discovering is what explorers do."""

In [37]:
words = word_tokenize(string_for_stemming)

In [38]:
words

['The',
 'crew',
 'of',
 'the',
 'USS',
 'Discovery',
 'discovered',
 'many',
 'discoveries',
 '.',
 'Discovering',
 'is',
 'what',
 'explorers',
 'do',
 '.']

In [39]:
# Create a list of the stemmed versions of the words in words by using 
# stemmer.stem() in a list comprehension:

stemmed_words = [stemmer.stem(word) for word in words]

In [40]:
stemmed_words

['the',
 'crew',
 'of',
 'the',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

Those results look a little inconsistent. Why would 'Discovery' give you 'discoveri' when 'Discovering' gives you 'discov'?

Understemming and overstemming are two ways stemming can go wrong:

- **Understemming** happens when two related words should be reduced to the same stem but aren’t. This is a false negative.
- **Overstemming** happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a false positive.

### Lemmatizing
Now that you’re up to speed on parts of speech, you can circle back to lemmatizing. Like stemming, **lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.**

Note: **A lemma is a word that represents a whole group of words, and that group of words is called a lexeme.**

For example, if you were to look up the word “blending” in a dictionary, then you’d need to look at the entry for “blend,” but you would find “blending” listed in that entry.

In this example, “blend” is the lemma, and “blending” is part of the lexeme. So when you lemmatize a word, you are reducing it to its lemma.

Here’s how to import the relevant parts of NLTK in order to start lemmatizing:

In [41]:
from nltk.stem import WordNetLemmatizer

In [42]:
lemmatizer = WordNetLemmatizer()

In [48]:
lemmatizer.lemmatize("scarves")
# 'scarf'

'scarf'

In [49]:
string_for_lemmatizing = "The friends of DeSoto love scarves."

In [50]:
words = word_tokenize(string_for_lemmatizing)

In [51]:
words

['The', 'friends', 'of', 'DeSoto', 'love', 'scarves', '.']

In [52]:
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

In [53]:
lemmatized_words

['The', 'friend', 'of', 'DeSoto', 'love', 'scarf', '.']

In [54]:
lemmatizer.lemmatize("worst")
# 'worst'

'worst'

In [55]:
lemmatizer.lemmatize("worst", pos="a")
# 'bad'

'bad'

You got the result 'worst' because lemmatizer.lemmatize() assumed that "worst" was a noun. You can make it clear that you want "worst" to be an adjective: