## Tokenization:-

* Tokenization is a fundamental step in Natural Language Processing (NLP). It involves dividing a Textual input into smaller units known as tokens.
* These tokens can be in the form of words, characters, sub-words, or sentences.

```python
from nltk.tokenize import sent_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article."
sent_tokenize(text)

Output: 

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']
```

* To learn more on tokenization and its different types, see here:- (https://www.geeksforgeeks.org/nlp/nlp-how-tokenizing-text-sentence-words-works/)

## Stemming:- 

* Stemming is a method in text processing that eliminates prefixes and suffixes from words, transforming them into their fundamental or root form, The main objective of stemming is to streamline and standardize words, enhancing the effectiveness of the natural language processing tasks.
* Simplifying words to their most basic form is called stemming, and it is made easier by stemmers or stemming algorithms. For example, "chocolates" becomes "chocolate" and "retrieval" becomes "retrieve."
* Stemming in natural language processing reduces words to their base or root form, aiding in text normalization for easier processing. This technique is crucial in tasks like text classification, information retrieval, and text summarization.
* While beneficial, stemming has drawbacks, including potential impacts on text readability and occasional inaccuracies in determining the correct root form of a word.

**Impletementation of Porter stemmer (a type of stemming method)**:

```python
from nltk.stem import PorterStemmer

# Create a Porter Stemmer instance
porter_stemmer = PorterStemmer()

# Example words for stemming
words = ["running", "jumps", "happily", "running", "happily"]

# Apply stemming to each word
stemmed_words = [porter_stemmer.stem(word) for word in words]

# Print the results
print("Original words:", words)
print("Stemmed words:", stemmed_words)

Output:

Original words: ['running', 'jumps', 'happily', 'running', 'happily']
Stemmed words: ['run', 'jump', 'happili', 'run', 'happili']
```

* To learn more on stemming and its different types, see here:- (https://www.geeksforgeeks.org/machine-learning/introduction-to-stemming/)

## Lemmatization:

* Lemmatization is a fundamental text pre-processing technique widely applied in natural language processing (NLP) and machine learning.
* Lemmatization is similar to stemming but it brings context to the words. So, it links words with similar meanings to one word.
* Lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.

**Impletementation of Lemmatization**
```python
# import these modules
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))

Output:

rocks : rock
corpora : corpus
better : good
```
* To learn more on Lemmatization and its different types, see here:- (https://www.geeksforgeeks.org/python/python-lemmatization-with-nltk/)

In [1]:
import nltk

In [2]:
from nltk.stem import PorterStemmer

In [3]:
porter = PorterStemmer()

In [4]:
porter.stem("Walking")

'walk'

In [5]:
porter.stem("Walked")

'walk'

In [7]:
porter.stem("walks")

'walk'

In [9]:
porter.stem("ran")

'ran'

In [10]:
porter.stem("running")

'run'

In [11]:
porter.stem("bosses")

'boss'

In [12]:
porter.stem("replacement")

'replac'

In [13]:
porter.stem("happily")

'happili'

In [15]:
sentence = "Lemmatization is more sophisticated than stemming".split()
sentence

['Lemmatization', 'is', 'more', 'sophisticated', 'than', 'stemming']

In [21]:
for token in sentence:
    print(porter.stem(token), end=" ")

lemmat is more sophist than stem 

In [22]:
for token in sentence:
    print(porter.stem(token), end="/")

lemmat/is/more/sophist/than/stem/

In [18]:
for token in sentence:
    print(porter.stem(token))

lemmat
is
more
sophist
than
stem


In [23]:
porter.stem("unnecessary")

'unnecessari'

In [24]:
porter.stem("berry")

'berri'

In [25]:
from nltk.stem import WordNetLemmatizer

In [26]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to C:\Users\Deepam
[nltk_data]     Shah\AppData\Roaming\nltk_data...


True

### **Wordnet**:
> Wordnet is a publicly available lexical database of over 200 languages that provides semantic relationships between its words. It is one of the earliest and most commonly used lemmatizer technique. It is present in the nltk library in python. It groups synonyms in the form of synsets.

In [27]:
from nltk.corpus import wordnet

In [28]:
lemmatizer = WordNetLemmatizer()

In [43]:
lemmatizer.lemmatize("walking") # lemmatizer need parts of speech(pos) for every word,
# if not given it by default takes pos of that word as a noun and does operates that word
# on the basis of noun, here it takes walking as noun but it is not, it's a verb actually,
# lemmatizer produced the same output if given incorrect pos.

'walking'

In [32]:
lemmatizer.lemmatize("walking", pos=wordnet.VERB)

'walk'

In [34]:
lemmatizer.lemmatize("going")

'going'

In [36]:
lemmatizer.lemmatize("going", pos=wordnet.VERB)

'go'

In [37]:
lemmatizer.lemmatize("ran", pos=wordnet.VERB)

'run'

In [38]:
porter.stem("mice")

'mice'

In [41]:
lemmatizer.lemmatize("mice")

'mouse'

In [44]:
porter.stem("was")

'wa'

In [46]:
lemmatizer.lemmatize("was", pos=wordnet.VERB)

'be'

In [47]:
porter.stem("is")

'is'

In [48]:
lemmatizer.lemmatize("is", pos=wordnet.VERB)

'be'

In [49]:
porter.stem("better")

'better'

In [50]:
lemmatizer.lemmatize("better", pos=wordnet.ADJ)

'good'

In [51]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else: 
        return wordnet.NOUN

In [52]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Deepam Shah\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [59]:
sentence = "Donald Trump has a devoted following".split()

In [61]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Deepam Shah\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [62]:
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

[('Donald', 'NNP'),
 ('Trump', 'NNP'),
 ('has', 'VBZ'),
 ('a', 'DT'),
 ('devoted', 'VBN'),
 ('following', 'NN')]

In [63]:
for word, tag in words_and_tags:
    lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
    print(lemma, end=" ")

Donald Trump have a devote following 

In [64]:
sentence = "The cat was following the bird as it flew by".split()

In [65]:
words_and_tags = nltk.pos_tag(sentence) # This returns a list containing tuples, and each tuple contains eachword in the document, along with its corresponding tag
words_and_tags

[('The', 'DT'),
 ('cat', 'NN'),
 ('was', 'VBD'),
 ('following', 'VBG'),
 ('the', 'DT'),
 ('bird', 'NN'),
 ('as', 'IN'),
 ('it', 'PRP'),
 ('flew', 'VBD'),
 ('by', 'IN')]

In [66]:
for word, tag in words_and_tags:
    lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
    print(lemma, end=" ")

The cat be follow the bird a it fly by 