<a href="https://colab.research.google.com/github/Riyajha182005/Genai-from-scratch/blob/main/9.Tokenization/Techniques_Texpreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Stemming**

Stemming is a process in natural language processing (NLP) that reduces words to their root form. For example, the words "running," "runs," and "ran" would all be reduced to the stem "run." This is useful for tasks like searching and text analysis because it allows you to treat different forms of the same word as equivalent


In [None]:
words = ["running", "runs", "ran", "easily", "fairly", "beautiful"]


In [None]:
import nltk

In [None]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()

In [None]:
for word in words:
  print(word + "---->" + stemming.stem(word))


running---->run
runs---->run
ran---->ran
easily---->easili
fairly---->fairli
beautiful---->beauti


**PorterStemmer**

In [None]:
## Where its not worked
words = ["generous","generation","generously","generate"]
stemming = PorterStemmer()
for word in words:
  print(word + "---->" + stemming.stem(word))


generous---->gener
generation---->gener
generously---->gener
generate---->gener


Sometimes, stemming can make words that mean different things look the same. This is a problem because it can be confusing. For example, "generous," "generation," "generously," and "generate" all get cut down to "gener." Even though they mean different things, stemming makes them look like the same word. This can be a disadvantage when you need to know the exact meaning of a word.

**Advantages of Stemming**

Despite some disadvantages, stemming is useful in tasks like text classification and sentiment analysis. It helps to:

*   Reduce the vocabulary size, making models simpler and faster.
*   Group together words with similar meanings, which can improve the accuracy of models that rely on word patterns.

**RegexpStemmer Class**

The `RegexpStemmer` in NLTK is a stemming tool that uses regular expressions to remove suffixes from words. Unlike rule-based stemmers like the Porter or Snowball stemmers, which apply a predefined set of rules in a specific order, the `RegexpStemmer` allows you to define your own regular expressions to specify which suffixes to remove. This provides more flexibility and control over the stemming process, making it suitable for specific use cases where custom stemming rules are required.

In [None]:
from nltk import RegexpStemmer
stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [None]:
stemmer.stem('eating')

'eat'

In [None]:
from nltk import SnowballStemmer
snowball_stemmer=SnowballStemmer('english')

In [None]:
for word in words:
  print(word + "---->"+snowball_stemmer.stem(word))

generous---->generous
generation---->generat
generously---->generous
generate---->generat


In [None]:
stemming.stem('fairly'), stemming.stem('generously')

('fairli', 'gener')

In [None]:
snowball_stemmer.stem('fairly'),snowball_stemmer.stem('generously')

('fair', 'generous')

**Comparison of Porter and Snowball Stemmers**

As seen in the outputs above, the Porter Stemmer and Snowball Stemmer can produce different results for the same words. This difference arises from the distinct sets of rules and algorithms they employ for suffix removal. The Snowball Stemmer, often considered a successor to the Porter Stemmer (Porter2), is generally more aggressive and may result in shorter stems. This highlights that the choice of stemming algorithm can impact the final stemmed output and should be considered based on the specific requirements of the NLP task.

**Lemmatization**

Lemmatization is another technique in NLP that reduces words to their base or dictionary form, known as the lemma. Unlike stemming, which often just chops off suffixes, lemmatization considers the word's meaning and context to arrive at a valid word. For example, "running," "runs," and "ran" would all be reduced to the lemma "run," similar to stemming. However, lemmatization would correctly identify the lemma of "better" as "good," whereas a stemmer might just remove the suffix to get "bett." This makes lemmatization more accurate than stemming in preserving the meaning of words.

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Example words
words_for_lemmatization = ["running", "runs", "ran", "better", "goodness"]

# Lemmatize the words
print("Lemmatization examples:")
for word in words_for_lemmatization:
    print(f"{word} ----> {lemmatizer.lemmatize(word)}")

Lemmatization examples:
running ----> running
runs ----> run
ran ----> ran
better ----> better
goodness ----> goodness


Note that for "better" and "goodness", the lemmatizer needs to know the part of speech to give the correct lemma. By default, it assumes the word is a noun.

In [None]:
# Lemmatize with part of speech
print("\nLemmatization with part of speech:")
print(f"better (adjective) ----> {lemmatizer.lemmatize('better', pos='a')}")
print(f"goodness (noun) ----> {lemmatizer.lemmatize('goodness', pos='n')}")


Lemmatization with part of speech:
better (adjective) ----> good
goodness (noun) ----> goodness


**Parts of speech**

Parts of speech are categories of words based on their grammatical function and meaning within a sentence. Common parts of speech include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections. Identifying the correct part of speech is crucial for accurate lemmatization, as the lemma of a word can vary depending on its grammatical role.

In [None]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds. From Alexander onwards, the Greeks, the Turks, the Mughals, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours. Yet we have not done this to any other nation. We have not conquered anyone. We have not grabbed their land, their culture, their history and tried to enforce our way of life on them. Why? Because we believe in freedom, we believe in respecting the freedom of others.

This is my first vision, freedom. I believe that India got its first taste of this freedom in 1857, when we started the war of independence. It is this freedom that we must protect and nurture and build on. If we are not free, no one will respect us.

My second vision for India is development. For fifty years we have been a developing nation. It is time we see ourselves as a developed nation. We are among the top 5 nations of the world in terms of GDP. We have 10 percent growth rate in most areas. Our poverty levels are falling. Our achievements are being globally recognized today. Yet we lack the self-confidence to see ourselves as a developed nation, self-reliant and self-assured. Isn't this incorrect?

I have a third vision. India must stand up to the world. Because I believe that unless India stands up to the world, no one will respect us. Only strength respects strength. We must be strong not only as a military power but also as an economic power. Both must go hand-in-hand.

Why are we in India so embarrassed to recognize our strengths, our achievements? We are such a great nation with so much potential. We have the potential for growth. Yet we are held back by our lack of self-confidence.

My message to the youth is to have courage, courage to think differently, courage to invent, courage to travel the unexplored path, courage to discover the impossible, courage to conquer the problems and succeed. These are great qualities that they must work towards. This is my message to the youth."""

In [None]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
import nltk
sentences = nltk.sent_tokenize(paragraph)

In [None]:
sentences

['I have three visions for India.',
 'In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Mughals, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we believe in freedom, we believe in respecting the freedom of others.',
 'This is my first vision, freedom.',
 'I believe that India got its first taste of this freedom in 1857, when we started the war of independence.',
 'It is this freedom that we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India is development.',
 'For fifty years we have been a developing nation.',
 'I

In [None]:
from nltk.corpus import stopwords
sentences = nltk.sent_tokenize(paragraph)

In [None]:
sentences

['I have three visions for India.',
 'In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Mughals, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we believe in freedom, we believe in respecting the freedom of others.',
 'This is my first vision, freedom.',
 'I believe that India got its first taste of this freedom in 1857, when we started the war of independence.',
 'It is this freedom that we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India is development.',
 'For fifty years we have been a developing nation.',
 'I

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

**Name Entity Recognization**

Try this sentence:

"In 2023, over 500,000 tourists visited the Eiffel Tower in Paris, France."

In [1]:
sentence = "In 2023, over 500,000 tourists visited the Eiffel Tower in Paris, France."

In [7]:
import nltk
nltk.download('punkt_tab')
words=nltk.word_tokenize(sentence)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [8]:
words

['In',
 '2023',
 ',',
 'over',
 '500,000',
 'tourists',
 'visited',
 'the',
 'Eiffel',
 'Tower',
 'in',
 'Paris',
 ',',
 'France',
 '.']

In [15]:
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


True

In [16]:
tagged_words = nltk.pos_tag(words)

In [18]:
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [21]:
named_entities = nltk.ne_chunk(tagged_words)
print(named_entities)

(S
  In/IN
  2023/CD
  ,/,
  over/IN
  500,000/CD
  tourists/NNS
  visited/VBD
  the/DT
  (ORGANIZATION Eiffel/NNP Tower/NNP)
  in/IN
  (GPE Paris/NNP)
  ,/,
  (GPE France/NNP)
  ./.)


Based on the output from the Named Entity Recognition code, here's an explanation of the identified entities in the sentence "In 2023, over 500,000 tourists visited the Eiffel Tower in Paris, France.":

*   `(ORGANIZATION Eiffel/NNP Tower/NNP)`: This identifies "Eiffel Tower" as an Organization. In this context, NLTK has incorrectly tagged it as an organization. It should ideally be tagged as a Facility or Location.
*   `(GPE Paris/NNP)`: This identifies "Paris" as a GPE (Geo-Political Entity). This is correct, as Paris is a city.
*   `(GPE France/NNP)`: This identifies "France" as a GPE (Geo-Political Entity). This is also correct, as France is a country.

The output also includes other tokens and their parts of speech (e.g., `In/IN`, `2023/CD`, `visited/VBD`), which are not named entities but are part of the tokenization and tagging process.