__Name:__ Amrita Veshin <br>
__Register Number:__ 22122104

------------------------------------------------------------------------
#  <center> NLP LAB-08: Differentiating Stemming and Lemmatizing of Words
------------------------------------------------------------------------

<br>__ABOUT THE CORPUS:__ <br>
The 10 headlines considered within the corpus have been taken from the technology section of the Indian Express website, and can be accessed from the following link: <br>
https://indianexpress.com/section/technology/  

In [None]:
text='''
1. Generative AI to your smartphone is the next big thing… Qualcomm’s trajectory underlines this
2. Qualcomm leans on ‘Generative AI’ with flagship Snapdragon phone and PC chips
3. China rushes to swap Western tech with domestic options as U.S. cracks down
4. watchOS 10.1 update: Double Tap gesture now available for Apple Watch Series 9, Watch Ultra 2
5. Honor’s Magic 6 eye-tracking feature lets you open apps using your eyes
6. Bill Gates feels Generative AI has plateaued, says GPT-5 will not be any better
7. Humane to unveil GPT-powered AI Pin on November 9
8. Not foldable, but bendable: Motorola’s latest concept phone can be worn like a wristwatch
9. Amazon gets passkey support, allows login using fingerprint and face: Here’s how to set it up
10. Qualcomm announces S7, S7 Pro Gen 1 sound platforms: Brings Wi-Fi to earbuds & headphones
'''


In [None]:
import spacy
nlp=spacy.load('en_core_web_sm')
doc=nlp(text)
tokens=[token.text for token in doc if token.is_alpha and not (token.is_stop or token.is_punct or token.is_digit)]


In [None]:
import nltk
#from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

In [None]:
nltk.download('punkt')
#nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Perform stemming on the filtered tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]


In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# Applying stemming to each token
stemmed_tokens = [stemmer.stem(word) for word in tokens]

In [None]:
#Segregating Original Sentences
sentences=[]
for sentence in doc.sents:
  sentences.append(sentence)


In [None]:
import pandas as pd

#column_names=['Tokens', 'Lemmatized Tokens', 'Stemmed Tokens']
data={'Tokens':tokens, 'Lemmatized Tokens':lemmatized_tokens, 'Stemmed Tokens':stemmed_tokens}
dataframe= pd.DataFrame(data)
print(dataframe)


        Tokens Lemmatized Tokens Stemmed Tokens
0   Generative        Generative          gener
1           AI                AI             ai
2   smartphone        smartphone      smartphon
3          big               big            big
4        thing             thing          thing
..         ...               ...            ...
84      Brings            Brings          bring
85          Wi                Wi             wi
86          Fi                Fi             fi
87     earbuds           earbuds         earbud
88  headphones         headphone       headphon

[89 rows x 3 columns]


In [None]:
from tabulate import tabulate

pretty_table = tabulate(dataframe, headers='keys', tablefmt='fancy_grid')

# Print the pretty table
print(pretty_table)

╒════╤═════════════╤═════════════════════╤══════════════════╕
│    │ Tokens      │ Lemmatized Tokens   │ Stemmed Tokens   │
╞════╪═════════════╪═════════════════════╪══════════════════╡
│  0 │ Generative  │ Generative          │ gener            │
├────┼─────────────┼─────────────────────┼──────────────────┤
│  1 │ AI          │ AI                  │ ai               │
├────┼─────────────┼─────────────────────┼──────────────────┤
│  2 │ smartphone  │ smartphone          │ smartphon        │
├────┼─────────────┼─────────────────────┼──────────────────┤
│  3 │ big         │ big                 │ big              │
├────┼─────────────┼─────────────────────┼──────────────────┤
│  4 │ thing       │ thing               │ thing            │
├────┼─────────────┼─────────────────────┼──────────────────┤
│  5 │ Qualcomm    │ Qualcomm            │ qualcomm         │
├────┼─────────────┼─────────────────────┼──────────────────┤
│  6 │ trajectory  │ trajectory          │ trajectori       │
├────┼──

## INFERENCES
1. __Lemmatization:__ <br> According to the Stanford NLP Group, Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. <br>
__After observing the lemmatized version of the tokens extracted from the above corpus, in most of the cases, the lemmatized version is the singular form of a plural word (Eg. headphones becomes headphone, eyes becomes eye, etc.)__

2. __Stemming:__ <br> Again, according to the Stanford NLP Group, Stemming usually refers to a crude heuristic process that chops off the ends of words and often includes the removal of derivational affixes. It is the process of reducing inflected words to their word stem, base or root form.<br> __On observing the above corpus, we see that the stemmed version of a word usually removes the suffixes at the end of the word which, in quite a significant no. of cases, would leave a meaningless word behind. (Eg. announces becomes announc, bendable becomes bendabl, and similarly with foldabl, etc.)__

3. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. However, the two words differ in their flavor. __Stemmers use language-specific rules, but they require less knowledge than a lemmatizer, which needs a complete vocabulary and morphological analysis to correctly lemmatize words.__