In [2]:
# NLTK (Natural Language Toolkit) is a popular open-source Python library used for natural language processing (NLP) tasks. 
# It provides a wide range of functionalities and tools for working with human language data. NLTK offers support for tasks 
# such as tokenization, stemming, lemmatization, part-of-speech tagging, parsing, semantic reasoning, and more

pip install nltk

SyntaxError: invalid syntax (3548907855.py, line 5)

In [None]:
# WordNet is a lexical database and semantic network for the English language. It is a widely used resource in natural language 
# processing (NLP) and computational linguistics

pip install wordnet

In [None]:
################################################################################################################################

The Porter Stemmer reduces words to their base form, or stem, by removing suffixes and modifying the word accordingly. It may not always produce a dictionary word, as its main goal is to produce a common stem that captures the essential meaning of the original word. Stemming can be helpful for tasks such as information retrieval, text mining, and other NLP applications where understanding the core meaning of words is more important than their precise form.

In [3]:
# Example PorterStemmer

from nltk.stem import PorterStemmer

# Create an instance of the PorterStemmer
stemmer = PorterStemmer()

# Stemming example words
words = ["running", "ran", "jumps", "jumped", "happiness", "happy"]

for word in words:
    stemmed_word = stemmer.stem(word)
    print(f"Original: {word}, Stemmed: {stemmed_word}")

Original: running, Stemmed: run
Original: ran, Stemmed: ran
Original: jumps, Stemmed: jump
Original: jumped, Stemmed: jump
Original: happiness, Stemmed: happi
Original: happy, Stemmed: happi


In [6]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Louis\AppData\Roaming\nltk_data...


True

In [None]:
################################################################################################################################

The Lancaster Stemmer produces more aggressive stemming results compared to the Porter Stemmer. For example, it stems "running" to "run" instead of "runn" produced by the Porter Stemmer. The Lancaster Stemmer may not always produce dictionary words, but it aims to reduce words to a common stem that captures the core meaning.

In [7]:
# Example of LancasterStemmer

from nltk.stem import LancasterStemmer

# Create an instance of the LancasterStemmer
stemmer = LancasterStemmer()

# Stemming example words
words = ["running", "ran", "jumps", "jumped", "happiness", "happy"]

for word in words:
    stemmed_word = stemmer.stem(word)
    print(f"Original: {word}, Stemmed: {stemmed_word}")

Original: running, Stemmed: run
Original: ran, Stemmed: ran
Original: jumps, Stemmed: jump
Original: jumped, Stemmed: jump
Original: happiness, Stemmed: happy
Original: happy, Stemmed: happy


In [None]:
################################################################################################################################

The WordNet Lemmatizer correctly identifies the lemmas of the words based on their POS. For example, "running" remains as "running" since it can function as a noun or a verb. "Jumps" is lemmatized to "jump" (noun form), and "jumped" is lemmatized to "jumped" (past tense verb form).

In [8]:
# Example of WordNetLemmatizer

from nltk.stem import WordNetLemmatizer

# Create an instance of the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatization example words
words = ["running", "ran", "jumps", "jumped", "happiness", "happy"]

for word in words:
    lemma = lemmatizer.lemmatize(word)
    print(f"Original: {word}, Lemmatized: {lemma}")

Original: running, Lemmatized: running
Original: ran, Lemmatized: ran
Original: jumps, Lemmatized: jump
Original: jumped, Lemmatized: jumped
Original: happiness, Lemmatized: happiness
Original: happy, Lemmatized: happy


In [None]:
################################################################################################################################

The WordNet Lemmatizer is used to lemmatize the word "running" specifically as a verb by providing the POS (Part-of-Speech) tag 'v'. The output shows that the lemmatized form of "running" as a verb is "run" and the lemmarized form of "bought" as a verb is "buy".

In [9]:
from nltk.stem import WordNetLemmatizer

# Create an instance of the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatization of a word as a verb
word = "running"
word1 = "bought"
pos = "v"  # 'v' represents verb

lemma = lemmatizer.lemmatize(word, pos)
lemma1 = lemmatizer.lemmatize(word1, pos)
print(f"Original: {word}, Lemmatized as verb: {lemma}")
print(f"Original: {word}, Lemmatized as verb: {lemma1}")

Original: running, Lemmatized as verb: run
Original: running, Lemmatized as verb: buy


In [None]:
################################################################################################################################

The Porter Stemmer and the Lancaster Stemmer are both stemming algorithms used to reduce words to their base or root form. 

The Porter Stemmer tends to produce longer stems, which can be considered more linguistically informed and closer to the actual root form. The Lancaster Stemmer, on the other hand, produces shorter stems but can be more aggressive in reducing words to their base forms.

In [None]:
# Import packages
import pandas as pd
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Instantiate stemmers and lemmatiser

porter = PorterStemmer() # Stemmer
lancaster = LancasterStemmer() # Stemmmer

lemmatiser = WordNetLemmatizer() # Lemmatizer

# Create function that normalises text using all three techniques
def normalise_text(words, pos='v'):
    """Stem and lemmatise each word in a list. Return output in a dataframe."""
    normalised_text = pd.DataFrame(index=words, columns=['Porter', 'Lancaster', 'Lemmatiser'])
    for word in words:
        normalised_text.loc[word,'Porter'] = porter.stem(word)
        normalised_text.loc[word,'Lancaster'] = lancaster.stem(word)
        normalised_text.loc[word,'Lemmatiser'] = lemmatiser.lemmatize(word)
    return normalised_text

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aicyb\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
normalise_text(['apples', 'pears', 'tasks', 'children', 'earrings', 'dictionary', 'marriage', 'connections',\
                'universe', 'university'], pos='n')

Unnamed: 0,Porter,Lancaster,Lemmatiser
apples,appl,appl,apple
pears,pear,pear,pear
tasks,task,task,task
children,children,childr,child
earrings,ear,ear,earring
dictionary,dictionari,dict,dictionary
marriage,marriag,marry,marriage
connections,connect,connect,connection
universe,univers,univers,universe
university,univers,univers,university


In [None]:
# Import packages
import pandas as pd
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer

# Instantiate stemmers and lemmatiser

porter = PorterStemmer() # Stemmer
lancaster = LancasterStemmer() # Stemmmer

lemmatiser = WordNetLemmatizer() # Lemmatizer

# Create function that normalises text using all three techniques
def normalise_text(words, pos='n'):
    """Stem and lemmatise each word in a list. Return output in a dataframe."""
    normalised_text = pd.DataFrame(index=words, columns=['Porter', 'Lancaster', 'Lemmatiser'])
    for word in words:
        normalised_text.loc[word,'Porter'] = porter.stem(word)
        normalised_text.loc[word,'Lancaster'] = lancaster.stem(word)
        normalised_text.loc[word,'Lemmatiser'] = lemmatiser.lemmatize(word, pos=pos)
    return normalised_text

In [None]:
normalise_text(['apples', 'pears', 'tasks', 'children', 'earrings', 'dictionary', 'marriage', 'connections',\
                'universe', 'university'], pos='v')

Unnamed: 0,Porter,Lancaster,Lemmatiser
apples,appl,appl,apples
pears,pear,pear,pears
tasks,task,task,task
children,children,childr,children
earrings,ear,ear,earrings
dictionary,dictionari,dict,dictionary
marriage,marriag,marry,marriage
connections,connect,connect,connections
universe,univers,univers,universe
university,univers,univers,university


In [None]:
normalise_text(['pie', 'globe', 'house', 'knee', 'angle', 'acetone', 'time', 'brownie', 'climate',\
                'independence'], pos='n')

Unnamed: 0,Porter,Lancaster,Lemmatiser
pie,pie,pie,pie
globe,globe,glob,globe
house,hous,hous,house
knee,knee,kne,knee
angle,angl,angl,angle
acetone,aceton,aceton,acetone
time,time,tim,time
brownie,browni,browny,brownie
climate,climat,clim,climate
independence,independ,independ,independence


In [None]:
normalise_text(['wrote', 'thinking', 'remembered', 'relies', 'ate', 'gone', 'won', 'ran', \
                'swimming', 'mistreated'], pos='v')

Unnamed: 0,Porter,Lancaster,Lemmatiser
wrote,wrote,wrot,write
thinking,think,think,think
remembered,rememb,rememb,remember
relies,reli,rely,rely
ate,ate,at,eat
gone,gone,gon,go
won,won,won,win
ran,ran,ran,run
swimming,swim,swim,swim
mistreated,mistreat,mist,mistreat
