# Introduction:
 - Natural Language Processing (NLP) is a key field in artificial intelligence that focuses on analyzing and understanding textual and linguistic data. Proper text preprocessing, including techniques like stemming and lemmatization, is an essential step to ensure accurate and effective results in applications such as text analysis, classification, and sentiment analysis.

 - In this notebook, we will explore and implement various text preprocessing techniques using the popular NLTK library. We will dive into the differences between tools like the Porter Stemmer, Lancaster Stemmer, and WordNet Lemmatizer, while also experimenting with non-English languages such as Arabic. Additionally, we will demonstrate how to use a custom dictionary to extract roots from text effectively.

# Notebook Objectives:
 1. Explore stemming and lemmatization techniques using the NLTK library.
 2. Apply these techniques to both English and Arabic texts.
 3. Compare the performance of different stemming algorithms.
 4. Utilize a custom dictionary for root extraction in specific cases.
 5. uild a practical understanding of text preprocessing as a critical step in NLP projects.

In [1]:
pip install --upgrade nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.2.4
    Uninstalling nltk-3.2.4:
      Successfully uninstalled nltk-3.2.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
preprocessing 0.1.13 requires nltk==3.2.4, but you have nltk 3.9.1 which is incompatible.[0m[31m
[0mSuccessfully installed nltk-3.9.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
import nltk

nltk.download('punkt')

nltk.download('wordnet')

from nltk.stem.porter import *

p_stemmer = PorterStemmer()

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
text = "I am running towards the park, and I have been running for an hour."

words = nltk.word_tokenize(text)

for word in words:

    print(word+' --> '+p_stemmer.stem(word))

I --> i
am --> am
running --> run
towards --> toward
the --> the
park --> park
, --> ,
and --> and
I --> i
have --> have
been --> been
running --> run
for --> for
an --> an
hour --> hour
. --> .


In [5]:
from nltk.stem.snowball import SnowballStemmer

s_stemmer = SnowballStemmer(language='english')



words = ['run','runner','running','ran','runs','easily','fairly']



for word in words:

    print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


In [6]:
words = ['generous','generation','generously','generate']



for word in words:

    print(word+' --> '+s_stemmer.stem(word))

    print(word+' --> '+p_stemmer.stem(word))

    print('---------------------------------------')

generous --> generous
generous --> gener
---------------------------------------
generation --> generat
generation --> gener
---------------------------------------
generously --> generous
generously --> gener
---------------------------------------
generate --> generat
generate --> gener
---------------------------------------


In [7]:
from nltk.stem import PorterStemmer , LancasterStemmer

from nltk.tokenize import sent_tokenize, word_tokenize



ps = PorterStemmer()

ls =  LancasterStemmer()

In [8]:
words = ["is","was","be","been","are","were"]

In [9]:
for w in words:

    print(f'Word  {w}    has setmming      {ps.stem(w)}')

Word  is    has setmming      is
Word  was    has setmming      wa
Word  be    has setmming      be
Word  been    has setmming      been
Word  are    has setmming      are
Word  were    has setmming      were


In [10]:
for w in words:

    print(f'Word  {w}    has setmming      {ls.stem(w)}')

Word  is    has setmming      is
Word  was    has setmming      was
Word  be    has setmming      be
Word  been    has setmming      been
Word  are    has setmming      ar
Word  were    has setmming      wer


In [11]:
words = ["book","booking","booked","books","booker","bookstore"]

In [12]:
for w in words:

    print(f'Word  {w}    has setmming      {ps.stem(w)}')

Word  book    has setmming      book
Word  booking    has setmming      book
Word  booked    has setmming      book
Word  books    has setmming      book
Word  booker    has setmming      booker
Word  bookstore    has setmming      bookstor


In [13]:
for w in words:

    print(f'Word  {w}    has setmming      {ls.stem(w)}')

Word  book    has setmming      book
Word  booking    has setmming      book
Word  booked    has setmming      book
Word  books    has setmming      book
Word  booker    has setmming      book
Word  bookstore    has setmming      bookst


In [14]:
sentence = 'had you booked the air booking yet ? if not try to book it ASAP since booking will be out of books'

In [15]:
words = word_tokenize(sentence)

for w in words:

    print(f'Word  {w}    has setmming      {ps.stem(w)}')

Word  had    has setmming      had
Word  you    has setmming      you
Word  booked    has setmming      book
Word  the    has setmming      the
Word  air    has setmming      air
Word  booking    has setmming      book
Word  yet    has setmming      yet
Word  ?    has setmming      ?
Word  if    has setmming      if
Word  not    has setmming      not
Word  try    has setmming      tri
Word  to    has setmming      to
Word  book    has setmming      book
Word  it    has setmming      it
Word  ASAP    has setmming      asap
Word  since    has setmming      sinc
Word  booking    has setmming      book
Word  will    has setmming      will
Word  be    has setmming      be
Word  out    has setmming      out
Word  of    has setmming      of
Word  books    has setmming      book


In [16]:
for w in words:

    print(f'Word  {w}    has setmming      {ls.stem(w)}')

Word  had    has setmming      had
Word  you    has setmming      you
Word  booked    has setmming      book
Word  the    has setmming      the
Word  air    has setmming      air
Word  booking    has setmming      book
Word  yet    has setmming      yet
Word  ?    has setmming      ?
Word  if    has setmming      if
Word  not    has setmming      not
Word  try    has setmming      try
Word  to    has setmming      to
Word  book    has setmming      book
Word  it    has setmming      it
Word  ASAP    has setmming      asap
Word  since    has setmming      sint
Word  booking    has setmming      book
Word  will    has setmming      wil
Word  be    has setmming      be
Word  out    has setmming      out
Word  of    has setmming      of
Word  books    has setmming      book


In [17]:
word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding",

             "railroad","moonlight","football"]

print("{0:20}{1:20}{2:20}".format("Word","Porter Stemmer","lancaster Stemmer"))

for word in word_list:

    print("{0:20}{1:20}{2:20}".format(word,ps.stem(word),ls.stem(word)))

Word                Porter Stemmer      lancaster Stemmer   
friend              friend              friend              
friendship          friendship          friend              
friends             friend              friend              
friendships         friendship          friend              
stabil              stabil              stabl               
destabilize         destabil            dest                
misunderstanding    misunderstand       misunderstand       
railroad            railroad            railroad            
moonlight           moonlight           moonlight           
football            footbal             footbal             


In [18]:
nltk.download('omw-1.4')
nltk.data.path.append('/path/to/custom/nltk_data')

[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


In [19]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [20]:
words = ["cats","cacti","radii","feet","speech",'runner']



for word in words :

    print(lemmatizer.lemmatize(word))

cat
cactus
radius
foot
speech
runner


In [21]:
import nltk

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()



sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

punctuations="?:!.,;"

sentence_words = nltk.word_tokenize(sentence)

for word in sentence_words:

    if word in punctuations:

        sentence_words.remove(word)



sentence_words

print("{0:20}{1:20}".format("Word","Lemma"))

for word in sentence_words:

    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))

Word                Lemma               
He                  He                  
was                 wa                  
running             running             
and                 and                 
eating              eating              
at                  at                  
same                same                
time                time                
He                  He                  
has                 ha                  
bad                 bad                 
habit               habit               
of                  of                  
swimming            swimming            
after               after               
playing             playing             
long                long                
hours               hour                
in                  in                  
the                 the                 
Sun                 Sun                 


In [22]:
for word in sentence_words:

    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word, pos="v")))

He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


In [23]:
words = ["is","was","be","been","are","were"]



for word in words :

    print(lemmatizer.lemmatize(word))

is
wa
be
been
are
were


In [24]:
words = ["is","was","be","been","are","were"]

for word in words :

    print(lemmatizer.lemmatize(word,'v'))

be
be
be
be
be
be


In [25]:
words = ["feet","radii","men","children","carpenter","fighter"]

for word in words :

    print(lemmatizer.lemmatize(word,'n'))

foot
radius
men
child
carpenter
fighter


In [26]:
from nltk.stem.snowball import SnowballStemmer

s_stemmer = SnowballStemmer(language='arabic')



words = ['الجري','تجري','يجرون','جري','يجري']

for word in words:

    print(word+' --> '+s_stemmer.stem(word))

الجري --> الجر
تجري --> تجر
يجرون --> يجرو
جري --> جر
يجري --> يجر


In [27]:
words = ['الجري','تجري','يجرون','جري','يجري']

for word in words:

    print(word+' --> '+p_stemmer.stem(word))

الجري --> الجري
تجري --> تجري
يجرون --> يجرون
جري --> جري
يجري --> يجري


In [28]:
from nltk.stem import PorterStemmer , LancasterStemmer

from nltk.tokenize import sent_tokenize, word_tokenize



ps = PorterStemmer()

ls =  LancasterStemmer()

In [29]:
words = ['الجري','تجري','يجرون','جري','يجري']

In [30]:
for w in words:

    print(w+'---->'+ps.stem(w))

الجري---->الجري
تجري---->تجري
يجرون---->يجرون
جري---->جري
يجري---->يجري


In [31]:
for w in words:

    print(w+'---->'+ls.stem(w))

الجري---->الجري
تجري---->تجري
يجرون---->يجرون
جري---->جري
يجري---->يجري


In [32]:
words = ['الجري','تجري','يجرون','جري','يجري']

print("{0:20}{1:20}{2:20}".format("Word","Porter Stemmer","lancaster Stem"))

for word in words:

    print("{0:20}{1:20}{2:20}".format(word,ps.stem(word),ls.stem(word)))

Word                Porter Stemmer      lancaster Stem      
الجري               الجري               الجري               
تجري                تجري                تجري                
يجرون               يجرون               يجرون               
جري                 جري                 جري                 
يجري                يجري                يجري                


In [33]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [34]:
words = ['الجري','تجري','يجرون','جري','يجري']

for word in words :

    print(lemmatizer.lemmatize(word))

الجري
تجري
يجرون
جري
يجري


In [35]:
StemDict = {'run':['run','runs','running','runner','rerun','ran'],

            'book':['book','books','booking','booker','rebook'],

           'عمل' : ['يعمل','عامل','يعملون','عمال','العمال'] }



# T ='he went for running last night as he love ran and he always runs at night , so he can book the booking from the booker'

T = 'ذهب محمد مع العمال كي يعملون في المصنع لانه يعمل  عامل مع عمال آخرين'



NewT=[]

for word in T.split():

  NewWord=''

  Found=False

  for key in StemDict:

    if word in StemDict[key]:

      NewWord=key

      Found=True

      break

  if Found==False:

    NewWord=word

  NewT.append(NewWord)





In [36]:
' '.join(NewT)

'ذهب محمد مع عمل كي عمل في المصنع لانه عمل عمل مع عمل آخرين'