# <h1 align=center>**Text Normalization**</h1>
It involves cleaning and preprocessing text data to make it consistent and usable for different NLP tasks. The process includes a variety of techniques, such as case normalization, punctuation removal, stop word removal, stemming, and lemmatization.

- <a href='#Tokenization'>Tokenization</a>
    - <a href='#Tokenization Using NLTK EN'>1.1. Tokenization Using NLTK EN</a>
    - <a href='#Tokenization Using NLTK AR'>1.2. Tokenization Using NLTK AR</a>
- <a href='#POS'>2. POS</a>
- <a href='#NER'>3. NER</a>
- <a href='#Text Normalizaion ( Cleaning )'>Text Normalizaion ( Cleaning )</a>
    - <a href='#stop words in EN'>4.1.  Stop words in EN</a>
    - <a href='#punctuation'>4.2.  Punctuation</a>
    - <a href='#stemming using nltk'>4.3.  Stemming using nltk</a>
    - <a href='#stemming in AR'>4.4.  Stemming in AR</a>
    - <a href='#lemmatization using nltk'>4.5.  Lemmatization using nltk</a>
    - <a href='#lemmatization in AR'>4.6.  Lemmatization in AR</a>
    - <a href='#special arabic cleaning functions'>4.7.  Special arabic cleaning functions</a>

<h1 align=center></h1>

# Import Library ( nltk )
- work with human language data for applying in statistical natural language processing (NLP).
- It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning
- Ref : "https://www.nltk.org/data.html"
- NLTK **Corpora ( DataSets )** : "https://www.nltk.org/nltk_data/"

In [68]:
import nltk
nltk.download() # to download all features

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [39]:
# to know the functions in the library
dir(nltk)[0:10]

['ARLSTem',
 'ARLSTem2',
 'AbstractLazySequence',
 'AffixTagger',
 'AlignedSent',
 'Alignment',
 'AnnotationTask',
 'ApplicationExpression',
 'Assignment',
 'BigramAssocMeasures']

<b>
<a id='Tokenization'></a>
<font size="7">Tokenization</font>
</b>

##### Tokenization : mean split the sentence into tokens( **words** ) and called **Word tokenizer**
##### Tokenization : split paragraph into sentence and called **sentence tokenizer**
- Hint  Word tokenizer : is like split but split depend on space or symbol to seprate words
- Apply tokenization Text ->  **This product is amazing, but the delivery was late.**
  - **Answer** :  ["This", "product", "is", "amazing", ",", "but", "the", "delivery", "was", "late", "."]

<b>
<a id='Tokenization Using NLTK EN'></a>
<font size="5">Tokenization Using NLTK EN</font>
</b>

In [69]:
from nltk.tokenize import word_tokenize
#word_tokenize is a function
EXAMPLE_TEXT = """
Hello Mr. Smith, how are you doing today? The weather is great,
and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard.
"""
print(word_tokenize(EXAMPLE_TEXT))

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']


#### Difference between split and token

In [70]:
text = "my name's hossam and i work as a teacher assistant at bfcai and i have 600$"
print(text.split())
print('------------------------')
print(word_tokenize(text))
print('====================================================')

['my', "name's", 'hossam', 'and', 'i', 'work', 'as', 'a', 'teacher', 'assistant', 'at', 'bfcai', 'and', 'i', 'have', '600$']
------------------------
['my', 'name', "'s", 'hossam', 'and', 'i', 'work', 'as', 'a', 'teacher', 'assistant', 'at', 'bfcai', 'and', 'i', 'have', '600', '$']


In [71]:
from nltk.tokenize import sent_tokenize
s = 'Good muffins cost $3.88\nin New York.  Please buy me two of them.\n\nThanks.'
print(sent_tokenize(s))
#whitespaceTokenizer?

['Good muffins cost $3.88\nin New York.', 'Please buy me two of them.', 'Thanks.']


<b>
<a id='Tokenization Using NLTK AR'></a>
<font size="5">Tokenization Using NLTK AR</font>
</b>

In [72]:
EXAMPLE_TEXT='هل تعلم؟ #نيوكاسل يتفوق بالمواجهات المباشرة على #ارسنال في تاريخ الدوري الممتاز الانجليزي؟ بعد قليل أرقام واحصائيات للفريقين'
print(word_tokenize(EXAMPLE_TEXT))

['هل', 'تعلم؟', '#', 'نيوكاسل', 'يتفوق', 'بالمواجهات', 'المباشرة', 'على', '#', 'ارسنال', 'في', 'تاريخ', 'الدوري', 'الممتاز', 'الانجليزي؟', 'بعد', 'قليل', 'أرقام', 'واحصائيات', 'للفريقين']


In [74]:
print(EXAMPLE_TEXT.split())
print('------------------------')
print(word_tokenize(EXAMPLE_TEXT))
print('====================================================')

['هل', 'تعلم؟', '#نيوكاسل', 'يتفوق', 'بالمواجهات', 'المباشرة', 'على', '#ارسنال', 'في', 'تاريخ', 'الدوري', 'الممتاز', 'الانجليزي؟', 'بعد', 'قليل', 'أرقام', 'واحصائيات', 'للفريقين']
------------------------
['هل', 'تعلم؟', '#', 'نيوكاسل', 'يتفوق', 'بالمواجهات', 'المباشرة', 'على', '#', 'ارسنال', 'في', 'تاريخ', 'الدوري', 'الممتاز', 'الانجليزي؟', 'بعد', 'قليل', 'أرقام', 'واحصائيات', 'للفريقين']


<b>
<a id='POS'></a>
<font size="5">POS</font>
</b>


- POS Tagging **(Parts of Speech Tagging)**
    - is a process to mark up the words in text format for a particular part of a speech based on its definition and context.
    - It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word.
    - It is also called grammatical tagging.

In [75]:
text = ''' POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a particular part of a speech based on its definition and context.
It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word. It is also called grammatical tagging.
'''

In [76]:
tokens = nltk.word_tokenize(text)

In [77]:
tagged = nltk.pos_tag(tokens)
print(tagged)

[('POS', 'NNP'), ('Tagging', 'NNP'), ('(', '('), ('Parts', 'NNP'), ('of', 'IN'), ('Speech', 'NNP'), ('Tagging', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('process', 'NN'), ('to', 'TO'), ('mark', 'VB'), ('up', 'RP'), ('the', 'DT'), ('words', 'NNS'), ('in', 'IN'), ('text', 'JJ'), ('format', 'NN'), ('for', 'IN'), ('a', 'DT'), ('particular', 'JJ'), ('part', 'NN'), ('of', 'IN'), ('a', 'DT'), ('speech', 'NN'), ('based', 'VBN'), ('on', 'IN'), ('its', 'PRP$'), ('definition', 'NN'), ('and', 'CC'), ('context', 'NN'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('responsible', 'JJ'), ('for', 'IN'), ('text', 'JJ'), ('reading', 'NN'), ('in', 'IN'), ('a', 'DT'), ('language', 'NN'), ('and', 'CC'), ('assigning', 'VBG'), ('some', 'DT'), ('specific', 'JJ'), ('token', 'NN'), ('(', '('), ('Parts', 'NNP'), ('of', 'IN'), ('Speech', 'NNP'), (')', ')'), ('to', 'TO'), ('each', 'DT'), ('word', 'NN'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('also', 'RB'), ('called', 'VBN'), ('grammatical', 'JJ'), ('taggi


 # Abbreviation	Meaning
 The below NLTK POS tag list contains all the NLTK POS Tags. NLTK POS tagger is used to assign grammatical information of each word of the sentence. Installing, Importing and downloading all the packages of POS NLTK is complete.
- CC		coordinating conjunction
- CD		cardinal digit
- DT		determiner
- EX	existential there
- FW	foreign word
- IN	preposition/subordinating conjunction
- JJ	This NLTK POS Tag is an adjective (large)
- JJR	adjective, comparative (larger)
- JJS	adjective, superlative (largest)
- LS	list market
- MD	modal (could, will)
- NN	noun, singular (cat, tree)
- NNS	noun plural (desks)
- NNP	proper noun, singular (sarah)
- NNPS	proper noun, plural (indians or americans)
- PDT	predeterminer (all, both, half)
- POS	possessive ending (parent\ ‘s)
- PRP	personal pronoun (hers, herself, him, himself)
- PRP$	possessive pronoun (her, his, mine, my, our )
- RB	adverb (occasionally, swiftly)
- RBR	adverb, comparative (greater)
- RBS	adverb, superlative (biggest)
- RP	particle (about)
- TO	infinite marker (to)
- UH	interjection (goodbye)
- VB	verb (ask)
- VBG	verb gerund (judging)
- VBD	verb past tense (pleaded)
- VBN	verb past participle (reunified)
- VBP	verb, present tense not 3rd person singular(wrap)
- VBZ	verb, present tense with 3rd person singular (bases)
- WDT	wh-determiner (that, what)
- WP	wh- pronoun (who)
-WRB	wh- adverb (how) 

<b>
<a id='NER'></a>
<font size="5">NER</font>
</b>


- The term Named Entity : to identify **names of organisations, people and geographic locations** in the text, currency, time, and percentage expressions.
- Today, NER is widely used across various fields and sectors to automate the information extraction process.

In [78]:
sentence = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)

In [79]:
tagged[0] # Return : ("WORD" , "PART OF SPEECH")

('WASHINGTON', 'NNP')

In [80]:
tagged[0][0]   , tagged[0][1]

('WASHINGTON', 'NNP')

In [81]:
tagged[0:10]

[('WASHINGTON', 'NNP'),
 ('--', ':'),
 ('In', 'IN'),
 ('the', 'DT'),
 ('wake', 'NN'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('string', 'NN'),
 ('of', 'IN'),
 ('abuses', 'NNS')]

In [82]:
for chunk in nltk.ne_chunk(tagged):
     #print(chunk)
     if hasattr(chunk, 'label'): #  Checks if the chunk is a named entity (i.e., has a label).
        print(chunk.label(), ' '.join(c[0] for c in chunk)) #(e.g., PERSON, GPE for location)

GPE WASHINGTON
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn


<b>
<a id='Text Normalizaion ( Cleaning )'></a>
<font size="5">Text Normalizaion ( Cleaning )</font>
</b>

# How to remove all this from the text?
- **Stop Words** : (such as “the”, “a”, “an”, “in”) -- a search engine has been programmed to ignore.
- **Punctuation** : ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',....]
- **Stemming** : Stemming reduce the text by a set of pre-defined rules like removing `ing` from verbs
    - Stemming is the process of producing morphological variants of a root/base word.
    - اداه تسمح بتجريد اى كلمه من جميع الاضافات التى فيها والعوده للمصدر الاصلى لها 
    - Example : (plays,played,playing,player) -->> play 
- **Lemmatization** : reduce the word by looking it up in the `WordNet` where it tries to find the root of the word for example `rocks` -> `rock`
    - Lemmatization is similar to stemming but it brings context to the words.
    - So it links words with similar meanings to one word.
    - اكثر قوة وفعاليه لانها مش بتكتفى انها تحذف الزوائد من الكلمات ولكن بتبحث ف معنى واصل الكلمه
    - Example : ( rocks - rock ) , ( corpora - corpus ) , ( better - good )
#### Lemmatization vs Stemming
 - The key concept here is that stemming sometime destroy the word unlike lemmatization where we keep the meaning.

<b>
<a id='stop words in EN'></a>
<font size="5">stop words in EN</font>
</b>

In [41]:
#NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [42]:
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

## Task 1 :  Removing stop words with NLTK ?

In [15]:
example_sent = """This is a sample sentence, showing off the stop words filtration."""
stop_words = stopwords.words('english')
word_tokens = word_tokenize(example_sent)
filtered_sentence = []  
for w in word_tokens:
    if w.lower() not in stop_words:
        filtered_sentence.append(w.lower())

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']
sample sentence , showing stop words filtration .


<b>
<a id='stop words in AR'></a>
<font size="5">stop words in AR</font>
</b>

In [88]:
stop_word_arabic = set(stopwords.words("arabic"))
print(len(stop_word_arabic))

701


In [90]:
' - '.join(stop_word_arabic)

'حار - عل - لن - إياكما - لكن - ت - خاصة - لي - الآن - سبت - بعدا - حدَث - تموز - قام - اثنان - مرّة - أمامكَ - لنا - جير - عَدَسْ - كأنما - ثلاثون - فيه - لاسيما - صباح - إذما - أيّ - إياه - هللة - هَذا - أول - ة - لها - أُفٍّ - ستمئة - طاء - هاء - فلان - نوفمبر - تانِك - سمعا - بهما - بل - هنالك - استحال - إياكم - بين - عليه - هَيْهات - مكانكنّ - إنَّ - تخذ - لكم - لولا - سبعون - فلس - كسا - نفس - أمّا - هيت - غدا - بطآن - حمدا - نحن - راح - همزة - ص - كلما - ومن - أطعم - كاف - كل - ثلاثمائة - ثالث - فرادى - طاق - أن - مذ - ذين - أنتن - جلل - غالبا - ذو - دونك - تلقاء - ما برح - ذينك - قد - لعل - ش - أضحى - كأي - وَيْ - يورو - هَذِه - أيار - أيلول - ع - تفعلان - أين - حجا - كذلك - بهن - هيّا - خاء - ليسوا - شين - ثمّة - أعطى - ثمانون - تبدّل - ثمانين - علق - طالما - هناك - جعل - نحو - بنا - شَتَّانَ - هاهنا - أربع - قاطبة - أل - بئس - هلّا - ذواتي - أمام - عاد - أخذ - لستما - أسكن - أمس - آب - مادام - ذواتا - حين - رُبَّ - فوق - انقلب - مائة - لبيك - غداة - حمٌ - كلَّا - خميس - تلكما

<b>
<a id='punctuation'></a>
<font size="5">punctuation</font>
</b>

In [17]:
from string import punctuation

punctuation = list(punctuation)
print(punctuation)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


## Task 2 : Removing punctuation with tokenization ? 

In [18]:
example_sent = """My Email address is: taneshbalodi8@gmail.com."""
word_tokens = word_tokenize(example_sent)
filtered_sentence = []
  
for w in word_tokens:
    if w not in punctuation:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['My', 'Email', 'address', 'is', ':', 'taneshbalodi8', '@', 'gmail.com', '.']
['My', 'Email', 'address', 'is', 'taneshbalodi8', 'gmail.com']


## Task 3 : Removing punctuation without tokenization

In [19]:
example_sent = """My Email address is: hossamfares180100@gmail.com."""
filtered_sentence = []
  
for w in example_sent:
    if w not in punctuation:
        filtered_sentence.append(w)

print(example_sent)
print(filtered_sentence)
print(''.join(filtered_sentence))

My Email address is: hossamfares180100@gmail.com.
['M', 'y', ' ', 'E', 'm', 'a', 'i', 'l', ' ', 'a', 'd', 'd', 'r', 'e', 's', 's', ' ', 'i', 's', ' ', 'h', 'o', 's', 's', 'a', 'm', 'f', 'a', 'r', 'e', 's', '1', '8', '0', '1', '0', '0', 'g', 'm', 'a', 'i', 'l', 'c', 'o', 'm']
My Email address is hossamfares180100gmailcom


<b>
<a id='stemming using nltk'></a>
<font size="5">stemming using nltk</font>
</b>

In [21]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
  
ps = PorterStemmer()
  
# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]
  
for w in words:
    print(w, " : ", ps.stem(w))

program  :  program
programs  :  program
programmer  :  programm
programming  :  program
programmers  :  programm


In [22]:
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)


for w in words:
    print(w, " : ", ps.stem(w))

Programmers  :  programm
program  :  program
with  :  with
programming  :  program
languages  :  languag


###  HINT: snowballstemmer is somewhat faster and more logical than the original Porter Stemmer.

In [23]:
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer(language='english')
words = ['generous','generate','generously','generation']
for word in words:
    print(word,"--->",snowball.stem(word))

generous ---> generous
generate ---> generat
generously ---> generous
generation ---> generat


# Snowball stemmer support different languages

In [40]:
from nltk.stem.snowball import SnowballStemmer

print(", ".join(SnowballStemmer.languages))

arabic, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, porter, portuguese, romanian, russian, spanish, swedish


<b>
<a id='stemming in AR'></a>
<font size="5">stemming in AR</font>
</b>

In [24]:
#stemming is very weak in arabic
words = ['الجري','تجري','يجرون','جري','يجري']
for word in words:
    print(word+' --> '+ ps.stem(word))

الجري --> الجري
تجري --> تجري
يجرون --> يجرون
جري --> جري
يجري --> يجري


In [25]:
from nltk.stem.snowball import SnowballStemmer
s_stemmer = SnowballStemmer(language='arabic')

words = ['الجري','تجري','يجرون','جري','يجري']
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

الجري --> الجر
تجري --> تجر
يجرون --> يجرو
جري --> جر
يجري --> يجر


In [26]:
from nltk.stem.isri import ISRIStemmer
st = ISRIStemmer()
words = ['الجري','تجري','يجرون','جري','يجري']

for word in words : 
    print(st.stem(word))
print (st.stem('اعلاميون'))

جري
تجر
يجر
جري
يجر
علم


<b>
<a id='lemmatization using nltk'></a>
<font size="5">Lemmatization using nltk</font>
</b>

- Examples of lemmatization:
    - rocks : rock
    - corpora : corpus
    - better : good

In [92]:
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()


In [94]:
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
print("better :", lemmatizer.lemmatize("better", pos ="a"))

rocks : rock
corpora : corpus
better : good


In [96]:
from nltk.stem import WordNetLemmatizer

# Without POS (default is noun)
print(lemmatizer.lemmatize("better"))
# With POS as adjective ('a')
print(lemmatizer.lemmatize("better", pos="a"))  
# With POS as verb ('v')
print(lemmatizer.lemmatize("running", pos="v"))  

better
good
run


In [98]:
# single word lemmatization examples
list1 = ['kites', 'babies', 'dogs', 'flying', 'smiling', 'driving', 'died', 'tried', 'feet']

for words in list1:
    print(words + " ---> " + lemmatizer.lemmatize(words))

kites ---> kite
babies ---> baby
dogs ---> dog
flying ---> flying
smiling ---> smiling
driving ---> driving
died ---> died
tried ---> tried
feet ---> foot


In [30]:
words = ["is","was","be","been","are","were"]

for word in words : 
    print(lemmatizer.lemmatize(word))

is
wa
be
been
are
were


In [31]:
#what if we put them as verb
words = ["is","was","be","been","are","were"]
for word in words : 
    print(lemmatizer.lemmatize(word,'v'))

be
be
be
be
be
be


<b>
<a id='lemmatization in AR'></a>
<font size="5">lemmatization in AR</font>
</b>

In [32]:
words = ['الجري','تجري','يجرون','جري','يجري']

for word in words : 
    print(lemmatizer.lemmatize(word))

الجري
تجري
يجرون
جري
يجري


### Task 4 : Create Method Take (Text as input ) then : ( 15 MIN )
-  returns a cleaned version by removing stopwords, punctuation, applying lemmatization, and stemming:

In [114]:
def clean_text(text):
    filtered_sentence = []  
    
    for word in text.split(" "):
        if word.lower() not in stop_words and  w not in punctuation :
            filtered_sentence.append(word.lower())
    #print(filtered_sentence)
    
    ps_text = [ps.stem(word) for word in filtered_sentence]
    lemmatizer_text = [lemmatizer.lemmatize(word) for word in filtered_sentence]
    return ps_text , lemmatizer_text

In [118]:
clean_text("Welcome to the NLP Course!. Natural Language Processing (NLP) is an exciting field of Artificial Intelligence that enables computers to understand, interpret, and generate human language.")

(['welcom',
  'nlp',
  'course!.',
  'natur',
  'languag',
  'process',
  '(nlp)',
  'excit',
  'field',
  'artifici',
  'intellig',
  'enabl',
  'comput',
  'understand,',
  'interpret,',
  'gener',
  'human',
  'language.'],
 ['welcome',
  'nlp',
  'course!.',
  'natural',
  'language',
  'processing',
  '(nlp)',
  'exciting',
  'field',
  'artificial',
  'intelligence',
  'enables',
  'computer',
  'understand,',
  'interpret,',
  'generate',
  'human',
  'language.'])

# Extra Topics :

## Some works in Arabic

In [33]:
pip install qalsadi

Collecting qalsadi
  Downloading qalsadi-0.5-py3-none-any.whl.metadata (12 kB)
Collecting Arabic-Stopwords>=0.4.2 (from qalsadi)
  Downloading Arabic_Stopwords-0.4.3-py3-none-any.whl.metadata (8.9 kB)
Collecting alyahmor>=0.2 (from qalsadi)
  Downloading alyahmor-0.2-py3-none-any.whl.metadata (11 kB)
Collecting arramooz-pysqlite>=0.4.2 (from qalsadi)
  Downloading arramooz_pysqlite-0.4.2-py3-none-any.whl.metadata (4.0 kB)
Collecting codernitydb3 (from qalsadi)
  Downloading codernitydb3-0.6.0.tar.gz (46 kB)
     ---------------------------------------- 0.0/46.1 kB ? eta -:--:--
     -------------------------- ------------- 30.7/46.1 kB 1.3 MB/s eta 0:00:01
     -------------------------------------- 46.1/46.1 kB 568.6 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting libqutrub>=1.2.3 (from qalsadi)
  Downloading libqutrub-1.2.4.1-py3-none-any.whl.metadata (7.5 kB)
Collecting mysam-tagmanager>=0.3.3 (from qa


[notice] A new release of pip is available: 24.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [34]:
import qalsadi.lemmatizer
lemmer = qalsadi.lemmatizer.Lemmatizer()
words = ['الجري','تجري','يجرون','جري','يجري']

for word in words : 
    print(lemmer.lemmatize(word),end=' ')
print()
print (st.stem('يقول'),end=' ')
print (st.stem('يقولون'),end=' ')
print (st.stem('تقول'),end=' ')
print (st.stem('مقوله'))

جرة تجرة جرى جرة جرى 
يقل يقل تقل قول


In [35]:
text = text = """هل تحتاج إلى ترجمة كي تفهم خطاب الملك؟ اللغة "الكلاسيكية" (الفصحى) موجودة في كل اللغات وكذلك اللغة "الدارجة" .. الفرنسية التي ندرس في المدرسة ليست الفرنسية التي يستخدمها الناس في شوارع باريس .. وملكة بريطانيا لا تخطب بلغة شوارع لندن .. لكل مقام مقال"""
lemmas = lemmer.lemmatize_text(text)
print(lemmas)

['هل', 'احتاج', 'إلى', 'ترجمة', 'كي', 'تف', 'خطاب', 'ملك', '؟', 'لغة', '"', 'كلاسيكي', '"(', 'فصحى', ')', 'موجود', 'في', 'كل', 'لغة', 'كذلك', 'لغة', '"', 'دارج', '"..', 'فرنسة', 'التي', 'درس', 'في', 'مدرس', 'ليست', 'فرنسة', 'التي', 'استخدم', 'ناس', 'في', 'شوارع', 'باريس', '..', 'ملك', 'بريطاني', 'لا', 'خطب', 'بلغة', 'شوارع', 'أدان', '..', 'كل', 'مقام', 'مقال']


In [36]:
lemmas = lemmer.lemmatize_text(text, return_pos=True)
print(lemmas)

[('هل', 'stopword'), ('احتاج', 'verb'), ('إلى', 'stopword'), ('ترجمة', 'noun'), ('كي', 'stopword'), ('تف', 'noun'), ('خطاب', 'noun'), ('ملك', 'noun'), ('؟', 'pounct'), ('لغة', 'noun'), ('"', 'pounct'), ('كلاسيكي', 'noun'), ('"(', 'pounct'), ('فصحى', 'noun'), (')', 'pounct'), ('موجود', 'noun'), ('في', 'stopword'), ('كل', 'stopword'), ('لغة', 'noun'), ('كذلك', 'stopword'), ('لغة', 'noun'), ('"', 'pounct'), ('دارج', 'noun'), ('"..', 'pounct'), ('فرنسة', 'noun'), ('التي', 'stopword'), ('درس', 'verb'), ('في', 'stopword'), ('مدرس', 'noun'), ('ليست', 'stopword'), ('فرنسة', 'noun'), ('التي', 'stopword'), ('استخدم', 'verb'), ('ناس', 'noun'), ('في', 'stopword'), ('شوارع', 'noun'), ('باريس', 'all'), ('..', 'pounct'), ('ملك', 'noun'), ('بريطاني', 'noun'), ('لا', 'stopword'), ('خطب', 'verb'), ('بلغة', 'noun'), ('شوارع', 'noun'), ('أدان', 'verb'), ('..', 'pounct'), ('كل', 'stopword'), ('مقام', 'noun'), ('مقال', 'noun')]


In [37]:
# put diacritics on text
lemmer.set_vocalized_lemma()
lemmas = lemmer.lemmatize_text(text)
print(lemmas)

['هَلْ', 'اِحْتَاجَ', 'إِلَى', 'تَرْجَمَةٌ', 'كَيْ', 'تَفَهُّمٌ', 'خَطَّابٌ', 'مَلَكٌ', '؟', 'لُغَةٌ', '"', 'كِلاَسِيكِيٌّ', '"(', 'فُصْحَى', ')', 'مَوْجُودٌ', 'فِي', 'كُلَّ', 'لُغَةٌ', 'كَذَلِكَ', 'لُغَةٌ', '"', 'دَارِجٌ', '"..', 'فَرَنْسِيّ', 'الَّتِي', 'دَرَسَ', 'فِي', 'مَدْرَسَةٌ', 'لَيْسَتْ', 'فَرَنْسِيّ', 'الَّتِي', 'اِسْتَخْدَمَ', 'نَاسٌ', 'فِي', 'شَوَارِعٌ', 'باريس', '..', 'مَلَكٌ', 'برِيطانِيا', 'لَا', 'خَطَبَ', 'بَلَغَةٌ', 'شَوَارِعٌ', 'أَدَانَ', '..', 'كُلَّ', 'مَقَامٌ', 'مَقَالٌ']


In [None]:
#

<b>
<a id='special arabic cleaning functions'></a>
<font size="5">special arabic cleaning functions</font>
</b>

In [38]:
# special cleaning to arabic text
import re
arabic_diacritics = re.compile("""
                             ّ    | # Tashdid
                             َ    | # Fatha
                             ً    | # Tanwin Fath
                             ُ    | # Damma
                             ٌ    | # Tanwin Damm
                             ِ    | # Kasra
                             ٍ    | # Tanwin Kasr
                             ْ    | # Sukun
                             ـ     # Tatwil/Kashida
                         """, re.VERBOSE)


def normalize_arabic(text):
    text = re.sub("[إأآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    return text


def remove_diacritics(text):
    text = re.sub(arabic_diacritics, '', text)
    return text


def remove_repeating_char(text):
    return re.sub(r'(.)\1+', r'\1', text)

print(' '.join(lemmas))
print('-----------------------------------------------------')
print(remove_diacritics(' '.join(lemmas)))
print('-----------------------------------------------------')
print(normalize_arabic(remove_diacritics(' '.join(lemmas))))

هَلْ اِحْتَاجَ إِلَى تَرْجَمَةٌ كَيْ تَفَهُّمٌ خَطَّابٌ مَلَكٌ ؟ لُغَةٌ " كِلاَسِيكِيٌّ "( فُصْحَى ) مَوْجُودٌ فِي كُلَّ لُغَةٌ كَذَلِكَ لُغَةٌ " دَارِجٌ ".. فَرَنْسِيّ الَّتِي دَرَسَ فِي مَدْرَسَةٌ لَيْسَتْ فَرَنْسِيّ الَّتِي اِسْتَخْدَمَ نَاسٌ فِي شَوَارِعٌ باريس .. مَلَكٌ برِيطانِيا لَا خَطَبَ بَلَغَةٌ شَوَارِعٌ أَدَانَ .. كُلَّ مَقَامٌ مَقَالٌ
-----------------------------------------------------
هل احتاج إلى ترجمة كي تفهم خطاب ملك ؟ لغة " كلاسيكي "( فصحى ) موجود في كل لغة كذلك لغة " دارج ".. فرنسي التي درس في مدرسة ليست فرنسي التي استخدم ناس في شوارع باريس .. ملك بريطانيا لا خطب بلغة شوارع أدان .. كل مقام مقال
-----------------------------------------------------
هل احتاج الي ترجمه كي تفهم خطاب ملك ؟ لغه " كلاسيكي "( فصحي ) موجود في كل لغه كذلك لغه " دارج ".. فرنسي التي درس في مدرسه ليست فرنسي التي استخدم ناس في شوارع باريس .. ملك بريطانيا لا خطب بلغه شوارع ادان .. كل مقام مقال
