# TP1 Spacy

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pretrained statistical models and word vectors, and currently supports tokenization for 60+ languages.

In [None]:
## install spacy

In [2]:
!pip install spacy



In [3]:
import spacy
from spacy.lang.ar import *

In [4]:
!python -m spacy download en_core_web_sm 

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


# tokenize text by word

In [5]:
#instantiating English module

In [6]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [7]:
english_text= """Perhaps one of the most significant advances made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely 
the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from 
the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowed rational 
numbers, irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects". It gave mathematics a whole new 
development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the 
subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itself in a 
way which had not happened before.
"""
 
arabic_text="""ربما كانت أحد أهم التطورات التي قامت بها الرياضيات العربية التي بدأت في هذا الوقت بعمل الخوارزمي و هي بدايات الجبر، و من المهم فهم كيف كانت هذه الفكرة الجديدة مهمة، فقد كانت خطوة ثورية بعيدا عن المفهوم اليوناني للرياضيات التي هي في جوهرها هندسة، الجبر كان نظرية موحدة تتيح الأعداد الكسرية و الأعداد اللا كسرية، و قدم وسيلة للتنمية في هذا الموضوع مستقبلا. و جانب آخر مهم لإدخال أفكار الجبر و هو أنه سمح بتطبيق الرياضيات على نفسها بطريقة لم تحدث من قبل"""



# tokenize english text by word

In [8]:
 #instantiating English module
nlp = spacy.load("en_core_web_sm") 

In [9]:
#creating doc object containing our token features
doc = nlp(english_text)

In [10]:
#Creating and updating our list of tokens using list comprehension
for token in doc:
    print(token.text)

Perhaps
one
of
the
most
significant
advances
made
by
Arabic
mathematics
began
at
this
time
with
the
work
of
al
-
Khwarizmi
,
namely


the
beginnings
of
algebra
.
It
is
important
to
understand
just
how
significant
this
new
idea
was
.
It
was
a
revolutionary
move
away
from


the
Greek
concept
of
mathematics
which
was
essentially
geometry
.
Algebra
was
a
unifying
theory
which
allowed
rational


numbers
,
irrational
numbers
,
geometrical
magnitudes
,
etc
.
,
to
all
be
treated
as
"
algebraic
objects
"
.
It
gave
mathematics
a
whole
new


development
path
so
much
broader
in
concept
to
that
which
had
existed
before
,
and
provided
a
vehicle
for
future
development
of
the


subject
.
Another
important
aspect
of
the
introduction
of
algebraic
ideas
was
that
it
allowed
mathematics
to
be
applied
to
itself
in
a


way
which
had
not
happened
before
.




# tokenize arabic text by word

In [11]:
nlp = spacy.load("en_core_web_sm") 
doc = nlp(arabic_text)
words = [token.text for token in doc]
print (words)

['ربما', 'كانت', 'أحد', 'أهم', 'التطورات', 'التي', 'قامت', 'بها', 'الرياضيات', 'العربية', 'التي', 'بدأت', 'في', 'هذا', 'الوقت', 'بعمل', 'الخوارزمي', 'و', 'هي', 'بدايات', 'الجبر', '،', 'و', 'من', 'المهم', 'فهم', 'كيف', 'كانت', 'هذه', 'الفكرة', 'الجديدة', 'مهمة', '،', 'فقد', 'كانت', 'خطوة', 'ثورية', 'بعيدا', 'عن', 'المفهوم', 'اليوناني', 'للرياضيات', 'التي', 'هي', 'في', 'جوهرها', 'هندسة', '،', 'الجبر', 'كان', 'نظرية', 'موحدة', 'تتيح', 'الأعداد', 'الكسرية', 'و', 'الأعداد', 'اللا', 'كسرية', '،', 'و', 'قدم', 'وسيلة', 'للتنمية', 'في', 'هذا', 'الموضوع', 'مستقبلا', '.', 'و', 'جانب', 'آخر', 'مهم', 'لإدخال', 'أفكار', 'الجبر', 'و', 'هو', 'أنه', 'سمح', 'بتطبيق', 'الرياضيات', 'على', 'نفسها', 'بطريقة', 'لم', 'تحدث', 'من', 'قبل']


# Sentence Tokenizing

# ----------------------tokenize english text by sentence--------------------

In [12]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(english_text)

for sent in doc.sents:
  print(sent.text)
  print("---")

Perhaps one of the most significant advances made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely 
the beginnings of algebra.
---
It is important to understand just how significant this new idea was.
---
It was a revolutionary move away from 
the Greek concept of mathematics which was essentially geometry.
---
Algebra was a unifying theory which allowed rational 
numbers, irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects".
---
It gave mathematics a whole new 
development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the 
subject.
---
Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itself in a 
way which had not happened before.

---


# tokenize arabic text by sentence

In [13]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(arabic_text)

for sent in doc.sents:
  print(sent.text)
  print("--") 

ربما كانت أحد أهم التطورات التي قامت بها الرياضيات العربية التي بدأت في هذا الوقت بعمل الخوارزمي و هي بدايات الجبر،
--
و من المهم فهم كيف كانت هذه الفكرة الجديدة مهمة، فقد كانت خطوة ثورية بعيدا عن المفهوم اليوناني للرياضيات التي
--
هي في جوهرها هندسة، الجبر
--
كان نظرية موحدة تتيح الأعداد الكسرية و الأعداد اللا كسرية، و قدم وسيلة للتنمية في هذا الموضوع مستقبلا.
--
و جانب آخر مهم لإدخال أفكار الجبر و هو أنه سمح بتطبيق الرياضيات على نفسها بطريقة لم تحدث من قبل
--


# Lemmatization

Lemmatization:

It is a process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word’s lemma, or dictionary form.

# Lemmatization english text

In [16]:
#instantiating English module
nlp = spacy.load("en_core_web_sm")

#creating doc object containing our token features
doc = nlp(english_text)

#Creating and updating our list of tokens using list comprehension 
tokens = [[token.text,token.pos_,token.lemma_,"n"] for token in doc]
print(tokens)

 

[['Perhaps', 'ADV', 'perhaps', 'n'], ['one', 'NUM', 'one', 'n'], ['of', 'ADP', 'of', 'n'], ['the', 'DET', 'the', 'n'], ['most', 'ADV', 'most', 'n'], ['significant', 'ADJ', 'significant', 'n'], ['advances', 'NOUN', 'advance', 'n'], ['made', 'VERB', 'make', 'n'], ['by', 'ADP', 'by', 'n'], ['Arabic', 'ADJ', 'arabic', 'n'], ['mathematics', 'NOUN', 'mathematic', 'n'], ['began', 'VERB', 'begin', 'n'], ['at', 'ADP', 'at', 'n'], ['this', 'DET', 'this', 'n'], ['time', 'NOUN', 'time', 'n'], ['with', 'ADP', 'with', 'n'], ['the', 'DET', 'the', 'n'], ['work', 'NOUN', 'work', 'n'], ['of', 'ADP', 'of', 'n'], ['al', 'PROPN', 'al', 'n'], ['-', 'PUNCT', '-', 'n'], ['Khwarizmi', 'PROPN', 'Khwarizmi', 'n'], [',', 'PUNCT', ',', 'n'], ['namely', 'ADV', 'namely', 'n'], ['\n', 'SPACE', '\n', 'n'], ['the', 'DET', 'the', 'n'], ['beginnings', 'NOUN', 'beginning', 'n'], ['of', 'ADP', 'of', 'n'], ['algebra', 'PROPN', 'algebra', 'n'], ['.', 'PUNCT', '.', 'n'], ['It', 'PRON', 'it', 'n'], ['is', 'AUX', 'be', 'n'], ['

# Lemmatization arabic text

In [14]:
#instantiating English module
nlp = spacy.load("en_core_web_sm")

#creating doc object containing our token features
doc = nlp(arabic_text)

#Creating and updating our list of tokens using list comprehension 
tokens = [[token.text,token.pos_,token.lemma_,"n"] for token in doc]
print(tokens)


[['ربما', 'PROPN', 'ربما'], ['كانت', 'PROPN', 'كانت'], ['أحد', 'PROPN', 'أحد'], ['أهم', 'PROPN', 'أهم'], ['التطورات', 'VERB', 'التطورات'], ['التي', 'PROPN', 'التي'], ['قامت', 'VERB', 'قامت'], ['بها', 'NOUN', 'بها'], ['الرياضيات', 'ADJ', 'الرياضيات'], ['العربية', 'NOUN', 'العربية'], ['التي', 'NOUN', 'التي'], ['بدأت', 'VERB', 'بدأت'], ['في', 'ADP', 'في'], ['هذا', 'ADJ', 'هذا'], ['الوقت', 'NOUN', 'الوقت'], ['بعمل', 'PROPN', 'بعمل'], ['الخوارزمي', 'VERB', 'الخوارزمي'], ['و', 'PROPN', 'و'], ['هي', 'PROPN', 'هي'], ['بدايات', 'PROPN', 'بدايات'], ['الجبر', 'PROPN', 'الجبر'], ['،', 'PROPN', '،'], ['و', 'PROPN', 'و'], ['من', 'PROPN', 'من'], ['المهم', 'NOUN', 'المهم'], ['فهم', 'PROPN', 'فهم'], ['كيف', 'PROPN', 'كيف'], ['كانت', 'PROPN', 'كانت'], ['هذه', 'PROPN', 'هذه'], ['الفكرة', 'PROPN', 'الفكرة'], ['الجديدة', 'PROPN', 'الجديدة'], ['مهمة', 'PROPN', 'مهمة'], ['،', 'PROPN', '،'], ['فقد', 'PROPN', 'فقد'], ['كانت', 'PROPN', 'كانت'], ['خطوة', 'PROPN', 'خطوة'], ['ثورية', 'PROPN', 'ثورية'], ['بعيدا', '

# Removing Stop Words

While working with textual data, we encounter many data instances which aren't of much use for our analysis as they do not add any meaning/relevance to our data. These can be pronouns (like I, you, etc.) or words like are , is , was , etc.

These words are called Stop words. We can use the built in STOP_WORDS from spaCy for filtering our text

In [45]:
from spacy.lang.en.stop_words import STOP_WORDS
stop = STOP_WORDS
print(stop)

{'what', 'me', 'mine', 'namely', 'their', 'although', 'full', '‘re', 'third', 'wherever', 'empty', 'above', 'put', 'others', 'wherein', 'is', 'when', 'beforehand', 'n‘t', 'against', 'except', 'first', 'nobody', 'around', 'has', 'are', 'afterwards', 'which', 'if', 'had', 'indeed', 'becoming', 'none', 'nowhere', 'alone', 'anything', 'formerly', 'take', 'very', 'those', 'a', 'rather', "'re", 'further', 'our', "'ll", 'nine', 'fifteen', 'than', 'bottom', 'give', 'along', 'how', 'though', 'serious', 'towards', 'of', '’m', 'either', 'noone', 'using', 'am', 'quite', 'among', 'no', 'various', 're', 'or', 'these', 'least', 'onto', 'while', 'two', 'whatever', 'any', "'s", 'seeming', 'ca', 'down', 'nothing', 'regarding', 'often', 'eight', 'anyhow', 'and', 'doing', 'about', 'nor', 'whole', "'ve", 'front', 'did', 'n’t', 'hence', 'itself', 'get', 'why', 'toward', 'once', 'yourselves', 'here', 'behind', 'almost', 'could', 'without', 'may', 'who', 'more', 'some', 'up', 'never', 'only', 'whereas', 'for'

# Filter_english_text

In [46]:
doce = nlp(english_text)

#Creating and updating our list of tokens using list comprehension 

tokens = [token.text for token in doce]

print(tokens)

#Creating and updating our list of filtered tokens using list comprehension 

filtered = [token.text for token in doce if token.is_stop == False]

print(filtered)

['Perhaps', 'one', 'of', 'the', 'most', 'significant', 'advances', 'made', 'by', 'Arabic', 'mathematics', 'began', 'at', 'this', 'time', 'with', 'the', 'work', 'of', 'al', '-', 'Khwarizmi', ',', 'namely', '\n', 'the', 'beginnings', 'of', 'algebra', '.', 'It', 'is', 'important', 'to', 'understand', 'just', 'how', 'significant', 'this', 'new', 'idea', 'was', '.', 'It', 'was', 'a', 'revolutionary', 'move', 'away', 'from', '\n', 'the', 'Greek', 'concept', 'of', 'mathematics', 'which', 'was', 'essentially', 'geometry', '.', 'Algebra', 'was', 'a', 'unifying', 'theory', 'which', 'allowed', 'rational', '\n', 'numbers', ',', 'irrational', 'numbers', ',', 'geometrical', 'magnitudes', ',', 'etc', '.', ',', 'to', 'all', 'be', 'treated', 'as', '"', 'algebraic', 'objects', '"', '.', 'It', 'gave', 'mathematics', 'a', 'whole', 'new', '\n', 'development', 'path', 'so', 'much', 'broader', 'in', 'concept', 'to', 'that', 'which', 'had', 'existed', 'before', ',', 'and', 'provided', 'a', 'vehicle', 'for', '

# You can observe the differences between the two lists. Indeed, spaCy makes our work pretty easy.

# Part-of-Speech Tagging (POS)

A word's part of speech defines the functionality of that word in the document. For example - in the text Robin is an astute programmer, "Robin" is a Proper Noun while "astute" is an Adjective.

In [40]:
#Creating doc object
doc = nlp(english_text)

#Extracting POS
pos = [[token.text,token.pos_] for token in doc]
print (pos)

[['Perhaps', 'ADV'], ['one', 'NUM'], ['of', 'ADP'], ['the', 'DET'], ['most', 'ADV'], ['significant', 'ADJ'], ['advances', 'NOUN'], ['made', 'VERB'], ['by', 'ADP'], ['Arabic', 'ADJ'], ['mathematics', 'NOUN'], ['began', 'VERB'], ['at', 'ADP'], ['this', 'DET'], ['time', 'NOUN'], ['with', 'ADP'], ['the', 'DET'], ['work', 'NOUN'], ['of', 'ADP'], ['al', 'PROPN'], ['-', 'PUNCT'], ['Khwarizmi', 'PROPN'], [',', 'PUNCT'], ['namely', 'ADV'], ['\n', 'SPACE'], ['the', 'DET'], ['beginnings', 'NOUN'], ['of', 'ADP'], ['algebra', 'PROPN'], ['.', 'PUNCT'], ['It', 'PRON'], ['is', 'AUX'], ['important', 'ADJ'], ['to', 'PART'], ['understand', 'VERB'], ['just', 'ADV'], ['how', 'ADV'], ['significant', 'ADJ'], ['this', 'DET'], ['new', 'ADJ'], ['idea', 'NOUN'], ['was', 'AUX'], ['.', 'PUNCT'], ['It', 'PRON'], ['was', 'AUX'], ['a', 'DET'], ['revolutionary', 'ADJ'], ['move', 'NOUN'], ['away', 'ADV'], ['from', 'ADP'], ['\n', 'SPACE'], ['the', 'DET'], ['Greek', 'ADJ'], ['concept', 'NOUN'], ['of', 'ADP'], ['mathemati

# Entity Recognition

 Entity recognition is a text preprocessing technique that identifies word-describing elements like places, people, organizations, and languages within our input text.
We will make use of ".ents" attribute of our doc object.

In [17]:
 
#creating doc object
eng_doc= nlp(english_text)


#extracting entities 
entities=[(i, i.label_, i.label) for i in eng_doc.ents]
print(entities)

[(Arabic mathematics, 'NORP', 381), (al-Khwarizmi, 'GPE', 384), (Greek, 'NORP', 381)]


#  Chunking

Chunking is the process of extracting noun phrases from the text. spaCy can identify noun phrases (or noun chunks), as well. You can think of noun chunks as a noun plus the words describing the noun. It’s also possible to identify and extract the base-noun of a given chunk

In [18]:
doc = nlp(english_text)
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

the most significant advances advances pobj of
Arabic mathematics mathematics pobj by
this time time pobj at
the work work pobj with
al-Khwarizmi Khwarizmi pobj of
the beginnings beginnings dobj 

algebra algebra pobj of
It It nsubj is
this new idea idea nsubj was
It It nsubj was
a revolutionary move move attr was
the Greek concept concept attr was
mathematics mathematics pobj of
essentially geometry geometry attr was
Algebra Algebra nsubj was
a unifying theory theory attr was
rational 
numbers numbers dobj allowed
irrational numbers numbers conj numbers
geometrical magnitudes magnitudes conj numbers
"algebraic objects objects pobj as
It It nsubj gave
mathematics mathematics dative gave
a whole new 
development path path dobj gave
concept concept pobj in
a vehicle vehicle dobj provided
future development development pobj for
the 
subject subject pobj of
Another important aspect aspect nsubj was
the introduction introduction pobj of
algebraic ideas ideas pobj of
it it nsubj allowed
math

In [20]:
import re
import spacy

def filtered_chunks(doc, pattern):
  for chunk in doc.noun_chunks:
    signature = ''.join(['<%s>' % w.tag_ for w in chunk])
    if pattern.match(signature) is not None:
      yield chunk
 
nlp = spacy.load("en_core_web_sm")
doc = nlp(u'Great work!')
pattern = re.compile(r'(<JJ>)*(<NN>|<NNS>|<NNP>)+')

print(list(filtered_chunks(doc, pattern)))


[Great work]
