<a href="https://colab.research.google.com/github/Nouran-Khallaf/Arabic-Readability-Corpus/blob/main/NoteBooks/1_Visualisation_SpaCy_Tutorial_Modified.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visualisation with spaCy Tutorial

This notebook demonstrates how to use spaCy for text visualisation, focusing on named entity recognition and syntactic dependency parsing.


### Author: Dr Mahmoud El-Haj (with help from the Internet) as part of the "Visualise My Corpus Tutorial" an event by Lanacaster University's UCREL and DSG Seminars
### GitHub repository: https://github.com/drelhaj/NLP_ML_Visualization_Tutorial

## Step 1: Install and Import Libraries

In this step, we will install the required libraries and import them. We will primarily use spaCy for text processing and visualisation.

In [1]:
#installing spaCy
#https://spacy.io/usage
!pip install -U spacy
!pip install -U spacy-lookups-data
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.5-py2.py3-none-any.whl (98.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.5/98.5 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.5
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting de-core-news

## Step 2: Load and Preprocess Data

We will load the text data that will be used for visualisation. For this tutorial, we will use a small sample text. Ensure the text is in a suitable format for processing.

In [2]:
# SpaCy Tokenizer Construction
# If running the following returns with ModuleNotFoundError then you need to download the language needed (see cell above)
from spacy.tokenizer import Tokenizer

import spacy

nlp = spacy.load("en_core_web_sm") #loading language model. Use de_core_news_sm for German for e.g..

#otherwise you can use import of spacy.load()
    #import en_core_web_sm
    #nlp = en_core_web_sm.load()

# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

In [56]:
sentence = "Today is   March 18th 2021 and   Mahmoud, is showing us   how to visualise text online at Lancaster University."

In [57]:
tokens = tokenizer(sentence) #' '.join(sentence.split()) needed to avoid excess whitespaces.
# notice that a punctuation such as ، (a comma in Arabic), is considered a token!
print('Number of words: ',len(tokens))
print('\n>>>>>>>Tokens<<<<<<<:')
for t in tokens:
    print(t)

Number of words:  21

>>>>>>>Tokens<<<<<<<:
Today
is
  
March
18th
2021
and
  
Mahmoud,
is
showing
us
  
how
to
visualise
text
online
at
Lancaster
University.


In [63]:
import re
sentence = ' '.join(sentence.split())#remove extra white spaces
sentence = re.sub(r'[^\w\s]','',sentence)#use regex to remove puncatuations

tokens = tokenizer(sentence) #we call the tokenizer again over the cleaned sentence

print('Numer of words: ',len(tokens))
print('\n>>>>>>>Tokens<<<<<<<:')
for t in tokens:
    print(t)

Numer of words:  18

>>>>>>>Tokens<<<<<<<:
Today
is
March
18th
2021
and
Mahmoud
is
showing
us
how
to
visualise
text
online
at
Lancaster
University


In [61]:
def tokenise_sentence(sentence):
    doc = nlp(sentence)
    tokens = [token for token in doc if token.text.strip()] # Keep the tokens as spaCy objects
    return tokens

In [62]:
tokens = tokenise_sentence(sentence)
print('Number of words:', len(tokens))
print('\n>>>>>>> Tokens <<<<<<<:')
for t in tokens:
    print(t)

Number of words: 18

>>>>>>> Tokens <<<<<<<:
Today
is
March
18th
2021
and
Mahmoud
is
showing
us
how
to
visualise
text
online
at
Lancaster
University


In [41]:
from prettytable import PrettyTable

# Create a PrettyTable object
table = PrettyTable()
table.field_names = ["Token Index", "Token"]

# Add tokens to the table (converting tokens to strings)
for i, token in enumerate(tokens):
    table.add_row([i + 1, token.text])  # Access the text of the token

# Display the table
print(table)

+-------------+------------+
| Token Index |   Token    |
+-------------+------------+
|      1      |   Today    |
|      2      |     is     |
|      3      |   March    |
|      4      |    18th    |
|      5      |    2021    |
|      6      |    and     |
|      7      |  Mahmoud   |
|      8      |     is     |
|      9      |  showing   |
|      10     |     us     |
|      11     |    how     |
|      12     |     to     |
|      13     | visualise  |
|      14     |    text    |
|      15     |   online   |
|      16     |     at     |
|      17     | Lancaster  |
|      18     | University |
+-------------+------------+


In [65]:
from tabulate import tabulate
# Prepare data for tabulation
table_data = [[i + 1, token] for i, token in enumerate(tokens)]
table_headers = ["Token Index", "Token"]

# Display the table using tabulate
print(tabulate(table_data, headers=table_headers, tablefmt="pretty"))


+-------------+------------+
| Token Index |   Token    |
+-------------+------------+
|      1      |   Today    |
|      2      |     is     |
|      3      |   March    |
|      4      |    18th    |
|      5      |    2021    |
|      6      |    and     |
|      7      |  Mahmoud   |
|      8      |     is     |
|      9      |  showing   |
|     10      |     us     |
|     11      |    how     |
|     12      |     to     |
|     13      | visualise  |
|     14      |    text    |
|     15      |   online   |
|     16      |     at     |
|     17      | Lancaster  |
|     18      | University |
+-------------+------------+


In [66]:
import plotly.express as px

import pandas as pd
from collections import Counter

# Example function to visualize token frequency using Plotly
def visualize_token_frequency_plotly(tokens):
    # Extract token texts for counting and plotting
    token_texts = [token.text for token in tokens]

    # Count token frequencies
    token_counts = Counter(token_texts)  # Count based on token texts

    # Convert to DataFrame for Plotly
    token_df = pd.DataFrame(token_counts.items(), columns=['Token', 'Frequency']).sort_values(by='Frequency', ascending=False)

    # Plot the token frequencies
    fig = px.bar(token_df, x='Token', y='Frequency', title='Token Frequency Distribution')
    fig.update_layout(xaxis_tickangle=-90)
    fig.show()


tokens = tokenise_sentence(sentence)
sentence = "Today is   March 18th 2021 and   Mahmoud, is showing us   how to visualise text online at Lancaster University."
print('Number of words:', len(tokens))
visualize_token_frequency_plotly(tokens)


Number of words: 18


In [25]:
#what about stop-words?
#SpaCy's English language stop words (for other languages see: https://spacy.io/usage/models)
from spacy.lang.en.stop_words import STOP_WORDS

# Convert stop words set to a DataFrame
stop_words_df = pd.DataFrame(list(STOP_WORDS), columns=['Stop Word'])

# Display the DataFrame
stop_words_df

Unnamed: 0,Stop Word
0,sixty
1,made
2,see
3,perhaps
4,without
...,...
321,rather
322,any
323,few
324,twenty


In [67]:
#Let's get tokens ignoring stop-words and punctuations (remember we used regex to remove puncations).
tokens_no_stopwords = [token.text for token in tokens if token.is_stop != True and token.is_punct != True]

In [68]:
#Notice that the puncation disappears in the 2nd output as well as the new stop-word 'text'
print('With stop-words\n',*tokens, '>>>>>>',len(tokens), 'words.')
print('Without stop-words\n',*tokens_no_stopwords, '>>>>>>', len(tokens_no_stopwords), 'words.')

With stop-words
 Today is March 18th 2021 and Mahmoud is showing us how to visualise text online at Lancaster University >>>>>> 18 words.
Without stop-words
 Today March 18th 2021 Mahmoud showing visualise text online Lancaster University >>>>>> 11 words.


In [69]:
#what if we want to add/remove to/from the default stop-words list?
#assume the word 'text' is very frequent in our corpus to an extent that it becomes a stop-word
#to add 'text' to the stop words list:
nlp.Defaults.stop_words.add("text")

In [71]:
#print the list, notice 'text' is now an entry
#to remove a word from the list use: nlp.Defaults.stop_words.remove("word_to_be_removed")
# Convert stop words set to a DataFrame
stop_words_df = pd.DataFrame(list(STOP_WORDS), columns=['Stop Word'])

# Display the DataFrame
stop_words_df


Unnamed: 0,Stop Word
0,sixty
1,made
2,see
3,perhaps
4,without
...,...
322,rather
323,any
324,few
325,twenty


In [72]:
nlp = spacy.load("en_core_web_sm")
tokenizer = Tokenizer(nlp.vocab)#recreating the tokenizer as the previous one used the unupdated stopwords list
tokens = tokenizer(sentence)
#loop through the tokens and only consider non-stop-words and non-punctuations.
tokens_no_stopwords = [token.text for token in tokens if token.is_stop != True and token.is_punct != True]

In [73]:
print('Without updated-stop-words\n',*tokens_no_stopwords, '>>>>>>', len(tokens_no_stopwords), 'words.')

Without updated-stop-words
 Today    March 18th 2021    Mahmoud, showing    visualise online Lancaster University. >>>>>> 13 words.


In [75]:
#Linguistic annotations (Part of speech tags and dependencies using the Universal Dependecies https://universaldependencies.org)
#This will return a Language object containing all components and data needed to process text
doc = nlp(sentence) #A Doc is a sequence of Token https://spacy.io/api/doc

# Extract token details
data = [(token.text, token.pos_, token.dep_) for token in doc]
df = pd.DataFrame(data, columns=['Token', 'Part of Speech', 'Dependency'])

# Display the DataFrame
print(df)


'''nsubj: nominal subject.\t nummod: numeric modifier ...etc. For more visit: https://universaldependencies.org'''

         Token Part of Speech Dependency
0        Today           NOUN      nsubj
1           is            AUX       ROOT
2                       SPACE        dep
3        March          PROPN   compound
4         18th           NOUN       attr
5         2021            NUM   npadvmod
6          and          CCONJ         cc
7                       SPACE        dep
8      Mahmoud          PROPN       conj
9            ,          PUNCT      punct
10          is            AUX        aux
11     showing           VERB      advcl
12          us           PRON     dative
13                      SPACE        dep
14         how          SCONJ     advmod
15          to           PART        aux
16   visualise           VERB      xcomp
17        text           NOUN       dobj
18      online            ADV     advmod
19          at            ADP       prep
20   Lancaster          PROPN   compound
21  University          PROPN       pobj
22           .          PUNCT      punct


'nsubj: nominal subject.\t nummod: numeric modifier ...etc. For more visit: https://universaldependencies.org'

## Step 4: Visualising Syntactic Dependencies

Syntactic dependency parsing helps to understand the grammatical structure of a sentence. We will visualise the syntactic dependencies of our sample text using spaCy's visualisation tools.

In [78]:
#Let's visualise the annotated sentence above

from spacy import displacy

#nlp = spacy.load("en_core_web_sm") #uncomment if not loaded previously
#doc = nlp(sentence)# check previous cell. That is the original cleaned sentence (only extra spaces and puncations were removed)
displacy.render(doc, style="dep")

In [79]:
#Can we make it look a bit cooler? (for more options https://spacy.io/api/top-level#displacy_options)
options = {"compact": True, "bg": "#ebc334",
           "color": "black", "font": "Source Sans Pro"}
displacy.render(doc, style="dep", options=options,)

In [81]:
!mkdir plots

In [82]:
#to save in Scalable Vector Graphics (SVG) so you can view it in full screen:
from pathlib import Path
svg = displacy.render(doc, style="dep", options=options,jupyter=False)

output_path = Path("./plots/dependency_plot.svg")
output_path.open("w", encoding="utf-8").write(svg)

17277

## Step 3: Named Entity Recognition (NER)

Named Entity Recognition is a technique used to identify and classify named entities in text. Here, we will visualise the named entities present in our sample text using spaCy.

In [88]:
#what about named entities (NER)?

from tabulate import tabulate

# Extract entity details
data = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
headers = ['Entity', 'Start Char', 'End Char', 'Label']

# Display the table using tabulate
print(tabulate(data, headers=headers, tablefmt="pretty"))


+----------------------+------------+----------+--------+
|        Entity        | Start Char | End Char | Label  |
+----------------------+------------+----------+--------+
|        Today         |     0      |    5     |  DATE  |
|   March 18th 2021    |     11     |    26    |  DATE  |
|       Mahmoud        |     33     |    40    | PERSON |
| Lancaster University |     90     |   110    |  ORG   |
+----------------------+------------+----------+--------+


In [84]:
#can we visualise named entities? Well, of course! :-)
displacy.render(doc, style="ent")

In [85]:
#to save in Scalable Vector Graphics (SVG) so you can view it in full screen:
from pathlib import Path
html = displacy.render(doc, style="ent",jupyter=False)#withouth jupyter = False you'll get a

output_path = Path("./plots/ner_plot.html")
output_path.open("w", encoding="utf-8").write(html)

1361

## Step 5: Combining Visualisations

In this step, we will combine the visualisations of named entities and syntactic dependencies to get a comprehensive view of the text's structure.