<a href="https://colab.research.google.com/github/ArefPhD/General-training/blob/main/examples/better-nlp/notebooks/jupyter/better_nlp_summarisers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Better NLP

This is a wrapper program/library that encapsulates a couple of NLP libraries that are popular among the AI and ML communities.

Examples have been used to illustrate the usage as much as possible. Not all the APIs of the underlying libraries have been covered.

The idea is to keep the API language as high-level as possible, so its easier to use and stays human-readable.

Libraries / frameworks covered:

- nltk [site](http://www.nltk.org/) | [docs](https://buildmedia.readthedocs.org/media/pdf/nltk/latest/nltk.pdf)
- numpy [site](https://www.numpy.org/) | [docs](https://docs.scipy.org/doc/)
- networkx [site](https://networkx.github.io/) | [docs](https://networkx.github.io/documentation/stable/index.html)

See [https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/examples/better-nlp](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/examples/better-nlp) for more details.

### This notebook will demonstrate the below NLP features / functionalities, using the above mentioned libraries

- Cosine similarity summarisation technique (extractive summarisation)
- Vertex ranking algorithm summarisation technique
- Build a simple text summarisation tool using NLTK
- Summarisation 4 (TODO)
- Summarisation 5 (TODO)

_Summarisation can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning._

### Resources

- [Understand Text Summarization and create your own summarizer in python](https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70) or [An Introduction to Text Summarization using the TextRank Algorithm (with Python implementation)](https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/)
- [Beyond bag of words: Using PyTextRank to find Phrases and Summarize text](https://medium.com/@aneesha/beyond-bag-of-words-using-pytextrank-to-find-phrases-and-summarize-text-f736fa3773c5)
- [Build a simple text summarisation tool using NLTK](https://medium.com/@wilamelima/build-a-simple-text-summarisation-tool-using-nltk-ff0984fedb4f)
- [Summarise Text with TFIDF in Python 1/2](https://towardsdatascience.com/tfidf-for-piece-of-text-in-python-43feccaa74f8) and [Summarise Text with TFIDF in Python 2/2](https://medium.com/@shivangisareen/summarise-text-with-tfidf-in-python-bc7ca10d3284)
- [How to Make a Text Summarizer - Intro to Deep Learning #10 by Siraj Raval](https://www.youtube.com/watch?v=ogrJaOIuBx4)

#### Setup and installation ( optional )

In case, this notebook is running in a local environment (Linux/MacOS) or _Google Colab_ environment and in case it does not have the necessary dependencies installed then please execute the steps in the next section.

Otherwise, please SKIP to the **Install Spacy model ( NOT optional )** section.

In [1]:
%%time
%%bash

apt-get install apt-utils dselect dpkg

echo "OSTYPE=$OSTYPE"
if [[ "$OSTYPE" == "cygwin" ]] || [[ "$OSTYPE" == "msys" ]] ; then
    echo "Windows or Windows-like environment detected, script not tested, and may not work."
    echo "Try installing the components mention in the install-[ostype].sh scripts manually."
    echo "Or try running under CGYWIN or git-bash."
    echo "If successfully installed, please contribute back with the solution via a pull request, to https://github.com/neomatrix369/awesome-ai-ml-dl/"
    echo "Please give the file a good name, i.e. install-windows.sh or install-windows.bat depending on what kind of script you end up writing"
    exit 0
elif [[ "$OSTYPE" == "linux-gnu" ]] || [[ "$OSTYPE" == "linux" ]]; then
    TARGET_OS="linux"
else
    TARGET_OS="macos"
fi

if [[ -e ../../library/org/neomatrix369 ]]; then
  echo "Library source found"
  
  cd ../../build
  
  echo "Detected OS: ${TARGET_OS}"
  ./install-${TARGET_OS}.sh || true
else
  if [[ -e awesome-ai-ml-dl/examples/better-nlp/library ]]; then
     echo "Library source found"
  else
     git clone "https://github.com/neomatrix369/awesome-ai-ml-dl"
  fi

  echo "Library source exists"
  cd awesome-ai-ml-dl/examples/better-nlp/build

  echo "Detected OS: ${TARGET_OS}"
  ./install-${TARGET_OS}.sh || true 
fi

Reading package lists...
Building dependency tree...
Reading state information...
apt-utils is already the newest version (1.6.13).
dpkg is already the newest version (1.19.0.5ubuntu2.3).
The following NEW packages will be installed:
  dselect
0 upgraded, 1 newly installed, 0 to remove and 39 not upgraded.
Need to get 148 kB of archives.
After this operation, 1,667 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 dselect amd64 1.19.0.5ubuntu2.3 [148 kB]
Fetched 148 kB in 1s (103 kB/s)
Selecting previously unselected package dselect.
(Reading database ... (Reading database ... 5%(Reading database ... 10%(Reading database ... 15%(Reading database ... 20%(Reading database ... 25%(Reading database ... 30%(Reading database ... 35%(Reading database ... 40%(Reading database ... 45%(Reading database ... 50%(Reading database ... 55%(Reading database ... 60%(Reading database ... 65%(Reading database ... 70%(Reading databas

Cloning into 'awesome-ai-ml-dl'...
E: Package 'libnode64' has no installation candidate


CPU times: user 365 ms, sys: 50.2 ms, total: 415 ms
Wall time: 1min 18s


#### Install Spacy model ( NOT optional )

Install the large English language model for spaCy - will be needed for the examples in this notebooks.

**Note:** from observation it appears that spaCy model should be installed towards the end of the installation process, it avoid errors when running programs using the model.

In [2]:
%%time
%%bash

python -m spacy download en_core_web_lg
python -m spacy link en_core_web_lg en || true

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py): started
  Building wheel for en-core-web-lg (setup.py): finished with status 'done'
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-cp37-none-any.whl size=829180945 sha256=41c227269921af9742b75704650b4bf086e252c746469347b2720ee5d7c09cef
  Stored in directory: /tmp/pip-ephem-wheel-cache-r59lqajd/wheels/2a/c1/a6/fc7a877b1efca9bc6a089d6f506f16d3868408f9ff89f8dbfc
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')

[38;5;1m✘ Link 'en' already exists[0m
To overwrite an existing link, use the --force flag

CPU times: user 832 

## Examples of various summarisation methods

### 1. Cosine similarity summarisation technique (extractive summarisation)

**Abstractive Summarization:** Abstractive methods select words based on semantic understanding, even those words did not appear in the source documents. It aims at producing important material in a new way. They interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text.

**Flow:** Input document → understand context → semantics → create own summary

**Extractive Summarization:** Extractive methods attempt to summarize articles by selecting a subset of words that retain the most important points.

**Flow:** Input document → sentences similarity → weight sentences → select sentences with higher rank

**Cosine similarity** is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Its measures cosine of the angle between vectors. Angle will be 0 if sentences are similar and tend towards 90 as they begin to differ.

Inspired by Praveen Dubey the author of https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70

or see [Understand Text Summarization and create your own summarizer in python](https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/)

In [4]:
pip install textacy

Collecting textacy
[?25l  Downloading https://files.pythonhosted.org/packages/fc/de/139b38896a4027bd44ab14594981c4012ff9de4425df93e74d8e3998225c/textacy-0.11.0-py3-none-any.whl (200kB)
[K     |█▋                              | 10kB 14.1MB/s eta 0:00:01[K     |███▎                            | 20kB 8.7MB/s eta 0:00:01[K     |█████                           | 30kB 8.2MB/s eta 0:00:01[K     |██████▌                         | 40kB 7.6MB/s eta 0:00:01[K     |████████▏                       | 51kB 4.1MB/s eta 0:00:01[K     |█████████▉                      | 61kB 4.3MB/s eta 0:00:01[K     |███████████▌                    | 71kB 4.7MB/s eta 0:00:01[K     |█████████████                   | 81kB 5.2MB/s eta 0:00:01[K     |██████████████▊                 | 92kB 3.8MB/s eta 0:00:01[K     |████████████████▍               | 102kB 4.2MB/s eta 0:00:01[K     |██████████████████              | 112kB 4.2MB/s eta 0:00:01[K     |███████████████████▋            | 122kB 4.2MB/s eta 0

In [2]:
pip install pytextrank

Collecting pytextrank
  Downloading https://files.pythonhosted.org/packages/dc/bc/df8b57f1c28e2b66859b7428520115723a7c49ce6e4d1a224307dbe079dd/pytextrank-3.1.1-py3-none-any.whl
Collecting graphviz>=0.13
  Downloading https://files.pythonhosted.org/packages/86/86/89ba50ba65928001d3161f23bfa03945ed18ea13a1d1d44a772ff1fa4e7a/graphviz-0.16-py2.py3-none-any.whl
Collecting icecream>=2.1
  Downloading https://files.pythonhosted.org/packages/1f/c0/8e2bc1b5eab95e5155841c826b431692638c19bf04ee4cdc86b379f85150/icecream-2.1.1-py2.py3-none-any.whl
Collecting executing>=0.3.1
  Downloading https://files.pythonhosted.org/packages/e1/a6/07d28b53b1fab42985cba6b704d685a60a2e3a5efce4cfaaad42a4494bd8/executing-0.6.0-py2.py3-none-any.whl
Collecting asttokens>=2.0.1
  Downloading https://files.pythonhosted.org/packages/16/d5/b0ad240c22bba2f4591693b0ca43aae94fbd77fb1e2b107d54fff1462b6f/asttokens-2.0.5-py2.py3-none-any.whl
Collecting colorama>=0.3.9
  Downloading https://files.pythonhosted.org/packages/44/98/

In [3]:
import sys
sys.path.insert(0, '../../library')
sys.path.insert(0, './awesome-ai-ml-dl/examples/better-nlp/library')

from org.neomatrix369.better_nlp import BetterNLP

import pprint
pp = pprint.PrettyPrinter(indent=4)

This version of Python is 64 bits.
This version of Python is 64 bits.
This version of Python is 64 bits.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
This version of Python is 64 bits.


In [9]:
betterNLP = BetterNLP() ### do not re-run this unless you wish to re-initialise the object
generic_text="""USA is a powerful country. It got lot of money and guns."""

In [10]:
summarised_result = betterNLP.summarise(generic_text)

print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print("summarisation_processing_time_in_secs=",summarised_result['summarisation_processing_time_in_secs'])
pp.pprint("summarised_text=" + summarised_result['summarised_text'])
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
summarisation_processing_time_in_secs= 0.0009243488311767578
'summarised_text=USA is a powerful country'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


In [11]:
print("ranked_sentences=") 
pp.pprint(summarised_result['ranked_sentences'])

ranked_sentences=
[(1.0, ['USA', 'is', 'a', 'powerful', 'country'])]


### 2. Vertex ranking algorithm summarisation technique

Using PyTextRank to find Phrases and Summarize text: Multi-word Phrase Extraction and Sentence Extraction for Summarization

Inspired by the author of https://medium.com/@aneesha/beyond-bag-of-words-using-pytextrank-to-find-phrases-and-summarize-text-f736fa3773c5 
(Notebook: https://github.com/DerwenAI/pytextrank/blob/master/example.ipynb)

Another resource to take a look at: https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

In [12]:
betterNLP = BetterNLP() ### do not re-run this unless you wish to re-initialise the object

In [13]:
source_file='source.json'
source_json_content='{"id":"777", "text":"In an attempt to build an AI-ready workforce, SmartSoft Corp. announced Smart Colab Program which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Smart Colab Program will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Palo Alto giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and AI services such as SmartSoft Corp. Cognitive Services, Bot Services and Machine Learning Services. According to Mark Smith, Country AI Manager, SmartSoft Corp. India, said, ''With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. This will require more collaborations and training and working with AI. That''s why it has become more critical than ever for educational institutions to integrate new cloud and AI technologies. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the workforce of tomorrow.'' The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry. Earlier in April this year, the company announced SmartSoft Corp. Advanced Program In AI as a learning track open to the public. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. This program also included developer-focused AI school that provided a bunch of assets to help build AI skills."}'
f = open(source_file, 'w')
f.write("%s" % source_json_content)
f.close()

summarised_result = betterNLP.summarise(source_file, method="pytextrank")

print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print("summarisation_processing_time_in_secs=",summarised_result['summarisation_processing_time_in_secs'])
print("summarised_text=",summarised_result['summarised_text'])
print("token_ranks=",summarised_result['token_ranks'])
print("key_phrases=",summarised_result['key_phrases'])
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
betterNLP.show_graph(summarised_result["graph"])

AttributeError: ignored

In [14]:
!pip install --upgrade pytextrank>=2.0.1

In [15]:
!pip list | grep pytext

pytextrank                    3.1.1              


In [16]:
import pytextrank
import sys

path_stage0 = "dat/mih.json"
path_stage1 = "o1.json"

with open(path_stage1, 'w') as f:
    for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
        f.write("%s\n" % pytextrank.pretty_print(graf._asdict()))
        # to view output in this notebook
        print(pretty_print(graf))

AttributeError: ignored

### 3. Build a simple text summarisation tool using NLTK

Inspired by Wilame Lima Vallantin, the author of [Build a simple text summarisation tool using NLTK](https://medium.com/@wilamelima/build-a-simple-text-summarisation-tool-using-nltk-ff0984fedb4f).

We have to break the text into sentences and tokens, remove stop-words. Tokenise words, calculate word frequency to determine if a word is important on the corpus, using the TF-IDF technique.

In [17]:
summarised_result = betterNLP.summarise(generic_text, method="tfidf")

print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print("summarisation_processing_time_in_secs=",summarised_result['summarisation_processing_time_in_secs'])
print("summarised_text=")
pp.pprint(summarised_result['summarised_text'])
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
summarisation_processing_time_in_secs= 2.201470136642456
summarised_text=
[]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


  'stop_words.' % sorted(inconsistent))


In [18]:
print("important_words=")
pp.pprint(summarised_result['important_words'])

important_words=
[   ('gun', 0.3779644730092272),
    ('money', 0.3779644730092272),
    ('lot', 0.3779644730092272),
    ('got', 0.3779644730092272),
    ('country', 0.3779644730092272),
    ('powerful', 0.3779644730092272),
    ('usa', 0.3779644730092272)]


### 4. Summarising text in python using a variation of TF-IDF method


Inspired by Shivangi Sareen from the posts:
[Summarise Text with TFIDF in Python 1](https://towardsdatascience.com/tfidf-for-piece-of-text-in-python-43feccaa74f8) and [Summarise Text with TFIDF in Python 2](https://medium.com/@shivangisareen/summarise-text-with-tfidf-in-python-bc7ca10d3284)

We have to break the text into sentences and tokens, ***we do not remove stop-words*** but do remove special characters. Tokenise words, calculate word TF and IDF frequencies to determine if a word is important on the corpus, using the TF-IDF technique. And then based on the average score method filter out only those sentences that meet the criteria.

We could also use the (average score + 1.5 * std dev) or (average score + 3 * std dev), depending on the size of the target documents to summarise.

In [None]:
summarised_result = betterNLP.summarise(generic_text, method="tfidf-ignore-stopwords")

print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print("summarisation_processing_time_in_secs=",summarised_result['summarisation_processing_time_in_secs'])
pp.pprint("summarised_text=" + summarised_result['summarised_text'])
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

In [None]:
print("scored_documents=")
pp.pprint(summarised_result['scored_documents'])