# Better NLP

This is a wrapper program/library that encapsulates a couple of NLP libraries that are popular among the AI and ML communities.

Examples have been used to illustrate the usage as much as possible. Not all the APIs of the underlying libraries have been covered.

The idea is to keep the API language as high-level as possible, so its easier to use and stays human-readable.

Libraries / frameworks covered:

- SpaCy ([site](https://spacy.io/) | [docs](https://spacy.io/usage/))
- Textacy ([github](https://github.com/chartbeat-labs/textacy) | [docs](https://chartbeat-labs.github.io/textacy/))

See [https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/examples/better-nlp](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/examples/better-nlp) for more details.

#### Setup and installation ( optional )

In case, this notebook is running in a local environment (Linux/MacOS) or _Google Colab_ environment and in case it does not have the necessary dependencies installed then please execute the steps in the next section.

Otherwise, please SKIP to the **Examples** section.

In [1]:
%%time
%%bash

apt-get install apt-utils dselect dpkg

echo "OSTYPE=$OSTYPE"
if [[ "$OSTYPE" == "cygwin" ]] || [[ "$OSTYPE" == "msys" ]] ; then
    echo "Windows or Windows-like environment detected, script not tested, and may not work."
    echo "Try installing the components mention in the install-[ostype].sh scripts manually."
    echo "Or try running under CGYWIN or git-bash."
    echo "If successfully installed, please contribute back with the solution via a pull request, to https://github.com/neomatrix369/awesome-ai-ml-dl/"
    echo "Please give the file a good name, i.e. install-windows.sh or install-windows.bat depending on what kind of script you end up writing"
    exit 0
elif [[ "$OSTYPE" == "linux-gnu" ]] || [[ "$OSTYPE" == "linux" ]]; then
    TARGET_OS="linux"
else
    TARGET_OS="macos"
fi

BASE_URL="https://raw.githubusercontent.com/neomatrix369/awesome-ai-ml-dl/master/examples/better-nlp/build/"
if [[ ! -f "install-${TARGET_OS}.sh" ]]; then
    wget ${BASE_URL}/install-${TARGET_OS}.sh
    chmod +x ./install-${TARGET_OS}.sh
fi


if [[ ! -f "install-dependencies.sh" ]]; then
    wget ${BASE_URL}/install-dependencies.sh
    chmod +x ./install-dependencies.sh
fi

echo "Detected OS: ${TARGET_OS}"
./install-${TARGET_OS}.sh || true

Reading package lists...
Building dependency tree...
Reading state information...
dpkg is already the newest version (1.18.25).
dpkg set to manually installed.
The following NEW packages will be installed:
  apt-utils dselect libapt-inst2.0
0 upgraded, 3 newly installed, 0 to remove and 4 not upgraded.
Need to get 1888 kB of archives.
After this operation, 4168 kB of additional disk space will be used.
Get:1 http://deb.debian.org/debian stretch/main amd64 libapt-inst2.0 amd64 1.4.9 [192 kB]
Get:2 http://deb.debian.org/debian stretch/main amd64 apt-utils amd64 1.4.9 [410 kB]
Get:3 http://deb.debian.org/debian stretch/main amd64 dselect amd64 1.18.25 [1285 kB]
Fetched 1888 kB in 1s (1171 kB/s)
Selecting previously unselected package libapt-inst2.0:amd64.
(Reading database ... 36312 files and directories currently installed.)
Preparing to unpack .../libapt-inst2.0_1.4.9_amd64.deb ...
Unpacking libapt-inst2.0:amd64 (1.4.9) ...
Selecting previously unselected package apt-utils.
Preparing to

debconf: delaying package configuration, since apt-utils is not installed
--2019-04-14 18:54:03--  https://raw.githubusercontent.com/neomatrix369/awesome-ai-ml-dl/master/examples/better-nlp/build//install-linux.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.16.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.16.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2019-04-14 18:54:04 ERROR 404: Not Found.

chmod: cannot access './install-linux.sh': No such file or directory
--2019-04-14 18:54:04--  https://raw.githubusercontent.com/neomatrix369/awesome-ai-ml-dl/master/examples/better-nlp/build//install-dependencies.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.16.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.16.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2019-04-14 18:54:04 ERROR 404: Not Found.

chmod: c

CPU times: user 20 ms, sys: 10 ms, total: 30 ms
Wall time: 5.82 s


#### Install Spacy model ( NOT optional )

Install the large English language model for spaCy - will be needed for the examples in this notebooks.

**Note:** from observation it appears that spaCy model should be installed towards the end of the installation process, it avoid errors when running programs using the model.

In [None]:
%%time
%%bash

python -m spacy download en_core_web_lg
python -m spacy link en_core_web_lg en || true

#### Clone the repo with the library code

Eventually we won't do this, when the library can be installed 

In [None]:
%%bash

if [[ -e awesome-ai-ml-dl/examples/better-nlp/ ]] || [[ -e ../../org/neomatrix369 ]]; then
   echo "Library source exists"
else
    git clone "https://github.com/neomatrix369/awesome-ai-ml-dl"
fi

## Examples

### Extract entities

In [None]:
import sys
sys.path.insert(0, '../../library')

from org.neomatrix369.better_nlp import BetterNLP

In [None]:
# Can be any factual text or any text to experiment with
generic_text = """Denis Guedj (1940 – April 24, 2010) was a French novelist and 
a professor of the History of Science at Paris VIII University. He was born 
in Setif. He spent many years devising courses and games to teach adults 
and children math. He is the author of Numbers: The Universal Language and 
of the novel The Parrot's Theorem. He died in Paris. 
"""

betterNLP = BetterNLP() ### do not re-run this unless you wish to re-initialise the object

In [None]:
model_loading_result = betterNLP.load_nlp_model()
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print("model_loading_time_in_secs=",model_loading_result['model_loading_time_in_secs'])
print("model_loading_method=",model_loading_result['model_loading_method'])
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

model = model_loading_result["model"]

In [None]:
parsed_generic_text = betterNLP.extract_entities(model, generic_text)
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print("extract_entities_processing_time_in_secs=", parsed_generic_text['extract_entities_processing_time_in_secs'])
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

parsed_generic_text = parsed_generic_text['parsed_text']
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
[print(f"{each_entity.text} ({each_entity.label_})") for each_entity in parsed_generic_text.ents if each_entity.text.strip() == each_entity.text]
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print(betterNLP.token_entity_types())

### Noun extraction

In [None]:
chunks = betterNLP.extract_nouns_chunks(model, generic_text)
chunks = chunks.get("noun_chunks")
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
set_of_noun_chunks = set(chunks)
if len(set_of_noun_chunks) == 0:
	print("Did not find words that belong together.")
else:
	print("A list of words that belong together (in lowercase):")

[print(each_noun_chunk) for each_noun_chunk in set_of_noun_chunks if len(each_noun_chunk.split(" ")) > betterNLP.minimum_occurrence_frequency]
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

### Gather facts

In [None]:
target_topic = "Denis Guedj"
extracted_facts = betterNLP.extract_facts(model, generic_text, target_topic)

In [None]:
extracted_facts = extracted_facts.get("facts")

print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print("Trying to gather details about " + target_topic)

number_of_facts_found = 0
for each_fact_statement in extracted_facts:
    number_of_facts_found =+ 1
    subject, verb, fact = each_fact_statement
    print(f" - {fact}")

if number_of_facts_found == 0:
    print("There were no facts on " + target_topic)
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

### Obfuscate privacy details

In [None]:
obfuscated_text = betterNLP.obfuscate_text(model, generic_text)
obfuscated_text = obfuscated_text.get("obfuscated_text")
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
print("Obfuscated generic text: ", "".join(obfuscated_text))
print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")