# Text Analytics and NLP Using Graphs

Use a tagged corpus for:
- Supervised: Classification models to classify documents in pre-determined topics
- Unsupervised: Community detection to discover new topics

Chapter covers:
- Provide overview of dataset
- Understand concepts and tools in NLP
- Create graphs from a corpus of documents
- Build a document topic classifier

Dataset:

Reuters-21578 news articles published in newswire in 1987. Has a very skewed distribution, so will use a modified version: ApteMod that has a smaller skew distribution and consistent labels between train and test sets. Each document has a set of labels that represents its content, thus is a perfect benchmark for testing supervised and unsupervised algorithms.

In [None]:
import nltk
import numpy as np
import pandas as pd
import networkx as nx

%matplotlib inline
from matplotlib import pyplot as plt

In [None]:
from nltk.corpus import reuters

In [None]:
nltk.download('reuters')

In [None]:
corpus = pd.DataFrame([
    {"id": _id, "clean_text": reuters.raw(_id).replace("\n", ""), "label": reuters.categories(_id)} # remove newline characters
    for _id in reuters.fileids()
]).set_index("id")

In [None]:
corpus.iloc[10]["clean_text"], corpus.iloc[10]["label"]


In [None]:
from collections import Counter
len(Counter([label for document_labels in corpus["label"] for label in document_labels]).most_common())

90 different topics with large class-imbalance, with 37% of documents in most common and 0.01% in each offive least common.

In [None]:
corpus.sample(1)

## Language Detection
An nlp technique is to look for the most common words (stopwords) and build a score based on its frequency. However, there are many libraries that allow us to infer more elaborate logic

In [None]:
import langdetect
import numpy as np

def getLanguage(text: str):
    try: 
        return langdetect.detect(text)
    except:
        return np.nan
    
corpus["language"] = corpus["clean_text"].apply(getLanguage)

In [None]:
# there are many languages other than English
# there may be some documents that are short or have a strange structure, so not actually news articles
corpus["language"].value_counts().head(10)

In [None]:
# using fasttext to detect language
!curl -w GET https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz > lid.176.ftz

In [None]:
!pip install fasttext

In [None]:
import fasttext

m = fasttext.load_model("lid.176.ftz")
def getLanguage(text: str):
    return m.predict(text)[0][0].replace("__label__", "")

In [None]:
corpus["language"] = corpus["clean_text"].apply(getLanguage)

In [None]:
corpus[corpus["language"]=="ja"].iloc[5]["clean_text"]

### NLP Enrichment

In [None]:
import spacy

# load model
nlp = spacy.load("en_core_web_md")

In [None]:
# apply model to text
corpus["parsed"] = corpus["clean_text"].apply(nlp)

The parsed object has several fields due to many models being combined to a single pipeline, these provide different levels of text structuring:
-  **Text segmentation and tokenisation**: Aims to split a document into periods, sentences and single words/tokens. Leverages punctuation, blank spaces, newline characters for segmentation. Spacy works fairly well but in practice may require custom rules, such as separation based on hashtags for tweets (TweetTokenizer) etc..
-  **Part-of-Speech Tagger**: Associate each token with PoS tag, its grammatical type - nouns, verbs, adjectives etc.. This mode lhas been trained on previous actual data
-  **Named Entity Recognition (NER)**: Trained to recognise nouns that appear in text, e.g. Organisation, Person, Geographic location etc.. Usually trained on large, tagged dataset that learn common patterns/structures from
-  **Dependency Parser**: Infers relationships between tokens within a sentence, can build a syntax tree of how words are related
-  **Lemmatizer**: Reduces words to common root; reduce word to more stable form for easier processing. Or **Stemmers**: Remove last part of word to reduce to main part of word only

In [None]:
corpus.loc["test/14832"]["clean_text"]

In [None]:
from spacy import displacy

In [None]:
# nice way to utilise entities in text
displacy.render(corpus.loc["test/14832"]["parsed"], style="ent", jupyter=True)

In [None]:
corpus.head(2)

In [None]:
# export
corpus[["clean_text", "label", "language", "parsed"]].to_pickle("corpus.p")

## Graph Generation

Two kinds of graphs from the corupus of documents and information we extracted in the previous:
-  **Knowledge based graphs**: Subject-verb-object (triplet) relation will be encoded to build a semantic graph
-  **Bipartite graph**: Link documents with entities/keywords appearing therein

### Knowledge base

In [None]:
from subject_object_extraction import findSVOs

In [None]:
corpus["triplets"] = corpus["parsed"].apply(lambda x: findSVOs(x))
corpus.sample(1)

In [None]:
edge_list = []
for _id, triplets in corpus["triplets"].iteritems():
    for (source, edge, target) in triplets:
        edge_list.append({"id": _id, 
        "source": [x.lemma_ for x in nlp(source)][0].lower(), 
        "target": [x.lemma_ for x in nlp(target)][0].lower(), 
        "edge": [x.lemma_ for x in nlp(edge)][0].lower()
        })

In [None]:
edges = pd.DataFrame(edge_list)

In [None]:
# most common are basic predicates
edges["edge"].value_counts().head(10)

In [None]:
# now can create knowledge graph with networkx utility function
import networkx as nx

G=nx.from_pandas_edgelist(edges, "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph())
len(G.nodes)

In [None]:
def plotDistribution(serie: pd.Series, nbins: int, minValue=None, maxValue=None):
    _minValue=int(np.floor(np.log10(minValue if minValue is not None else serie.min())))
    _maxValue=int(np.ceil(np.log10(maxValue if maxValue is not None else serie.max())))
    bins = [0] + list(np.logspace(_minValue, _maxValue, nbins)) + [np.inf]
    serie.hist(bins=bins)
    plt.xscale("log")

def graphSummary(graph, bins=10):
    print(nx.info(graph))
    plt.figure(figsize=(20, 8))
    plt.subplot(1,2,1)
    degrees = pd.Series({k: v for k, v in nx.degree(graph)})
    plt.yscale("log")
    plotDistribution(degrees, bins)
    try:
        plt.subplot(1,2,2)
        allEdgesWeights = pd.Series({(d[0], d[1]): d[2]["weight"] for d in graph.edges(data=True)})
        plotDistribution(allEdgesWeights, bins)
        plt.yscale("log")
    except:
        pass

In [None]:
print(nx.info(G))

In [None]:
graphSummary(G, bins=15)

In [None]:
import numpy as np
np.log10(pd.Series({k: v for k, v in nx.degree(G)}).sort_values(ascending=False)).hist()
plt.yscale("log")

In [None]:
edges.head()

In [None]:
# look at edges for 'lend'
e = edges[(edges["source"]!=" ") & (edges["target"]!=" ") & (edges["edge"]=="lend")]

G=nx.from_pandas_edgelist(e, "source", "target", 
                          edge_attr=True, create_using=nx.MultiDiGraph())

In [None]:
# visualise
import os

plt.figure(figsize=(13, 6))

pos = nx.spring_layout(G, k=1.2) # k regulates the distance between nodes

nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos, font_size=12)

# plt.show()
# plt.savefig(os.path.join(".", "KnowledgeGraph.png"), dpi=300, format="png")

### Bipartite Graph