## Further Text Processing & Re-Topic Modeling
* Earlier topic models had the issue of multiple derivatives of the same word (meet, meetings, meeting) etc. 
    * This diluted our topic models accuracy
* Resolve this by further processing our text with `SpaCy` and `textacy` to extract more specific parts of the text
    * E.G. lemma forms of these words, named-entities etc.
* Then try re-implementing `sklearn.decomposition NMF` and see if we can improve our results from notebooks (3) and (4)

In [7]:
import pandas as pd 
import numpy as np 
import math
import regex as re  
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from tqdm.auto import tqdm
import textacy

In [2]:
# spacy components 
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

# * download spacy medium English model
# !python -m spacy download en_core_web_md

In [3]:
# load the language object, labelled "nlp" from here on by convention 
## disable parser as it won't be needed for now
nlp = spacy.load('en_core_web_md', disable=["parser"])

* Load DF pickle from notebook (1) 

In [48]:
# read pickled DF from notebook #1 (initial exploratory data visuals)
df = pd.read_pickle('df_un_general_debates.pkl')

### The tokenisation problem from earlier notebook: 
* E.G. problems like: co-operation is split into co and operation etc.
* Need to utilise lemmatisation to avoid getting functionally-identical stems of the same word (meet, meets, meetings) etc.

* Below are functions to overcome this problem & handle other issues like noun phrase extraction etc.



# FURTHER TEXT PROCESSING WITH SPACY



## Customising Re-tokenisation 
* Spacy/NLTK by default splits on: hash signs, hyphens and underscores
    * Hence our problem with "co-operation" 
* Spacy's tokeniser is rule based:
    * splits text on white space
    * then uses prefix, suffix, and infix spliting rules defined by REGEX to further split remaining tokens
    * exceptions are applied for english specific oddities, like "can't" (should be split into lemmas: can and not) 
    
    
* Thus, re-apply logic to create new tokens 

In [16]:
def custom_tokenizer(nlp):
    
    # use default patterns except the ones matched by re.search
    prefixes = [pattern for pattern in nlp.Defaults.prefixes 
                if pattern not in ['-', '_', '#']]
    
    suffixes = [pattern for pattern in nlp.Defaults.suffixes
                if pattern not in ['_']]
    
    infixes  = [pattern for pattern in nlp.Defaults.infixes
                if not re.search(pattern, 'xx-xx')]

    # these are the default params from spacy's "Tokenizer.__init__"
    return Tokenizer(vocab          = nlp.vocab, 
                     rules          = nlp.Defaults.tokenizer_exceptions,
                     prefix_search  = compile_prefix_regex(prefixes).search,
                     suffix_search  = compile_suffix_regex(suffixes).search,
                     infix_finditer = compile_infix_regex(infixes).finditer,
                     token_match    = nlp.Defaults.token_match)
                     #url_match [optional]


In [6]:
# recall: nlp = spacy.load('en_core_web_md', disable=["parser"])
## thus, alter default behaviour 
nlp.tokenizer = custom_tokenizer(nlp)

In [22]:
# print stopwords 
# print(nlp.Defaults.stop_words)

In [7]:
# add our own custom corpus specific stopwors again 

additional_stopwords = {'dear','regards','also','would','must'}
# exclude_stopwords = {''}

nlp.Defaults.stop_words |= additional_stopwords
# nlp.Defaults.stop_words -= exclude_stopwords

## LEMMATISATION
* The mapping words to their uninflected roots
* Requires a lookup dictionary + knowledge of the part of speech
    * So it can know the lemma of the (noun) meeting = meeting and the lemma of the verb "meeting" is meet
    * Spacy can do this automatically for English with automatic part-of-speech dependency
    
    
* The `pos_` attribute contains the simplified tag of the universal part-of-speech tagset- whcih remains stable across different models


* The general logic that follows is that: "Nouns, Verbs, Adjectives, Adverbs" are content words (much of the sentence meaning depends on them)
    * And that function words (pronouns, prepositions, conjunctions, determiners) create grammatical relationships within sentences - not always highly relevant to a lot of NLP applications
    
    
* NOTE: lemmatisation could affect sentiment analysis results
    * As in the example below, (best) becomes (good) which may be misleading for sentiment analysis purposes

In [34]:
#E.G. filtering on stopwords / punctuation
sample_text = "dear Ryan, we need to sit down and talk about our best and most exciting plans. regards, Pete"
sample_doc = nlp(sample_text)

print(*[t.lemma_ for t in sample_doc if not t.is_stop and not t.is_punct], sep=" | ")

Ryan | need | sit | talk | good | exciting | plan | Pete


In [39]:
#E.G. filtering on POS tag
sample_text = "dear Ryan, we need to sit down and talk about our best and most exciting plans. regards, Pete"
sample_doc = nlp(sample_text)

print(*[t.lemma_ for t in sample_doc if t.pos_ in ['NOUN','PROPN']], sep=" | ")

dear | Ryan | plan | regard | Pete


## Textacy's `extract.words` function
* Utilising POS tags + additional token properties like: 
    * `is_punct` or `is_stop` 
    
    
* For more information, see docs:
    * https://textacy.readthedocs.io/en/0.11.0/api_reference/extract.html

In [60]:
sample_text = "dear Ryan, we need to sit down and talk about our best and most exciting plans. regards, Pete"
doc = nlp(sample_text)

In [61]:
# extract tokens for ['adjectives','nouns'] from passed in nlp("doc") object 
tokens = textacy.extract.words(doc, 
            filter_stops = True,           # default True, no stopwords
            filter_punct = True,           # default True, no punctuation
            filter_nums = True,            # default False, no numbers
            include_pos = ['ADJ', 'NOUN'], # default None = include all
            exclude_pos = None,            # default None = exclude none
            min_freq = 1)                  # minimum frequency of words

In [62]:
print(*[t for t in tokens], sep='|')

best|exciting|plans


* Then create a wrapper around `textacy` by forwarding keyword arguments from (**kwargs)
    * This function then accepts the same parameters as textacy's `extract.words` 

In [63]:
def extract_lemmas(doc, **kwargs):
    return [t.lemma_ for t in textacy.extract.words(doc, **kwargs)]

In [64]:
lemmas = extract_lemmas(doc, include_pos=['ADJ', 'NOUN'])
print(*lemmas, sep='|')

good|exciting|plan


* "best exciting plans" becomes lemmatised as "good exciting plan" 

### Noun phrase extraction 
* based on POS patterns
    * function takes:
        * Doc
        * list of POS tags
        * A seprator character to join words of the noun phrase
        

* constructed pattern searches for sequences of nouns, that are preceded by a token with one of the pre-specified POS tags
    * Returns lemmas
    
    
* Will extract all phrases consisting of an adjective or a noun followed by a sequence of nouns

In [5]:
# slight syntax difference dependign on textacy version
# if version >= 0.11, then add "token_matches" after the textacy.extract.matches
print(textacy.__version__)

0.11.0


In [6]:
def extract_noun_phrases(doc, preceding_pos=['NOUN'], sep='_'):
    patterns = []
    for pos in preceding_pos:
        patterns.append(f"POS:{pos} POS:NOUN:+")

    spans = textacy.extract.matches.token_matches(doc, patterns=patterns)

    return [sep.join([t.lemma_ for t in s]) for s in spans]

In [9]:
# paste exampe again
sample_text = "dear Ryan, we need to sit down and talk about our best and most exciting plans,\
as well as our fanciest adventures. regards, Pete"

doc = nlp(sample_text)

In [10]:
print(*extract_noun_phrases(doc, ['ADJ', 'NOUN']), sep=' | ')

exciting_plan | fancy_adventure


## Extracting Named Entities
* makes use of SpaCy's `displacy` function for visualisations 

In [44]:
from spacy import displacy 
# from pathlib import Path

In [32]:
df.query('country=="AUS" and year==2015')

Unnamed: 0,session,year,country,country_name,speaker,position,text,length,tokens,len_tokens
7322,70,2015,AUS,Australia,Ms. Julie Bishop,Minister for Foreign Affairs,We meet this day at an important time for the ...,12832,"[meet, day, important, time, united, nations, ...",1095


In [22]:
# pick an example text, E.G. from Australia in 2015
sample_aus_2015 = df.iloc[7322]['text']

In [38]:
# put into spacy // convert object dtype to string
sample_aus_2015_doc = nlp(str(sample_aus_2015))

In [48]:
# note: pass in jupyter=False to avoid premature rendering in notebook
html = displacy.render(sample_aus_2015_doc, style="ent", jupyter=False)

In [51]:
# write and visualise 
with open("sample_aus_2015_vis.html", "w+") as f:
    f.write(html)

* see REPO for the example without needing to run code 

## Extracting named entities of certain types using `textacy.extract.entities`
* documentation:
    * https://textacy.readthedocs.io/en/0.11.0/api_reference/extract.html

In [5]:
def extract_entities(doc, include_types=None, sep='_'):

    ents = textacy.extract.entities(doc,
             include_types=include_types, # pass in with function call 
             exclude_types=None,
             drop_determiners=True,  #"the" etc.
             min_freq=1) # remove entities which occur fewer than min_freq() times

    # return the lemma + a " / " seperatign the lemma from its NER tag (E.G. GPE, person, nominal etc.)
    return [sep.join([t.lemma_ for t in e])+' / '+e.label_ for e in ents]

In [13]:
# random fake string
sample_text = "Henry Kissinger, ex-chairman of Microsoft, now lives in San Francisco, USA."
doc = nlp(sample_text)

In [15]:
extract_entities(doc, ['PERSON', 'GPE', 'ORG', 'LOC'])

['Henry_Kissinger / PERSON',
 'Microsoft / ORG',
 'San_Francisco / GPE',
 'USA / GPE']

## Combined function-call with all sub-elements 
* combines:
    * `extract_lemmas`
    * `extract_noun_phrases`
    * `extract_entities` 
    
* 

In [17]:
def extract_lemmas(doc, **kwargs):
    return [t.lemma_ for t in textacy.extract.words(doc, **kwargs)]

In [18]:
def extract_noun_phrases(doc, preceding_pos=['NOUN'], sep='_'):
    patterns = []
    for pos in preceding_pos:
        patterns.append(f"POS:{pos} POS:NOUN:+")

    spans = textacy.extract.matches.token_matches(doc, patterns=patterns)

    return [sep.join([t.lemma_ for t in s]) for s in spans]

In [19]:
def extract_entities(doc, include_types=None, sep='_'):

    ents = textacy.extract.entities(doc,
             include_types=include_types, # pass in with function call 
             exclude_types=None,
             drop_determiners=True,  #"the" etc.
             min_freq=1) # remove entities which occur fewer than min_freq() times

    # return the lemma + a " / " seperatign the lemma from its NER tag (E.G. GPE, person, nominal etc.)
    return [sep.join([t.lemma_ for t in e])+' / '+e.label_ for e in ents]

In [41]:
def extract_nlp(doc):
    return {
    'lemmas'          : extract_lemmas(doc, 
                                     exclude_pos = ['PART', 'PUNCT', 
                                        'DET', 'PRON', 'SYM', 'SPACE'],
                                     filter_stops = True),
    'adjs_verbs'      : extract_lemmas(doc, include_pos = ['ADJ', 'VERB']),
    'nouns'           : extract_lemmas(doc, include_pos = ['NOUN', 'PROPN']),
    'noun_phrases'    : extract_noun_phrases(doc, ['NOUN']),
    'adj_noun_phrases': extract_noun_phrases(doc, ['ADJ']),
    'entities'        : extract_entities(doc, ['PERSON', 'ORG', 'GPE', 'LOC'])
    }

In [42]:
additional_stopwords = {'dear','regards','also','would','must'}
# exclude_stopwords = {''}

nlp.Defaults.stop_words |= additional_stopwords

In [43]:
#E.G. 
text = "My best friend Ryan Peters also likes fancy adventure games."
doc = nlp(text)
for col, values in extract_nlp(doc).items():
    print(f"{col}: {values}")

lemmas: ['good', 'friend', 'Ryan', 'Peters', 'like', 'fancy', 'adventure', 'game']
adjs_verbs: ['good', 'like', 'fancy']
nouns: ['friend', 'Ryan', 'Peters', 'adventure', 'game']
noun_phrases: ['adventure_game']
adj_noun_phrases: ['good_friend', 'fancy_adventure', 'fancy_adventure_game']
entities: ['Ryan_Peters / PERSON']


In [38]:
# view the additional grammatical elements to be added to the DF
nlp_columns = list(extract_nlp(nlp.make_doc('')).keys())
print(nlp_columns)

['lemmas', 'adjs_verbs', 'nouns', 'noun_phrases', 'adj_noun_phrases', 'entities']


In [39]:
# initialise empty variables to over-write with data
for col in nlp_columns:
    df[col] = None

In [32]:
# defined at start of notebook but again here for clarity 
nlp.tokenizer = custom_tokenizer(nlp)

In [35]:
batch_size = 50
batches = math.ceil(len(df) / batch_size) ###

In [36]:
batches

151

In [None]:
# WARNING # 1H runtime total
for i in tqdm(range(0, len(df), batch_size), total=batches):
    docs = nlp.pipe(df['text'][i:i+batch_size])
    
    for j, doc in enumerate(docs):
        for col, values in extract_nlp(doc).items():
            df[col].iloc[i+j] = values

In [46]:
#E.G. 
## some further analysis can be done on these other grammatical structures after 
df.head(3)

Unnamed: 0,session,year,country,country_name,speaker,position,text,length,tokens,len_tokens,lemmas,adjs_verbs,nouns,noun_phrases,adj_noun_phrases,entities
0,25,1970,ALB,Albania,Mr. NAS,,33: May I first convey to our President the co...,51419,"[may, first, convey, president, congratulation...",4125,"[33, convey, President, congratulation, albani...","[convey, albanian, twenty-fifth, take, twenty-...","[President, congratulation, delegation, electi...","[balance_sheet, world_security, world_arena, l...","[albanian_delegation, twenty-fifth_session, fi...","[General_Assembly / ORG, General_Assembly / OR..."
1,25,1970,ARG,Argentina,Mr. DE PABLO PARDO,,177.\t : It is a fortunate coincidence that pr...,29286,"[fortunate, coincidence, precisely, time, unit...",2327,"[177, fortunate, coincidence, precisely, time,...","[fortunate, celebrate, twenty-five, eminent, l...","[coincidence, time, United, Nations, year, exi...","[state_limit, extent_idea, case_discouragement...","[fortunate_coincidence, twenty-five_year, emin...","[United_Nations / ORG, Organization / ORG, Gen..."
2,25,1970,AUS,Australia,Mr. McMAHON,,100.\t It is a pleasure for me to extend to y...,31839,"[pleasure, extend, mr, president, warmest, con...",2545,"[100, pleasure, extend, Mr., President, warm, ...","[extend, warm, distinguished, play, authoritat...","[pleasure, Mr., President, congratulation, Aus...","[anniversary_session, war_wnidi, world_peace, ...","[warm_congratulation, distinguished_part, auth...","[Australia_Government / ORG, General_Assembly ..."


## Lemmatisation & Topic Modeling
* Just restrict out function to lemmaisation (reduce training time and more relevant to our topic model) 
* SpaCy medium/large model is more accurate than NLTK
* Use these lemmas to try re- topic modeling with SKlearn

In [59]:
def extract_lemmas(doc, **kwargs):
    return [t.lemma_ for t in textacy.extract.words(doc, **kwargs)]

In [73]:
def extract_nlp_simp(doc):
    return {
    'lemmas'          : extract_lemmas(doc, 
                                     include_pos = ['ADJ', 'NOUN'],
                                     filter_stops = True)
    }

In [50]:
# define the number of paragraphs of the text, splitting on the punctuation and ignoring the whitespace after them
df["paragraphs"] = df["text"].map(lambda text: re.split('[.?!]\s*\n', text))

In [51]:
paragraph_df = pd.DataFrame([{ "text": paragraph, "year": year } 
                               for paragraphs, year in zip(df["paragraphs"], df["year"]) 
                                    for paragraph in paragraphs if paragraph])

In [53]:
# just keep text 
paragraph_df = pd.DataFrame([{ "text": paragraph} 
                               for paragraphs in (df["paragraphs"]) 
                                    for paragraph in paragraphs if paragraph])

In [56]:
#E.G.
paragraph_df.head(3)

Unnamed: 0,text
0,33: May I first convey to our President the co...
1,34.\tIn taking up the work on the agenda of th...
2,35.\tThe utilization of the United Nations to ...


In [82]:
# 1000 sents a batch, 283 batches to map whole DF
batch_size = 1000
batches = math.ceil(len(paragraph_df) / batch_size) 

In [84]:
# initialise empty values to iterate over in loop through nlp.doc objects 
paragraph_df['noun_adj_lemmas'] = None

In [85]:
#loop over paragraph df "text" and apply the function by updating the None values in "noun_adj_lemmas" with their lemmas
for i in tqdm(range(0, len(paragraph_df), batch_size), total=batches):
    docs = nlp.pipe(paragraph_df['text'][i:i+batch_size])
    
    for j, doc in enumerate(docs):
        for col, values in extract_nlp_simp(doc).items(): # remove col? 
            paragraph_df['noun_adj_lemmas'].iloc[i+j] = values

  0%|          | 0/283 [00:00<?, ?it/s]

In [86]:
#E.G. 
paragraph_df

Unnamed: 0,text,noun_adj_lemmas
0,33: May I first convey to our President the co...,"[congratulation, albanian, delegation, electio..."
1,34.\tIn taking up the work on the agenda of th...,"[work, agenda, twenty-, fifth, session, eve, t..."
2,35.\tThe utilization of the United Nations to ...,"[utilization, policy, able, hand, aggression, ..."
3,36.\tThe whole of progressive mankind recalls ...,"[progressive, mankind, admiration, heroic, str..."
4,37.\tAll this has had well known consequences ...,"[consequence, damaging, authority, ability, in..."
...,...,...
282205,"For some months now, we have watched heartbrea...","[month, heartbreaking, harrowing, scene, despe..."
282206,This tragic situation could have been avoided ...,"[tragic, situation, respect, independence, cou..."
282207,"My country, Zimbabwe, is committed to a fair, ...","[country, fair, effective, multilateralism, in..."
282208,We invite other countries with which we may ha...,"[country, difference, nature, threat, pressure..."


In [87]:
## update pickle 
# paragraph_df.to_pickle('paragraph_df_lemmas.pkl')

In [4]:
# read pickle 
paragraph_df = pd.read_pickle('paragraph_df_lemmas.pkl')

In [34]:
# add some additional corpus specific stopwords 
additional_stopwords = {'dear','regards','also','would','must', 'congratulation',
                       'world','new','year', 'nation', 'today', 'time', 'challenge', 'great', 'people',
                       'session', 'delegation', 'presidency', 'behalf', 'country', 'co', 'international', 'operation'} 
# exclude_stopwords = {''}

nlp.Defaults.stop_words |= additional_stopwords

In [35]:
## tf-idf vectoriser for speech's text: ## from notebook #3 for sklearn 

## start with just unigrams 
tfidf_para_vectorizer_lemma = TfidfVectorizer(stop_words=nlp.Defaults.stop_words, 
                                                 min_df=6, 
                                                 max_df=0.7, 
                                                 smooth_idf = True, 
                                                 ngram_range=(1,1))

In [18]:
# tfidf vectoriser requires strings, not list (quickly adjust the output) 
paragraph_df['noun_adj_lemma_str'] = paragraph_df['noun_adj_lemmas'].apply(lambda x: ' '.join(map(str, x)))

In [19]:
paragraph_df.head()

Unnamed: 0,text,noun_adj_lemmas,noun_adj_lemma_str
0,33: May I first convey to our President the co...,"[congratulation, albanian, delegation, electio...",congratulation albanian delegation election tw...
1,34.\tIn taking up the work on the agenda of th...,"[work, agenda, twenty-, fifth, session, eve, t...",work agenda twenty- fifth session eve twenty-f...
2,35.\tThe utilization of the United Nations to ...,"[utilization, policy, able, hand, aggression, ...",utilization policy able hand aggression part w...
3,36.\tThe whole of progressive mankind recalls ...,"[progressive, mankind, admiration, heroic, str...",progressive mankind admiration heroic struggle...
4,37.\tAll this has had well known consequences ...,"[consequence, damaging, authority, ability, in...",consequence damaging authority ability incumbe...


In [36]:
%%time 

tfidf_para_lemma_vectors = tfidf_para_vectorizer_lemma.fit_transform(paragraph_df['noun_adj_lemma_str'])



Wall time: 5.19 s


In [23]:
tfidf_para_lemma_vectors.shape

(282210, 11224)

## Re-import display topics function from notebook (3)
* Or imported as separate py function in github repo for simplicity 

In [32]:
# function defined in notebook #3 as separate py file 
from display_topics_func import display_topics

In [37]:
# from: from sklearn.decomposition import NMF
# repeat steps from initial model in notebook #3
# try with 10 topics first 
nmf_para_model = NMF(n_components=10, random_state=42)

In [38]:
%%time

# define matrixes from para model + the afore-defined para vectors variable 
W_para_matrix = nmf_para_model.fit_transform(tfidf_para_lemma_vectors)
H_para_matrix = nmf_para_model.components_



Wall time: 26.6 s


In [39]:
# display topics for paragraph model, set top words = 8 
# change to para vectorizer 
display_topics(nmf_para_model, tfidf_para_vectorizer_lemma.get_feature_names(),
              top_words=8)


Topic 00
  problem (11.88)
  solution (6.64)
  question (1.42)
  debt (0.83)
  search (0.77)
  peaceful (0.74)
  effort (0.71)
  concern (0.71)

Topic 01
  economic (4.54)
  development (4.54)
  social (1.92)
  economy (1.09)
  growth (1.03)
  resource (0.90)
  political (0.88)
  sustainable (0.86)

Topic 02
  right (10.51)
  human (8.26)
  freedom (1.68)
  respect (1.49)
  fundamental (1.43)
  violation (1.11)
  protection (1.10)
  democracy (1.08)

Topic 03
  nuclear (7.87)
  weapon (6.12)
  disarmament (3.03)
  arm (2.05)
  proliferation (1.56)
  destruction (1.46)
  mass (1.23)
  race (1.15)

Topic 04
  peace (14.89)
  security (7.20)
  stability (2.25)
  region (2.16)
  maintenance (1.28)
  process (1.24)
  keeping (1.18)
  justice (1.14)

Topic 05
  election (4.22)
  general (3.19)
  work (3.11)
  secretary (3.09)
  predecessor (1.66)
  success (1.64)
  appreciation (1.58)
  tribute (1.49)

Topic 06
  principle (2.17)
  independence (1.57)
  relation (1.31)
  policy (1.23)
  sta

* Definitely visible that the bulk of the topics are more well defined than in notebook (3) without lemmas - seems to be a definite improvement
    * Demonstrated by the quick drop off in values for contribution by word to topic 
    * With the exception of topic 8 (which has the highest % of documents assigned to it (see below) generally because the topics are quite broad compared to the others 

In [40]:
# visualise what % of documents could be assigned mainly to each topic 
for i, val in enumerate(W_para_matrix.sum(axis=0) / W_para_matrix.sum() * 100.0 ):
    print("Topic {} = {:.2f}%".format(i, val))

Topic 0 = 5.14%
Topic 1 = 13.23%
Topic 2 = 8.60%
Topic 3 = 6.02%
Topic 4 = 10.92%
Topic 5 = 9.15%
Topic 6 = 13.27%
Topic 7 = 7.41%
Topic 8 = 17.74%
Topic 9 = 8.52%


## Future Scope of works


* try re-iterate with NER? To see mentions of organisations by year etc.? 
* Use other grammatical structures to draw some insights
* Try text summarisation techniques 
