## Find more frequent/important phrases with NLTK and YAKE

Extract n-grams from text data and then find the ones with the highest point wise mutual information (PMI). Find the words that co-occur together much more than they would to by chance.

YAKE identifies most relevant key words in text by using statistical data from sinlg texts.

Resources:
- https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/
- https://community.dataiku.com/t5/Knowledge-Base/How-to-use-spaCy-models-in-DSS/tac-p/12090

In [1]:
#Import libraries
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

import nltk
nltk.download('genesis')
nltk.download('punkt')
from nltk.collocations import *
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder


#hide warnings
import warnings
warnings.filterwarnings('ignore')

##KeyPhrase Extraction
#import spacy
#spacy==2.2.4
#https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
#nltk==3.5
#import pytextrank
#from spacy.lang.en import English
#from spacy.pipeline import SentenceSegmenter, EntityRecognizer, TextCategorizer
#from spacy.pipeline.textrank import TextRank
import yake

[nltk_data] Downloading package genesis to /home/dataiku/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package punkt to /home/dataiku/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

In [3]:
# Read recipe inputs
Text_Cleaned = dataiku.Dataset("Text_Cleaned")
Text_Cleaned_df = Text_Cleaned.get_dataframe()
Text_Cleaned_df

Unnamed: 0,SE_DOWNTIMETYPE,SE_SUBJECT_concat_cleaned
0,Other,pr bl da63 control cppcontainer go shutdown pr...
1,Scheduling,bl ape2 tp statistic waiting end execution inc...
2,Technical,slt test please ignore pr bl totalpowerprocess...
3,Weather,bad wather bad weather bad weather shutdown ac...


__________

### Downtime Type - Other

In [4]:
#Split texts by downtime type
OTHER = Text_Cleaned_df.loc[Text_Cleaned_df['SE_DOWNTIMETYPE'] == 'Other']
#print(OTHER)

OTHER_TEXT = OTHER['SE_SUBJECT_concat_cleaned']
OTHER_TEXT = OTHER_TEXT.values[0]
#OTHER_TEXT

In [5]:
#find the words that co-occur together - not by chance
# initialize BigramCollocationFinder with variable
OTHER_finder = BigramCollocationFinder.from_words(nltk.word_tokenize(OTHER_TEXT))

# only bigrams that appear 3+ times
OTHER_finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
OTHER_finder.nbest(bigram_measures.pmi, 10)

[('resource', 'conflict'),
 ('bl', 'correlator'),
 ('pr', 'bl'),
 ('pr1', 'bl'),
 ('other', 'dv06'),
 (',', 'other'),
 ("'", ',')]

In [6]:
#YAKE Key Phrase extraction
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(OTHER_TEXT)
for kw in keywords:
    print(kw[0])

correlator resource conflict
resource conflict receive
receive callbacks expect
conflict receive callbacks
failed activate component
activate component control
container crash array
component control array
container crash time
status failed activate
crash time handover
mount subreflector power
show weird status
exception throw clustercommander.cpp
execute method startsubscansequence
work fdm acs
sudden servo failure
servo failure axis
time handover control
ccl handover control


___________

### Downtime Type - Scheduling

In [7]:
#Split texts by downtime type
SCHEDULING = Text_Cleaned_df.loc[Text_Cleaned_df['SE_DOWNTIMETYPE'] == 'Scheduling']
#print(SCHEDULING)

SCHEDULING_TEXT = SCHEDULING['SE_SUBJECT_concat_cleaned']
SCHEDULING_TEXT = SCHEDULING_TEXT.values[0]
#SCHEDULING_TEXT

In [8]:
##find the words that co-occur together, not by chance
# initialize BigramCollocationFinder with variable
SCHEDULING_finder = BigramCollocationFinder.from_words(nltk.word_tokenize(SCHEDULING_TEXT))

# only bigrams that appear 3+ times
SCHEDULING_finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
SCHEDULING_finder.nbest(bigram_measures.pmi, 10)

[('baseline', 'monitor'),
 ('covid-19', 'protocol'),
 ('isolate', 'mode'),
 ('night', 'aod'),
 ('solar', 'campaign'),
 ('solar', 'campaing'),
 ('high', 'phase'),
 ('execute', 'none'),
 ('handover', 'eng'),
 ('observation', 'isolate')]

In [9]:
#YAKE Key Phrase extraction
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(SCHEDULING_TEXT)
for kw in keywords:
    print(kw[0])

project observe end
project observe scheduling
project observe
project observe waiting
observe end shift
project observe aca
end shift gap
project end shift
shift end shift
science project observe
end shift end
project observe weather
project observe sched
observe waiting project
observe no project
project observe gap
project run end
gap end shift
project observe family
gap waiting project


___________

### Downtime Type - Technical

In [10]:
#Split texts by downtime type
TECHNICAL = Text_Cleaned_df.loc[Text_Cleaned_df['SE_DOWNTIMETYPE'] == 'Technical']
#print(TECHNICAL)

TECHNICAL_TEXT = TECHNICAL['SE_SUBJECT_concat_cleaned']
TECHNICAL_TEXT = TECHNICAL_TEXT.values[0]
#TECHNICAL_TEXT

In [11]:
##find the words that co-occur together - not by chance
# initialize BigramCollocationFinder with variable
TECHNICAL_finder = BigramCollocationFinder.from_words(nltk.word_tokenize(TECHNICAL_TEXT))

# only bigrams that appear 3+ times
TECHNICAL_finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
TECHNICAL_finder.nbest(bigram_measures.pmi, 10)

[('corrthreadsyncguard', 'cpp:142'),
 ('ra', 'dec'),
 ('acserr', 'namevalue'),
 ('primary', 'flux'),
 ('timing', 'pulse'),
 ('trex', 'crazy'),
 ('ml2', 'went'),
 ('selection', 'criterium'),
 ('type=10000', 'code=22'),
 ('circuit', 'braker')]

In [12]:
#YAKE Key Phrase extraction
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(TECHNICAL_TEXT)
for kw in keywords:
    print(kw[0])

aca aca aca
aca aos aca
aca fail aca
fail aca fail
issue aca aca
aca error aca
aca failure aca
aca issue aca
aca recovery aca
datum aca cdpcs
correlation datum aca
aca restart aca
aca cdpcs aca
aca aca corr
aos aca error
issue aca fail
issue aca correlator
aca failed receive
aca error invoke
aca timed wait


___________

### Downtime Type - Weather

In [13]:
#Split texts by downtime type
WEATHER = Text_Cleaned_df.loc[Text_Cleaned_df['SE_DOWNTIMETYPE'] == 'Weather']
#print(WEATHER)

WEATHER_TEXT = WEATHER['SE_SUBJECT_concat_cleaned']
WEATHER_TEXT = WEATHER_TEXT.values[0]
#WEATHER_TEXT

In [14]:
##find the words that co-occur together - not by chance
# initialize BigramCollocationFinder with variable
WEATHER_finder = BigramCollocationFinder.from_words(nltk.word_tokenize(WEATHER_TEXT))

# only bigrams that appear 3+ times
WEATHER_finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
WEATHER_finder.nbest(bigram_measures.pmi, 10)

[('water', 'vapor'),
 ('mix', 'lite'),
 ('da', 'dv'),
 ('dvs', 'das'),
 ('little', 'hr'),
 ('guards', 'inform'),
 ('melt', 'little'),
 ('leave', 'track'),
 ('low', 'frequency'),
 ('mounts', 'simulation')]

In [15]:
#YAKE Key Phrase extraction
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(WEATHER_TEXT)
for kw in keywords:
    print(kw[0])

wind high wind
high wind aos
high wind speed
wind speed wind
wind speed aos
high wind high
aos high wind
wind wind wind
wind aos high
high wind wind
wind wind speed
wind speed high
speed wind speed
speed high wind
antenna high wind
weather high wind
wind speed alarm
high wind antenna
wind wind high
wind aos wind


___________

In [16]:
# Write recipe outputs
# Dataset n-grams renamed to KeyPhraseExtraction by vkb6bn on 2023-03-08 17:31:42
n_grams = dataiku.Dataset("KeyPhraseExtraction")
#KeyPhrases.write_with_schema(n_grams_df)