# Title: Exploring Textual Patterns / Performing Information Extractions

The primary context of this notebook will be to finalize of extracting valuable insights from the news articles and see if we can really focus on extracting out of the box relationships as well. The following steps would suffice this notebook:  
- Text Preprocessing  
- Rule 1 for IE: Noun-Verb-Noun Extraction  
- Rule 2 for IE: Adjective-Noun Extraction  
- Rule 3 for IE: Preprosition-Noun Extraction  
- Rule 4 for IE: Combination of NVN + AD Extarction based rules

Details for each section could be explored in following sections

## Generic Actions

In [1]:
import os
os.chdir(os.path.dirname(os.getcwd()))
os.getcwd()

'c:\\Users\\manash.jyoti.konwar\\Documents\\AI_Random_Projects\\NLP-Information-Pattern-Finder'

### Libraries Import

In [2]:
import spacy
import multiprocessing
import pandas as pd
import dask.dataframe as dd

from tqdm import tqdm
from dask.diagnostics import ProgressBar
ProgressBar().register()

from sn_textual_preprocessing import *

pd.set_option("display.max_rows", 600)
pd.set_option("display.max_columns", 500)
pd.set_option("max_colwidth", 400)

### Notebook Variables

In [3]:
# Input file path
input_filepath = os.path.join('input', 'news_articles_dataset.csv')

# Derived file path
output_path = 'output'
if not os.path.exists(output_path):
    os.makedirs(output_path)

sample_frac = 0.1
# spacy_model_name = 'en_core_web_trf'
spacy_model_name = 'en_core_web_lg'

ouptut_overall_data = os.path.join(output_path, 'df_nvn_news.csv')
ouptut_nvn_sep_data = os.path.join(output_path, 'df_nvn_sep_news.csv')

In [4]:
try:
    nlp_spacy_en_model = spacy.load(spacy_model_name)
except OSError:
    spacy.cli.download(spacy_model_name)
    nlp_spacy_en_model = spacy.load(spacy_model_name)

### Reading data

In [5]:
input_data = pd.read_csv(input_filepath)
input_data.columns = [col_name.upper() for col_name in input_data.columns]
input_data.shape

(2225, 3)

In [6]:
sample_data = input_data.groupby('CATEGORIES', group_keys=False).apply(lambda x: x.sample(frac=0.1, random_state=42))
sample_data.shape

(223, 3)

In [7]:
sample_data.CATEGORIES.value_counts()

business         51
sport            51
politics         42
tech             40
entertainment    39
Name: CATEGORIES, dtype: int64

## Text Preprocessing  

The steps are as follows:  
- Remove mentions and hashtags  
- Remove URLs  
- Remove contractions  
- Remove stopwords and punctuations  
- Lemmatize all words amd lower case each of them  
- Remove redundant domain specific words  
- Remove extra spaces 

In [8]:
def preprocess_text(text):
    result = remove_urls(text)
    result = remove_mentions_hashtags(result)
    result = remove_contractions(result)
    result = remove_stopwords_punc_nos(result, 
                                       remove_stopwords_flag=False, 
                                       punc_2_remove=string.punctuation.replace('-','').replace('%','').replace('.',''), 
                                       remove_digits_flag=False,
                                       remove_pattern_punc_flag=True)
    result = remove_extra_spaces(result)
    return result

In [9]:
sample_data['PREPROCESSED_TEXT'] = dd.from_pandas(sample_data.ARTICLES, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: preprocess_text(row))).compute(scheduler='processes')

[########################################] | 100% Completed | 7.85 ss


## Experimentation

In [10]:
sample_data.head(5)

Unnamed: 0,ARTICLES,SUMMARIES,CATEGORIES,PREPROCESSED_TEXT
480,"Christmas sales worst since 1981\n\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.\n\nRetail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A num...","""The retail sales figures are very weak, but as Bank of England governor Mervyn King indicated last night, you don't really get an accurate impression of Christmas trading until about Easter,"" said Mr Shaw.The last time retailers endured a tougher Christmas was 23 years previously, when sales plunged 1.7%.A number of retailers have already reported poor figures for December.Retail sales droppe...",business,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...
449,"US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December, boosted by strong car sales.\n\nSeasonally adjusted sales rose 1.2% in the month, compared to 0.1% a month earlier, boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year, the best performance since an 8.5% rise in 1999, the Commerce Department...","US retail sales ended the year on a high note with solid gains in December, boosted by strong car sales.Sales for the year also broke through the $4 trillion mark for the first time - with annual sales coming in at $4.06 trillion However, if automotives are excluded from December's data, retail sales rose just 0.3% on the month.Retail sales are seen as a major part of consumer spending - which...",business,US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December boosted by strong car sales. Seasonally adjusted sales rose 1.2% in the month compared to 0.1% a month earlier boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year the best performance since an 8.5% rise in 1999 the Commerce Department added. ...
475,"Saudi NCCI's shares soar\n\nShares in Saudi Arabia's National Company for Cooperative Insurance (NCCI) soared on their first day of trading in Riyadh.\n\nThey were trading 84% above the offer price on Monday, changing hands at 372 riyals ($99; Â£53) after topping 400 early in the day. Demand for the insurer's debut shares was strong - 12 times what was on sale. The listing was part of the coun...","Shares in Saudi Arabia's National Company for Cooperative Insurance (NCCI) soared on their first day of trading in Riyadh.Previously, only NCCI has been legally allowed to offer insurance products within Saudi Arabia.The listing was part of the country's plans to open up its insurance market and boost demand in the sector.Saudi Arabia now wants a fully functioning insurance industry and is int...",business,Saudi NCCIs shares soar\n\nShares in Saudi Arabias National Company for Cooperative Insurance NCCI soared on their first day of trading in Riyadh. They were trading 84% above the offer price on Monday changing hands at 372 riyals 99 Â£53 after topping 400 early in the day. Demand for the insurers debut shares was strong - 12 times what was on sale. The listing was part of the countrys plans to...
434,"Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp, sparking rumours of a possible takeover.\n\nFosters bought 18.8% of Southcorp, the global winemaker behind the Penfolds, Lindemans and Rosemount brands, for 4.17 Australian dollars per share. A bid at that price would value the company at A$3.1bn ($2.4bn; Â£1.25bn ). Fosters...","Australian brewer Fosters has bought a large stake in Australian winemaker Southcorp, sparking rumours of a possible takeover.Fosters bought 18.8% of Southcorp, the global winemaker behind the Penfolds, Lindemans and Rosemount brands, for 4.17 Australian dollars per share.Fosters bought the 18.8% stake from Reline Investments, the family investment firm for the Oatleys, who founded the Rosemou...",business,Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp sparking rumours of a possible takeover. Fosters bought 18.8% of Southcorp the global winemaker behind the Penfolds Lindemans and Rosemount brands for 4.17 Australian dollars per share. A bid at that price would value the company at A3.1bn 2.4bn Â£1.25bn . Fosters said it was...
368,"Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Eco's stake in Sun Interbrew, Russia's second-largest brewer, for up to 259.7m euros ($353.3m; Â£183.75m).\n\nAlfa-Eco, the venture capital arm of Russian conglomerate Alfa Group, has a one-fifth stake in Sun Interbrew. The deal gives Inbev, the world's biggest beermaker, near-total control over the Russian brewer. ...","Inbev was formed in August 2004 when Belgium's Interbrew bought Brazilian brewer Ambev.Brewing giant Inbev has agreed to buy Alfa-Eco's stake in Sun Interbrew, Russia's second-largest brewer, for up to 259.7m euros ($353.3m; Â£183.75m).Sun Interbrew, which employs 8,000 staff, owns breweries in eight Russian cities - Klin, Ivanovo, Saransk, Kursk, Volzhsky, Omsk, Perm and Novocheboksarsk.The d...",business,Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Ecos stake in Sun Interbrew Russias second-largest brewer for up to 259.7m euros 353.3m Â£183.75m. Alfa-Eco the venture capital arm of Russian conglomerate Alfa Group has a one-fifth stake in Sun Interbrew. The deal gives Inbev the worlds biggest beermaker near-total control over the Russian brewer. Inbev bought out...


In [11]:
test_text = sample_data.PREPROCESSED_TEXT[480]
test_text

'Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of retailers have already reported poor figures for December. Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth according to the ONS. The last time retailers endured a tougher Christmas was 23 years previously when sales plunged 1.7%. The ONS echoed an earlier caution from Bank of England governor Mervyn King not to read too much into the poor December figures. Some analysts put a positive gloss on the figures pointing out that the non-seasonally-adjusted figures showed a performance comparable with 2003.. The November-Decembe

In [12]:
doc = nlp_spacy_en_model(test_text)

for token in doc:
    print(token.text,'->',token.pos_)
    
from spacy import displacy 
displacy.render(doc, style='dep',jupyter=True)

Christmas -> PROPN
sales -> NOUN
worst -> ADV
since -> SCONJ
1981 -> NUM


 -> SPACE
UK -> PROPN
retail -> ADJ
sales -> NOUN
fell -> VERB
in -> ADP
December -> PROPN
failing -> VERB
to -> PART
meet -> VERB
expectations -> NOUN
and -> CCONJ
making -> VERB
it -> PRON
by -> ADP
some -> DET
counts -> NOUN
the -> DET
worst -> ADJ
Christmas -> PROPN
since -> SCONJ
1981 -> NUM
.. -> PUNCT
Retail -> ADJ
sales -> NOUN
dropped -> VERB
by -> ADP
1 -> NUM
% -> NOUN
on -> ADP
the -> DET
month -> NOUN
in -> ADP
December -> PROPN
after -> ADP
a -> DET
0.6 -> NUM
% -> NOUN
rise -> NOUN
in -> ADP
November -> PROPN
the -> DET
Office -> PROPN
for -> ADP
National -> PROPN
Statistics -> PROPN
ONS -> PROPN
said -> VERB
. -> PUNCT
The -> DET
ONS -> PROPN
revised -> VERB
the -> DET
annual -> ADJ
2004 -> NUM
rate -> NOUN
of -> ADP
growth -> NOUN
down -> ADP
from -> ADP
the -> DET
5.9 -> NUM
% -> NOUN
estimated -> VERB
in -> ADP
November -> PROPN
to -> ADP
3.2 -> NUM
% -> NOUN
. -> PUNCT
A -> DET
number -> NOUN

## Rule 1 for IE: NVN Extraction

In [13]:
# Function for rule 1: noun(subject), verb, noun(object)
def rule_nvn(text):
    doc = nlp_spacy_en_model(text)
    sent = []
    for token in doc:
        # if the token is a verb
        if (token.pos_=='VERB'):
            phrase =''
            
            # only extract noun or pronoun subjects
            for sub_tok in token.lefts:
                if (sub_tok.dep_ in ['nsubj','nsubjpass']) and (sub_tok.pos_ in ['NOUN','PROPN','PRON']):
                    # add subject to the phrase
                    phrase += sub_tok.text
                    # save the root of the verb in phrase
                    phrase += ' '+token.lemma_ 
                    # check for noun or pronoun direct objects
                    for sub_tok in token.rights:
                        # save the object in the phrase
                        if (sub_tok.dep_ in ['dobj']) and (sub_tok.pos_ in ['NOUN','PROPN']):
                            phrase += ' '+sub_tok.text
                            sent.append(phrase)
    return sent

In [14]:
rule_nvn(test_text)

['ONS revise rate',
 'number report figures',
 'retailers endure Christmas',
 'ONS echo caution',
 'analysts put gloss',
 'figures show performance',
 'measures cut prices',
 'figures have effect',
 'you get impression',
 'Bank keep powder']

In [15]:
tqdm.pandas(desc='Extracting NVN Phrases')
sample_data['NVN_PHRASES'] = sample_data['PREPROCESSED_TEXT'].progress_apply(rule_nvn)
# sample_data['NVN_PHRASES'] = dd.from_pandas(sample_data.PREPROCESSED_TEXT, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: rule_nvn(row))).compute(scheduler='processes')

Extracting NVN Phrases: 100%|██████████| 223/223 [00:30<00:00,  7.22it/s]


In [16]:
final_nvn_list = [x for x in sample_data.NVN_PHRASES if len(x)>0]
len(final_nvn_list)

223

In [17]:
final_nvn_list[:5]

[['ONS revise rate',
  'number report figures',
  'retailers endure Christmas',
  'ONS echo caution',
  'analysts put gloss',
  'figures show performance',
  'measures cut prices',
  'figures have effect',
  'you get impression',
  'Bank keep powder'],
 ['dealers use offers',
  'increase push spending',
  'Harris tell Reuters',
  'which make thirds',
  'sales grow %',
  'analysts expect improvement'],
 ['shares soar Shares',
  'authorities turn eye',
  'Arabia want industry',
  'Arabia sell shares',
  'applicants get shares'],
 ['Fosters buy stake',
  'Fosters buy stake',
  'Fosters buy %',
  'bid value company',
  'firms ask market',
  'Fosters buy stake',
  'who found label',
  'Southcorp employ people',
  'prospect startle investors',
  'It have cash',
  'People scratch heads',
  'Fosters do flip',
  'It spend years',
  'It seize spot',
  'it buy Hardy',
  'it pay 1bn',
  'it make clutch',
  'takeover say analyst'],
 ['giant swallow giant',
  'deal give control',
  'Inbev buy partne

## Rule 2 for IE: AN Extraction

In [18]:
# Function for rule 2: adjective noun
def rule_an(text):
    doc = nlp_spacy_en_model(text)
    pat = []
    
    # iterate over tokens
    for token in doc:
        phrase = ''
        # if the word is a subject noun or an object noun
        if (token.pos_ == 'NOUN')\
            and (token.dep_ in ['dobj','pobj','nsubj','nsubjpass']):
            
            # iterate over the children nodes
            for subtoken in token.children:
                # if word is an adjective or has a compound dependency
                if (subtoken.pos_ == 'ADJ') or (subtoken.dep_ == 'compound'):
                    phrase += subtoken.text + ' '
                    
            if len(phrase)!=0:
                phrase += token.text
             
        if  len(phrase)!=0:
            pat.append(phrase)
    return pat

In [19]:
rule_an(test_text)

['Christmas sales',
 'retail sales',
 'Retail sales',
 '% rise',
 'annual rate',
 'poor figures',
 'Clothing retailers',
 'only internet retailers',
 'significant growth',
 'earlier caution',
 'poor December figures',
 'positive gloss',
 'non - figures',
 'comparable performance',
 'December jump',
 'recent averages',
 'serious booms',
 'retail volume',
 'actual spending',
 'Street retailers',
 'festive period',
 'Consortium survey',
 'other retailers',
 'festive sales',
 'last year',
 'poor retail figures',
 'immediate effect',
 'interest rates',
 'retail sales figures',
 'accurate impression',
 'Christmas trading',
 'big picture']

In [20]:
tqdm.pandas(desc='Extracting AN Phrases')
sample_data['AN_PHRASES'] = sample_data['PREPROCESSED_TEXT'].progress_apply(rule_an)
# sample_data['AN_PHRASES'] = dd.from_pandas(sample_data.PREPROCESSED_TEXT, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: rule_an(row))).compute(scheduler='processes')

Extracting AN Phrases: 100%|██████████| 223/223 [00:15<00:00, 13.97it/s]


In [21]:
final_an_list = [x for x in sample_data.AN_PHRASES if len(x)>0]
len(final_an_list)

223

In [22]:
final_an_list[:5]

[['Christmas sales',
  'retail sales',
  'Retail sales',
  '% rise',
  'annual rate',
  'poor figures',
  'Clothing retailers',
  'only internet retailers',
  'significant growth',
  'earlier caution',
  'poor December figures',
  'positive gloss',
  'non - figures',
  'comparable performance',
  'December jump',
  'recent averages',
  'serious booms',
  'retail volume',
  'actual spending',
  'Street retailers',
  'festive period',
  'Consortium survey',
  'other retailers',
  'festive sales',
  'last year',
  'poor retail figures',
  'immediate effect',
  'interest rates',
  'retail sales figures',
  'accurate impression',
  'Christmas trading',
  'big picture'],
 ['retail sales',
  'high note',
  'solid gains',
  'strong car sales',
  '% rise',
  '% jump',
  'auto sales',
  'enhanced offers',
  'sales growth',
  'tough quarter',
  'usual sales boom',
  'total spending',
  'first time',
  'annual sales',
  'Decembers retail sales',
  'Home furnishings',
  'more US consumers',
  'mail

## Rule 3 fro IE: PN Extraction

In [23]:
# rule 3 function
def rule_p(text):
    doc = nlp_spacy_en_model(text)
    sent = []
    
    for token in doc:
        # look for prepositions
        if token.pos_=='ADP':
            phrase = ''
            # if its head word is a noun
            if token.head.pos_=='NOUN':
                # append noun and preposition to phrase
                phrase += token.head.text
                phrase += ' '+token.text

                # check the nodes to the right of the preposition
                for right_tok in token.rights:
                    # append if it is a noun or proper noun
                    if (right_tok.pos_ in ['NOUN','PROPN']):
                        phrase += ' '+right_tok.text
                
                if len(phrase)>2:
                    sent.append(phrase)
    return sent

In [24]:
rule_p(test_text)

['rise in November',
 'rate of growth',
 'number of retailers',
 'caution from King',
 'way below booms',
 'figures for volume',
 'measures of spending indication',
 'weakness of sector',
 'effect on rates',
 'impression of trading']

In [25]:
tqdm.pandas(desc='Extracting PN Phrases')
sample_data['P_PHRASES'] = sample_data['PREPROCESSED_TEXT'].progress_apply(rule_p)
# sample_data['P_PHRASES'] = dd.from_pandas(sample_data.PREPROCESSED_TEXT, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: rule_p(row))).compute(scheduler='processes')

Extracting PN Phrases: 100%|██████████| 223/223 [00:14<00:00, 15.57it/s]


In [26]:
final_p_list = [x for x in sample_data.P_PHRASES if len(x)>0]
len(final_p_list)

223

In [27]:
final_p_list[:5]

[['rise in November',
  'rate of growth',
  'number of retailers',
  'caution from King',
  'way below booms',
  'figures for volume',
  'measures of spending indication',
  'weakness of sector',
  'effect on rates',
  'impression of trading'],
 ['surge in',
  'note with gains',
  'gains in December',
  'surge in shopping',
  'rise in',
  'jump in sales',
  'end of year',
  'increase in sales',
  'increase during December',
  'Sales for year',
  'order for purchases',
  'purchases with retailers',
  'policy of',
  'Consumers for',
  'part of spending',
  'thirds of output',
  'output in US',
  'recession of decade',
  'improvement in growth',
  'rise in unemployment',
  'number of Americans'],
 ['Shares in Company',
  'day of trading',
  'day in Riyadh',
  'Demand for shares',
  'part of countrys',
  'demand in sector',
  'demand for cover',
  'confidence in system',
  '% of'],
 ['stake in Southcorp',
  'rumours of takeover',
  '% of Southcorp',
  'winemaker behind Lindemans',
  'brand

## Rule 4 for IE: Combination of NVN + AD Extraction based rules

In [28]:
def rule_ad_mod(doc, text, index):
    # doc = nlp_spacy_en_model(text)
    phrase = ''
    
    for token in doc:
        if token.i == index:
            for subtoken in token.children:
                if (subtoken.pos_ == 'ADJ'):
                    phrase += ' '+subtoken.text
            break
    return phrase

def rule_nvn_mod(text):
    doc = nlp_spacy_en_model(text)
    sent = []
    
    for token in doc:
        # root word
        if (token.pos_=='VERB'):
            phrase =''
            
            # only extract noun or pronoun subjects
            for sub_tok in token.lefts:
                if (sub_tok.dep_ in ['nsubj','nsubjpass']) and (sub_tok.pos_ in ['NOUN','PROPN','PRON']):
                    adj = rule_ad_mod(doc, text, sub_tok.i)
                    phrase += adj + ' ' + sub_tok.text

                    # save the root word of the word
                    phrase += ' '+token.lemma_ 

                    # check for noun or pronoun direct objects
                    for sub_tok in token.rights:
                        if (sub_tok.dep_ in ['dobj']) and (sub_tok.pos_ in ['NOUN','PROPN']):
                            adj = rule_ad_mod(doc, text, sub_tok.i)
                            # add adj based noun
                            phrase += adj+' '+sub_tok.text
                            sent.append(phrase)
            
    return sent

In [29]:
rule_nvn_mod(test_text)

[' ONS revise annual rate',
 ' number report poor figures',
 ' retailers endure tougher Christmas',
 ' ONS echo earlier caution',
 ' analysts put positive gloss',
 ' non - figures show comparable performance',
 ' measures cut prices',
 ' poor retail figures have immediate effect',
 ' you get accurate impression',
 ' Bank keep powder']

In [30]:
tqdm.pandas(desc='Extracting NVN with Adjectives based Phrases')
sample_data['NVN_MOD_PHRASES'] = sample_data['PREPROCESSED_TEXT'].progress_apply(rule_nvn_mod)
# sample_data['NVN_MOD_PHRASES'] = dd.from_pandas(sample_data.PREPROCESSED_TEXT, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: rule_nvn_mod(row))).compute(scheduler='processes')

Extracting NVN with Adjectives based Phrases: 100%|██████████| 223/223 [00:15<00:00, 14.25it/s]


In [31]:
final_nvn_mod_list = [x for x in sample_data.NVN_MOD_PHRASES if len(x)>0]
len(final_nvn_mod_list)

223

In [32]:
final_nvn_mod_list[:5]

[[' ONS revise annual rate',
  ' number report poor figures',
  ' retailers endure tougher Christmas',
  ' ONS echo earlier caution',
  ' analysts put positive gloss',
  ' non - figures show comparable performance',
  ' measures cut prices',
  ' poor retail figures have immediate effect',
  ' you get accurate impression',
  ' Bank keep powder'],
 [' dealers use enhanced offers',
  ' increase push total spending',
  ' Harris tell Reuters',
  ' which make thirds',
  ' sales grow lacklustre %',
  ' analysts expect improvement'],
 [' shares soar Shares',
  ' authorities turn blind eye',
  ' Arabia want industry',
  ' Arabia sell shares',
  ' applicants get shares'],
 [' Fosters buy stake',
  ' Australian Fosters buy large stake',
  ' Fosters buy %',
  ' bid value company',
  ' firms ask market',
  ' Fosters buy stake',
  ' who found label',
  ' Southcorp employ people',
  ' prospect startle investors',
  ' It have available cash',
  ' People scratch heads',
  ' Fosters do flip',
  ' It spe

## Writing Results

## Segregating Outputs