# Title: Exploring Textual Patterns / Performing Information Extractions

The primary context of this notebook will be to finalize of extracting valuable insights from the news articles and see if we can really focus on extracting out of the box relationships as well. The following steps would suffice this notebook:  
- Text Preprocessing  
- Rule 1 for IE: Noun-Verb-Noun Extraction  
- Rule 2 for IE: Adjective-Noun Extraction  
- Rule 3 for IE: Preprosition-Noun Extraction  
- Rule 4 for IE: Combination of NVN + AD Extarction based rules

Details for each section could be explored in following sections

## Generic Actions

In [1]:
import os
os.chdir(os.path.dirname(os.getcwd()))
os.getcwd()

'c:\\Users\\manash.jyoti.konwar\\Documents\\AI_Random_Projects\\NLP-Information-Pattern-Finder'

### Libraries Import

In [2]:
import spacy
import multiprocessing
import pandas as pd
import dask.dataframe as dd

from tqdm import tqdm
from dask.diagnostics import ProgressBar
ProgressBar().register()

from sn_textual_preprocessing import *

pd.set_option("display.max_rows", 600)
pd.set_option("display.max_columns", 500)
pd.set_option("max_colwidth", 400)

### Notebook Variables

In [3]:
# Input file path
input_filepath = os.path.join('input', 'news_articles_dataset.csv')

# Derived file path
output_path = 'output'
if not os.path.exists(output_path):
    os.makedirs(output_path)

sample_frac = 0.1
# spacy_model_name = 'en_core_web_trf'
spacy_model_name = 'en_core_web_lg'

output_overall_data = os.path.join(output_path, 'df_news_phrase_extracts.csv')
output_nvn_sep_data = os.path.join(output_path, 'df_nvn_sep_news.csv')
output_an_sep_data = os.path.join(output_path, 'df_an_sep_news.csv')
output_nvn_mod_sep_data = os.path.join(output_path, 'df_nvn_mod_news.csv')

In [4]:
try:
    nlp_spacy_en_model = spacy.load(spacy_model_name)
except OSError:
    spacy.cli.download(spacy_model_name)
    nlp_spacy_en_model = spacy.load(spacy_model_name)

### Reading data

In [5]:
input_data = pd.read_csv(input_filepath)
input_data.columns = [col_name.upper() for col_name in input_data.columns]
input_data.shape

(2225, 3)

In [6]:
sample_data = input_data.groupby('CATEGORIES', group_keys=False).apply(lambda x: x.sample(frac=0.1, random_state=42))
sample_data.shape

(223, 3)

In [7]:
sample_data.CATEGORIES.value_counts()

business         51
sport            51
politics         42
tech             40
entertainment    39
Name: CATEGORIES, dtype: int64

## Text Preprocessing  

The steps are as follows:  
- Remove mentions and hashtags  
- Remove URLs  
- Remove contractions  
- Remove stopwords and punctuations  
- Lemmatize all words amd lower case each of them  
- Remove redundant domain specific words  
- Remove extra spaces 

In [8]:
def preprocess_text(text):
    result = remove_urls(text)
    result = remove_mentions_hashtags(result)
    result = remove_contractions(result)
    result = remove_stopwords_punc_nos(result, 
                                       remove_stopwords_flag=False, 
                                       punc_2_remove=string.punctuation.replace('-','').replace('%','').replace('.',''), 
                                       remove_digits_flag=False,
                                       remove_pattern_punc_flag=True)
    result = remove_extra_spaces(result)
    return result

In [9]:
sample_data['PREPROCESSED_TEXT'] = dd.from_pandas(sample_data.ARTICLES, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: preprocess_text(row))).compute(scheduler='processes')

[########################################] | 100% Completed | 8.25 ss


## Experimentation

In [10]:
sample_data.head(5)

Unnamed: 0,ARTICLES,SUMMARIES,CATEGORIES,PREPROCESSED_TEXT
480,"Christmas sales worst since 1981\n\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.\n\nRetail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A num...","""The retail sales figures are very weak, but as Bank of England governor Mervyn King indicated last night, you don't really get an accurate impression of Christmas trading until about Easter,"" said Mr Shaw.The last time retailers endured a tougher Christmas was 23 years previously, when sales plunged 1.7%.A number of retailers have already reported poor figures for December.Retail sales droppe...",business,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...
449,"US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December, boosted by strong car sales.\n\nSeasonally adjusted sales rose 1.2% in the month, compared to 0.1% a month earlier, boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year, the best performance since an 8.5% rise in 1999, the Commerce Department...","US retail sales ended the year on a high note with solid gains in December, boosted by strong car sales.Sales for the year also broke through the $4 trillion mark for the first time - with annual sales coming in at $4.06 trillion However, if automotives are excluded from December's data, retail sales rose just 0.3% on the month.Retail sales are seen as a major part of consumer spending - which...",business,US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December boosted by strong car sales. Seasonally adjusted sales rose 1.2% in the month compared to 0.1% a month earlier boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year the best performance since an 8.5% rise in 1999 the Commerce Department added. ...
475,"Saudi NCCI's shares soar\n\nShares in Saudi Arabia's National Company for Cooperative Insurance (NCCI) soared on their first day of trading in Riyadh.\n\nThey were trading 84% above the offer price on Monday, changing hands at 372 riyals ($99; Â£53) after topping 400 early in the day. Demand for the insurer's debut shares was strong - 12 times what was on sale. The listing was part of the coun...","Shares in Saudi Arabia's National Company for Cooperative Insurance (NCCI) soared on their first day of trading in Riyadh.Previously, only NCCI has been legally allowed to offer insurance products within Saudi Arabia.The listing was part of the country's plans to open up its insurance market and boost demand in the sector.Saudi Arabia now wants a fully functioning insurance industry and is int...",business,Saudi NCCIs shares soar\n\nShares in Saudi Arabias National Company for Cooperative Insurance NCCI soared on their first day of trading in Riyadh. They were trading 84% above the offer price on Monday changing hands at 372 riyals 99 Â£53 after topping 400 early in the day. Demand for the insurers debut shares was strong - 12 times what was on sale. The listing was part of the countrys plans to...
434,"Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp, sparking rumours of a possible takeover.\n\nFosters bought 18.8% of Southcorp, the global winemaker behind the Penfolds, Lindemans and Rosemount brands, for 4.17 Australian dollars per share. A bid at that price would value the company at A$3.1bn ($2.4bn; Â£1.25bn ). Fosters...","Australian brewer Fosters has bought a large stake in Australian winemaker Southcorp, sparking rumours of a possible takeover.Fosters bought 18.8% of Southcorp, the global winemaker behind the Penfolds, Lindemans and Rosemount brands, for 4.17 Australian dollars per share.Fosters bought the 18.8% stake from Reline Investments, the family investment firm for the Oatleys, who founded the Rosemou...",business,Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp sparking rumours of a possible takeover. Fosters bought 18.8% of Southcorp the global winemaker behind the Penfolds Lindemans and Rosemount brands for 4.17 Australian dollars per share. A bid at that price would value the company at A3.1bn 2.4bn Â£1.25bn . Fosters said it was...
368,"Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Eco's stake in Sun Interbrew, Russia's second-largest brewer, for up to 259.7m euros ($353.3m; Â£183.75m).\n\nAlfa-Eco, the venture capital arm of Russian conglomerate Alfa Group, has a one-fifth stake in Sun Interbrew. The deal gives Inbev, the world's biggest beermaker, near-total control over the Russian brewer. ...","Inbev was formed in August 2004 when Belgium's Interbrew bought Brazilian brewer Ambev.Brewing giant Inbev has agreed to buy Alfa-Eco's stake in Sun Interbrew, Russia's second-largest brewer, for up to 259.7m euros ($353.3m; Â£183.75m).Sun Interbrew, which employs 8,000 staff, owns breweries in eight Russian cities - Klin, Ivanovo, Saransk, Kursk, Volzhsky, Omsk, Perm and Novocheboksarsk.The d...",business,Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Ecos stake in Sun Interbrew Russias second-largest brewer for up to 259.7m euros 353.3m Â£183.75m. Alfa-Eco the venture capital arm of Russian conglomerate Alfa Group has a one-fifth stake in Sun Interbrew. The deal gives Inbev the worlds biggest beermaker near-total control over the Russian brewer. Inbev bought out...


In [11]:
test_text = sample_data.PREPROCESSED_TEXT[480]
test_text

'Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of retailers have already reported poor figures for December. Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth according to the ONS. The last time retailers endured a tougher Christmas was 23 years previously when sales plunged 1.7%. The ONS echoed an earlier caution from Bank of England governor Mervyn King not to read too much into the poor December figures. Some analysts put a positive gloss on the figures pointing out that the non-seasonally-adjusted figures showed a performance comparable with 2003.. The November-Decembe

In [12]:
doc = nlp_spacy_en_model(test_text)

for token in doc:
    print(token.text,'->',token.pos_)
    
from spacy import displacy 
displacy.render(doc, style='dep',jupyter=True)

Christmas -> PROPN
sales -> NOUN
worst -> ADV
since -> SCONJ
1981 -> NUM


 -> SPACE
UK -> PROPN
retail -> ADJ
sales -> NOUN
fell -> VERB
in -> ADP
December -> PROPN
failing -> VERB
to -> PART
meet -> VERB
expectations -> NOUN
and -> CCONJ
making -> VERB
it -> PRON
by -> ADP
some -> DET
counts -> NOUN
the -> DET
worst -> ADJ
Christmas -> PROPN
since -> SCONJ
1981 -> NUM
.. -> PUNCT
Retail -> ADJ
sales -> NOUN
dropped -> VERB
by -> ADP
1 -> NUM
% -> NOUN
on -> ADP
the -> DET
month -> NOUN
in -> ADP
December -> PROPN
after -> ADP
a -> DET
0.6 -> NUM
% -> NOUN
rise -> NOUN
in -> ADP
November -> PROPN
the -> DET
Office -> PROPN
for -> ADP
National -> PROPN
Statistics -> PROPN
ONS -> PROPN
said -> VERB
. -> PUNCT
The -> DET
ONS -> PROPN
revised -> VERB
the -> DET
annual -> ADJ
2004 -> NUM
rate -> NOUN
of -> ADP
growth -> NOUN
down -> ADP
from -> ADP
the -> DET
5.9 -> NUM
% -> NOUN
estimated -> VERB
in -> ADP
November -> PROPN
to -> ADP
3.2 -> NUM
% -> NOUN
. -> PUNCT
A -> DET
number -> NOUN

## Rule 1 for IE: NVN Extraction

In [13]:
# Function for rule 1: noun(subject), verb, noun(object)
def rule_nvn(text):
    doc = nlp_spacy_en_model(text)
    sent = []
    for token in doc:
        # if the token is a verb
        if (token.pos_=='VERB'):
            phrase =''
            
            # only extract noun or pronoun subjects
            for sub_tok in token.lefts:
                if (sub_tok.dep_ in ['nsubj','nsubjpass']) and (sub_tok.pos_ in ['NOUN','PROPN','PRON']):
                    # add subject to the phrase
                    phrase += sub_tok.text
                    # save the root of the verb in phrase
                    phrase += ' '+token.lemma_ 
                    # check for noun or pronoun direct objects
                    for sub_tok in token.rights:
                        # save the object in the phrase
                        if (sub_tok.dep_ in ['dobj']) and (sub_tok.pos_ in ['NOUN','PROPN']):
                            phrase += ' '+sub_tok.text
                            sent.append({'phrase': phrase, 'verb': token.lemma_})
    return sent

In [14]:
rule_nvn(test_text)

[{'phrase': 'ONS revise rate', 'verb': 'revise'},
 {'phrase': 'number report figures', 'verb': 'report'},
 {'phrase': 'retailers endure Christmas', 'verb': 'endure'},
 {'phrase': 'ONS echo caution', 'verb': 'echo'},
 {'phrase': 'analysts put gloss', 'verb': 'put'},
 {'phrase': 'figures show performance', 'verb': 'show'},
 {'phrase': 'measures cut prices', 'verb': 'cut'},
 {'phrase': 'figures have effect', 'verb': 'have'},
 {'phrase': 'you get impression', 'verb': 'get'},
 {'phrase': 'Bank keep powder', 'verb': 'keep'}]

In [15]:
tqdm.pandas(desc='Extracting NVN Phrases')
sample_data['NVN_PHRASES'] = sample_data['PREPROCESSED_TEXT'].progress_apply(rule_nvn)
# sample_data['NVN_PHRASES'] = dd.from_pandas(sample_data.PREPROCESSED_TEXT, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: rule_nvn(row))).compute(scheduler='processes')

Extracting NVN Phrases: 100%|██████████| 223/223 [00:22<00:00, 10.13it/s]


In [16]:
final_nvn_list = [x for x in sample_data.NVN_PHRASES if len(x)>0]
len(final_nvn_list)

223

In [17]:
final_nvn_list[:5]

[[{'phrase': 'ONS revise rate', 'verb': 'revise'},
  {'phrase': 'number report figures', 'verb': 'report'},
  {'phrase': 'retailers endure Christmas', 'verb': 'endure'},
  {'phrase': 'ONS echo caution', 'verb': 'echo'},
  {'phrase': 'analysts put gloss', 'verb': 'put'},
  {'phrase': 'figures show performance', 'verb': 'show'},
  {'phrase': 'measures cut prices', 'verb': 'cut'},
  {'phrase': 'figures have effect', 'verb': 'have'},
  {'phrase': 'you get impression', 'verb': 'get'},
  {'phrase': 'Bank keep powder', 'verb': 'keep'}],
 [{'phrase': 'dealers use offers', 'verb': 'use'},
  {'phrase': 'increase push spending', 'verb': 'push'},
  {'phrase': 'Harris tell Reuters', 'verb': 'tell'},
  {'phrase': 'which make thirds', 'verb': 'make'},
  {'phrase': 'sales grow %', 'verb': 'grow'},
  {'phrase': 'analysts expect improvement', 'verb': 'expect'}],
 [{'phrase': 'shares soar Shares', 'verb': 'soar'},
  {'phrase': 'authorities turn eye', 'verb': 'turn'},
  {'phrase': 'Arabia want industry', 

## Rule 2 for IE: AN Extraction

In [18]:
# Function for rule 2: adjective noun
def rule_an(text):
    doc = nlp_spacy_en_model(text)
    pat = []
    
    # iterate over tokens
    for token in doc:
        phrase = ''
        # if the word is a subject noun or an object noun
        if (token.pos_ == 'NOUN')\
            and (token.dep_ in ['dobj','pobj','nsubj','nsubjpass']):
            
            # iterate over the children nodes
            for subtoken in token.children:
                # if word is an adjective or has a compound dependency
                if (subtoken.pos_ == 'ADJ') or (subtoken.dep_ == 'compound'):
                    phrase += subtoken.text + ' '
                    
            if len(phrase)!=0:
                phrase += token.text
             
        if  len(phrase)!=0:
            pat.append({'phrase':phrase, 'noun': token.text})
    return pat

In [19]:
rule_an(test_text)

[{'phrase': 'Christmas sales', 'noun': 'sales'},
 {'phrase': 'retail sales', 'noun': 'sales'},
 {'phrase': 'Retail sales', 'noun': 'sales'},
 {'phrase': '% rise', 'noun': 'rise'},
 {'phrase': 'annual rate', 'noun': 'rate'},
 {'phrase': 'poor figures', 'noun': 'figures'},
 {'phrase': 'Clothing retailers', 'noun': 'retailers'},
 {'phrase': 'only internet retailers', 'noun': 'retailers'},
 {'phrase': 'significant growth', 'noun': 'growth'},
 {'phrase': 'earlier caution', 'noun': 'caution'},
 {'phrase': 'poor December figures', 'noun': 'figures'},
 {'phrase': 'positive gloss', 'noun': 'gloss'},
 {'phrase': 'non - figures', 'noun': 'figures'},
 {'phrase': 'comparable performance', 'noun': 'performance'},
 {'phrase': 'December jump', 'noun': 'jump'},
 {'phrase': 'recent averages', 'noun': 'averages'},
 {'phrase': 'serious booms', 'noun': 'booms'},
 {'phrase': 'retail volume', 'noun': 'volume'},
 {'phrase': 'actual spending', 'noun': 'spending'},
 {'phrase': 'Street retailers', 'noun': 'retai

In [20]:
tqdm.pandas(desc='Extracting AN Phrases')
sample_data['AN_PHRASES'] = sample_data['PREPROCESSED_TEXT'].progress_apply(rule_an)
# sample_data['AN_PHRASES'] = dd.from_pandas(sample_data.PREPROCESSED_TEXT, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: rule_an(row))).compute(scheduler='processes')

Extracting AN Phrases: 100%|██████████| 223/223 [00:14<00:00, 14.93it/s]


In [21]:
final_an_list = [x for x in sample_data.AN_PHRASES if len(x)>0]
len(final_an_list)

223

In [22]:
final_an_list[:5]

[[{'phrase': 'Christmas sales', 'noun': 'sales'},
  {'phrase': 'retail sales', 'noun': 'sales'},
  {'phrase': 'Retail sales', 'noun': 'sales'},
  {'phrase': '% rise', 'noun': 'rise'},
  {'phrase': 'annual rate', 'noun': 'rate'},
  {'phrase': 'poor figures', 'noun': 'figures'},
  {'phrase': 'Clothing retailers', 'noun': 'retailers'},
  {'phrase': 'only internet retailers', 'noun': 'retailers'},
  {'phrase': 'significant growth', 'noun': 'growth'},
  {'phrase': 'earlier caution', 'noun': 'caution'},
  {'phrase': 'poor December figures', 'noun': 'figures'},
  {'phrase': 'positive gloss', 'noun': 'gloss'},
  {'phrase': 'non - figures', 'noun': 'figures'},
  {'phrase': 'comparable performance', 'noun': 'performance'},
  {'phrase': 'December jump', 'noun': 'jump'},
  {'phrase': 'recent averages', 'noun': 'averages'},
  {'phrase': 'serious booms', 'noun': 'booms'},
  {'phrase': 'retail volume', 'noun': 'volume'},
  {'phrase': 'actual spending', 'noun': 'spending'},
  {'phrase': 'Street retail

## Rule 3 fro IE: NPN Extraction

In [23]:
# rule 3 function
def rule_npn(text):
    doc = nlp_spacy_en_model(text)
    sent = []
    
    for token in doc:
        # look for prepositions
        if token.pos_=='ADP':
            phrase = ''
            # if its head word is a noun
            if token.head.pos_=='NOUN':
                # append noun and preposition to phrase
                phrase += token.head.text
                phrase += ' '+token.text

                # check the nodes to the right of the preposition
                for right_tok in token.rights:
                    # append if it is a noun or proper noun
                    if (right_tok.pos_ in ['NOUN','PROPN']):
                        phrase += ' '+right_tok.text
                
                if len(phrase)>2:
                    sent.append({'phrase':phrase, 'preposition': token.text})
    return sent

In [24]:
rule_npn(test_text)

[{'phrase': 'rise in November', 'preposition': 'in'},
 {'phrase': 'rate of growth', 'preposition': 'of'},
 {'phrase': 'number of retailers', 'preposition': 'of'},
 {'phrase': 'caution from King', 'preposition': 'from'},
 {'phrase': 'way below booms', 'preposition': 'below'},
 {'phrase': 'figures for volume', 'preposition': 'for'},
 {'phrase': 'measures of spending indication', 'preposition': 'of'},
 {'phrase': 'weakness of sector', 'preposition': 'of'},
 {'phrase': 'effect on rates', 'preposition': 'on'},
 {'phrase': 'impression of trading', 'preposition': 'of'}]

In [25]:
tqdm.pandas(desc='Extracting NPN Phrases')
sample_data['NPN_PHRASES'] = sample_data['PREPROCESSED_TEXT'].progress_apply(rule_npn)
# sample_data['P_PHRASES'] = dd.from_pandas(sample_data.PREPROCESSED_TEXT, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: rule_p(row))).compute(scheduler='processes')

Extracting NPN Phrases: 100%|██████████| 223/223 [00:14<00:00, 14.99it/s]


In [26]:
final_npn_list = [x for x in sample_data.NPN_PHRASES if len(x)>0]
len(final_npn_list)

223

In [27]:
final_npn_list[:5]

[[{'phrase': 'rise in November', 'preposition': 'in'},
  {'phrase': 'rate of growth', 'preposition': 'of'},
  {'phrase': 'number of retailers', 'preposition': 'of'},
  {'phrase': 'caution from King', 'preposition': 'from'},
  {'phrase': 'way below booms', 'preposition': 'below'},
  {'phrase': 'figures for volume', 'preposition': 'for'},
  {'phrase': 'measures of spending indication', 'preposition': 'of'},
  {'phrase': 'weakness of sector', 'preposition': 'of'},
  {'phrase': 'effect on rates', 'preposition': 'on'},
  {'phrase': 'impression of trading', 'preposition': 'of'}],
 [{'phrase': 'surge in', 'preposition': 'in'},
  {'phrase': 'note with gains', 'preposition': 'with'},
  {'phrase': 'gains in December', 'preposition': 'in'},
  {'phrase': 'surge in shopping', 'preposition': 'in'},
  {'phrase': 'rise in', 'preposition': 'in'},
  {'phrase': 'jump in sales', 'preposition': 'in'},
  {'phrase': 'end of year', 'preposition': 'of'},
  {'phrase': 'increase in sales', 'preposition': 'in'},


## Rule 4 for IE: Combination of NVN + AD Extraction based rules

In [28]:
def rule_ad_mod(doc, text, index):
    # doc = nlp_spacy_en_model(text)
    phrase = ''
    
    for token in doc:
        if token.i == index:
            for subtoken in token.children:
                if (subtoken.pos_ == 'ADJ'):
                    phrase += ' '+subtoken.text
            break
    return phrase

def rule_nvn_mod(text):
    doc = nlp_spacy_en_model(text)
    sent = []
    
    for token in doc:
        # root word
        if (token.pos_=='VERB'):
            phrase =''
            
            # only extract noun or pronoun subjects
            for sub_tok in token.lefts:
                if (sub_tok.dep_ in ['nsubj','nsubjpass']) and (sub_tok.pos_ in ['NOUN','PROPN','PRON']):
                    adj = rule_ad_mod(doc, text, sub_tok.i)
                    phrase += adj + ' ' + sub_tok.text

                    # save the root word of the word
                    phrase += ' '+token.lemma_ 

                    # check for noun or pronoun direct objects
                    for sub_tok in token.rights:
                        if (sub_tok.dep_ in ['dobj']) and (sub_tok.pos_ in ['NOUN','PROPN']):
                            adj = rule_ad_mod(doc, text, sub_tok.i)
                            # add adj based noun
                            phrase += adj+' '+sub_tok.text
                            sent.append({'phrase':phrase, 'verb':token.lemma_})
            
    return sent

In [29]:
rule_nvn_mod(test_text)

[{'phrase': ' ONS revise annual rate', 'verb': 'revise'},
 {'phrase': ' number report poor figures', 'verb': 'report'},
 {'phrase': ' retailers endure tougher Christmas', 'verb': 'endure'},
 {'phrase': ' ONS echo earlier caution', 'verb': 'echo'},
 {'phrase': ' analysts put positive gloss', 'verb': 'put'},
 {'phrase': ' non - figures show comparable performance', 'verb': 'show'},
 {'phrase': ' measures cut prices', 'verb': 'cut'},
 {'phrase': ' poor retail figures have immediate effect', 'verb': 'have'},
 {'phrase': ' you get accurate impression', 'verb': 'get'},
 {'phrase': ' Bank keep powder', 'verb': 'keep'}]

In [30]:
tqdm.pandas(desc='Extracting NVN with Adjectives based Phrases')
sample_data['NVN_MOD_PHRASES'] = sample_data['PREPROCESSED_TEXT'].progress_apply(rule_nvn_mod)
# sample_data['NVN_MOD_PHRASES'] = dd.from_pandas(sample_data.PREPROCESSED_TEXT, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: rule_nvn_mod(row))).compute(scheduler='processes')

Extracting NVN with Adjectives based Phrases: 100%|██████████| 223/223 [00:17<00:00, 12.88it/s]


In [31]:
final_nvn_mod_list = [x for x in sample_data.NVN_MOD_PHRASES if len(x)>0]
len(final_nvn_mod_list)

223

In [32]:
final_nvn_mod_list[:5]

[[{'phrase': ' ONS revise annual rate', 'verb': 'revise'},
  {'phrase': ' number report poor figures', 'verb': 'report'},
  {'phrase': ' retailers endure tougher Christmas', 'verb': 'endure'},
  {'phrase': ' ONS echo earlier caution', 'verb': 'echo'},
  {'phrase': ' analysts put positive gloss', 'verb': 'put'},
  {'phrase': ' non - figures show comparable performance', 'verb': 'show'},
  {'phrase': ' measures cut prices', 'verb': 'cut'},
  {'phrase': ' poor retail figures have immediate effect', 'verb': 'have'},
  {'phrase': ' you get accurate impression', 'verb': 'get'},
  {'phrase': ' Bank keep powder', 'verb': 'keep'}],
 [{'phrase': ' dealers use enhanced offers', 'verb': 'use'},
  {'phrase': ' increase push total spending', 'verb': 'push'},
  {'phrase': ' Harris tell Reuters', 'verb': 'tell'},
  {'phrase': ' which make thirds', 'verb': 'make'},
  {'phrase': ' sales grow lacklustre %', 'verb': 'grow'},
  {'phrase': ' analysts expect improvement', 'verb': 'expect'}],
 [{'phrase': ' s

## Writing Overall Results

In [33]:
sample_data.loc[sample_data.CATEGORIES.isin(['politics'])].head(5)

Unnamed: 0,ARTICLES,SUMMARIES,CATEGORIES,PREPROCESSED_TEXT,NVN_PHRASES,AN_PHRASES,NPN_PHRASES,NVN_MOD_PHRASES
1048,"Clarke to unveil immigration plan\n\nNew controls on economic migrants and tighter border patrols will be part of government plans unveiled on Monday.\n\nHome Secretary Charles Clarke wants to introduce a points system for economic migrants and increase deportations of failed asylum seekers. Tony Blair has said people are right to be concerned about abuses of the system but there is no ""magic ...","But he said it was yet to be seen if Mr Clarke could deliver ""a fair and efficient asylum system"".Conservative shadow home secretary David Davis said the government had failed to remove 250,000 failed asylum seekers from the UK and limits on economic migrants had been a ""shambles"".Home Secretary Charles Clarke wants to introduce a points system for economic migrants and increase deportations o...",politics,Clarke to unveil immigration plan\n\nNew controls on economic migrants and tighter border patrols will be part of government plans unveiled on Monday. Home Secretary Charles Clarke wants to introduce a points system for economic migrants and increase deportations of failed asylum seekers. Tony Blair has said people are right to be concerned about abuses of the system but there is no magic bull...,"[{'phrase': 'Clarke unveil plan', 'verb': 'unveil'}, {'phrase': 'Clarke unveil plan controls', 'verb': 'unveil'}, {'phrase': 'plans produce system', 'verb': 'produce'}, {'phrase': 'Labour reform immigration', 'verb': 'reform'}, {'phrase': 'it win election', 'verb': 'win'}, {'phrase': 'figure reflect needs', 'verb': 'reflect'}, {'phrase': 'Blair tell Radio', 'verb': 'tell'}, {'phrase': 'Blair t...","[{'phrase': 'immigration plan', 'noun': 'plan'}, {'phrase': 'New controls', 'noun': 'controls'}, {'phrase': 'economic migrants', 'noun': 'migrants'}, {'phrase': 'government plans', 'noun': 'plans'}, {'phrase': 'points system', 'noun': 'system'}, {'phrase': 'economic migrants', 'noun': 'migrants'}, {'phrase': 'asylum seekers', 'noun': 'seekers'}, {'phrase': 'efficient system', 'noun': 'system'}...","[{'phrase': 'controls on migrants', 'preposition': 'on'}, {'phrase': 'part of plans', 'preposition': 'of'}, {'phrase': 'system for migrants', 'preposition': 'for'}, {'phrase': 'deportations of seekers', 'preposition': 'of'}, {'phrase': 'abuses of system', 'preposition': 'of'}, {'phrase': 'action by campaigning', 'preposition': 'by'}, {'phrase': 'part of process', 'preposition': 'of'}, {'phrase...","[{'phrase': ' Clarke unveil plan', 'verb': 'unveil'}, {'phrase': ' Clarke unveil plan New controls', 'verb': 'unveil'}, {'phrase': ' plans produce efficient system', 'verb': 'produce'}, {'phrase': ' Labour reform immigration', 'verb': 'reform'}, {'phrase': ' it win election', 'verb': 'win'}, {'phrase': ' arbitrary figure reflect needs', 'verb': 'reflect'}, {'phrase': ' Blair tell Radio', 'verb..."
1293,"Lib Dems predict 'best ever poll'\n\nThe Lib Dems are set for their best results in both the general election and the local council polls, one of their frontbenchers has predicted.\n\nLocal government spokesman Ed Davey was speaking as the party launched its campaign for the local elections being held in 37 English council areas. The flagship pledge is to replace council tax with a local incom...","""I think we are going to have the best general election results and local election results we have ever had under [party leader] Charles Kennedy.The Lib Dems are set for their best results in both the general election and the local council polls, one of their frontbenchers has predicted.Local government spokesman Ed Davey was speaking as the party launched its campaign for the local elections ...",politics,Lib Dems predict best ever poll\n\nThe Lib Dems are set for their best results in both the general election and the local council polls one of their frontbenchers has predicted. Local government spokesman Ed Davey was speaking as the party launched its campaign for the local elections being held in 37 English council areas. The flagship pledge is to replace council tax with a local income tax....,"[{'phrase': 'party launch campaign', 'verb': 'launch'}, {'phrase': 'people pay tax', 'verb': 'pay'}]","[{'phrase': 'best results', 'noun': 'results'}, {'phrase': 'general election', 'noun': 'election'}, {'phrase': 'local elections', 'noun': 'elections'}, {'phrase': 'council areas', 'noun': 'areas'}, {'phrase': 'flagship pledge', 'noun': 'pledge'}, {'phrase': 'council tax', 'noun': 'tax'}, {'phrase': 'local income tax', 'noun': 'tax'}, {'phrase': 'more tax', 'noun': 'tax'}, {'phrase': 'partys su...","[{'phrase': 'results in election', 'preposition': 'in'}, {'phrase': 'campaign for elections', 'preposition': 'for'}, {'phrase': 'tax with tax', 'preposition': 'with'}, {'phrase': 'endorsement of leader', 'preposition': 'of'}]","[{'phrase': ' party launch campaign', 'verb': 'launch'}, {'phrase': ' people pay more tax', 'verb': 'pay'}]"
1307,"'Last chance' warning for voters\n\nPeople in England, Scotland and Wales must have registered by 1700 GMT to be able to vote in the general election if it is held, as expected, on 5 May.\n\nThose who filled in forms last autumn should already be on the register - but those who have moved house or were on holiday may have been left off. There will also be elections for local councils and mayor...","There will also be elections for local councils and mayors in parts of England on 5 May.People in England, Scotland and Wales must have registered by 1700 GMT to be able to vote in the general election if it is held, as expected, on 5 May.Last week Preston City Council reported that more than 14,000 of its voters were not registered.""If you want your voice to be heard on 5 May you will need to...",politics,Last chance warning for voters\n\nPeople in England Scotland and Wales must have registered by 1700 GMT to be able to vote in the general election if it is held as expected on 5 May. Those who filled in forms last autumn should already be on the register - but those who have moved house or were on holiday may have been left off. There will also be elections for local councils and mayors in par...,"[{'phrase': 'who move house', 'verb': 'move'}, {'phrase': 'you have say', 'verb': 'have'}]","[{'phrase': 'Last chance', 'noun': 'chance'}, {'phrase': 'general election', 'noun': 'election'}, {'phrase': 'local councils', 'noun': 'councils'}, {'phrase': 'registration forms', 'noun': 'forms'}, {'phrase': 'local authorities', 'noun': 'authorities'}, {'phrase': 'councils polls', 'noun': 'polls'}, {'phrase': 'unitary authorities', 'noun': 'authorities'}, {'phrase': 'electoral roll', 'noun':...","[{'phrase': 'elections for councils', 'preposition': 'for'}, {'phrase': 'councils in parts', 'preposition': 'in'}, {'phrase': 'parts of England', 'preposition': 'of'}, {'phrase': 'day on Friday', 'preposition': 'on'}, {'phrase': 'polls for authorities', 'preposition': 'for'}, {'phrase': 'polls at Isle', 'preposition': 'at'}, {'phrase': 'polls at Tyneside', 'preposition': 'at'}, {'phrase': 'dip...","[{'phrase': ' who move house', 'verb': 'move'}, {'phrase': ' you have say', 'verb': 'have'}]"
990,"Brown hits back in Blair rift row\n\nGordon Brown has criticised a union leader who said conflict between himself and Tony Blair was harming the workings of government.\n\nJonathan Baume, of the top civil servants' union, spoke of ""competing agendas"" between Mr Brown and Mr Blair. But the chancellor said Mr Baume was never at meetings between himself and the prime minister so could not judge. ...","But the chancellor said Mr Baume was never at meetings between himself and the prime minister so could not judge.He also said that as Mr Baume was never present at meetings between himself and the prime minister, he was not in a position to judge.Number 10 said ministers were interested in governing and not a ""soap opera"" about Mr Blair and Mr Brown.Jonathan Baume, of the top civil servants' u...",politics,Brown hits back in Blair rift row\n\nGordon Brown has criticised a union leader who said conflict between himself and Tony Blair was harming the workings of government. Jonathan Baume of the top civil servants union spoke of competing agendas between Mr Brown and Mr Blair. But the chancellor said Mr Baume was never at meetings between himself and the prime minister so could not judge. He said ...,"[{'phrase': 'Brown criticise leader', 'verb': 'criticise'}, {'phrase': 'conflict harm workings', 'verb': 'harm'}, {'phrase': 'which threaten jobs', 'verb': 'threaten'}, {'phrase': 'It suit purpose', 'verb': 'suit'}, {'phrase': 'Brown tell programme', 'verb': 'tell'}, {'phrase': 'Blair make decisions', 'verb': 'make'}, {'phrase': 'Baume tell News', 'verb': 'tell'}, {'phrase': 'departments get m...","[{'phrase': 'union leader', 'noun': 'leader'}, {'phrase': 'union leader', 'noun': 'leader'}, {'phrase': 'service reform', 'noun': 'reform'}, {'phrase': 'members jobs', 'noun': 'jobs'}, {'phrase': 'Radio 4s Today programme', 'noun': 'programme'}, {'phrase': 'servants jobs', 'noun': 'jobs'}, {'phrase': 'frontline services', 'noun': 'services'}, {'phrase': 'Baumes judgement', 'noun': 'judgement'}...","[{'phrase': 'conflict between', 'preposition': 'between'}, {'phrase': 'workings of government', 'preposition': 'of'}, {'phrase': 'agendas between Brown', 'preposition': 'between'}, {'phrase': 'meetings between', 'preposition': 'between'}, {'phrase': 'purpose of', 'preposition': 'of'}, {'phrase': 'judgement on matter', 'preposition': 'on'}, {'phrase': 'decisions on reforms', 'preposition': 'on'...","[{'phrase': ' Brown criticise leader', 'verb': 'criticise'}, {'phrase': ' conflict harm workings', 'verb': 'harm'}, {'phrase': ' which threaten jobs', 'verb': 'threaten'}, {'phrase': ' It suit purpose', 'verb': 'suit'}, {'phrase': ' Brown tell programme', 'verb': 'tell'}, {'phrase': ' Blair make same decisions', 'verb': 'make'}, {'phrase': ' Baume tell News', 'verb': 'tell'}, {'phrase': ' depa..."
966,"Visa decision 'every 11 minutes'\n\nVisa processing staff are sometimes expected to rule on an application every 11 minutes, MPs have said.\n\nPressure was placed on staff to be efficient, rather than to do a thorough examination of an application, the Public Accounts Committee found. Every officer had an annual target of 8,000 applications - equivalent to 40 a day or one every 11 minutes. MPs...","Visa processing staff are sometimes expected to rule on an application every 11 minutes, MPs have said.Committee members said the Home Office had been wrong to dismiss concerns from visa staff abroad who feared the system was being abused.Committee chairman Edward Leigh said: ""There is a worrying tension between quick processing and proper control over the visas issued.""Entry clearance staff a...",politics,Visa decision every 11 minutes\n\nVisa processing staff are sometimes expected to rule on an application every 11 minutes MPs have said. Pressure was placed on staff to be efficient rather than to do a thorough examination of an application the Public Accounts Committee found. Every officer had an annual target of 8000 applications - equivalent to 40 a day or one every 11 minutes. MPs want res...,"[{'phrase': 'officer have target', 'verb': 'have'}, {'phrase': 'MPs want research', 'verb': 'want'}, {'phrase': 'report discuss scandal', 'verb': 'discuss'}, {'phrase': 'people enter UK', 'verb': 'enter'}, {'phrase': 'who set business', 'verb': 'set'}]","[{'phrase': 'Visa decision', 'noun': 'decision'}, {'phrase': 'minutes Visa processing staff', 'noun': 'staff'}, {'phrase': 'thorough examination', 'noun': 'examination'}, {'phrase': 'annual target', 'noun': 'target'}, {'phrase': 'equivalent applications', 'noun': 'applications'}, {'phrase': 'UK visa holders', 'noun': 'holders'}, {'phrase': 'black market', 'noun': 'market'}, {'phrase': 'quick p...","[{'phrase': 'examination of application', 'preposition': 'of'}, {'phrase': 'target of applications', 'preposition': 'of'}, {'phrase': 'research into', 'preposition': 'into'}, {'phrase': 'end of stays', 'preposition': 'of'}, {'phrase': 'tension between processing', 'preposition': 'between'}, {'phrase': 'processing over visas', 'preposition': 'over'}, {'phrase': 'resignation of Hughes', 'preposi...","[{'phrase': ' officer have annual target', 'verb': 'have'}, {'phrase': ' MPs want research', 'verb': 'want'}, {'phrase': ' report discuss scandal', 'verb': 'discuss'}, {'phrase': ' people enter UK', 'verb': 'enter'}, {'phrase': ' who set valid business', 'verb': 'set'}]"


In [34]:
if not os.path.exists(output_overall_data):
    sample_data.to_csv(output_overall_data, index=False)

## Segregating Outputs

### Segregating NVN phrases

In [35]:
# selecting non-empty output rows
sample_data_copy = sample_data[['ARTICLES','CATEGORIES','PREPROCESSED_TEXT','NVN_PHRASES']].copy().reset_index(drop=True)
print(sample_data_copy.shape)
nvn_sample_data = pd.DataFrame(columns=sample_data_copy.columns)

for row in tqdm(range(len(sample_data_copy)), desc='Selecting non empty rows'):
    if len(sample_data_copy.loc[row,'NVN_PHRASES'])!=0:
        nvn_sample_data = pd.concat([nvn_sample_data, pd.DataFrame([sample_data_copy.loc[row,:]])], ignore_index=True)

# reset the index
nvn_sample_data.reset_index(inplace=True)
nvn_sample_data.drop('index', axis=1, inplace=True)   
print(nvn_sample_data.shape)
nvn_sample_data.head(5)

(223, 4)


Selecting non empty rows: 100%|██████████| 223/223 [00:00<00:00, 517.38it/s]

(223, 4)





Unnamed: 0,ARTICLES,CATEGORIES,PREPROCESSED_TEXT,NVN_PHRASES
0,"Christmas sales worst since 1981\n\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.\n\nRetail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A num...",business,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,"[{'phrase': 'ONS revise rate', 'verb': 'revise'}, {'phrase': 'number report figures', 'verb': 'report'}, {'phrase': 'retailers endure Christmas', 'verb': 'endure'}, {'phrase': 'ONS echo caution', 'verb': 'echo'}, {'phrase': 'analysts put gloss', 'verb': 'put'}, {'phrase': 'figures show performance', 'verb': 'show'}, {'phrase': 'measures cut prices', 'verb': 'cut'}, {'phrase': 'figures have eff..."
1,"US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December, boosted by strong car sales.\n\nSeasonally adjusted sales rose 1.2% in the month, compared to 0.1% a month earlier, boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year, the best performance since an 8.5% rise in 1999, the Commerce Department...",business,US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December boosted by strong car sales. Seasonally adjusted sales rose 1.2% in the month compared to 0.1% a month earlier boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year the best performance since an 8.5% rise in 1999 the Commerce Department added. ...,"[{'phrase': 'dealers use offers', 'verb': 'use'}, {'phrase': 'increase push spending', 'verb': 'push'}, {'phrase': 'Harris tell Reuters', 'verb': 'tell'}, {'phrase': 'which make thirds', 'verb': 'make'}, {'phrase': 'sales grow %', 'verb': 'grow'}, {'phrase': 'analysts expect improvement', 'verb': 'expect'}]"
2,"Saudi NCCI's shares soar\n\nShares in Saudi Arabia's National Company for Cooperative Insurance (NCCI) soared on their first day of trading in Riyadh.\n\nThey were trading 84% above the offer price on Monday, changing hands at 372 riyals ($99; Â£53) after topping 400 early in the day. Demand for the insurer's debut shares was strong - 12 times what was on sale. The listing was part of the coun...",business,Saudi NCCIs shares soar\n\nShares in Saudi Arabias National Company for Cooperative Insurance NCCI soared on their first day of trading in Riyadh. They were trading 84% above the offer price on Monday changing hands at 372 riyals 99 Â£53 after topping 400 early in the day. Demand for the insurers debut shares was strong - 12 times what was on sale. The listing was part of the countrys plans to...,"[{'phrase': 'shares soar Shares', 'verb': 'soar'}, {'phrase': 'authorities turn eye', 'verb': 'turn'}, {'phrase': 'Arabia want industry', 'verb': 'want'}, {'phrase': 'Arabia sell shares', 'verb': 'sell'}, {'phrase': 'applicants get shares', 'verb': 'get'}]"
3,"Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp, sparking rumours of a possible takeover.\n\nFosters bought 18.8% of Southcorp, the global winemaker behind the Penfolds, Lindemans and Rosemount brands, for 4.17 Australian dollars per share. A bid at that price would value the company at A$3.1bn ($2.4bn; Â£1.25bn ). Fosters...",business,Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp sparking rumours of a possible takeover. Fosters bought 18.8% of Southcorp the global winemaker behind the Penfolds Lindemans and Rosemount brands for 4.17 Australian dollars per share. A bid at that price would value the company at A3.1bn 2.4bn Â£1.25bn . Fosters said it was...,"[{'phrase': 'Fosters buy stake', 'verb': 'buy'}, {'phrase': 'Fosters buy stake', 'verb': 'buy'}, {'phrase': 'Fosters buy %', 'verb': 'buy'}, {'phrase': 'bid value company', 'verb': 'value'}, {'phrase': 'firms ask market', 'verb': 'ask'}, {'phrase': 'Fosters buy stake', 'verb': 'buy'}, {'phrase': 'who found label', 'verb': 'found'}, {'phrase': 'Southcorp employ people', 'verb': 'employ'}, {'phr..."
4,"Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Eco's stake in Sun Interbrew, Russia's second-largest brewer, for up to 259.7m euros ($353.3m; Â£183.75m).\n\nAlfa-Eco, the venture capital arm of Russian conglomerate Alfa Group, has a one-fifth stake in Sun Interbrew. The deal gives Inbev, the world's biggest beermaker, near-total control over the Russian brewer. ...",business,Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Ecos stake in Sun Interbrew Russias second-largest brewer for up to 259.7m euros 353.3m Â£183.75m. Alfa-Eco the venture capital arm of Russian conglomerate Alfa Group has a one-fifth stake in Sun Interbrew. The deal gives Inbev the worlds biggest beermaker near-total control over the Russian brewer. Inbev bought out...,"[{'phrase': 'giant swallow giant', 'verb': 'swallow'}, {'phrase': 'deal give control', 'verb': 'give'}, {'phrase': 'Inbev buy partner', 'verb': 'buy'}, {'phrase': 'brands include Hoegaarden', 'verb': 'include'}, {'phrase': 'It employ people', 'verb': 'employ'}, {'phrase': 'it own %', 'verb': 'own'}, {'phrase': 'Interbrew buy Ambev', 'verb': 'buy'}, {'phrase': 'which employ staff', 'verb': 'emp..."


In [36]:
verb_dict = dict()
dis_dict = dict()
dis_list = []

# iterating over all the sentences
for i in range(len(nvn_sample_data)):
    
    # sentence containing the output
    sentence = nvn_sample_data.loc[i,'PREPROCESSED_TEXT']
    # catgeory info
    category = nvn_sample_data.loc[i,'CATEGORIES']
    # output of the sentence
    output = nvn_sample_data.loc[i,'NVN_PHRASES']
    
    # iterating over all the outputs from the sentence
    for sent in output:
        # separate subject, verb and object
        n1, v, n2 = sent['phrase'].split(sent['verb'])[0], sent['verb'], sent['phrase'].split(sent['verb'])[1]
        
        # append to list, along with the sentence
        dis_dict = {
            'PREPROCESSED_TEXT':sentence,
            'CATEGORY':category,
            'NOUN1':n1,
            'VERB':v,
            'NOUN2':n2}
        dis_list.append(dis_dict)
        
        # counting the number of sentences containing the verb
        verb = sent['phrase'].split()[1]
        if verb in verb_dict:
            verb_dict[verb]+=1
        else:
            verb_dict[verb]=1

df_nvn_sep = pd.DataFrame(dis_list)
df_nvn_sep.head(5)

Unnamed: 0,PREPROCESSED_TEXT,CATEGORY,NOUN1,VERB,NOUN2
0,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,ONS,revise,rate
1,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,number,report,figures
2,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,retailers,endure,Christmas
3,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,ONS,echo,caution
4,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,analysts,put,gloss


In [37]:
df_verb_counter = df_nvn_sep.loc[df_nvn_sep.CATEGORY.isin(['sport'])].VERB.value_counts().reset_index()
df_verb_counter = df_verb_counter.rename(columns={'VERB':'COUNTER', 'index':'VERB'})
df_verb_counter[df_verb_counter.COUNTER>1]

Unnamed: 0,VERB,COUNTER
0,have,49
1,win,19
2,tell,18
3,make,16
4,take,16
5,get,14
6,play,12
7,miss,11
8,beat,9
9,set,7


In [38]:
df_nvn_sep.loc[df_nvn_sep.CATEGORY.isin(['sport'])][df_nvn_sep['VERB']=='win']

  df_nvn_sep.loc[df_nvn_sep.CATEGORY.isin(['sport'])][df_nvn_sep['VERB']=='win']


Unnamed: 0,PREPROCESSED_TEXT,CATEGORY,NOUN1,VERB,NOUN2
1216,Tulu to appear at Caledonian run\n\nTwo-time Olympic 10000 metres champion Derartu Tulu has confirmed she will take part in the BUPA Great Caledonian Run in Edinburgh on 8 May. The 32-year-old Ethiopian is the first star name to enter the event. Tulu has won the Boston London and Tokyo Marathons as well as the world 10000m title in 2001.. We are delighted to have secured the services of one th...,sport,Tulu,win,London
1217,Tulu to appear at Caledonian run\n\nTwo-time Olympic 10000 metres champion Derartu Tulu has confirmed she will take part in the BUPA Great Caledonian Run in Edinburgh on 8 May. The 32-year-old Ethiopian is the first star name to enter the event. Tulu has won the Boston London and Tokyo Marathons as well as the world 10000m title in 2001.. We are delighted to have secured the services of one th...,sport,her,win,medal
1228,Holmes back on form in Birmingham\n\nDouble Olympic champion Kelly Holmes was back to her best as she comfortably won the 1000m at the Norwich Union Birmingham Indoor Grand Prix. The 34-year-old running only her second competitive race of the season shook off the rust to win in two minutes 35.39 seconds. But she is still undecided about competing in the European Championships in Madrid from 4-...,sport,she,win,m
1285,Reyes tricked into Real admission\n\nJose Antonio Reyes has added to speculation linking him with a move from Arsenal to Real Madrid after falling victim to a radio prank. The Spaniard believed he was talking to Real Madrid sporting director Emilio Butragueno when he allegedly berated his team-mates as bad people. I wish I was playing for Real Madrid the 21-year-old told Cadena Cope. Hopefully...,sport,team,win,trophies
1332,England given tough Sevens draw\n\nEngland will have to negotiate their way through a tough draw if they are to win the Rugby World Cup Sevens in Hong Kong next month. The second seeds have been drawn against Samoa France Italy Georgia and Chinese Taipei. The top two sides in each pool qualify but England could face 2001 winners New Zealand in the quarter-finals if they stumble against Samoa. ...,sport,England,win,event
1338,England given tough Sevens draw\n\nEngland will have to negotiate their way through a tough draw if they are to win the Rugby World Cup Sevens in Hong Kong next month. The second seeds have been drawn against Samoa France Italy Georgia and Chinese Taipei. The top two sides in each pool qualify but England could face 2001 winners New Zealand in the quarter-finals if they stumble against Samoa. ...,sport,England,win,Sevens
1356,Pavey focuses on indoor success\n\nJo Pavey will miss Januarys View From Great Edinburgh International Cross Country to focus on preparing for the European Indoor Championships in March. The 31-year-old was third behind Hayley Yelling and Justyna Bak in last weeks European Cross Country Championships but she prefers to race on the track. It was great winning bronze but I am wary of injuries an...,sport,team,win,medal
1370,Beckham rules out management move\n\nReal Madrid midfielder David Beckham has no plans to become a manager when his playing career is over. I am not interested in being a coach but I would like to have football schools the England captain said on television station Canal Plus. I have wanted to do that since I went to the Bobby Charlton school. I am going to open one in London and one in LA. My...,sport,priority,win,title
1402,Roddick into San Jose final\n\nAndy Roddick will play Cyril Saulnier in the final of the SAP Open in San Jose on Sunday. The American top seed and defending champion overcame Germanys Tommy Haas the third seed 7-6 7-3 6-3.. And Saulnier survived an injury scare in his semi-final with seventh-seeded Austrian Jurgen Melzer. The Frenchman twisted his ankle early in the second set but overcame Mel...,sport,Roddick,win,points
1412,Collins to compete in Birmingham\n\nWorld and Commonwealth 100m champion Kim Collins will compete in the 60m at the Norwich Union Grand Prix in Birmingham on 18 February. The St Kitts and Nevis star joins British Olympic relay gold medallists Jason Gardener and Mark Lewis-Francis. Sydney Olympic 100m champion and world indoor record holder Maurice Greene and Athens Olympic 100m silver medallis...,sport,I,win,medal


### Segregating AN phrases

In [39]:
# selecting non-empty output rows
sample_data_copy = sample_data[['ARTICLES','CATEGORIES','PREPROCESSED_TEXT','AN_PHRASES']].copy().reset_index(drop=True)
print(sample_data_copy.shape)
an_sample_data = pd.DataFrame(columns=sample_data_copy.columns)

for row in tqdm(range(len(sample_data_copy)), desc='Selecting non empty rows'):
    if len(sample_data_copy.loc[row,'AN_PHRASES'])!=0:
        an_sample_data = pd.concat([an_sample_data, pd.DataFrame([sample_data_copy.loc[row,:]])], ignore_index=True)

# reset the index
an_sample_data.reset_index(inplace=True)
an_sample_data.drop('index', axis=1, inplace=True)   
print(an_sample_data.shape)
an_sample_data.head(5)

(223, 4)


Selecting non empty rows: 100%|██████████| 223/223 [00:00<00:00, 495.58it/s]

(223, 4)





Unnamed: 0,ARTICLES,CATEGORIES,PREPROCESSED_TEXT,AN_PHRASES
0,"Christmas sales worst since 1981\n\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.\n\nRetail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A num...",business,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,"[{'phrase': 'Christmas sales', 'noun': 'sales'}, {'phrase': 'retail sales', 'noun': 'sales'}, {'phrase': 'Retail sales', 'noun': 'sales'}, {'phrase': '% rise', 'noun': 'rise'}, {'phrase': 'annual rate', 'noun': 'rate'}, {'phrase': 'poor figures', 'noun': 'figures'}, {'phrase': 'Clothing retailers', 'noun': 'retailers'}, {'phrase': 'only internet retailers', 'noun': 'retailers'}, {'phrase': 'si..."
1,"US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December, boosted by strong car sales.\n\nSeasonally adjusted sales rose 1.2% in the month, compared to 0.1% a month earlier, boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year, the best performance since an 8.5% rise in 1999, the Commerce Department...",business,US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December boosted by strong car sales. Seasonally adjusted sales rose 1.2% in the month compared to 0.1% a month earlier boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year the best performance since an 8.5% rise in 1999 the Commerce Department added. ...,"[{'phrase': 'retail sales', 'noun': 'sales'}, {'phrase': 'high note', 'noun': 'note'}, {'phrase': 'solid gains', 'noun': 'gains'}, {'phrase': 'strong car sales', 'noun': 'sales'}, {'phrase': '% rise', 'noun': 'rise'}, {'phrase': '% jump', 'noun': 'jump'}, {'phrase': 'auto sales', 'noun': 'sales'}, {'phrase': 'enhanced offers', 'noun': 'offers'}, {'phrase': 'sales growth', 'noun': 'growth'}, {'..."
2,"Saudi NCCI's shares soar\n\nShares in Saudi Arabia's National Company for Cooperative Insurance (NCCI) soared on their first day of trading in Riyadh.\n\nThey were trading 84% above the offer price on Monday, changing hands at 372 riyals ($99; Â£53) after topping 400 early in the day. Demand for the insurer's debut shares was strong - 12 times what was on sale. The listing was part of the coun...",business,Saudi NCCIs shares soar\n\nShares in Saudi Arabias National Company for Cooperative Insurance NCCI soared on their first day of trading in Riyadh. They were trading 84% above the offer price on Monday changing hands at 372 riyals 99 Â£53 after topping 400 early in the day. Demand for the insurers debut shares was strong - 12 times what was on sale. The listing was part of the countrys plans to...,"[{'phrase': 'NCCIs shares', 'noun': 'shares'}, {'phrase': 'first day', 'noun': 'day'}, {'phrase': 'offer price', 'noun': 'price'}, {'phrase': 'insurers debut shares', 'noun': 'shares'}, {'phrase': 'insurance market', 'noun': 'market'}, {'phrase': 'damage cover', 'noun': 'cover'}, {'phrase': 'insurance products', 'noun': 'products'}, {'phrase': 'blind eye', 'noun': 'eye'}, {'phrase': 'many othe..."
3,"Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp, sparking rumours of a possible takeover.\n\nFosters bought 18.8% of Southcorp, the global winemaker behind the Penfolds, Lindemans and Rosemount brands, for 4.17 Australian dollars per share. A bid at that price would value the company at A$3.1bn ($2.4bn; Â£1.25bn ). Fosters...",business,Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp sparking rumours of a possible takeover. Fosters bought 18.8% of Southcorp the global winemaker behind the Penfolds Lindemans and Rosemount brands for 4.17 Australian dollars per share. A bid at that price would value the company at A3.1bn 2.4bn Â£1.25bn . Fosters said it was...,"[{'phrase': 'large stake', 'noun': 'stake'}, {'phrase': 'possible takeover', 'noun': 'takeover'}, {'phrase': 'Australian dollars', 'noun': 'dollars'}, {'phrase': 'major corporate announcement', 'noun': 'announcement'}, {'phrase': 'separate statement', 'noun': 'statement'}, {'phrase': 'Sydney stock market', 'noun': 'market'}, {'phrase': 'Southcorps shares', 'noun': 'shares'}, {'phrase': '% stak..."
4,"Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Eco's stake in Sun Interbrew, Russia's second-largest brewer, for up to 259.7m euros ($353.3m; Â£183.75m).\n\nAlfa-Eco, the venture capital arm of Russian conglomerate Alfa Group, has a one-fifth stake in Sun Interbrew. The deal gives Inbev, the world's biggest beermaker, near-total control over the Russian brewer. ...",business,Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Ecos stake in Sun Interbrew Russias second-largest brewer for up to 259.7m euros 353.3m Â£183.75m. Alfa-Eco the venture capital arm of Russian conglomerate Alfa Group has a one-fifth stake in Sun Interbrew. The deal gives Inbev the worlds biggest beermaker near-total control over the Russian brewer. Inbev bought out...,"[{'phrase': 'Beer giant', 'noun': 'giant'}, {'phrase': 'firm Brewing giant', 'noun': 'giant'}, {'phrase': 'Ecos stake', 'noun': 'stake'}, {'phrase': 'largest brewer', 'noun': 'brewer'}, {'phrase': 'capital arm', 'noun': 'arm'}, {'phrase': 'fifth stake', 'noun': 'stake'}, {'phrase': 'beermaker total control', 'noun': 'control'}, {'phrase': 'Russian brewer', 'noun': 'brewer'}, {'phrase': 'Inbev ..."


In [40]:
noun_dict = dict()
dis_dict = dict()
dis_list = []

# iterating over all the sentences
for i in range(len(an_sample_data)):
    
    # sentence containing the output
    sentence = an_sample_data.loc[i,'PREPROCESSED_TEXT']
    # catgeory info
    category = an_sample_data.loc[i,'CATEGORIES']
    # output of the sentence
    output = an_sample_data.loc[i,'AN_PHRASES']
    
    # iterating over all the outputs from the sentence
    for sent in output:
        # separate adjective and noun
        adj, n = ''.join([item.strip() for item in sent['phrase'].split(sent['noun'])]), sent['noun']
        
        # append to list, along with the sentence
        dis_dict = {
            'PREPROCESSED_TEXT':sentence,
            'CATEGORY':category,
            'ADJ':adj,
            'NOUN':n}
        dis_list.append(dis_dict)
        
        # counting the number of sentences containing the noun
        noun = sent['noun']
        if noun in noun_dict:
            noun_dict[noun]+=1
        else:
            noun_dict[noun]=1

df_an_sep = pd.DataFrame(dis_list)
df_an_sep.head(5)

Unnamed: 0,PREPROCESSED_TEXT,CATEGORY,ADJ,NOUN
0,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,Christmas,sales
1,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,retail,sales
2,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,Retail,sales
3,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,%,rise
4,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,annual,rate


In [41]:
df_noun_counter = df_an_sep.loc[df_an_sep.CATEGORY.isin(['sport'])].NOUN.value_counts().reset_index()
df_noun_counter = df_noun_counter.rename(columns={'NOUN':'COUNTER', 'index':'NOUN'})
df_noun_counter[df_noun_counter.COUNTER>1]

Unnamed: 0,NOUN,COUNTER
0,game,12
1,title,11
2,record,11
3,injury,11
4,time,10
5,football,10
6,match,9
7,seed,9
8,players,8
9,side,7


In [42]:
df_an_sep.loc[df_an_sep.CATEGORY.isin(['sport'])][df_an_sep['NOUN']=='game']

  df_an_sep.loc[df_an_sep.CATEGORY.isin(['sport'])][df_an_sep['NOUN']=='game']


Unnamed: 0,PREPROCESSED_TEXT,CATEGORY,ADJ,NOUN
3285,Vickery out of Six Nations\n\nEngland tight-head prop Phil Vickery has been ruled out of the rest of the 2005 RBS Six Nations after breaking a bone in his right forearm. Vickery was injured as his club side Gloucester beat Bath 17-16 in the West country derby on Saturday. He could be joined on the sidelines by Bath centre Olly Barkley who sat out the derby due to a leg injury. Barkley will hav...,sport,Ireland,game
3536,Leeds v Saracens Fri\n\nHeadingley\n\nFriday 25 February\n\n2000 GMT\n\nThe Tykes have brought in Newcastle prop Ed Kalman and Tom McGee from the Borders on loan while fly-half Craig McMullen has joined from Narbonne. Raphael Ibanez is named at hooker for Saracens in one of four changes. Simon Raiwalui and Ben Russell are also selected in the pack while Kevin Sorrell comes in at outside centre...,sport,Fridays,game
3554,Johansson takes Adelaide victory\n\nSecond seed Joachim Johansson won his second career title with a 7-5 6-3 win over Taylor Dent at the Australian hardcourt championships in Adelaide. The Swede was made to graft American Dent surviving three break points in the fifth game of the match. But Johansson got the breakthrough with a sublime backhand return winner and won the second set with more ea...,sport,fifth,game
3659,Benitez joy as Reds take control\n\nLiverpool boss Rafael Benitez was satisfied after his teams 3-1 win over Bayer Leverkusen despite conceding a goal in the last minute. Before the game if you had said the score will be 3-1 I would have happily accepted that said Benitez. But you must realise that you have to concentrate right to the very last seconds of a game at this level. I have confidenc...,sport,good,game
3706,Hewitt fights back to reach final\n\nLleyton Hewitt kept his dream of an Australian Open title alive with a four-set win over Andy Roddick in Fridays second semi-final. The home favourite will face Marat Safin in Sundays final after coming through 3-6 7-6 7-3 7-6 7-4 6-1.. Hewitt fought back from a set down and trailed in both tie-breaks but would not be denied thrilling the Melbourne crowd wi...,sport,opening service,game
3830,Robinson out of Six Nations\n\nEngland captain Jason Robinson will miss the rest of the Six Nations because of injury. Robinson stand-in captain in the absence of Jonny Wilkinson had been due to lead England in their final two games against Italy and Scotland. But the Sale full-back pulled out of the squad on Wednesday because of a torn ligament in his right thumb. The 30-year-old will undergo...,sport,best,game
3831,Robinson out of Six Nations\n\nEngland captain Jason Robinson will miss the rest of the Six Nations because of injury. Robinson stand-in captain in the absence of Jonny Wilkinson had been due to lead England in their final two games against Italy and Scotland. But the Sale full-back pulled out of the squad on Wednesday because of a torn ligament in his right thumb. The 30-year-old will undergo...,sport,March,game
3840,Finnan says Irish can win group\n\nSteve Finnan believes the Republic of Ireland can qualify directly for the World Cup finals. After Saturdays superb display in the draw in Paris Ireland face minnows the Faroe Islands in Dublin on Wednesday. The versatile Finnan who starred against the French is confident the group is Irelands for the taking. There is a chance for us now to go on win our home...,sport,Wednesdays,game
3851,Taylor poised for Scotland return\n\nSimon Taylor has been named in the Scotland squad for Saturdays Six Nations clash with Italy. The 25-year-old number eight made a scoring return for Edinburgh at the weekend - his first game in a year for the capital side. Taylor suffered knee ligament damage playing against Ireland in Dublin in the 2004 Six Nations championship. Simon is one of Scotlands t...,sport,first,game
3865,Taylor poised for Scotland return\n\nSimon Taylor has been named in the Scotland squad for Saturdays Six Nations clash with Italy. The 25-year-old number eight made a scoring return for Edinburgh at the weekend - his first game in a year for the capital side. Taylor suffered knee ligament damage playing against Ireland in Dublin in the 2004 Six Nations championship. Simon is one of Scotlands t...,sport,weekends,game


### Segregating NPN phrases

In [43]:
# selecting non-empty output rows
sample_data_copy = sample_data[['ARTICLES','CATEGORIES','PREPROCESSED_TEXT','NPN_PHRASES']].copy().reset_index(drop=True)
print(sample_data_copy.shape)
npn_sample_data = pd.DataFrame(columns=sample_data_copy.columns)

for row in tqdm(range(len(sample_data_copy)), desc='Selecting non empty rows'):
    if len(sample_data_copy.loc[row,'NPN_PHRASES'])!=0:
        npn_sample_data = pd.concat([npn_sample_data, pd.DataFrame([sample_data_copy.loc[row,:]])], ignore_index=True)

# reset the index
npn_sample_data.reset_index(inplace=True)
npn_sample_data.drop('index', axis=1, inplace=True)   
print(npn_sample_data.shape)
npn_sample_data.head(5)

(223, 4)


Selecting non empty rows: 100%|██████████| 223/223 [00:00<00:00, 346.82it/s]


(223, 4)


Unnamed: 0,ARTICLES,CATEGORIES,PREPROCESSED_TEXT,NPN_PHRASES
0,"Christmas sales worst since 1981\n\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.\n\nRetail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A num...",business,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,"[{'phrase': 'rise in November', 'preposition': 'in'}, {'phrase': 'rate of growth', 'preposition': 'of'}, {'phrase': 'number of retailers', 'preposition': 'of'}, {'phrase': 'caution from King', 'preposition': 'from'}, {'phrase': 'way below booms', 'preposition': 'below'}, {'phrase': 'figures for volume', 'preposition': 'for'}, {'phrase': 'measures of spending indication', 'preposition': 'of'}, ..."
1,"US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December, boosted by strong car sales.\n\nSeasonally adjusted sales rose 1.2% in the month, compared to 0.1% a month earlier, boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year, the best performance since an 8.5% rise in 1999, the Commerce Department...",business,US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December boosted by strong car sales. Seasonally adjusted sales rose 1.2% in the month compared to 0.1% a month earlier boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year the best performance since an 8.5% rise in 1999 the Commerce Department added. ...,"[{'phrase': 'surge in', 'preposition': 'in'}, {'phrase': 'note with gains', 'preposition': 'with'}, {'phrase': 'gains in December', 'preposition': 'in'}, {'phrase': 'surge in shopping', 'preposition': 'in'}, {'phrase': 'rise in', 'preposition': 'in'}, {'phrase': 'jump in sales', 'preposition': 'in'}, {'phrase': 'end of year', 'preposition': 'of'}, {'phrase': 'increase in sales', 'preposition':..."
2,"Saudi NCCI's shares soar\n\nShares in Saudi Arabia's National Company for Cooperative Insurance (NCCI) soared on their first day of trading in Riyadh.\n\nThey were trading 84% above the offer price on Monday, changing hands at 372 riyals ($99; Â£53) after topping 400 early in the day. Demand for the insurer's debut shares was strong - 12 times what was on sale. The listing was part of the coun...",business,Saudi NCCIs shares soar\n\nShares in Saudi Arabias National Company for Cooperative Insurance NCCI soared on their first day of trading in Riyadh. They were trading 84% above the offer price on Monday changing hands at 372 riyals 99 Â£53 after topping 400 early in the day. Demand for the insurers debut shares was strong - 12 times what was on sale. The listing was part of the countrys plans to...,"[{'phrase': 'Shares in Company', 'preposition': 'in'}, {'phrase': 'day of trading', 'preposition': 'of'}, {'phrase': 'day in Riyadh', 'preposition': 'in'}, {'phrase': 'Demand for shares', 'preposition': 'for'}, {'phrase': 'part of countrys', 'preposition': 'of'}, {'phrase': 'demand in sector', 'preposition': 'in'}, {'phrase': 'demand for cover', 'preposition': 'for'}, {'phrase': 'confidence in..."
3,"Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp, sparking rumours of a possible takeover.\n\nFosters bought 18.8% of Southcorp, the global winemaker behind the Penfolds, Lindemans and Rosemount brands, for 4.17 Australian dollars per share. A bid at that price would value the company at A$3.1bn ($2.4bn; Â£1.25bn ). Fosters...",business,Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp sparking rumours of a possible takeover. Fosters bought 18.8% of Southcorp the global winemaker behind the Penfolds Lindemans and Rosemount brands for 4.17 Australian dollars per share. A bid at that price would value the company at A3.1bn 2.4bn Â£1.25bn . Fosters said it was...,"[{'phrase': 'stake in Southcorp', 'preposition': 'in'}, {'phrase': 'rumours of takeover', 'preposition': 'of'}, {'phrase': '% of Southcorp', 'preposition': 'of'}, {'phrase': 'winemaker behind Lindemans', 'preposition': 'behind'}, {'phrase': 'brands for dollars', 'preposition': 'for'}, {'phrase': 'dollars per share', 'preposition': 'per'}, {'phrase': 'bid at price', 'preposition': 'at'}, {'phra..."
4,"Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Eco's stake in Sun Interbrew, Russia's second-largest brewer, for up to 259.7m euros ($353.3m; Â£183.75m).\n\nAlfa-Eco, the venture capital arm of Russian conglomerate Alfa Group, has a one-fifth stake in Sun Interbrew. The deal gives Inbev, the world's biggest beermaker, near-total control over the Russian brewer. ...",business,Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Ecos stake in Sun Interbrew Russias second-largest brewer for up to 259.7m euros 353.3m Â£183.75m. Alfa-Eco the venture capital arm of Russian conglomerate Alfa Group has a one-fifth stake in Sun Interbrew. The deal gives Inbev the worlds biggest beermaker near-total control over the Russian brewer. Inbev bought out...,"[{'phrase': 'stake in brewer', 'preposition': 'in'}, {'phrase': 'arm of Group', 'preposition': 'of'}, {'phrase': 'stake in Interbrew', 'preposition': 'in'}, {'phrase': 'control over brewer', 'preposition': 'over'}, {'phrase': 'countries across Europe', 'preposition': 'across'}, {'phrase': '% of shares', 'preposition': 'of'}, {'phrase': '% of shares', 'preposition': 'of'}, {'phrase': 'shares of..."


In [44]:
preposition_dict = dict()
dis_dict = dict()
dis_list = []

# iterating over all the sentences
for i in range(len(npn_sample_data)):
    
    # sentence containing the output
    sentence = npn_sample_data.loc[i,'PREPROCESSED_TEXT']
    # catgeory info
    category = npn_sample_data.loc[i,'CATEGORIES']
    # output of the sentence
    output = npn_sample_data.loc[i,'NPN_PHRASES']
    
    # iterating over all the outputs from the sentence
    for sent in output:
        # separate subject, verb and object
        n1, prep, n2 = sent['phrase'].split()[:1], sent['phrase'].split()[1], sent['phrase'].split()[2:]
        
        # append to list, along with the sentence
        dis_dict = {
            'PREPROCESSED_TEXT':sentence,
            'CATEGORY':category,
            'NOUN1':n1,
            'PREPOSITION':prep,
            'NOUN2':n2}
        dis_list.append(dis_dict)
        
        # counting the number of sentences containing the verb
        preposition = sent['phrase'].split()[1]
        if prep in preposition_dict:
            preposition_dict[prep]+=1
        else:
            preposition_dict[prep]=1

df_npn_sep = pd.DataFrame(dis_list)
df_npn_sep.head(5)

Unnamed: 0,PREPROCESSED_TEXT,CATEGORY,NOUN1,PREPOSITION,NOUN2
0,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,[rise],in,[November]
1,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,[rate],of,[growth]
2,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,[number],of,[retailers]
3,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,[caution],from,[King]
4,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,[way],below,[booms]


In [45]:
df_prep_counter = df_npn_sep.loc[df_npn_sep.CATEGORY.isin(['sport'])].PREPOSITION.value_counts().reset_index()
df_prep_counter = df_prep_counter.rename(columns={'PREPOSITION':'COUNTER', 'index':'PREPOSITION'})
df_prep_counter[df_prep_counter.COUNTER>1]

Unnamed: 0,PREPOSITION,COUNTER
0,of,162
1,in,102
2,for,59
3,at,32
4,with,25
5,to,23
6,on,18
7,from,17
8,over,14
9,against,13


In [46]:
df_npn_sep.loc[df_npn_sep.CATEGORY.isin(['sport'])][df_npn_sep['PREPOSITION']=='against']

  df_npn_sep.loc[df_npn_sep.CATEGORY.isin(['sport'])][df_npn_sep['PREPOSITION']=='against']


Unnamed: 0,PREPROCESSED_TEXT,CATEGORY,NOUN1,PREPOSITION,NOUN2
2260,Cudicini misses Carling Cup final\n\nChelsea goalkeeper Carlo Cudicini will miss Sundays Carling Cup final after the club dropped their appeal against his red card against Newcastle. The Italian was sent off for bringing down Shola Ameobi in the final minute of Sundays match. Blues boss Jose Mourinho had promised to pick Cudicini for the final instead of first-choice keeper Petr Cech. The 31-y...,sport,[appeal],against,[card]
2261,Cudicini misses Carling Cup final\n\nChelsea goalkeeper Carlo Cudicini will miss Sundays Carling Cup final after the club dropped their appeal against his red card against Newcastle. The Italian was sent off for bringing down Shola Ameobi in the final minute of Sundays match. Blues boss Jose Mourinho had promised to pick Cudicini for the final instead of first-choice keeper Petr Cech. The 31-y...,sport,[card],against,[Newcastle]
2274,Vickery out of Six Nations\n\nEngland tight-head prop Phil Vickery has been ruled out of the rest of the 2005 RBS Six Nations after breaking a bone in his right forearm. Vickery was injured as his club side Gloucester beat Bath 17-16 in the West country derby on Saturday. He could be joined on the sidelines by Bath centre Olly Barkley who sat out the derby due to a leg injury. Barkley will hav...,sport,[injury],against,[Bath]
2280,Wales hails new superstar\n\nOne game into his Six Nations career and Gavin Henson is already a Welsh legend. A mesmeric display against England topped off by his howitzer of a match-winning penalty has secured life membership of that particular club. At 23 Henson has the rugby world at his silver-booted feet. And if his natural self-assurance and swagger is shared by his Wales team-mates then...,sport,[display],against,[England]
2380,England coach faces rap after row\n\nEngland coach Andy Robinson is facing disciplinary action after criticising referee Jonathan Kaplan in his sides Six Nations defeat to Ireland. The Rugby Football Union RFU will investigate Robinson after deciding not to lodge a complaint against Kaplan. Robinson may even have to apologise for his comments in order to avoid sanction from the International R...,sport,[complaint],against,[Kaplan]
2390,England coach faces rap after row\n\nEngland coach Andy Robinson is facing disciplinary action after criticising referee Jonathan Kaplan in his sides Six Nations defeat to Ireland. The Rugby Football Union RFU will investigate Robinson after deciding not to lodge a complaint against Kaplan. Robinson may even have to apologise for his comments in order to avoid sanction from the International R...,sport,[match],against,[Wales]
2418,Beckham rules out management move\n\nReal Madrid midfielder David Beckham has no plans to become a manager when his playing career is over. I am not interested in being a coach but I would like to have football schools the England captain said on television station Canal Plus. I have wanted to do that since I went to the Bobby Charlton school. I am going to open one in London and one in LA. My...,sport,[off],against,[Argentina]
2446,Roddick into San Jose final\n\nAndy Roddick will play Cyril Saulnier in the final of the SAP Open in San Jose on Sunday. The American top seed and defending champion overcame Germanys Tommy Haas the third seed 7-6 7-3 6-3.. And Saulnier survived an injury scare in his semi-final with seventh-seeded Austrian Jurgen Melzer. The Frenchman twisted his ankle early in the second set but overcame Mel...,sport,[chances],against,[player]
2524,Lions blow to World Cup winners\n\nBritish and Irish Lions coach Clive Woodward says he is unlikely to select any players not involved in next years RBS Six Nations Championship. World Cup winners Lawrence Dallaglio Neil Back and Martin Johnson had all been thought to be in the frame for next summers tour to New Zealand. I do not think you can ever say never said Woodward. But I would have to ...,sport,[performance],against,[France]
2585,Hewitt fights back to reach final\n\nLleyton Hewitt kept his dream of an Australian Open title alive with a four-set win over Andy Roddick in Fridays second semi-final. The home favourite will face Marat Safin in Sundays final after coming through 3-6 7-6 7-3 7-6 7-4 6-1.. Hewitt fought back from a set down and trailed in both tie-breaks but would not be denied thrilling the Melbourne crowd wi...,sport,[challenge],against,[Safin]


### Segregating NVN modified phrases  

(Compound/Adjective Noun 1 + Verb + Compound/Adjective Noun 1)

In [47]:
# selecting non-empty output rows
sample_data_copy = sample_data[['ARTICLES','CATEGORIES','PREPROCESSED_TEXT','NVN_MOD_PHRASES']].copy().reset_index(drop=True)
print(sample_data_copy.shape)
nvn_mod_sample_data = pd.DataFrame(columns=sample_data_copy.columns)

for row in tqdm(range(len(sample_data_copy)), desc='Selecting non empty rows'):
    if len(sample_data_copy.loc[row,'NVN_MOD_PHRASES'])!=0:
        nvn_mod_sample_data = pd.concat([nvn_mod_sample_data, pd.DataFrame([sample_data_copy.loc[row,:]])], ignore_index=True)

# reset the index
nvn_mod_sample_data.reset_index(inplace=True)
nvn_mod_sample_data.drop('index', axis=1, inplace=True)   
print(nvn_mod_sample_data.shape)
nvn_mod_sample_data.head(5)

(223, 4)


Selecting non empty rows: 100%|██████████| 223/223 [00:00<00:00, 793.59it/s]

(223, 4)





Unnamed: 0,ARTICLES,CATEGORIES,PREPROCESSED_TEXT,NVN_MOD_PHRASES
0,"Christmas sales worst since 1981\n\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.\n\nRetail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A num...",business,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,"[{'phrase': ' ONS revise annual rate', 'verb': 'revise'}, {'phrase': ' number report poor figures', 'verb': 'report'}, {'phrase': ' retailers endure tougher Christmas', 'verb': 'endure'}, {'phrase': ' ONS echo earlier caution', 'verb': 'echo'}, {'phrase': ' analysts put positive gloss', 'verb': 'put'}, {'phrase': ' non - figures show comparable performance', 'verb': 'show'}, {'phrase': ' measu..."
1,"US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December, boosted by strong car sales.\n\nSeasonally adjusted sales rose 1.2% in the month, compared to 0.1% a month earlier, boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year, the best performance since an 8.5% rise in 1999, the Commerce Department...",business,US retail sales surge in December\n\nUS retail sales ended the year on a high note with solid gains in December boosted by strong car sales. Seasonally adjusted sales rose 1.2% in the month compared to 0.1% a month earlier boosted by a surge in shopping just before and after Christmas. Sales climbed 8% for the year the best performance since an 8.5% rise in 1999 the Commerce Department added. ...,"[{'phrase': ' dealers use enhanced offers', 'verb': 'use'}, {'phrase': ' increase push total spending', 'verb': 'push'}, {'phrase': ' Harris tell Reuters', 'verb': 'tell'}, {'phrase': ' which make thirds', 'verb': 'make'}, {'phrase': ' sales grow lacklustre %', 'verb': 'grow'}, {'phrase': ' analysts expect improvement', 'verb': 'expect'}]"
2,"Saudi NCCI's shares soar\n\nShares in Saudi Arabia's National Company for Cooperative Insurance (NCCI) soared on their first day of trading in Riyadh.\n\nThey were trading 84% above the offer price on Monday, changing hands at 372 riyals ($99; Â£53) after topping 400 early in the day. Demand for the insurer's debut shares was strong - 12 times what was on sale. The listing was part of the coun...",business,Saudi NCCIs shares soar\n\nShares in Saudi Arabias National Company for Cooperative Insurance NCCI soared on their first day of trading in Riyadh. They were trading 84% above the offer price on Monday changing hands at 372 riyals 99 Â£53 after topping 400 early in the day. Demand for the insurers debut shares was strong - 12 times what was on sale. The listing was part of the countrys plans to...,"[{'phrase': ' shares soar Shares', 'verb': 'soar'}, {'phrase': ' authorities turn blind eye', 'verb': 'turn'}, {'phrase': ' Arabia want industry', 'verb': 'want'}, {'phrase': ' Arabia sell shares', 'verb': 'sell'}, {'phrase': ' applicants get shares', 'verb': 'get'}]"
3,"Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp, sparking rumours of a possible takeover.\n\nFosters bought 18.8% of Southcorp, the global winemaker behind the Penfolds, Lindemans and Rosemount brands, for 4.17 Australian dollars per share. A bid at that price would value the company at A$3.1bn ($2.4bn; Â£1.25bn ). Fosters...",business,Fosters buys stake in winemaker\n\nAustralian brewer Fosters has bought a large stake in Australian winemaker Southcorp sparking rumours of a possible takeover. Fosters bought 18.8% of Southcorp the global winemaker behind the Penfolds Lindemans and Rosemount brands for 4.17 Australian dollars per share. A bid at that price would value the company at A3.1bn 2.4bn Â£1.25bn . Fosters said it was...,"[{'phrase': ' Fosters buy stake', 'verb': 'buy'}, {'phrase': ' Australian Fosters buy large stake', 'verb': 'buy'}, {'phrase': ' Fosters buy %', 'verb': 'buy'}, {'phrase': ' bid value company', 'verb': 'value'}, {'phrase': ' firms ask market', 'verb': 'ask'}, {'phrase': ' Fosters buy stake', 'verb': 'buy'}, {'phrase': ' who found label', 'verb': 'found'}, {'phrase': ' Southcorp employ people',..."
4,"Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Eco's stake in Sun Interbrew, Russia's second-largest brewer, for up to 259.7m euros ($353.3m; Â£183.75m).\n\nAlfa-Eco, the venture capital arm of Russian conglomerate Alfa Group, has a one-fifth stake in Sun Interbrew. The deal gives Inbev, the world's biggest beermaker, near-total control over the Russian brewer. ...",business,Beer giant swallows Russian firm\n\nBrewing giant Inbev has agreed to buy Alfa-Ecos stake in Sun Interbrew Russias second-largest brewer for up to 259.7m euros 353.3m Â£183.75m. Alfa-Eco the venture capital arm of Russian conglomerate Alfa Group has a one-fifth stake in Sun Interbrew. The deal gives Inbev the worlds biggest beermaker near-total control over the Russian brewer. Inbev bought out...,"[{'phrase': ' giant swallow giant', 'verb': 'swallow'}, {'phrase': ' deal give beermaker control', 'verb': 'give'}, {'phrase': ' Inbev buy partner', 'verb': 'buy'}, {'phrase': ' Inbev brands include Hoegaarden', 'verb': 'include'}, {'phrase': ' It employ people', 'verb': 'employ'}, {'phrase': ' it own %', 'verb': 'own'}, {'phrase': ' Interbrew buy Ambev', 'verb': 'buy'}, {'phrase': ' which emp..."


In [48]:
verb_dict = dict()
dis_dict = dict()
dis_list = []

# iterating over all the sentences
for i in range(len(nvn_mod_sample_data)):
    
    # sentence containing the output
    sentence = nvn_mod_sample_data.loc[i,'PREPROCESSED_TEXT']
    # catgeory info
    category = nvn_mod_sample_data.loc[i,'CATEGORIES']
    # output of the sentence
    output = nvn_mod_sample_data.loc[i,'NVN_MOD_PHRASES']
    
    # iterating over all the outputs from the sentence
    for sent in output:
        # separate subject, verb and object
        n1, v, n2 = sent['phrase'].split(sent['verb'])[0], sent['verb'], sent['phrase'].split(sent['verb'])[1]
        
        # append to list, along with the sentence
        dis_dict = {
            'PREPROCESSED_TEXT':sentence,
            'CATEGORY':category,
            'NOUN1':n1,
            'VERB':v,
            'NOUN2':n2}
        dis_list.append(dis_dict)
        
        # counting the number of sentences containing the verb
        verb = sent['verb']
        if verb in verb_dict:
            verb_dict[verb]+=1
        else:
            verb_dict[verb]=1

df_nvn_mod_sep = pd.DataFrame(dis_list)
df_nvn_mod_sep.head(5)

Unnamed: 0,PREPROCESSED_TEXT,CATEGORY,NOUN1,VERB,NOUN2
0,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,ONS,revise,annual rate
1,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,number,report,poor figures
2,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,retailers,endure,tougher Christmas
3,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,ONS,echo,earlier caution
4,Christmas sales worst since 1981\n\nUK retail sales fell in December failing to meet expectations and making it by some counts the worst Christmas since 1981.. Retail sales dropped by 1% on the month in December after a 0.6% rise in November the Office for National Statistics ONS said. The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%. A number of ...,business,analysts,put,positive gloss


In [49]:
df_verb_mod_counter = df_nvn_mod_sep.loc[df_nvn_mod_sep.CATEGORY.isin(['sport'])].VERB.value_counts().reset_index()
df_verb_mod_counter = df_verb_mod_counter.rename(columns={'VERB':'COUNTER', 'index':'VERB'})
df_verb_mod_counter[df_verb_mod_counter.COUNTER>1]

Unnamed: 0,VERB,COUNTER
0,have,49
1,win,19
2,tell,18
3,make,16
4,take,16
5,get,14
6,play,12
7,miss,11
8,beat,9
9,set,7


In [50]:
df_nvn_mod_sep.loc[df_nvn_mod_sep.CATEGORY.isin(['sport'])][df_nvn_mod_sep['VERB']=='win']

  df_nvn_mod_sep.loc[df_nvn_mod_sep.CATEGORY.isin(['sport'])][df_nvn_mod_sep['VERB']=='win']


Unnamed: 0,PREPROCESSED_TEXT,CATEGORY,NOUN1,VERB,NOUN2
1216,Tulu to appear at Caledonian run\n\nTwo-time Olympic 10000 metres champion Derartu Tulu has confirmed she will take part in the BUPA Great Caledonian Run in Edinburgh on 8 May. The 32-year-old Ethiopian is the first star name to enter the event. Tulu has won the Boston London and Tokyo Marathons as well as the world 10000m title in 2001.. We are delighted to have secured the services of one th...,sport,Tulu,win,London
1217,Tulu to appear at Caledonian run\n\nTwo-time Olympic 10000 metres champion Derartu Tulu has confirmed she will take part in the BUPA Great Caledonian Run in Edinburgh on 8 May. The 32-year-old Ethiopian is the first star name to enter the event. Tulu has won the Boston London and Tokyo Marathons as well as the world 10000m title in 2001.. We are delighted to have secured the services of one th...,sport,her,win,medal
1228,Holmes back on form in Birmingham\n\nDouble Olympic champion Kelly Holmes was back to her best as she comfortably won the 1000m at the Norwich Union Birmingham Indoor Grand Prix. The 34-year-old running only her second competitive race of the season shook off the rust to win in two minutes 35.39 seconds. But she is still undecided about competing in the European Championships in Madrid from 4-...,sport,she,win,m
1285,Reyes tricked into Real admission\n\nJose Antonio Reyes has added to speculation linking him with a move from Arsenal to Real Madrid after falling victim to a radio prank. The Spaniard believed he was talking to Real Madrid sporting director Emilio Butragueno when he allegedly berated his team-mates as bad people. I wish I was playing for Real Madrid the 21-year-old told Cadena Cope. Hopefully...,sport,team,win,more trophies
1332,England given tough Sevens draw\n\nEngland will have to negotiate their way through a tough draw if they are to win the Rugby World Cup Sevens in Hong Kong next month. The second seeds have been drawn against Samoa France Italy Georgia and Chinese Taipei. The top two sides in each pool qualify but England could face 2001 winners New Zealand in the quarter-finals if they stumble against Samoa. ...,sport,England,win,first event
1338,England given tough Sevens draw\n\nEngland will have to negotiate their way through a tough draw if they are to win the Rugby World Cup Sevens in Hong Kong next month. The second seeds have been drawn against Samoa France Italy Georgia and Chinese Taipei. The top two sides in each pool qualify but England could face 2001 winners New Zealand in the quarter-finals if they stumble against Samoa. ...,sport,England,win,first Sevens
1356,Pavey focuses on indoor success\n\nJo Pavey will miss Januarys View From Great Edinburgh International Cross Country to focus on preparing for the European Indoor Championships in March. The 31-year-old was third behind Hayley Yelling and Justyna Bak in last weeks European Cross Country Championships but she prefers to race on the track. It was great winning bronze but I am wary of injuries an...,sport,British team,win,medal
1370,Beckham rules out management move\n\nReal Madrid midfielder David Beckham has no plans to become a manager when his playing career is over. I am not interested in being a coach but I would like to have football schools the England captain said on television station Canal Plus. I have wanted to do that since I went to the Bobby Charlton school. I am going to open one in London and one in LA. My...,sport,immediate priority,win,Spanish title
1402,Roddick into San Jose final\n\nAndy Roddick will play Cyril Saulnier in the final of the SAP Open in San Jose on Sunday. The American top seed and defending champion overcame Germanys Tommy Haas the third seed 7-6 7-3 6-3.. And Saulnier survived an injury scare in his semi-final with seventh-seeded Austrian Jurgen Melzer. The Frenchman twisted his ankle early in the second set but overcame Mel...,sport,Roddick,win,last points
1412,Collins to compete in Birmingham\n\nWorld and Commonwealth 100m champion Kim Collins will compete in the 60m at the Norwich Union Grand Prix in Birmingham on 18 February. The St Kitts and Nevis star joins British Olympic relay gold medallists Jason Gardener and Mark Lewis-Francis. Sydney Olympic 100m champion and world indoor record holder Maurice Greene and Athens Olympic 100m silver medallis...,sport,I,win,medal
