# Extract information about breaking news events
The input of this task is the outpur of previous task.

Given the label of breaking news, to extract the information from the news. Therefore, the task here is basically a NER task, to get the entity like the `person` or `location`.

In this task, the `ArticleTitle` and `ArticleDescription` both are going to be used.

Pipeline:
- Load data set and prepare the data 
- NER (named entity recognition) work
- Get results and output into file

## Load the data set

In [1]:
import pandas as pd 
import numpy as np
import torch

In [142]:
# The input is the output of last task
df=pd.read_csv("output/news_dataset_labeled_task1.csv")
df_fill=df.fillna("")

In [143]:
df_fill.head()

Unnamed: 0,ArticleId,ArticleURL,ArticleTitle,ArticleDescription,ArticlePublishedTime,EventId
0,5cd7ed707ddacd3b2b3b549e,https://www.bbc.co.uk/news/uk-england-suffolk-...,Lowestoft sea wall fall cyclist rescued by friend,Coastguards praise the boy's friend for his ac...,1557653530,-1
1,5cd7e83beb96a44751294217,https://www.highsnobiety.com/p/met-gala-best-c...,The Met Gala & ‘Game of Thrones’ Feature in Th...,"Once again, our ever-sarcastic readership have...",1557653563,0
2,5cd7e99a8e662d1e4435cb3d,https://www.mirror.co.uk/news/uk-news/boy-dies...,Boy dies on prom day after allergic reaction t...,Joe Dale's family have spoken out about losing...,1557653574,1
3,5cd7f6dd7ddacd3b2b3b56ab,https://www.independent.co.uk/voices/paddy-jac...,Paddy Jackson’s return to Rugby is yet more pr...,He may have been found not guilty of rape last...,1557653588,-1
4,5cd7e89c8e662d1e4435cb13,https://www.standard.co.uk/showbiz/celebrity-n...,BAFTA TV Awards 2019: Stars prepare for glitzy...,Stars are preparing for Sunday night s TV Baft...,1557653610,-1


In [144]:
df_fill[df_fill['EventId']==1]

Unnamed: 0,ArticleId,ArticleURL,ArticleTitle,ArticleDescription,ArticlePublishedTime,EventId
2,5cd7e99a8e662d1e4435cb3d,https://www.mirror.co.uk/news/uk-news/boy-dies...,Boy dies on prom day after allergic reaction t...,Joe Dale's family have spoken out about losing...,1557653574,1
47,5cd7fa408e662d1e4435d112,https://www.independent.co.uk/news/uk/home-new...,Boy collapses and dies after suffering allergi...,Parents speak out over 'heart-wrenching' loss ...,1557654432,1


In the later work, the data labelled as `-1` will not be modelled. Only the labelled event `0,1, ...` will be extracted and the results will be written into the csv file.

## To extract the information from the articles

The task is basically a NER (named entity recognition) problem. 

`spacy` is used again in this part, loading the pre-trained model.

The information expected to be extracted:
- NewsTimePeriod
- RelatedDate
- Person
- EventLocation
- EventSummary

In [145]:
import spacy
from tqdm import tqdm

In [41]:
nlp = spacy.load("en_core_web_sm") # load the pre-trained model

In [148]:
event_ents_list = []

for i in tqdm(range(df_fill['EventId'].max()+1)):
    event_ents = []
    df_event = df_fill[df_fill['EventId'] == i]
    for index, content in df_event.iterrows():
        doc_title = nlp(content['ArticleTitle'])
        doc_description = nlp(content['ArticleDescription'])

        for ent in doc_title.ents:
            event_ents.append((ent.text, ent.label_))
        for ent in doc_description.ents:
            event_ents.append((ent.text, ent.label_))
    event_ents_list.append(event_ents)
print("NER finished in this corpus.")
#     print(event_ents)

#     df_event_ents=df_event_ents.append(pd.Series(event_ents),ignore_index=True)
#     df_event_ents.iloc[i,'EventEnts']=event_ents

100%|██████████| 33/33 [00:04<00:00,  7.73it/s]

NER finished in this corpus.





In [173]:
def count_entity_frequency(ent_label, ents_list):
    '''
    Count the frequency of words in a certain entity and sort to get the order.
    args:
        ents_list: the list of entities extracted by the nlp model
    return:
        keys list: the order is descending, sorted by the value. The most frequent word is the `out[0]`
    '''
    text_ent_frequency={}
    for (text,label) in ents_list:
        if label==ent_label:
            text_ent_frequency[text]= text_ent_frequency.get(text,0)+1
    
    return sorted(text_ent_frequency,key=text_ent_frequency.get,reverse=True)


def extract_information(df_event, event_id, ents_list):
    '''
    Basically, use the frequency of entities to decide the information.
    For each kind of entity, we see the word with the highest frequency as the key information
    
    the `EventSummary` hasn't been realized in this part.
    '''
    info = pd.Series(
        index=['EventId', 'NewsNumber', 'NewsTimeLength', 'RelatedDate', 'Person', 'EventLocation', 'EventSummary'])

    info['EventId'] = event_id
    df_event['ArticlePublishedTime'] = pd.to_datetime(df_event['ArticlePublishedTime'], unit='s')
    info['NewsNumber'] = df_event.shape[0]
    info['NewsTimeLength'] = df_event['ArticlePublishedTime'].max()-df_event['ArticlePublishedTime'].min()
    try:
        info['RelatedDate']=count_entity_frequency('DATE',ents_list)[0]
        info['Person']=count_entity_frequency('PERSON',ents_list)[0]
        info['EventLocation']=count_entity_ferquency('LOC',ents_list)[0]+', '+count_entity_ferquency('GPE',ents_list)[0]
        info['EventSummary']=''
    except IndexError:
        pass
    return info


In [151]:
info = pd.Series(
        index=['EventId', 'NewsNumber', 'NewsTimeLength', 'RelatedDate', 'Person', 'EventLocation', 'EventSummary'])
info['EventId']=1

In [174]:
df_information = pd.DataFrame(columns=[
                              'EventId', 'NewsNumber', 'NewsTimeLength', 'RelatedDate', 'Person', 'EventLocation', 'EventSummary'])
# df_event_ents=pd.DataFrame(columns=['EventEnts'])
for i in tqdm(range(df_fill['EventId'].max()+1)):
    info=extract_information(df_fill[df_fill['EventId']==i],i,event_ents_list[i])
#     print(info)
    df_information=df_information.append(info,ignore_index=True)
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
100%|██████████| 33/33 [00:04<00:00,  7.12it/s]


In [176]:
df_information.head()

Unnamed: 0,EventId,NewsNumber,NewsTimeLength,RelatedDate,Person,EventLocation,EventSummary
0,0.0,2.0,03:27:17,A decade,Lyft Earnings,,
1,1.0,2.0,00:14:18,prom day,Joe Dale's,,
2,2.0,157.0,04:21:07,today,Kick-off,"Cross River, Manchester City",
3,3.0,2.0,03:59:22,,,,
4,4.0,2.0,01:08:00,today,Sport,,


## Output the file 

In [178]:
df_information.to_csv("output/extracted_information_task2.csv",index=False)

## Summary

In this part,

NER, in this part, I use `spacy` again to realize it. The model used is already pre-trained. I just need to call the APIs to get the output.

Article Summary, I didn't realize the article summary part. I haven't found a good exercise to get the text summary. I know the idea of using the extracted entities to get the relation, then generate the summary. But I don't know how to realize it. I still need to do more reading on this part.


(to be continue)