# Session 12b - Text Mining
## Analysing and summarising collections of text
### Named Entity Recognition

Having learned how to clean and simplify our text for processing, the next stage is to ask what our text is about. This workbook looks at extracting "Entities" from our Text using SpaCy.

In [None]:
import pandas as pd
import spacy

In [None]:
df = pd.read_csv('sample_news_large_phrased.csv', index_col='index')

In [None]:
df.head()

In [None]:
# converting this specific data's tokens column back to a list
df['tokens'] = df['tokens'].apply(lambda token_string: token_string.split('|*|'))

In [None]:
df.head()

## Top Entities (Named Entity Recognition)

Named entity recognition (NER) is the technique of extracting key entities within a piece of text,
- people
- places
- organisations
- dates
- values
- currencies etc.

SpaCy's processing examines each word in context and uses this to predict which tokens likely refer to particular types of entities like people, organisations, dates etc. It is not using any limited list or reference to "look up" these entities, but instead identifies them based on contextual cues.


In [None]:
nlp = spacy.load('en_core_web_md')

In [None]:
trump = nlp("""A New York judge has ordered President Donald Trump to pay $2m (£1.6m)"""\
            """ for misusing funds from his charity to finance his 2016 political campaign."""\
            """ The Donald J Trump Foundation closed down in 2018. Prosecutors had accused it"""\
            """ of working as "little more than a chequebook" for Mr Trump's interests."""\
            """ Charities such as the one Mr Trump and his three eldest children headed cannot"""\
            """ engage in politics, the judge ruled.""")

# Source: https://www.bbc.co.uk/news/world-us-canada-50338231

In [None]:
# we can access the entities with the .ents attribute
trump.ents

In [None]:
# every object in the entities list has a text attribute and a label attribute to tell you the type of entity it is.

for entity in trump.ents:
    print(entity.text, entity.label_)

In [None]:
# as we're in Jupyter we can also use SpaCy's built in visualiser

spacy.displacy.render(trump,style='ent', jupyter=True)

In [None]:
# if you want to save the annotated version of the
# text you can save to html using this function.

def save_displacy_to_html(doc, filename, style='ent'):
    html_data = spacy.displacy.render(doc, style='ent', jupyter=False, page=True)
    with open(filename, 'w+', encoding="utf-8") as f:
        f.write(html_data)

save_displacy_to_html(trump, 'test.html', style='ent')

In [None]:
# lets create a function that can extract specific types of entities from a text

def entity_extractor(nlp_doc, entity_type=None):
    if entity_type is None:
        ents = [ent.text for ent in nlp_doc.ents]
    else:
        ents = [ent.text for ent in nlp_doc.ents if ent.label_ == entity_type.upper()]
    unique = list(set(ents))
    return unique

In [None]:
entity_extractor(trump, 'person')

In [None]:
docs = nlp.pipe(df['text'])
people = [entity_extractor(doc,'person') for doc in docs]

In [None]:
df['people'] = people
df['people']

In [None]:
# most mentioned people
df.explode('people')['people'].value_counts()[:10]

In [None]:
# top ten people per group
for query,data in df.groupby('query'):
    print(f"****{query}****")
    print(data.explode('people')['people'].value_counts()[:10])
    print()

In [None]:
to_plot = df.explode('people').groupby('people', as_index=False).count().nlargest(10, 'title')
to_plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
sns.barplot(data=to_plot, x='people',y='title')

fig = plt.gcf()
fig.set_size_inches(10, 5)

plt.xlabel('Person')
plt.ylabel('article frequency')
plt.title('Most mentioned people by article frequency')
plt.xticks(rotation=90)
plt.show()

In [None]:
# grouped by query
for query, data in df.groupby('query'):
    to_plot = data.explode('people').groupby('people', as_index=False).count().nlargest(10, 'title')
    
    sns.barplot(data=to_plot, x='people',y='title')

    fig = plt.gcf()
    fig.set_size_inches(10, 5)

    plt.xlabel('Person')
    plt.ylabel('article frequency')
    plt.title(f'{query}: Most mentioned people by article frequency')
    plt.xticks(rotation=90)
    plt.show()