# Introduction to Text Mining
## Part1: Extracting Key Entities from Text

In [1]:
# For today we're going to use spacy
import spacy

SpaCy uses pre-trained models of language to do a lot of the tasks we need. To create our spacy text tool we need to load in a model. SpaCy has a [load of different models](https://spacy.io/usage/models) for different languages and different types of task.

We're going to use `en_core_web_md` which means..
- en: English
- core: Can perform the core features of Spacy but not some of the more specialised techniques.
- web: trained on content from the web such as blogs, news, comments, making it suitable for similar content.
- md: medium verion. There is also the small and large models. Small is trained just on web text data from 2013. Medium is trained on [petabytes of data from the contemporary internet](https://commoncrawl.org/big-picture/) and so is much more up to date in how it understands contemporary language use.

In [2]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."

In [3]:
# we load in our model...
nlp = spacy.load('en_core_web_md')

In [4]:
# and create our new text object by wrapping it in our model

doc = nlp(test_phrase)
doc

I don't see my cat. He has a long tail, fluffy ears and big eyes! He also subscribes to Marxist historical materialism. It's just his way.

### What just happened?

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/spacy_pipe.png?raw=true" width="1000">
Our text has just gone on an incredible journey... or been mangled through a ton of processes. 

##### Tokenization
Breaks a string up into individual 'tokens', individual words, pieces of punctuation etc (we'll cover this more later).

##### Part-of-Speech Tagging (POS)
Tags up each token with a grammatical label such as Noun, Pronoun, Adjective, etc. This tagging is based on the word itself, and the context of each word, i.e. the position of that word in relation to other words, punctuation etc. POS tagging is complex and not directly important for our purposes, but it is fundamental to supporting later processes.

##### Dependency Parsing
Works out how words within each sentence are related to one another. For example in the sentence below...
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/spacy_dependency.png?raw=true" width="700">
   - "big" is the *adjective modifer (amod)* of "cat" as it modifies something about the "cat" object.
   - "home" is the *adverbial modifier (advmod)* of ran as it modifies the generic running, to running "home"
   
Dependency parsing is reliant on the preceeding steps of tokenizing and POS tagging. Again the information this step generates is     not directly of use to us, but it is necessary for any processes that need to work out what words go together, or that cut up text into semantically meaningful chunks.
##### Named Entity Recognition (NER)
Uses all preceeding steps to be able to predict which tokens likely refer to particular types of entities like people, organisations, dates etc. It is not using any limited list or reference to "look up" these entities, but instead identifies them based on all the information generated in the preceeding steps.

#### The value of SpaCy

In many Natural Language Processing libraries we would have to code all these steps ourselves, making sure that the output of every step is processed in the right way to fit the input of the next step. SpaCy has a pre-built pipeline that covers all of these steps and then wraps the result in a useful object called a SpaCy `document`.

Whilst our `doc` looks like a simple string but this is now a SpaCy `Document` which has a whole range of associated methods.
See the SpaCy Documentation on the [Document object](https://spacy.io/api/doc)

In [5]:
doc

I don't see my cat. He has a long tail, fluffy ears and big eyes! He also subscribes to Marxist historical materialism. It's just his way.

In [6]:
type(doc)

spacy.tokens.doc.Doc

### SpaCy Document Methods

#### Detect Language

In [7]:
doc.lang_

'en'

#### Break Down into Sentences

In [8]:
list(doc.sents)

[I don't see my cat.,
 He has a long tail, fluffy ears and big eyes!,
 He also subscribes to Marxist historical materialism.,
 It's just his way.]

#### Compare texts for similarity

In [9]:
similar_phrase = nlp("I have my cat. He has a long tail, fluffy ears and big eyes!"\
                     " He subscribes to Neoliberal free market monthly. It's just his way.")

bio_abstract = nlp("Helicobacter pylori pathogenesis and disease outcomes"\
                        " are mediated by a complex interplay between bacterial"\
                         " virulence factors, host, and environmental factors."\
                         " After H. pylori enters the host stomach, four steps are"\
                         " critical for bacteria to establish successful colonization")
                        # extract from https://doi.org/10.1016/j.bj.2015.06.002

    
print(doc.similarity(similar_phrase))

print(doc.similarity(bio_abstract))

0.9850796595679977
0.678126115963195


#### Break Down into Noun Chunks
Useful for examining phrases that might summarise the document.

In [10]:
#.noun_chunks will produce a generator. To force the generator to run we wrap it in a list.
list(doc.noun_chunks)

[I,
 my cat,
 He,
 a long tail,
 fluffy ears,
 big eyes,
 He,
 Marxist historical materialism,
 It,
 just his way]

In [11]:
list(bio_abstract.noun_chunks)

[Helicobacter pylori pathogenesis,
 disease outcomes,
 a complex interplay,
 bacterial virulence factors,
 host,
 environmental factors,
 H. pylori,
 the host stomach,
 four steps,
 bacteria,
 successful colonization]

#### Find the Entities in the text
Named entity recognition (NER) is the technique of extracting key entities within a piece of text,
- people
- places
- organisations
- dates
- values
- currencies etc.

In [12]:
trump = nlp("""A New York judge has ordered President Donald Trump to pay $2m (£1.6m)"""\
            """ for misusing funds from his charity to finance his 2016 political campaign."""\
            """ The Donald J Trump Foundation closed down in 2018. Prosecutors had accused it"""\
            """ of working as "little more than a chequebook" for Mr Trump's interests."""\
            """ Charities such as the one Mr Trump and his three eldest children headed cannot"""\
            """ engage in politics, the judge ruled.""")

# Source: https://www.bbc.co.uk/news/world-us-canada-50338231

In [13]:
# list the entities (again spacy produces a generator that we need to wrap in a list)
trump_ents = list(trump.ents)
trump_ents

[New York,
 Donald Trump,
 $2m,
 £1.6m,
 2016,
 Trump Foundation,
 2018,
 Trump,
 one,
 Trump,
 three]

In [14]:
# every object in the entities list has a text attribute and a label attribute to tell you the type of entity it is.

for entity in trump_ents:
    print(entity.text, entity.label_)

New York GPE
Donald Trump PERSON
$2m MONEY
£1.6m MONEY
2016 DATE
Trump Foundation ORG
2018 DATE
Trump PERSON
one CARDINAL
Trump PERSON
three CARDINAL


In [15]:
# as we're in Jupyter we can also use SpaCy's built in visualiser

spacy.displacy.render(trump,style='ent', jupyter=True)

In [16]:
# if you want to save the annotated version of the
# text you can save to html using this function.

def save_displacy_to_html(doc, filename, style='ent'):
    html_data = spacy.displacy.render(doc, style='ent', jupyter=False, page=True)
    with open(filename, 'w+', encoding="utf-8") as f:
        f.write(html_data)

save_displacy_to_html(trump, 'test.html', style='ent')

In [17]:
# lets create a function that can extract specific types of entities from a text

def entity_extractor(nlp_doc, entity_type):
    ents = list(nlp_doc.ents)
    ents_filtered = [ent.text for ent in ents if ent.label_ == entity_type]
    unique = list(set(ents_filtered))
    return unique

In [18]:
entity_extractor(trump, 'PERSON')

['Trump', 'Donald Trump']

#### Processing many documents
A collection of pieces of text is referred to as a *Corpus* in Natural Language Processing. The majority of the time you will be analysing a large corpus of text material such as a collection of...

- tweet texts
- forum posts, 
- documents from an archive

This means it is important that your code is able to process large numbers of documents simultaneously. Different Python libraries vary in how good they are at handling text at scale.

SpaCy is blazing fast if used correctly...

In [19]:
import pandas as pd

In [20]:
news_data = pd.read_pickle('news_data.pkl')

# examine the dataframe as usual, check the info
news_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 972 entries, 0 to 971
Data columns (total 6 columns):
uuid                 972 non-null object
query                972 non-null object
thread.title_full    972 non-null object
text                 972 non-null object
published            972 non-null object
thread.site          972 non-null object
dtypes: object(6)
memory usage: 45.7+ KB


In [21]:
# the head
news_data.head()

Unnamed: 0,uuid,query,thread.title_full,text,published,thread.site
0,e25fb36d29acc983cefbc8b61328b88dd4f7cffe,brexit,Have UK voters changed their minds on Brexit? ...,Image copyright Getty Images UK Prime Minister...,2019-10-16T03:00:00.000+03:00,bbc.co.uk
1,0b63b531f4ca3eb5f2c6968163787c6e95c1b2f2,brexit,"Stock market news: October 16, 2019",Stocks were off slightly as investors consider...,2019-10-16T16:32:00.000+03:00,yahoo.com
2,a7386fc9648e01a10d62b0b015a132329b349c86,brexit,Brexit: What are the backstop options? - BBC News,Image copyright Getty Images A key part of the...,2019-10-16T13:01:00.000+03:00,bbc.co.uk
3,755f1df8cd75db18c59eb397bada0daa9e24cbf8,brexit,New IRA says border infrastructure would be ‘l...,Send Load more share options\nThe New IRA has ...,2019-10-16T03:00:00.000+03:00,channel4.com
4,d25f1165001889962c0f8370cded464737309468,brexit,"Brexit: 'No deal tonight', UK government sourc...",The issue of the Irish border - and how to han...,2019-10-16T03:00:00.000+03:00,bbc.co.uk


In [22]:
# and in this case we can get a sense of what the stories are about by seeing checking how many soriest came from each query
news_data['query'].value_counts()

brexit            352
billionaire       278
Hong Kong         150
Tesla              98
alt-right          41
cryptocurrency     28
bitcoin            25
Name: query, dtype: int64

We can use `nlp.pipe` to process a list of documents all at once. This streamlines the process which speeds everything up.

We feed it our column of texts from the dataframe, and then we wrap the whole thing in a `list`, because nlp.pipe is a generator that will not produce anything until it is iterated over. Wrapping a generator in a list forces it to run and actually output the results.

The `%time ` command is a special piece of Jupyter ""Magic"" that tells us how long a command took to run.

In [23]:
# we can use nlp.pipe to process ALL the texts in our text column and assign the list of processed texts to a new variable
%time processed_docs = list(nlp.pipe(news_data['text'], n_process=-1))

CPU times: user 12 s, sys: 1.15 s, total: 13.1 s
Wall time: 1min 43s


In [24]:
# if we now assign this to a new column, because the list of processed_docs is in the same order as the dataframe, everything will line up
news_data['text_nlp'] = processed_docs

In [25]:
news_data.head()

Unnamed: 0,uuid,query,thread.title_full,text,published,thread.site,text_nlp
0,e25fb36d29acc983cefbc8b61328b88dd4f7cffe,brexit,Have UK voters changed their minds on Brexit? ...,Image copyright Getty Images UK Prime Minister...,2019-10-16T03:00:00.000+03:00,bbc.co.uk,"(Image, copyright, Getty, Images, UK, Prime, M..."
1,0b63b531f4ca3eb5f2c6968163787c6e95c1b2f2,brexit,"Stock market news: October 16, 2019",Stocks were off slightly as investors consider...,2019-10-16T16:32:00.000+03:00,yahoo.com,"(Stocks, were, off, slightly, as, investors, c..."
2,a7386fc9648e01a10d62b0b015a132329b349c86,brexit,Brexit: What are the backstop options? - BBC News,Image copyright Getty Images A key part of the...,2019-10-16T13:01:00.000+03:00,bbc.co.uk,"(Image, copyright, Getty, Images, A, key, part..."
3,755f1df8cd75db18c59eb397bada0daa9e24cbf8,brexit,New IRA says border infrastructure would be ‘l...,Send Load more share options\nThe New IRA has ...,2019-10-16T03:00:00.000+03:00,channel4.com,"(Send, Load, more, share, options, \n, The, Ne..."
4,d25f1165001889962c0f8370cded464737309468,brexit,"Brexit: 'No deal tonight', UK government sourc...",The issue of the Irish border - and how to han...,2019-10-16T03:00:00.000+03:00,bbc.co.uk,"(The, issue, of, the, Irish, border, -, and, h..."


In [26]:
# Remember the function we built earlier??
def entity_extractor(nlp_doc, entity_type):
    ents = list(nlp_doc.ents)
    ents_filtered = [ent.text for ent in ents if ent.label_ == entity_type]
    unique = list(set(ents_filtered))
    return unique

In [27]:
# We can now use this function on any value within our new text_nlp column

processed_row = news_data.loc[0, 'text_nlp']
entity_extractor(processed_row, entity_type='PERSON')

['Boris Johnson',
 'Johnson',
 'Deltapoll',
 'Brexit',
 'Dominic Bailey',
 'Panelbase',
 'David Brown',
 'Getty Images UK',
 'Opinium',
 'John Curtice',
 'Kantar']

### Understanding Pandas .apply (Advanced usage)

As the heavy lifting of the processing is already done we can do this on every row of our dataframe using Pandas `.apply`.

First we select the column that has our values we want to use and the `.apply` method built into the column object.

`news_data['text_nlp'].apply()`

However we need  to tell `.apply` what function to apply to each row in our column...

`news_data['text_nlp'].apply(entity_extractor)`

Note that we do not need to give the function parenthesis to activate it, apply will handle that part.

If our function only needed one argument the above would be fine as `.apply` presumes that it should feed the value of the row to the function as the function's first argument. In our case it would feed the value of the row to the function's `nlp_doc` argument, the first argument in the function.

However we have TWO arguments, `nlp_doc` and `entity_type`, both of which are necessary. If we provide this information to the `.apply` function as a "named argument" pandas `.apply` will take the information and feed that into the function for us, like so...


In [28]:
# Note that entity_type is an argument specified in our entity_extractor function, but we're passing it to apply.
# Apply will recognise that it doesn't use an argument called entity_type and instead feed it to the function we
# specified it should use. Apply is clever like that.

news_data['text_nlp'].apply(entity_extractor, entity_type='PERSON')

0      [Boris Johnson, Johnson, Deltapoll, Brexit, Do...
1                                     [Brendan McDermid]
2                           [Boris Johnson, Brexit, May]
3      [Alex Thomson, Ben De Pear, Theresa May, Brexi...
4      [Boris Johnson, Johnson, Tory Brexiteers, Stev...
                             ...                        
967                    [Pepe, Frog, Donald Trump, Wojak]
968    [Jonathan Turton, Turton, Violet Chachki, Jess...
969    [Fuentes, Nick Fuentes, Heather Heyer, Spencer...
970    [Nigel Farage, Mark Reckless, Boris Johnson, N...
971                                                   []
Name: text_nlp, Length: 972, dtype: object

In [29]:
# we can assign the result to a new column, 'entity_person' and 
# lets also create the column 'entity_org' using the entity_type argument 'ORG'

news_data['entity_person'] = news_data['text_nlp'].apply(entity_extractor, entity_type='PERSON')
news_data['entity_org'] = news_data['text_nlp'].apply(entity_extractor, entity_type='ORG')


In [30]:
news_data['entity_person']

0      [Boris Johnson, Johnson, Deltapoll, Brexit, Do...
1                                     [Brendan McDermid]
2                           [Boris Johnson, Brexit, May]
3      [Alex Thomson, Ben De Pear, Theresa May, Brexi...
4      [Boris Johnson, Johnson, Tory Brexiteers, Stev...
                             ...                        
967                    [Pepe, Frog, Donald Trump, Wojak]
968    [Jonathan Turton, Turton, Violet Chachki, Jess...
969    [Fuentes, Nick Fuentes, Heather Heyer, Spencer...
970    [Nigel Farage, Mark Reckless, Boris Johnson, N...
971                                                   []
Name: entity_person, Length: 972, dtype: object

In [31]:
news_data['entity_org']

0      [BMG, 27%-28, the European Union, Leave, Strat...
1      [House, Dow, Senate, the New York Stock Exchan...
2      [Theresa May's, Twitter\n, GPS, Theresa, DUP, ...
3      [PSNI, the Police Service of Northern Ireland,...
4      [the European Research Group, the House of Com...
                             ...                        
967    [NGO, Victoria Police, EAD, the Anti-Defamatio...
968    [the Ryerson University, RT, Mélançon-Golden's...
969    [CNN, Twitter, Charlottesville, Guardian, Ex-B...
970    [Tice, https://guardianbookshop.com/decline-an...
971    [Paint, BMP, Caps, Snip & Sketch, Snipping Too...
Name: entity_org, Length: 972, dtype: object

In [32]:
# we can subset the data to just get the articles from the brexit query
brexit_data = news_data[news_data['query'] == 'brexit'].copy()

In [33]:
# we explode the data to transform our single column with rows of lists, to give each item in each list its own row

exploded_df = brexit_data.explode('entity_person')
exploded_df

Unnamed: 0,uuid,query,thread.title_full,text,published,thread.site,text_nlp,entity_person,entity_org
0,e25fb36d29acc983cefbc8b61328b88dd4f7cffe,brexit,Have UK voters changed their minds on Brexit? ...,Image copyright Getty Images UK Prime Minister...,2019-10-16T03:00:00.000+03:00,bbc.co.uk,"(Image, copyright, Getty, Images, UK, Prime, M...",Boris Johnson,"[BMG, 27%-28, the European Union, Leave, Strat..."
0,e25fb36d29acc983cefbc8b61328b88dd4f7cffe,brexit,Have UK voters changed their minds on Brexit? ...,Image copyright Getty Images UK Prime Minister...,2019-10-16T03:00:00.000+03:00,bbc.co.uk,"(Image, copyright, Getty, Images, UK, Prime, M...",Johnson,"[BMG, 27%-28, the European Union, Leave, Strat..."
0,e25fb36d29acc983cefbc8b61328b88dd4f7cffe,brexit,Have UK voters changed their minds on Brexit? ...,Image copyright Getty Images UK Prime Minister...,2019-10-16T03:00:00.000+03:00,bbc.co.uk,"(Image, copyright, Getty, Images, UK, Prime, M...",Deltapoll,"[BMG, 27%-28, the European Union, Leave, Strat..."
0,e25fb36d29acc983cefbc8b61328b88dd4f7cffe,brexit,Have UK voters changed their minds on Brexit? ...,Image copyright Getty Images UK Prime Minister...,2019-10-16T03:00:00.000+03:00,bbc.co.uk,"(Image, copyright, Getty, Images, UK, Prime, M...",Brexit,"[BMG, 27%-28, the European Union, Leave, Strat..."
0,e25fb36d29acc983cefbc8b61328b88dd4f7cffe,brexit,Have UK voters changed their minds on Brexit? ...,Image copyright Getty Images UK Prime Minister...,2019-10-16T03:00:00.000+03:00,bbc.co.uk,"(Image, copyright, Getty, Images, UK, Prime, M...",Dominic Bailey,"[BMG, 27%-28, the European Union, Leave, Strat..."
...,...,...,...,...,...,...,...,...,...
351,1faf83bfead571acc2fe725ad5a6d44c9dcee446,brexit,General election 2019: Farage calls on Johnson...,Media playback is unsupported on your device M...,2019-11-01T14:16:00.000+02:00,bbc.co.uk,"(Media, playback, is, unsupported, on, your, d...",pro-Brexit,"[Plaid Cymru, Brexit Party, the Lib Dems, NHS,..."
351,1faf83bfead571acc2fe725ad5a6d44c9dcee446,brexit,General election 2019: Farage calls on Johnson...,Media playback is unsupported on your device M...,2019-11-01T14:16:00.000+02:00,bbc.co.uk,"(Media, playback, is, unsupported, on, your, d...",Farage,"[Plaid Cymru, Brexit Party, the Lib Dems, NHS,..."
351,1faf83bfead571acc2fe725ad5a6d44c9dcee446,brexit,General election 2019: Farage calls on Johnson...,Media playback is unsupported on your device M...,2019-11-01T14:16:00.000+02:00,bbc.co.uk,"(Media, playback, is, unsupported, on, your, d...",Nigel,"[Plaid Cymru, Brexit Party, the Lib Dems, NHS,..."
351,1faf83bfead571acc2fe725ad5a6d44c9dcee446,brexit,General election 2019: Farage calls on Johnson...,Media playback is unsupported on your device M...,2019-11-01T14:16:00.000+02:00,bbc.co.uk,"(Media, playback, is, unsupported, on, your, d...",John Curtice,"[Plaid Cymru, Brexit Party, the Lib Dems, NHS,..."


In [34]:
# We can see overall the most mentioned 'persons' in the dataset - note that Spacy think brexit is a person....
# this might be an interesting indicator of the way brexit is talked about in the press.

exploded_df['entity_person'].value_counts().head(10)

Brexit             228
Boris Johnson      177
Johnson            158
Jeremy Corbyn       81
Tory                51
Boris Johnson’s     44
Boris Johnson's     42
John Bercow         41
Donald Tusk         36
Donald Trump        36
Name: entity_person, dtype: int64

In [35]:
# we could see what the OVERALL most mentioned person entity is...

news_data.explode('entity_person')['entity_person'].value_counts()


# ...but this has issues...

Brexit               238
Boris Johnson        183
Donald Trump         175
Johnson              166
Trump                149
                    ... 
femenoid               1
Rick Rieder            1
Michael Dukakis        1
Andrew Stephenson      1
David Henderson        1
Name: entity_person, Length: 6705, dtype: int64

### Understanding Pandas .groupby (Advanced usage)

An issue with the above is that the top people mentioned in the dataset will be greatly determined by the number of stories gathered per query. Let's say our end goal is to find the top 5 persons mentioned, but this time for each query individually.

We could come up with some various filters and loops to cut up the dataset and spit out the results, but `.groupby` is designed for just these kinds of cases.

Groupby works using the process of...

#### SPLIT > APPLY > COMBINE

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/groupby.png?raw=true" width="600">

- Split the Dataframe into seperate dataframes based on the values in a particular column - usually some sort of category
- Apply a method or a function to each dataframe
- Combine those dataframes back into a single dataframe and return it...


First we explode the dataframe on the entity_column

`news_data.explode('entity_person')`

This will give us a dataframe where we convert our lists of persons in each row, to individual rows per person per article, as we did above. Then we move on to the groupby process.

The next step we `.groupby` the 'query' column, this means we have an object that is in effect 7 different dataframes, one for each query (The Split stage). This creates a `DataFrameGroupBy` object. We can't examine its contents yet because we haven't applied any transformations.

In [36]:
exploded_data = news_data.explode('entity_person')

In [37]:
groupby_query = exploded_data.groupby('query')
groupby_query

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x140038400>

This has applied a transformation to the groupby object, and it has returned us a single object of all 7 transformed dataframes combined.
It has split into multiple dataframe based on our groupby command.


Every command we give to this object will be applied to each of the 7 dataframes at the same time.
If we try .value_counts() it will give us the value_counts() for the entity_person column, for each dataframe seperately.

In [38]:
groupby_query['entity_person'].value_counts()

query           entity_person
Hong Kong       Donald Trump     26
                Carrie Lam       22
                Trump            16
                Xi Jinping       12
                LeBron James     10
                                 ..
cryptocurrency  Xi                1
                Yang              1
                Zuck              1
                drukken Foto      1
                ’m                1
Name: entity_person, Length: 7485, dtype: int64

We have the value counts of how many times each PERSON entity was mentioned within each of the seperate query dataframes. However because it is returning the entire set of results for each dataframe, we can't see much. Currently we're just looking at the top and bottom of the entire set of results. The very top of the first dataframe, and the very end of the bottom dataframe, and everything in between is hidden because there is too much data.

HOWEVER!
If we take this result, and group it AGAIN by the 'query' column, we will be cutting up this set of results into indivdual dataframes again. We can then just ask for the top 5 results from each, by using `.head(5)`.

This will take the top 5 from each dataframe, and then combine them into a single dataframe and pass it back to us.

In [39]:
groupby_query['entity_person'].value_counts().groupby('query').head(5)

query           entity_person                
Hong Kong       Donald Trump                      26
                Carrie Lam                        22
                Trump                             16
                Xi Jinping                        12
                LeBron James                      10
Tesla           Tesla                             27
                Elon Musk                         24
                Musk                              15
                Brexit                             6
                Gigafactory                        5
alt-right       Trump                             14
                Donald Trump                       9
                Richard Spencer                    9
                Breitbart                          7
                Donald Trump Jr.                   6
billionaire     Donald Trump                      97
                Trump                             88
                David Cay Johnston’s DCReport     73


To make life easier we can always create a function!

In [None]:
def top_entity(data, entity_col, groupby_col, top_n=5):
    exploded = data.explode(entity_col)
    g = exploded.groupby(groupby_col)
    result = g[entity_col].value_counts().groupby(groupby_col).head(top_n)
    return result

In [None]:
top_entity(data=news_data, entity_col='entity_person', groupby_col='query', top_n=5)

In [None]:
top_entity(data=news_data, entity_col='entity_org', groupby_col='query', top_n=5)

### From Entity Lists to Entity Networks

One analysis approach we can use is to build a network where each node is the name of an entity, and an edge exists between them if they co-occur in the same piece of text. For every co-occurence their edge is weighted +1.

This is relatively easy to do with a few tricks in Pandas and Networkx.

In [None]:
import networkx as nx

In [None]:
# Build an edge list where the source is just an id number for the article and the target is the name in the article.

# by exploding the entity column we get a dataframe where each row is repeated for every item in the entity list.
exploded = news_data.explode('entity_person')

# we take a copy of just two columns, the newly exploded 'entity_person' column, and the uuid column which is a unique id
# assigned to each article

edge_list = exploded[['uuid', 'entity_person']].copy()

#then we just rename the appropriate columns
edge_list = edge_list.rename(columns={'uuid':'source','entity_person':'target'})
edge_list

In [None]:
# create our graph from the edge list

G = nx.from_pandas_edgelist(edge_list)

We now have our graph which consists of nodes representing unique articles, and nodes representing people, with edges going between the articles and people if a person is mentioned in the article.

We could examine this to see the result...

In [None]:
nx.write_gexf(G, 'article_person.gexf')

#### Bipartite Graphs

Our graph is a bi-partite graph.

Bipartite graphs are networks where there are two types of nodes and where there are only edges between nodes of different types, never edges between nodes of the same type.

In our graph we have nodes representing articles, and nodes representing people. Currently there are only edges between people and the articles that they appear in, not edges between people, and not edges between articles.

##### Bi-Partite Projection
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/bipartite.png?raw=true" align="right" width="300">
Projection can be understood as the process of removing the 'intermediaries' between nodes, and connecting those nodes together directly. In the example image, Nodes A, B and C are not directly connected, but they do share intermediary nodes, 1, 2 and 3. The simple graph just represents that there is some form of connection, the multigraph demonstrates that...

- there are two shared connections between A and B (intermediary nodes 1 & 2)
- there are two shared connections between B and C (intermediaries, 2 and 3).
- There is only one connection between A and C (intermediary node 2).


Image from https://arxiv.org/pdf/1909.10977.pdf

In [None]:
# now if we check we can see our graph is a "bipartite graph"

nx.bipartite.is_bipartite(G)

This confirms we can can use projection techniques on this graph.
We can use the networkx function `nx.bipartite.weighted_projected_graph` to create our projection.

This function requires

- The graph you want to project from
- A list of the nodes you want to retain - in our case our list of people.

In [None]:
# we can create a list of unique nodes by taking out 'target' column of our edge 
# list (which remember was our list of PERSON entities) and using .unique()

nodes_to_keep = edge_list['target'].unique()
nodes_to_keep

In [None]:
# we create our final graph

projected_G = nx.bipartite.weighted_projected_graph(G,nodes=nodes_to_keep)

We now have a graph of just PERSON entity nodes, with edges weighted by the number of times those PERSON entities co-occurred in an article. It is likely that the majority of edges will have a weight of 1. We're interested to see if entities co-occur a lot so we could filter out these edges.

In [None]:
# current number of edges
projected_G.number_of_edges()

In [None]:
# lets identify all the edges that have a weight greater than 1
# we'll use this to filter our graph

edges_to_keep = []

# the if we iterate over a graph using the .edges() method it gives us two values
# the source and the target of the edge.

# for source, target in projected_G.edges():
#     Do something in this loop
#
#
#

# if we give .edges the argument data=True, it also gives us the attributes of each edge as a dictionary

for source, target, edge_attributes in projected_G.edges(data=True):
    if edge_attributes['weight'] > 1:
        edge = (source, target)
        edges_to_keep.append(edge)

In [None]:
# here is a quick view of some of our edges
edges_to_keep[1050:1100]

In [None]:
len(edges_to_keep)

In [None]:
# for the adventurous we could have achieved the same as above in a list comprehension,
# though it is not as readable

edges_to_keep = [(source,target) for source,target, edge_attributes in projected_G.edges(data=True) if edge_attributes['weight']>1]
len(edges_to_keep)

In [None]:
# we can easily filter the graph using the nx.edge_subgraph function
# this function requires the graph to be filtered, and a list of edges to keep.

filtered_G = nx.edge_subgraph(projected_G,edges_to_keep)

In [None]:
n_original_edges = projected_G.number_of_edges()
n_original_nodes = projected_G.number_of_nodes()

n_new_edges = filtered_G.number_of_edges()
n_new_nodes = filtered_G.number_of_nodes()

def pct_decrease(original,new):
    decrease = original - new
    return round((decrease/original)*100,2)



print(f"Original Graph - Nodes: {n_original_nodes}")
print(f"Original Graph - Edges: {n_original_edges}")
print(f"New Graph - Nodes: {n_new_nodes}")
print(f"New Graph - Edges: {n_new_edges}")

print(f"Graph Nodes decreased by: {pct_decrease(n_original_nodes, n_new_nodes)}%")
print(f"Graph Edges decreased by: {pct_decrease(n_original_edges, n_new_edges)}%")




In [None]:
# save the filtered graph to a gexf and take a look in Gephi

nx.write_gexf(filtered_G, 'entity_network.gexf')