# Introduction to Text Mining
## Part1: Extracting Key Entities from Text

In [None]:
# For today we're going to use spacy


SpaCy uses pre-trained models of language to do a lot of the tasks we need. To create our spacy text tool we need to load in a model. SpaCy has a [load of different models](https://spacy.io/usage/models) for different languages and different types of task.

We're going to use `en_core_web_md` which means..
- en: English
- core: Can perform the core features of Spacy but not some of the more specialised techniques.
- web: trained on content from the web such as blogs, news, comments, making it suitable for similar content.
- md: medium verion. There is also the small and large models. Small is trained just on web text data from 2013. Medium is trained on [petabytes of data from the contemporary internet](https://commoncrawl.org/big-picture/) and so is much more up to date in how it understands contemporary language use.

In [None]:
test_phrase = "I don't see my cat. He has a long tail, fluffy ears and big eyes!"\
" He also subscribes to Marxist historical materialism. It's just his way."

In [None]:
# we load in our model...
nlp = 

In [None]:
# and create our new text object by wrapping it in our model

doc = 
doc

### What just happened?

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/spacy_pipe.png?raw=true" width="1000">
Our text has just gone on an incredible journey... or been mangled through a ton of processes. 

##### Tokenization
Breaks a string up into individual 'tokens', individual words, pieces of punctuation etc (we'll cover this more later).

##### Part-of-Speech Tagging (POS)
Tags up each token with a grammatical label such as Noun, Pronoun, Adjective, etc. This tagging is based on the word itself, and the context of each word, i.e. the position of that word in relation to other words, punctuation etc. POS tagging is complex and not directly important for our purposes, but it is fundamental to supporting later processes.

##### Dependency Parsing
Works out how words within each sentence are related to one another. For example in the sentence below...
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/spacy_dependency.png?raw=true" width="700">
   - "big" is the *adjective modifer (amod)* of "cat" as it modifies something about the "cat" object.
   - "home" is the *adverbial modifier (advmod)* of ran as it modifies the generic running, to running "home"
   
Dependency parsing is reliant on the preceeding steps of tokenizing and POS tagging. Again the information this step generates is     not directly of use to us, but it is necessary for any processes that need to work out what words go together, or that cut up text into semantically meaningful chunks.
##### Named Entity Recognition (NER)
Uses all preceeding steps to be able to predict which tokens likely refer to particular types of entities like people, organisations, dates etc. It is not using any limited list or reference to "look up" these entities, but instead identifies them based on all the information generated in the preceeding steps.

#### The value of SpaCy

In many Natural Language Processing libraries we would have to code all these steps ourselves, making sure that the output of every step is processed in the right way to fit the input of the next step. SpaCy has a pre-built pipeline that covers all of these steps and then wraps the result in a useful object called a SpaCy `document`.

Whilst our `doc` looks like a simple string but this is now a SpaCy `Document` which has a whole range of associated methods.
See the SpaCy Documentation on the [Document object](https://spacy.io/api/doc)

In [None]:
doc

In [None]:
type(doc)

### SpaCy Document Methods

#### Detect Language

#### Break Down into Sentences

#### Compare texts for similarity

In [None]:
similar_phrase = nlp("I have my cat. He has a long tail, fluffy ears and big eyes!"\
                     " He subscribes to Neoliberal free market monthly. It's just his way.")

bio_abstract = nlp("Helicobacter pylori pathogenesis and disease outcomes"\
                        " are mediated by a complex interplay between bacterial"\
                         " virulence factors, host, and environmental factors."\
                         " After H. pylori enters the host stomach, four steps are"\
                         " critical for bacteria to establish successful colonization")
                        # extract from https://doi.org/10.1016/j.bj.2015.06.002

    


#### Break Down into Noun Chunks
Useful for examining phrases that might summarise the document.

In [None]:
#.noun_chunks will produce a generator. To force the generator to run we wrap it in a list.


#### Find the Entities in the text
Named entity recognition (NER) is the technique of extracting key entities within a piece of text,
- people
- places
- organisations
- dates
- values
- currencies etc.

In [None]:
trump = nlp("""A New York judge has ordered President Donald Trump to pay $2m (£1.6m)"""\
            """ for misusing funds from his charity to finance his 2016 political campaign."""\
            """ The Donald J Trump Foundation closed down in 2018. Prosecutors had accused it"""\
            """ of working as "little more than a chequebook" for Mr Trump's interests."""\
            """ Charities such as the one Mr Trump and his three eldest children headed cannot"""\
            """ engage in politics, the judge ruled.""")

# Source: https://www.bbc.co.uk/news/world-us-canada-50338231

In [None]:
# list the entities (again spacy produces a generator that we need to wrap in a list)
trump_ents = 
trump_ents

In [None]:
# every object in the entities list has a text attribute and a label attribute to tell you the type of entity it is.



In [None]:
# as we're in Jupyter we can also use SpaCy's built in visualiser



In [None]:
# if you want to save the annotated version of the
# text you can save to html using this function.

def save_displacy_to_html(doc, filename, style='ent'):
    html_data = spacy.displacy.render(doc, style='ent', jupyter=False, page=True)
    with open(filename, 'w+', encoding="utf-8") as f:
        f.write(html_data)

save_displacy_to_html(trump, 'test.html', style='ent')

In [None]:
# lets create a function that can extract specific types of entities from a text

def entity_extractor(nlp_doc, entity_type):
    

In [None]:
entity_extractor(trump, 'PERSON')

#### Processing many documents
A collection of pieces of text is referred to as a *Corpus* in Natural Language Processing. The majority of the time you will be analysing a large corpus of text material such as a collection of...

- tweet texts
- forum posts, 
- documents from an archive

This means it is important that your code is able to process large numbers of documents simultaneously. Different Python libraries vary in how good they are at handling text at scale.

SpaCy is blazing fast if used correctly...

In [None]:
#import pandas
import pandas as pd

In [None]:
news_data = pd.read_pickle('news_data.pkl')

# examine the dataframe as usual, check the info
news_data.info()

In [None]:
# the head
news_data.head()

In [None]:
# and in this case we can get a sense of what the stories are about by seeing checking how many stories came from each query
news_data['query'].value_counts()

We can use `nlp.pipe` to process a list of documents all at once. This streamlines the process which speeds everything up.

We feed it our column of texts from the dataframe, and then we wrap the whole thing in a `list`, because nlp.pipe is a generator that will not produce anything until it is iterated over. Wrapping a generator in a list forces it to run and actually output the results.

The `%time ` command is a special piece of Jupyter ""Magic"" that tells us how long a command took to run.

In [None]:
# we can use nlp.pipe to process ALL the texts in our text column and assign the list of processed texts to a new variable


In [None]:
# if we now assign this to a new column, because the list of processed_docs is in the same order as the dataframe, everything will line up
news_data['text_nlp'] = 

In [None]:
news_data.head()

In [None]:
# Remember the function we built earlier??
def entity_extractor(nlp_doc, entity_type):
    ents = list(nlp_doc.ents)
    ents_filtered = [ent.text for ent in ents if ent.label_ == entity_type]
    unique = list(set(ents_filtered))
    return unique

In [None]:
# We can now use this function on any value within our new text_nlp column

processed_row = 
entity_extractor(processed_row, entity_type='PERSON')

### Understanding Pandas .apply (Advanced usage)

As the heavy lifting of the processing is already done we can do this on every row of our dataframe using Pandas `.apply`.

First we select the column that has our values we want to use and the `.apply` method built into the column object.

`news_data['text_nlp'].apply()`

However we need  to tell `.apply` what function to apply to each row in our column...

`news_data['text_nlp'].apply(entity_extractor)`

Note that we do not need to give the function parenthesis to activate it, apply will handle that part.

If our function only needed one argument the above would be fine as `.apply` presumes that it should feed the value of the row to the function as the function's first argument. In our case it would feed the value of the row to the function's `nlp_doc` argument, the first argument in the function.

However we have TWO arguments, `nlp_doc` and `entity_type`, both of which are necessary. If we provide this information to the `.apply` function as a "named argument" pandas `.apply` will take the information and feed that into the function for us, like so...


In [None]:
# Note that entity_type is an argument specified in our entity_extractor function, but we're passing it to apply.
# Apply will recognise that it doesn't use an argument called entity_type and instead feed it to the function we
# specified it should use. Apply is clever like that.



In [None]:
# we can assign the result to a new column, 'entity_person' and 
# lets also create the column 'entity_org' using the entity_type argument 'ORG'

news_data['entity_person'] = 
news_data['entity_org'] = 


In [None]:
news_data['entity_person']

In [None]:
news_data['entity_org']

In [None]:
# we can subset the data to just get the articles from the brexit query
brexit_data = 

In [None]:
# we explode the data to transform our single column with rows of lists, to give each item in each list its own row

exploded_df = 
exploded_df

In [None]:
# We can see overall the most mentioned 'persons' in the dataset - note that Spacy think brexit is a person....
# this might be an interesting indicator of the way brexit is talked about in the press.



In [None]:
# we could see what the OVERALL most mentioned person entity is...

news_data.explode('entity_person')['entity_person'].value_counts()


# ...but this has issues...

### Understanding Pandas .groupby (Advanced usage)

An issue with the above is that the top people mentioned in the dataset will be greatly determined by the number of stories gathered per query. Let's say our end goal is to find the top 5 persons mentioned, but this time for each query individually.

We could come up with some various filters and loops to cut up the dataset and spit out the results, but `.groupby` is designed for just these kinds of cases.

Groupby works using the process of...

#### SPLIT > APPLY > COMBINE

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/groupby.png?raw=true" width="600">

- Split the Dataframe into seperate dataframes based on the values in a particular column - usually some sort of category
- Apply a method or a function to each dataframe
- Combine those dataframes back into a single dataframe and return it...


First we explode the dataframe on the entity_column

`news_data.explode('entity_person')`

This will give us a dataframe where we convert our lists of persons in each row, to individual rows per person per article, as we did above. Then we move on to the groupby process.

The next step we `.groupby` the 'query' column, this means we have an object that is in effect 7 different dataframes, one for each query (The Split stage). This creates a `DataFrameGroupBy` object. We can't examine its contents yet because we haven't applied any transformations.

In [None]:
exploded_data = 

In [None]:
groupby_query = 
groupby_query

This has applied a transformation to the groupby object, and it has returned us a single object of all 7 transformed dataframes combined.
It has split into multiple dataframe based on our groupby command.


Every command we give to this object will be applied to each of the 7 dataframes at the same time.
If we try .value_counts() it will give us the value_counts() for the entity_person column, for each dataframe seperately.

We have the value counts of how many times each PERSON entity was mentioned within each of the seperate query dataframes. However because it is returning the entire set of results for each dataframe, we can't see much. Currently we're just looking at the top and bottom of the entire set of results. The very top of the first dataframe, and the very end of the bottom dataframe, and everything in between is hidden because there is too much data.

HOWEVER!
If we take this result, and group it AGAIN by the 'query' column, we will be cutting up this set of results into indivdual dataframes again. We can then just ask for the top 5 results from each, by using `.head(5)`.

This will take the top 5 from each dataframe, and then combine them into a single dataframe and pass it back to us.

To make life easier we can always create a function!

In [None]:
def top_entity(data, entity_col, groupby_col, top_n=5):
    exploded = data.explode(entity_col)
    g = exploded.groupby(groupby_col)
    result = g[entity_col].value_counts().groupby(groupby_col).head(top_n)
    return result

In [None]:
top_entity(data=news_data, entity_col='entity_person', groupby_col='query', top_n=5)

In [None]:
top_entity(data=news_data, entity_col='entity_org', groupby_col='query', top_n=5)

### From Entity Lists to Entity Networks

One analysis approach we can use is to build a network where each node is the name of an entity, and an edge exists between them if they co-occur in the same piece of text. For every co-occurence their edge is weighted +1.

This is relatively easy to do with a few tricks in Pandas and Networkx.

In [None]:
import networkx as nx

In [None]:
# Build an edge list where the source is just an id number for the article and the target is the name in the article.

# by exploding the entity column we get a dataframe where each row is repeated for every item in the entity list.
exploded = 

# we take a copy of just two columns, the newly exploded 'entity_person' column, and the uuid column which is a unique id
# assigned to each article

edge_list = 

#then we just rename the appropriate columns
edge_list = 
edge_list

In [None]:
# create our graph from the edge list

G = 

We now have our graph which consists of nodes representing unique articles, and nodes representing people, with edges going between the articles and people if a person is mentioned in the article.

We could examine this to see the result...

In [None]:
nx.write_gexf(G, 'article_person.gexf')

#### Bipartite Graphs

Our graph is a bi-partite graph.

Bipartite graphs are networks where there are two types of nodes and where there are only edges between nodes of different types, never edges between nodes of the same type.

In our graph we have nodes representing articles, and nodes representing people. Currently there are only edges between people and the articles that they appear in, not edges between people, and not edges between articles.

##### Bi-Partite Projection
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/bipartite.png?raw=true" align="right" width="300">
Projection can be understood as the process of removing the 'intermediaries' between nodes, and connecting those nodes together directly. In the example image, Nodes A, B and C are not directly connected, but they do share intermediary nodes, 1, 2 and 3. The simple graph just represents that there is some form of connection, the multigraph demonstrates that...

- there are two shared connections between A and B (intermediary nodes 1 & 2)
- there are two shared connections between B and C (intermediaries, 2 and 3).
- There is only one connection between A and C (intermediary node 2).


Image from https://arxiv.org/pdf/1909.10977.pdf

In [None]:
# now if we check we can see our graph is a "bipartite graph"

nx.bipartite.is_bipartite(G)

This confirms we can can use projection techniques on this graph.
We can use the networkx function `nx.bipartite.weighted_projected_graph` to create our projection.

This function requires

- The graph you want to project from
- A list of the nodes you want to retain - in our case our list of people.

In [None]:
# we can create a list of unique nodes by taking out 'target' column of our edge 
# list (which remember was our list of PERSON entities) and using .unique()

nodes_to_keep =
nodes_to_keep

In [None]:
# we create our final graph

projected_G =

We now have a graph of just PERSON entity nodes, with edges weighted by the number of times those PERSON entities co-occurred in an article. It is likely that the majority of edges will have a weight of 1. We're interested to see if entities co-occur a lot so we could filter out these edges.

In [None]:
# current number of edges
projected_G.number_of_edges()

In [None]:
# lets identify all the edges that have a weight greater than 1
# we'll use this to filter our graph

edges_to_keep = []

# the if we iterate over a graph using the .edges() method it gives us two values
# the source and the target of the edge.

# for source, target in projected_G.edges():
#     Do something in this loop
#
#
#

# if we give .edges the argument data=True, it also gives us the attributes of each edge as a dictionary

for source, target, edge_attributes in projected_G.edges(data=True):
    

In [None]:
# here is a quick view of some of our edges
edges_to_keep[1050:1100]

In [None]:
len(edges_to_keep)

In [None]:
# for the adventurous we could have achieved the same as above in a list comprehension,
# though it is not as readable

edges_to_keep = [(source,target) for source,target, edge_attributes in projected_G.edges(data=True) if edge_attributes['weight']>1]
len(edges_to_keep)

In [None]:
# we can easily filter the graph using the nx.edge_subgraph function
# this function requires the graph to be filtered, and a list of edges to keep.

filtered_G = 

In [None]:
n_original_edges = projected_G.number_of_edges()
n_original_nodes = projected_G.number_of_nodes()

n_new_edges = filtered_G.number_of_edges()
n_new_nodes = filtered_G.number_of_nodes()

def pct_decrease(original,new):
    decrease = original - new
    return round((decrease/original)*100,2)



print(f"Original Graph - Nodes: {n_original_nodes}")
print(f"Original Graph - Edges: {n_original_edges}")
print(f"New Graph - Nodes: {n_new_nodes}")
print(f"New Graph - Edges: {n_new_edges}")

print(f"Graph Nodes decreased by: {pct_decrease(n_original_nodes, n_new_nodes)}%")
print(f"Graph Edges decreased by: {pct_decrease(n_original_edges, n_new_edges)}%")




In [None]:
# save the filtered graph to a gexf and take a look in Gephi

nx.write_gexf(filtered_G, 'entity_network.gexf')