In [None]:
! pip install transformers

In [None]:
# install Spacy and a language model

!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm


# IF YOU ARE USING A NEW APPLE (M1 CPU) COMPUTER use these lines instead.
# pip install -U pip setuptools wheel
# pip install -U 'spacy[apple]'
# python -m spacy download en_core_web_sm

# SC207 Text Mining
## Sentiment Analysis and Entity Recognition
### Using Pre-Trained Models for quick text insights

The two methods we're using today rely on pre-trained models to quickly pull apart and analyse pieces of text with a high degree of complexity. Trained on millions of examples of text from the internet, archives, books etc. These models go beyond simply looking at what words are being used, and consider the placement of words, their immediate and distant context, their role within sentence structure and more, to make inferences about what the text says, what matters in the text, and what it could imply.

#### Tools
Today we're using two packages.
- [Transformers](https://pypi.org/project/transformers/): Allows us to quickly download a pre-trained model designed specifically for sentiment analysis using the HuggingFace 🤗 [AI model repository](https://huggingface.co/)
- [SpaCy](https://spacy.io/): A natural language processing package that relies on its own pre-trained models to provide a large set of text analysis features. Today we'll be using itrs powerful entity recognition system.

In [None]:
import pandas as pd
from transformers import pipeline
import spacy
import seaborn as sns

# Sentiment Analysis

A tricky area to get right. Prior to pre-trained models sentiment was determined by matching specific words to a predefined table that gave each word a score depending on how positive/negative the designers felt the word was. Whilst this worked for simple text, sentiment is often context dependent, can be morphed by sarcasm, and changes over time as lagnuage evolves. Regularly updated models are shown examples of text that have been labelled as either positive or negative by human annotators, and tested to see if they can accurately predict what a human would label a brand new piece of text.

We can initialise one of those super complex incredibly difficult to build models, but it will take every bit of our coding skills to do so...

In [None]:
get_sentiment =

Done.

In [None]:
"I love every brilliant thing right now. Super happy"

In [None]:
"My name is James"

In [None]:
"I am very angry"

It is worth noting two things...
1. Text can ONLY be positive or negative under this model, there is no neutral.
2. The score does not indicate strength of sentiment. It indicates how confident the model is in its prediction. We'll look at this more later.

## Applying it to a whole dataset

In [None]:
def flatten_nested_dicts(text_data):
    dicts = text_data.to_dict(orient='records')
    flattened = pd.json_normalize(dicts)
    return flattened

We will use our twitter dataset and also our community assignments we generated using NetworkX. We can use these later to examine whether sentiment differs between different groups in our retweet network.

In [None]:
tweets = pd.read_pickle('example_twitter_data.pkl')

In [None]:
tweets =  # remove retweets
tweets =

communities = pd.read_csv('communities.csv', index_col=0)
communities.head()

In [None]:
# Merging together tweet data and community assignments
tweets =
tweets.shape

In [None]:
sample =
sample =

In [None]:
sample['sentiment'] =
sample

In [None]:
label_score =
sample =

sample['community'] =

sample.head()

## Visuals 
### Distribution of Sentiment

In [None]:
#sentiment counts

In [None]:
#countplot

In [None]:
#community/sentiment counts

In [None]:
#countplot community grouped

In [None]:
order =
#countplot community grouped with ordering


### Score Confidence

In [None]:
#describe community scores

In [None]:
#community confidence score boxplots

In [None]:
confidence_data =
confidence_data

In [None]:

#confidence heatmap

# Named Entity Recognition

Named entity recognition (NER) is the technique of extracting key entities within a piece of text,
- people
- places
- organisations
- dates
- values
- currencies etc.

SpaCy's processing examines each word in context and uses this to predict which tokens likely refer to particular types of entities like people, organisations, dates etc. It is not using any limited list or reference to "look up" these entities, but instead identifies them based on contextual cues.


In [None]:
text_data = pd.read_csv('sample_news_large.csv')

In [None]:
text_data.head()

In [None]:
nlp =

In [None]:
trump = nlp("""A New York judge has ordered President Donald Trump to pay $2m (£1.6m)"""\
            """ for misusing funds from his charity to finance his 2016 political campaign."""\
            """ The Donald J Trump Foundation closed down in 2018. Prosecutors had accused it"""\
            """ of working as "little more than a chequebook" for Mr Trump's interests."""\
            """ Charities such as the one Mr Trump and his three eldest children headed cannot"""\
            """ engage in politics, the judge ruled.""")

# Source: https://www.bbc.co.uk/news/world-us-canada-50338231

In [None]:
# we can access the entities with the .ents attribute


In [None]:
# every object in the entities list has a text attribute and a label attribute to tell you the type of entity it is.



In [None]:
# as we're in Jupyter we can also use SpaCy's built in visualiser



In [None]:
# if you want to save the annotated version of the
# text you can save to html using this function.

def save_displacy_to_html(doc, filename, style='ent'):
    html_data = spacy.displacy.render(doc, style='ent', jupyter=False, page=True)
    with open(filename, 'w+', encoding="utf-8") as f:
        f.write(html_data)

save_displacy_to_html(trump, 'test.html', style='ent')

In [None]:
# lets create a function that can extract specific types of entities from a text



In [None]:
entity_extractor(trump)

In [None]:
entity_extractor(trump, 'person')

In [None]:
docs =
people =

In [None]:
#assign to people column

In [None]:
people_data =
people_data


In [None]:
# most mentioned people
people_data['people'].value_counts()[:10]

In [None]:
# top ten people per group


In [None]:
top_people =
top_people


In [None]:
#most mentioned bar plot

In [None]:
#Most mentioned per query barplots


## Thinking of Entities Relationally
One of the skills we developed was to use networkx to start thinking relationally about data, and how we can use measures and metrics without necessarily relying purely on visualisation (though visualisation helps!).

We can intersect text analysis with network analysis by mapping the co-occurence of different people within news stories, implying that those people whose names co-occur often are probably connected in some way.

In [None]:
import networkx as nx
from networkx.algorithms import bipartite

import netwulf as nw

In [None]:
def filter_by_degree(G, minimum_degree):
    scores = G.degree()
    to_keep = [node for node,degree in scores if degree >= minimum_degree]
    return G.subgraph(to_keep)

# Sets the size attribute of our graph to whatever scores are passed in
def size_by(G,scores):
    nx.set_node_attributes(G,scores, name='size')
    return G

In [None]:
people_data.head()

In [None]:
edge_list =
edge_list

In [None]:
# rename
# reset index
edge_list

In [None]:
G =

In [None]:
G.number_of_edges()

In [None]:
G.number_of_nodes()

In [None]:
G.edges

In [None]:
# check bipartite

### Bipartite?
<img src='https://github.com/Minyall/sc207_2022-23/blob/master/images/bipartite_projection.png?raw=true' align="right" height="200">

A bi-partite graph is one where two sets of nodes only ever form edges with nodes outside their set. So in our case, there are only edges between article nodes, and person nodes, never article to article, nor person to person.

However we want a graph that is person to person, but that somehow retains the information on how often they co-occur that comes from their common connection to article nodes. The solution is to create a 'projection' of the original graph that connects up those common edges and weights them based on how many original connections there were.

In [None]:
#First we choose which nodes we want to 'keep' after the projection
keep_nodes =
keep_nodes

In [None]:
# Now we get our projected Graph

people_G =

In [None]:
people_G.edges(data=True)

In [None]:
# Now we find all edges where there was only 1 co-occurence and drop them

drop_edges =
drop_edges

In [None]:
people_G # remove edges

In [None]:
people_G.number_of_edges()

In [None]:
people_G.number_of_edges()

In [None]:
# Now we remove any noes that have less than 1 degree, i.e. no connections at all.

people_G = #filter degree
people_G.number_of_nodes()

In [None]:
# Here we create a degree score that accounts for the edge weights of any edges connected to a node
# we use our size-by function to quickly change the sizing before visualisation
degree_scores =
weighted_degree_scores =

In [None]:
# People who co-occur with the highest number unique of people


In [None]:
# People with the most co-occurences overall


In [None]:
# Highest co-occuring pairs


In [None]:
# The people most likely to be at the centrwe of the stories, they relationally connect together others.
betweenness_scores =


In [None]:
# Size by the betweeness centrality and see if we're right with a visual
people_G =

stylized_network, config =

In [None]:
nw.draw_netwulf(stylized_network, figsize=10)