# Available Variables or Insights

## Added by Text Analysis
- Topic per text
- Sentiment polarity per text
- Sentiment classification
- Entities in the text

## Added by Network Analysis
- Centrality Measures of entity signficance
- Relations between entities
- Entity Community 

## Method Driven Questions
I have a method, what can it do for me?
- What are the major topics of these articles?
- What are the major topics of this band's lyrics?
- Who is mentioned in these articles?
- What are the major and minor figures of these stories?
- What figures and organisations are commonly mentioned together?
- What are the sentiments of these documents?
- What are the most significant words of a collection of texts?



# Thinking in building blocks
Generally in social science research you are encouraged to be *question first*. This means you should be choosing data, methods and the frameworks you use to interpret those methods all in service of answering the question. The point of this module is to open up the range of possible sources of data, and methods that you have so you have a bigger toolkit to answer a greater variety of questions.

For the purposes of this course, your methods and sources of data are limited, but even in that limited range there are many options directly taught to you and other options that can come from *combining* those methods.


## Building Blocks
Think of your project as a pipeline with different stages of processing, analysis and interpretaion. What goes into the pipeline at the beginning, has a direct impact on what kinds of analysis make sense, how effective that analysis will be, and what kinds of interpretations you can draw from it. Each step in the pipeline has knock-on effects to what follows. Plannng and being reflective about what it is you are *actually doing* in each step is critical to good research and solid outcomes.

- The data
- The primary method of analysis
- The secondary method of analysis for feature generation and/or greater insight (optional)
- The ordering or segmenting of data for comparison and/or closer analysis



### The Data
- The data is the foundation of your project, it is what starts your pipeline and has a knock on effect to every other stage. Right at the very start the data you collect, what kinds of features it has, what has been included and what has been excluded by the collection process, such as the way you wrote your query to an API, will limit the range of possible questions, interpretations and applicability of the techniques you're using. 

- You're broadly limited to two sources from the course, Guardian news data, and Lyrics data. Other options are permissible but you need to talk to me and get my approval first. My goal is a good grade for you so if I think you're risking that, I will deny your request.

- Start with what you're interested in. You will find analysis easier and motivation higher if you're personally interested.

- Consider the most appropriate source at your disposal to fit that interest. Generally content from The Guardian, or music lyrics are your options.



#### Choosing the *right* data

- Now is the tricky part, what do you need from those sources to best address your question?

> Example: You want to study crime, you query 'crime' from the Guardian API and get as many articles using the word 'crime' as possible. You now have articles using the word 'crime' from the news, opinion, arts, and sports pillars across the entire Guardian archive's date range. Some of those articles are about crimes, some are reviews of plays where they declare that the obscurity of the lead actor is 'an absolute crime!'. Your dataset is truly massive, one of the biggest datasets of all time. Nobody's ever seen a dataset this big before, and it's going to be beautiful.
    - What kinds of questions could you actually answer with this dataset? How?

> Example: You want to study crime. You download all the lyrics of songs that you remember talk about doing crime along with their release dates and artist names.
    - What kinds of questions could you actually answer with this dataset? How?

> Example: You want to study instances of mass shootings over time. You search the Guardian website first and read some stories that look relevant to identify terms and phrases that may help you narrow it down. You use the Guardian API to search those terms and phrases using the OR and AND operators such that you recieve a relatively small but targetted dataset. You specify that material should only come from the News pillar, and you set a date range to limit it to the last twenty years. 
    -  What kinds of questions could you actually answer with this dataset? How?

### Greatness in, Greatness out
- The data you start with has a major impact on every other step of your project so it is important to really think through what you need to collect and how. Whilst getting a good sample size is important, a more carefully targetted or selected dataset with a good rationale makes it easier to then understand why you're applying specific analysis techniques, and what the results tell you.



##  Analysis techniques
The techniques we have learned are:
- Document summarisation using TFIDF word significance
- Document similarity using TFIDF vectors
- Document similarity using embeddings i.e. Topic Modelling
- Sentiment analysis
- Entity recognition to extract names and organisations
- Network analysis for entity significance, and entity communities.

Some of these techniques offer a wide range of additional analysis options. For example, topic modelling provides a wide array of additional interpretations of the data around the topics identified such as topics over time, topic clustering and comparison of topic to class (such as newspaper section). 

Some are more direct and specific, offering you one form of insight such as sentiment scores or most representative words. 

The network techniques, as they rely on there being a relation between things, moves away from your main unit of analysis being a 'document' to instead being relations within documents (though we'll see there are ways to map those relational insights back to 'per document').

The choice of technique used should be a balance of your own personal confidence and comfort with the technique, the kind of the data you are using and what kinds of insights you want to get out of it.

### Single Dataset Style Questions 
The methods we've learned suggest certain types of questions:
- What are the major topics of these articles?
- What are the major topics of this band's lyrics?
- Who is mentioned in these articles?
- What are the major and minor figures of these stories?
- What figures and organisations are commonly mentioned together?
- What are the sentiments of these documents?
- What are the most significant words of these texts?

These can be reasonable questions, but there may be more insight to be gained by splitting, ordering or grouping your data such that you can make a comparison, or see trends. For example:

### Segmented or Ordered Datset Style Questions
- What are the major topics of these articles over time?
- What is the average sentiment of a band's lyrics, over time?
- Who are the major and minor figures in these stories, by section?
- What are the most significant words of these different groups of texts?
- Which of these topics have the highest/lowest wordcount.
- Which entities get the highest wordcount?
- Which tags correlate with which topics?

> Variables for segmenting and ordering available from the source

> #### From the Guardian API
> - Pillar and Section (Generally exclusive)
> - Type of publication
> - Publication Date
> - Article Tags (Multiple)
> - Byline (Author)

> #### From the Genius API
> - Track and album release dates
> - Track and album titles
> - Track and album artist names

Rather than treating your dataset as a single lump, finding a way to make a comparison or  identify change gives you more interesting results for interpretation and also opens up different more nuanced kinds of questions.

A project based around these kinds of questions or similar, will do well.

# Mixing Methods
One of the key benefits, in my view, of computational social science, is the flexibility to build methods and tools specific to what you want to achieve. If you rely on pre-built software packages for the social sciences, often your range of possible actions with the data, and analysis techniques are limited by the expectations and design limitations of that software.

## Creating your own features
- When you perform analysis, that is not necessarily the end-point of the pipeline. Doing analysis generates outputs which themselves can be put back into your dataset for use in another analysis technique.

- For example, topic modelling generates a topic assignment for each document.
- Network community analysis assigns a community to key entities within your documents, and so for each document you have can have multiple 'entity communities'.
- Sentiment analysis can give a document a sentiment score or classification, or if broken down into paragraphs, a range of sentiments/clasifications.
- Groups of documents can be given a list of significant words representing that group. A group could be based on topic (which is how topic modelling works) but also a time period, filtered by mention of a specific entity, grouped by mention of an entity community etc.

## Levels of Analysis
One thing to keep in mind when thinking about your project is that data can be examined at different 'levels', that may tell us different things and that you might make claims about different levels. A simple way to differentiate the levels would be:

- Whole dataset
- Subsets of the dataset
- Each item of the dataset
- A component part of the item

For data from the Guardian API this would be
- Whole corpus of documents
- Collections of documents based on some grouping or segmenting
- Each indivdual article
- Paragraphs within articles

### Whole Dataset
For example, if you were using topic modelling you may run it across the whole corpus and report on the different topics available across the whole dataset - how many articles per topic, what each topic is about.

### Subset
You may also examine whether certain topics occur more often with different subsets, such as time periods of reporting, or sections, or all documents mentioning specific entities.

### Individual articles
You may qualitatively examine representative articles from each topic, and use the topics to help you sample a smaller number of articles. This then allows you to provide a more in-depth analysis than the topic summaries which tend to be descriptive rather than explanatory.

### Paragraphs
You could choose to split each article into paragraphs (like we did for entity detection) before running topic modelling. This would then mean each document was a paragraph and may provide more nuanced topics and encourage a much closer reading when examining representative documents, comparing with entity presence etc.

### Which level?
 - A whole corpus topic modelling will tell you the broad stories being discussed, but not whether there is a pattern in the story publication, or whether there is something of interest in the language used. 
- A more granular focussed (lower) level provides opportunities for finding patterns, and qualitative interpretation, but may also need more thought behind the research design.

> #### Example: The Cypherpunks
> In a project with Dr. Amy Stevens (Sheffield), we applied network analysis, topic modelling and qualitative content analysis to explore the discussions of an online group called 'The Cypherpunks' across a ten year period.

> Using their archived mailing list discussions (think group email chains) as our data we...
> 1. Used network representation to rebuild the archives into a network of related messages.
> 2. We used this information to identify key periods of activity and select a specific period of high engagement before a lull of a few years.
> 3. We then used the network representation to identify the most central members of the community and all messages part of discussions.
> 4. We applied topic modelling to the discussions and generated our topics.
> 5. Finally we qualitatively examined all discussions that were considered most representative of a topic AND which were started by high centrality members of the community.
> 6. This allowed us to systematically sample the data so that we reduced the possible data to consider for qualitative analysis from ~190,000 messages from hundreds of users down to just the discussions started by the top 44 that best represent the topics.




# Example Code Snippets
How do you mix some of these different elements? Some short code snippets below show you how.

In [2]:
import pandas as pd

articles = pd.read_parquet('farright_dataset_cleaned.parquet').head(100)
articles.head()

Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,tags,isHosted,pillarId,pillarName,byline,body,wordcount,cleaned_text,tokens
0,politics/2025/sep/12/starmer-hope-not-hate-let...,article,politics,Politics,2025-09-12 15:00:45+00:00,Starmer urged to do more to push back against ...,https://www.theguardian.com/politics/2025/sep/...,https://content.guardianapis.com/politics/2025...,"[{'activeSponsorships': None, 'apiUrl': 'https...",False,pillar/news,News,"Jessica Elgot, Rowena Mason and Peter Walker",<p>Senior Labour MPs and the UK’s largest anti...,925,Senior Labour MPs and the UK's largest anti-fa...,senior labour mps uk large anti fascist charit...
1,world/2025/sep/12/brazilians-take-to-the-stree...,article,world,World news,2025-09-12 14:54:33+00:00,Brazilians take to the streets to celebrate Bo...,https://www.theguardian.com/world/2025/sep/12/...,https://content.guardianapis.com/world/2025/se...,"[{'activeSponsorships': None, 'apiUrl': 'https...",False,pillar/news,News,Tom Phillips in Brasília,<p>Thousands of Brazilians have taken to the s...,718,Thousands of Brazilians have taken to the stre...,thousand brazilians take street rejoice jair b...
2,world/2025/sep/12/operation-world-cup-the-plot...,article,world,World news,2025-09-12 12:32:28+00:00,Operation World Cup: the murder plot at the he...,https://www.theguardian.com/world/2025/sep/12/...,https://content.guardianapis.com/world/2025/se...,"[{'activeSponsorships': None, 'apiUrl': 'https...",False,pillar/news,News,Tom Phillips in Brasília and Tiago Rogero in R...,<p>The conspirators used codenames to conceal ...,827,The conspirators used codenames to conceal the...,conspirator codename conceal identity prepare ...
3,world/2025/sep/12/ursula-von-der-leyen-eu-unde...,article,world,World news,2025-09-12 12:17:50+00:00,Europe’s cruel summer: Ursula von der Leyen fa...,https://www.theguardian.com/world/2025/sep/12/...,https://content.guardianapis.com/world/2025/se...,"[{'activeSponsorships': None, 'apiUrl': 'https...",False,pillar/news,News,Jennifer Rankin in Brussels,<p>When Ursula von der Leyen arrived in the va...,1048,When Ursula von der Leyen arrived in the vast ...,ursula von der leyen arrive vast semi circle d...
4,us-news/2025/sep/12/first-thing-new-video-of-s...,article,us-news,US news,2025-09-12 11:41:19+00:00,First Thing: New video of suspect released by ...,https://www.theguardian.com/us-news/2025/sep/1...,https://content.guardianapis.com/us-news/2025/...,"[{'activeSponsorships': None, 'apiUrl': 'https...",False,pillar/news,News,Nicola Slawson,<p>Good morning.</p> <p>US officials have issu...,954,Good morning.\nUS officials have issued an urg...,good morning official issue urgent appeal help...


## 1. Working with Paragraphs

We did a little of this when we generated entities per paragraph in the last networks session. However if you want to work at the paragraph level in other ways you'll need to do a little prep.

In [122]:
articles['paragraphs'] = articles['cleaned_text'].str.split('\n')

per_para_data = articles.explode('paragraphs')
per_para_data = per_para_data[per_para_data['paragraphs'].str.len() > 40] # filter out any paragraphs shorter than 40 characters

per_para_data = per_para_data.reset_index(drop=False, names='original_index') # create a column of the original article indexes to keep track
per_para_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2133 entries, 0 to 2132
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   original_index      2133 non-null   int64              
 1   id                  2133 non-null   object             
 2   type                2133 non-null   object             
 3   sectionId           2133 non-null   object             
 4   sectionName         2133 non-null   object             
 5   webPublicationDate  2133 non-null   datetime64[ns, UTC]
 6   webTitle            2133 non-null   object             
 7   webUrl              2133 non-null   object             
 8   apiUrl              2133 non-null   object             
 9   tags                2133 non-null   object             
 10  isHosted            2133 non-null   bool               
 11  pillarId            2133 non-null   object             
 12  pillarName          2133 non-null 

In [123]:
import spacy

nlp = spacy.load('en_core_web_sm')
def tokenise_doc(doc):
    tokens = [w.lemma_.lower() for w in doc if not w.is_stop and w.is_alpha]
    return ' '.join(tokens)

BATCH_SIZE = 150
WORKERS = 1

tokens = []
for doc in nlp.pipe(per_para_data['paragraphs'], # see above for how to make a per-paragraph dataset
                     batch_size=BATCH_SIZE, n_process=WORKERS):
    tokens.append(tokenise_doc(doc))

per_para_data['para_tokens'] = tokens
per_para_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2133 entries, 0 to 2132
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   original_index      2133 non-null   int64              
 1   id                  2133 non-null   object             
 2   type                2133 non-null   object             
 3   sectionId           2133 non-null   object             
 4   sectionName         2133 non-null   object             
 5   webPublicationDate  2133 non-null   datetime64[ns, UTC]
 6   webTitle            2133 non-null   object             
 7   webUrl              2133 non-null   object             
 8   apiUrl              2133 non-null   object             
 9   tags                2133 non-null   object             
 10  isHosted            2133 non-null   bool               
 11  pillarId            2133 non-null   object             
 12  pillarName          2133 non-null 

In [124]:
per_para_data.to_parquet('paragraphed_articles.parquet') 
# if you want to save time in the future, save this version of the dataset and load it in instead of the usual article per line version.

## 2. Topic Modelling Paragraphs

In [None]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

per_para_data = pd.read_parquet('paragraphed_articles.parquet') # see snippet 1

cv = CountVectorizer(min_df=2, max_df=0.95, stop_words='english')
topic_model = BERTopic(vectorizer_model=cv, calculate_probabilities=True, )

topics, probabilities = topic_model.fit_transform(per_para_data['paragraphs'])
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,331,-1_party_said_people_labour,"[party, said, people, labour, right, governmen...",[Union chiefs have warned that Labour must do ...
1,0,354,0_kirk_charlie_trump_violence,"[kirk, charlie, trump, violence, political, sh...","[Kirk, 31, died after being shot during a pres..."
2,1,170,1_court_lula_democracy_supreme,"[court, lula, democracy, supreme, brazilian, b...",[Bolsonaro was right. A majority of Brazil's s...
3,2,142,2_israel_gaza_israeli_hamas,"[israel, gaza, israeli, hamas, palestinian, is...",[Israel's offensive in Gaza has killed more th...
4,3,61,3_johnson_office_guardian_scheme,"[johnson, office, guardian, scheme, musk, busi...","[Johnson has claimed £182,000 since leaving go..."
5,4,50,4_congress_harris_bidens_trump,"[congress, harris, bidens, trump, republicans,...",[“Many people want to spin up a narrative of s...
6,5,50,5_lachlan_news_empire_fox,"[lachlan, news, empire, fox, business, sun, wa...","[Murdoch, who was practically born with ink in..."
7,6,47,6_gun_hunt_shooters_conservation,"[gun, hunt, shooters, conservation, right, sup...",[“Chris Minns wants the people of NSW to belie...
8,7,41,7_africa_climate_global_africas,"[africa, climate, global, africas, crisis, sol...",[For rich countries to tackle the climate cris...
9,8,40,8_macron_france_divided_political,"[macron, france, divided, political, french, l...",[Since president Emmanuel Macron called a snap...


In [134]:
# save your model

topic_model.save('para_topic_model.pkl',save_ctfidf=True)

# add topics to you paragraph data

per_para_data['topic'] = topics

per_para_data.to_parquet('paragraphed_articles.parquet')



In [126]:
hierarchy = topic_model.hierarchical_topics(per_para_data['paragraphs'])
topic_model.visualize_hierarchy(hierarchical_topics=hierarchy)

100%|██████████| 48/48 [00:00<00:00, 371.28it/s]


In [40]:
topic_model.visualize_hierarchical_documents(docs=per_para_data['paragraphs'],
                                             hierarchical_topics=hierarchy)

In [127]:
# Topics per section of the article they came from
topic_per_section = topic_model.topics_per_class(per_para_data['paragraphs'], classes=per_para_data['sectionName'])
topic_model.visualize_topics_per_class(topic_per_section, top_n_topics=20)

## 2. Mixing Topics and Entities
Which entities are most common per-topic?

In [135]:
import spacy

per_para_data = pd.read_parquet('paragraphed_articles.parquet')

nlp = spacy.load('en_core_web_sm')
KEEP_ENTS = ('PERSON','ORG', 'GPE') # Note GPE means Geopolitical entity - essentially country, or nation state
BATCH_SIZE = 150

paragraph_entity_lists = []

for para in nlp.pipe(per_para_data['paragraphs'], n_process=1, batch_size=BATCH_SIZE): 
    entities = [e for e in para.ents]
    entities = [e for e in entities if e.label_ in KEEP_ENTS]
    entities = [e.text for e in entities if e.text[0].isupper()]
    entities = list(set(entities))
    paragraph_entity_lists.append(entities)

assert len(paragraph_entity_lists) == len(per_para_data)

per_para_data['entities'] = paragraph_entity_lists

# Save your entities to your paragraph data
per_para_data.to_parquet('paragraphed_articles.parquet')

per_para_data[['original_index','topic','entities']]



Unnamed: 0,original_index,topic,entities
0,0,18,"[Keir Starmer, UK, Nigel Farage, Labour]"
1,0,18,"[Starmer, London]"
2,0,26,[Nick Lowles]
3,0,12,"[Bridget Phillipson, Lucy Powell, Labour]"
4,0,12,[]
...,...,...,...
2128,99,1,"[Francisco Antônio, Bolsonaristas, Belial, Mor..."
2129,99,1,"[Sebastião Coelho, Alexandre de Moraes, Bolson..."
2130,99,1,"[Jair Bolsonaro, Moraes, US, Donald Trump's]"
2131,99,1,"[Bolsonaro, Moraes]"


In [136]:
row_per_entity = per_para_data.explode('entities')

topic_entities = row_per_entity.groupby(['topic','entities'], as_index=False).agg(n_paragraphs=('original_index','count'),
                                                                                    n_articles=('original_index', 'nunique'))

TOPIC = 2
print(topic_model.topic_labels_[TOPIC])
topic_filter = topic_entities['topic'] == TOPIC

topic_entities[topic_filter].sort_values('n_paragraphs',ascending=False)


2_israel_gaza_israeli_hamas


Unnamed: 0,topic,entities,n_paragraphs,n_articles
896,2,Israel,64,13
873,2,Gaza,36,11
881,2,Hamas,32,5
931,2,Qatar,15,5
946,2,UK,15,5
...,...,...,...,...
879,2,Glastonbury,1,1
878,2,Gideon Saar,1,1
877,2,Ghaith Abdul-Ahad,1,1
875,2,Geneva,1,1


## 3. Ordering by network measure
Which people are the most 'important' in the network for each topic?


In [137]:
import networkx as nx

per_para_data = pd.read_parquet('paragraphed_articles.parquet')
row_per_entity = per_para_data.explode('entities')

# Represent our entities as a network
entity_dummies = pd.get_dummies(row_per_entity['entities'], # see snippet 2 for row_per_entity
                                dtype=int).groupby(level=0).sum()
adjacency = entity_dummies.T.dot(entity_dummies)
G = nx.from_pandas_adjacency(adjacency)

# Remove low weight edges
G.remove_edges_from([(source,target) for source, target, attr_dict in G.edges(data=True) if attr_dict['weight'] < 2])
# Remove low degree nodes (this will also clean up any nodes that were disconnected by the edge filter above)
G.remove_nodes_from([node for node, degree in G.degree if degree < 2])
# Get the giant component
largest_component = max(nx.connected_components(G), key=len)
G = G.subgraph(largest_component)

# Calculate whole network scores with chosen metric
scores = nx.eigenvector_centrality(G)


# Map those scores back to the dataset
topic_entities['eigenvector'] = topic_entities['entities'].map(scores)

In [138]:

TOPIC = 2
print(topic_model.topic_labels_[TOPIC])
topic_filter = topic_entities['topic'] == TOPIC

topic_entities[topic_filter].sort_values('eigenvector',ascending=False)

2_israel_gaza_israeli_hamas


Unnamed: 0,topic,entities,n_paragraphs,n_articles,eigenvector
948,2,US,12,6,0.429865
944,2,Trump,2,2,0.290770
865,2,Donald Trump,2,2,0.213601
866,2,EU,2,2,0.212811
946,2,UK,15,5,0.188896
...,...,...,...,...,...
950,2,Wafa,1,1,
954,2,WhatsApp,1,1,
956,2,Yechiel Leiter,1,1,
957,2,Yellow Dagger,1,1,


## 4. TFIDF Keywords by Subset
Generate a set of keywords representing texts from any subset of the data

In [139]:
from sklearn.feature_extraction.text import TfidfVectorizer

per_para_data = pd.read_parquet('paragraphed_articles.parquet')
tfidf = TfidfVectorizer(stop_words='english', min_df=5, max_df=0.95)
tfidf_vectors = tfidf.fit_transform(per_para_data['para_tokens'])
tfidf_vectors = pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.get_feature_names_out())
tfidf_vectors

Unnamed: 0,aamna,abandon,ability,able,absolute,absolutely,abuse,academic,accept,access,...,year,yellow,yesterday,york,young,younge,youth,youtube,zero,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2128,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2130,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [140]:
row_per_entity = per_para_data.explode('entities').dropna() # drops rows where paragraphs have no entities
ENTITY = 'Palestin*'
TOP_N = 20
group_indexes = list(set(row_per_entity[row_per_entity['entities'].str.contains(ENTITY)].index))
top_terms = tfidf_vectors.loc[group_indexes].sum().sort_values(ascending=False).head(TOP_N)
import plotly.express as px

fig = px.bar(y=top_terms.index, x=top_terms.values,
              title=f'Top {TOP_N} terms in paragraphs mentioning "{ENTITY}"',
                labels={'x':'TFIDF Score','y':'Entity'},
                height=600)
fig.update_yaxes(categoryorder='total ascending')

## Turning topic co-occurence into a network
If paragraphs of different topics co-occur in the same article, form a relation between them.
### What does this tell us?
Some topics may tend to be raised in relation to a greater number of other topics. If there is a topic called 'who set off the nuclear bomb?' which co-occurs with every other topic from UK politics, to environmental hazards to the price of cheese, you know the setting off of the nuclear bomb is related in some way to all these other topics.

In [31]:
per_para_data['topic'] = topics
dummies = pd.get_dummies(per_para_data['topic'], dtype=int).groupby(level=0).sum()
dummies = dummies.drop(columns=[-1])
correlations = dummies.corr()
correlations.rename(columns=topic_model.topic_labels_).loc[49].sort_values(ascending=False)

px.imshow(correlations)


In [16]:
import networkx as nx

G = nx.from_pandas_adjacency(correlations)

In [17]:
nx.set_node_attributes(G, topic_model.topic_labels_, name='label')
G.remove_edges_from([(source, target) for source, target, attr in G.edges(data=True) if attr['weight'] <=0])
G.remove_edges_from(nx.selfloop_edges(G))
G.nodes(data=True)

NodeDataView({0: {'label': '0_kirk_charlie_trump_violence'}, 1: {'label': '1_court_lula_democracy_supreme'}, 2: {'label': '2_lachlan_news_empire_fox'}, 3: {'label': '3_gun_hunt_shooters_conservation'}, 4: {'label': '4_guardian_email_app_inbox'}, 5: {'label': '5_gaza_killed_israeli_city'}, 6: {'label': '6_epstein_mandelson_jeffrey_mandy'}, 7: {'label': '7_mps_deputy_phillipson_race'}, 8: {'label': '8_harris_bidens_republicans_house'}, 9: {'label': '9_kyle_millicent_towns_says'}, 10: {'label': '10_macron_france_prime_minister'}, 11: {'label': '11_river_fishing_cunningham_says'}, 12: {'label': '12_labour_needs_think_nigel'}, 13: {'label': '13_nato_ukraine_poland_russia'}, 14: {'label': '14_thomas_police_charges_court'}, 15: {'label': '15_africa_climate_africas_global'}, 16: {'label': '16_heritage_white_age_damage'}, 17: {'label': '17_rights_couples_marriage_citys'}, 18: {'label': '18_protest_block_movement_france'}, 19: {'label': '19_sport_girls_womens_boys'}, 20: {'label': '20_governance

In [18]:
nx.write_gexf(G,'topic_correlations_max.gexf')

In [25]:
# Correlation of topics within articles
import plotly.express as px

px.imshow(correlations)

In [13]:
documents = pd.DataFrame(
            {
                "Document": per_para_data['paragraphs'],
                "ID": per_para_data['paragraphs'].index,
                "Topic": topic_model.topics_,
                "Image": None,
            }
        )

_, _, _, repr_docs_ids = topic_model._extract_representative_docs(
            topic_model.c_tf_idf_,
            documents,
            topics=topic_model.topic_labels_,
            nr_repr_docs=10,
        )

In [16]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,491,-1_party_labour_said_people,"[party, labour, said, people, government, refo...",[“Only the Labour party can stop this from hap...
1,0,325,0_kirk_violence_trump_shot,"[kirk, violence, trump, shot, political, shoot...","[US | Charlie Kirk, founder of rightwing youth..."
2,1,171,1_democracy_supreme_court_brazil,"[democracy, supreme, court, brazil, judge, 202...",[What did the judges say? Announcing Bolsonaro...
3,2,56,2_guardian_sign_breaking_finally,"[guardian, sign, breaking, finally, receive, n...","[Sign up: AU Breaking News email, Sign up: AU ..."
4,3,50,3_news_business_family_control,"[news, business, family, control, father, deal...",[Lachlan is said by some to not be as enamoure...
5,4,48,4_right_support_premier_nsw,"[right, support, premier, nsw, control, greens...",[The New South Wales premier has backtracked o...
6,5,42,5_house_joe_administration_questions,"[house, joe, administration, questions, white,...",[Kamala Harris calls Joe Biden's decision to s...
7,6,40,6_eu_sanctions_russia_deal,"[eu, sanctions, russia, deal, oil, western, eu...",[Senior EU and US officials have argued in fav...
8,7,39,7_killed_gaza_israeli_city,"[killed, gaza, israeli, city, israel, bank, pa...","[After the attack, the Israeli military said i..."
9,8,39,8_peter_man_relationship_starmer,"[peter, man, relationship, starmer, told, emer...",[Since the bundle of correspondence was publis...


In [17]:
topic_ids = topic_model.get_topic_info()["Topic"].tolist()
dict(zip(topic_ids, repr_docs_ids))

{-1: [2009, 1626, 2066, 2067, 1885, 1961, 1897, 1988, 10, 1532],
 0: [484, 443, 430, 391, 792, 977, 149, 503, 158, 881],
 1: [417, 373, 358, 1498, 380, 105, 48, 1492, 411, 207],
 2: [512, 686, 408, 1614, 1467, 2219, 1790, 1765, 1935, 834],
 3: [617, 1813, 591, 590, 587, 589, 604, 621, 1812, 394],
 4: [1788, 1786, 1798, 2222, 1785, 2214, 2226, 1792, 1807, 2238],
 5: [2124, 2114, 1270, 449, 1268, 2112, 1269, 1263, 1275, 1265],
 6: [2062, 86, 88, 2049, 2061, 2052, 2058, 1324, 1325, 2054],
 7: [1106, 1203, 1219, 2032, 2037, 1205, 1202, 2039, 2041, 2040],
 8: [1229, 882, 115, 1432, 1232, 1230, 1223, 1841, 393, 1439],
 9: [1108,
  98,
  1716,
  695,
  691,
  1820,
  73,
  1609,
  1623,
  685,
  693,
  1720,
  79,
  1611,
  687,
  1621,
  280,
  1722,
  1615,
  1619,
  68,
  1714,
  679,
  683,
  1617,
  75,
  1116,
  1718,
  1639,
  63,
  689,
  1607,
  1613,
  1606,
  66,
  678,
  681],
 10: [1753, 1747, 1734, 1715, 1731, 1732, 1756, 1748, 1754, 1717],
 11: [736, 745, 746, 716, 744, 742, 71