<a href="https://colab.research.google.com/github/Mel-Anden/Mel-Anden/blob/main/Extract_named_entities_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extract Named Entities Network from Actor Statements

*DTU - Explore the controversy about Energy Island*

**Goal**:
- Extract [named entities](https://en.wikipedia.org/wiki/Named-entity_recognition) from the list of statements
- Build a co-occurrence network (entities linked when they appear often in the same statements)
- Visualize the network

**Purpose**: useful to identify important markers (persons, places, events...) that can be reused in queries to find more documents related to the same controversy.

**How to use**:
- Edit settings then use "Runtime > Run all"
- Wait for each cell to run
- ⚠️ You may have to restart the runtime when installing libraries
- ⚠️ Allow the script to access your Google Drive data when prompted to
- To read the last visualization properly, you need to run the layout ("Play" button on the left of the widget)

## Settings

In [None]:
# SETTINGS (edit if necessary)
settings = {}
settings['statements_spreadsheet_drive_URL'] = 'https://docs.google.com/spreadsheets/d/1c6U-tF4ZTi-csTkusGFaSclE2-gn3tvi0Xaj4Q8AKvk/edit?usp=sharing'
settings['column_text'] = 'Restated version (the transformed actor statement)'
settings['edge_weight_threshold'] = 0.5 # Keep only strong enough connections between entities (i.e., normalized positive PMI; range: [0,1])

## Code

(You don't have to understand what's going on here, but feel free to take a look)

### Install stuff

In [None]:
# Install necessary libraries
!pip install pandas==2.0.3 gspread==5.10.0 google-auth==2.22.0 google-auth-oauthlib==1.0.0 google-auth-httplib2==0.1.0
!pip install -U spacy networkx ipysigma
!python -m spacy download en_core_web_sm

In [None]:
# Import necessary libraries
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")

from collections import defaultdict
import math

import networkx as nx

from ipysigma import Sigma
from google.colab import output
output.enable_custom_widget_manager()

from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

### Load data from the spreadsheet

In [None]:
# Open the spreadsheet by its key or URL
spreadsheet_key = settings['statements_spreadsheet_drive_URL'].split('/d/')[1].split('/edit')[0]
sh = gc.open_by_key(spreadsheet_key)

# Select the worksheet
worksheet_name = 'Form Responses'
worksheet = sh.worksheet(worksheet_name)

In [None]:
# Get all values from the worksheet as a list of lists
data = worksheet.get_all_values()

# Create a Pandas DataFrame from the list of lists
df = pd.DataFrame(data[1:], columns=data[0])

# Display dataframe for monitoring purposes
df

### Extract named entities

In [None]:
# Get text data
textData = df[settings['column_text']]

# Function to extract named entities
def extract_entities(text):
    doc = nlp(text)
    return [ent.text for ent in doc.ents]

# Create NetworkX graph
graph = nx.Graph()

# Create a dictionary to store entity frequencies
entity_freq = defaultdict(int)

# Create a dictionary to store co-occurrence frequencies
co_occurrence_freq = defaultdict(int)

# Process textData and update frequencies
for text in textData:
    entities = extract_entities(text)
    for entity in entities:
        entity_freq[entity] += 1
    for i in range(len(entities)):
        for j in range(i + 1, len(entities)):
            co_occurrence_freq[(entities[i], entities[j])] += 1

# Calculate total number of statements
total_statements = len(textData)

# Add edges to the graph with PMI as weight
for (entity1, entity2), freq in co_occurrence_freq.items():
    # Calculate PMI)
    pmi = math.log2(
        (freq / total_statements) /
        ((entity_freq[entity1] / total_statements) * (entity_freq[entity2] / total_statements))
    )
    # Normalize PMI
    npmi = pmi / -math.log2(freq / total_statements)
    # Add edge with PMI as weight
    if npmi>0 and npmi>settings['edge_weight_threshold']:
      graph.add_edge(entity1, entity2, weight=npmi)

# Print some information about the graph
print(f"Number of nodes: {graph.number_of_nodes()}")
print(f"Number of edges: {graph.number_of_edges()}")

In [None]:
# Export the network
nx.write_gexf(graph, "named_entities.gexf")

### Visualize the network
Note: push the PLAY button on the left part of the widget to run the layout and see the network's structure.

In [None]:
# Visualize the network
Sigma(graph, node_size=graph.degree, node_metrics={"community": "louvain"}, node_color="community", layout_settings={'edgeWeightInfluence':0.666}, edge_weight="weight")