## Preparation
This is a Python Notebook. To make it work, you need to press "play" on the code cells.  
Remember to always run cells in the right order and never skip one!
In case of doubt, you can always restart from the beginning.

First, let's clone the GitHub repository.

In [None]:
!git clone https://github.com/SimoneRebora/CMCLS.git

Then, we extract the named entities with [spaCy](https://spacy.io/)

In [None]:
# define text to be analyzed
my_text = 'CMCLS/corpus/Doyle_Study_1887.txt'

import spacy
import pandas as pd

# Install spacy and download a model if not already present
nlp = spacy.load("en_core_web_sm")

# Sample text
with open(my_text, 'r') as file:
    text = file.read()

# Process the text with Spacy
doc = nlp(text)

# extract entities per sentence
entities_with_sentences = []
sentences = list(doc.sents) # Convert sentences to a list to access by index

for ent in doc.ents:
    for i, sent in enumerate(sentences):
        if ent.start >= sent.start and ent.end <= sent.end:
            entities_with_sentences.append((ent.text, ent.label_, i)) # Store sentence index
            break # Found the sentence, move to the next entity

df = pd.DataFrame(entities_with_sentences)
df.columns = ['Entity', 'Type', 'Sentence Index']
df

## Part 1. Network

First, we extract all "PERSON" entities.  
Then we calculate the co-occurrences of persons in adjacent sentences (in the same or in the following one).  
We finally save the results into "nodes" and "edges" tables and show them.

In [None]:
import itertools

# filter person entities
person_entities_df = df[df['Type'] == 'PERSON']
# group entities per sentence
grouped_persons = person_entities_df.groupby('Sentence Index')['Entity'].apply(lambda x: list(x.unique())).to_dict()

# calculate co-occurrences
# 1. Initialize an empty dictionary to store co-occurrence relationships and their weights
co_occurrence_counts = {}

# 2. Get a sorted list of unique sentence indices from the grouped_persons dictionary
sorted_sentence_indices = sorted(grouped_persons.keys())

# 3. Iterate through the sorted list of sentence indices
for i, current_sentence_index in enumerate(sorted_sentence_indices):
    # a. Retrieve the list of person entities for the current_sentence_index
    persons_in_current_sentence = grouped_persons.get(current_sentence_index, [])

    # b. Determine the next_sentence_index
    next_sentence_index = current_sentence_index + 1

    # c. Check if next_sentence_index exists and retrieve persons
    persons_in_next_sentence = grouped_persons.get(next_sentence_index, [])

    # d. Combine persons from current and next sentences into a single set
    all_co_occurring_persons = set(persons_in_current_sentence + persons_in_next_sentence)

    # e. If all_co_occurring_persons contains more than one person, iterate through all unique pairs
    if len(all_co_occurring_persons) > 1:
        for person1, person2 in itertools.combinations(sorted(list(all_co_occurring_persons)), 2):
            # i. Ensure the pair is ordered alphabetically
            ordered_pair = tuple(sorted((person1, person2)))

            # ii. Increment the count for this ordered pair
            co_occurrence_counts[ordered_pair] = co_occurrence_counts.get(ordered_pair, 0) + 1

# save and show the table
co_occurrence_df = pd.DataFrame([
    {'source': pair[0], 'target': pair[1], 'weight': count}
    for pair, count in co_occurrence_counts.items()
])

co_occurrence_df.to_csv('co_occurrence_df.csv', index=False)

co_occurrence_df

We can (or, actually, should) curate the co-occurrences a little bit.  
We can do it by downloading the .csv file to our computer, correcting it, and uploading it again.

_______________________
**Correction can be done with software like [LibreOffice](https://it.libreoffice.org/download/download/)**
_______________________

Then we can re-read the co-occurrences file and convert it into "nodes" and "edges" tables.

In [None]:
# re-read the file
co_occurrence_df = pd.read_csv('co_occurrence_df.csv')

# Get all unique entities from 'source' and 'target' columns
unique_entities = pd.concat([co_occurrence_df['source'], co_occurrence_df['target']]).unique()

# Create a new DataFrame with 'ID' and 'label' columns
nodes_df = pd.DataFrame({
    'ID': unique_entities,
    'label': unique_entities
})

# save nodes and edges
nodes_df.to_csv('nodes_df.csv', index=False)
co_occurrence_df.to_csv('edges_df.csv', index=False)

# show them
display(nodes_df)
display(co_occurrence_df)

Finally, we can print the network (not very nice, though...)

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

# Create an empty graph
G = nx.Graph()

# Add edges from the co_occurrence_df
for index, row in co_occurrence_df.iterrows():
    G.add_edge(row['source'], row['target'], weight=row['weight'])

# show the plot
plt.figure(figsize=(12, 10))
pos = nx.spring_layout(G, k=0.8, iterations=20) # You can experiment with different layouts and parameters
nx.draw_networkx_nodes(G, pos, node_color='skyblue', node_size=2000)
nx.draw_networkx_edges(G, pos, edge_color='gray', width=1.0, alpha=0.7)
nx.draw_networkx_labels(G, pos, font_size=10, font_weight='bold')

plt.title("Co-occurrence Network of PERSON Entities")
plt.axis('off') # Hide axes
plt.show()

## Part 2. Mapping

First, we extract the "GPE" or "LOC" entities.  
And we geolocate them (using the [geopy](https://geopy.readthedocs.io/en/stable/) package).

In [None]:
from geopy.geocoders import Nominatim
import pandas as pd
import time

location_entities_df = df[df['Type'].isin(['GPE', 'LOC'])]['Entity'].unique()
location_entities_df = pd.DataFrame(location_entities_df, columns=['Location'])

geolocator = Nominatim(user_agent="colab_geocoder", timeout=10) # Increased timeout to 10 seconds

location_entities_df['Latitude'] = None
location_entities_df['Longitude'] = None

for index, row in location_entities_df.iterrows():
    location_name = row['Location']
    print(f"Geocoding {location_name}...")
    try:
        geodata = geolocator.geocode(location_name)
        if geodata:
            location_entities_df.loc[index, 'Latitude'] = geodata.latitude
            location_entities_df.loc[index, 'Longitude'] = geodata.longitude
    except Exception as e:
        print(f"Error geocoding {location_name}: {e}") # Removed the \n from inside the f-string
    time.sleep(1) # Add a delay of 1 second between requests

location_entities_df.to_csv("locations_df.csv")

location_entities_df

Again, we can (or, actually, should) curate the geolocations a little bit.  
We can do it by downloading the .csv file to our computer, correcting it, and uploading it again.

_______________________
**Correction can be done with software like [LibreOffice](https://it.libreoffice.org/download/download/)  
Note that you can use [GoogleMaps](https://www.google.com/maps) to find coordinates**
_______________________


Finally, we can show the map (still, not very good)

In [None]:
import folium

# re-read the table
location_entities_df = pd.read_csv("locations_df.csv")

# Create a base map centered around an approximate average of the locations
# You might want to adjust the initial center or zoom level
center_lat = location_entities_df['Latitude'].mean()
center_lon = location_entities_df['Longitude'].mean()
m = folium.Map(location=[center_lat, center_lon], zoom_start=2)

# Add markers for each location
for index, row in location_entities_df.iterrows():
    if pd.notna(row['Latitude']) and pd.notna(row['Longitude']):
        folium.Marker(
            location=[row['Latitude'], row['Longitude']],
            popup=row['Location']
        ).add_to(m)

# Display the map
display(m)