## Step 1 – Load cleaned quotes and assign IDs

**Goal:**  
Load the pre-cleaned quotes from `Art_quotes.txt` (created in the previous notebook) and give each quote a unique numeric ID.  
These IDs will later be used as node IDs in the quote network.

**What we do in this step:**
1. Read `Art_quotes.txt`, assuming one quote per line.
2. Strip empty lines and extra whitespace.
3. Assign an integer ID to each quote: 0, 1, 2, ...
4. Inspect the number of quotes and print a small sample to check everything looks correct.

In [1]:
# Step 1 – Load cleaned quotes and assign IDs

# If you prefer not to use pandas, you can skip the import and work only with the 'quotes' list.
import pandas as pd

# 1. Load quotes from the text file
quotes = []
with open("Art_quotes.txt", "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if line:  # skip empty lines
            quotes.append(line)

# 2. Basic sanity checks
print(f"Total number of quotes loaded: {len(quotes)}\n")

print("First 5 quotes:\n")
for i, q in enumerate(quotes[:10]):
    print(f"[{i}] {q}\n")

# 3. (Optional but useful) Put quotes into a DataFrame with IDs
quotes_df = pd.DataFrame(
    {
        "id": range(len(quotes)),
        "quote": quotes
    }
)

# Display the first few rows of the table (if you're in Jupyter, this will render nicely)
quotes_df.head(10)

Total number of quotes loaded: 321

First 5 quotes:

[0] When you have a cause, the best way to express yourself is artistically.

[1] The coming extinction of art is prefigured in the increasing impossibility of representing historical events.

[2] Art is magic delivered from the lie of being truth.

[3] I have this very what you call today "square" idea that art is something that makes you breathe with a different kind of happiness.

[4] Technique without art is shallow and doomed. Art without technique is insulting.

[5] I make art because it centers me in my body, and by doing so I hope to offer that experience to someone else.

[6] The arts are a wonderful medicine for the soul.

[7] Art would not be important if life were not important, and life is important.

[8] One writes out of one thing only — one's own experience. Everything depends on how relentlessly one forces from this experience the last drop, sweet or bitter, it can possibly give. This is the only real concern of the 

Unnamed: 0,id,quote
0,0,"When you have a cause, the best way to express..."
1,1,The coming extinction of art is prefigured in ...
2,2,Art is magic delivered from the lie of being t...
3,3,"I have this very what you call today ""square"" ..."
4,4,Technique without art is shallow and doomed. A...
5,5,"I make art because it centers me in my body, a..."
6,6,The arts are a wonderful medicine for the soul.
7,7,Art would not be important if life were not im...
8,8,One writes out of one thing only — one's own e...
9,9,Art is made by the alone for the alone.


## Step 2 – Extract Improved KeyBERT Subtopics (MMR + n-grams)

In this step we extract richer and more diverse subtopics for each quote using KeyBERT:

- **MMR (Maximal Marginal Relevance)** for diverse keyphrases  
- **n-grams up to 3 words** for more descriptive phrases  
- **top_n = 10** to increase the chance of overlap later  
- We keep the **raw phrases** (no heavy cleaning yet)

These raw subtopics will later be cleaned  and used as input for the network.

In [2]:
# Step 2 – Extract improved KeyBERT subtopics (MMR + n-grams + diversification)

from keybert import KeyBERT

# Load KeyBERT with MiniLM (default)
kw_model = KeyBERT(model="all-MiniLM-L6-v2")

# Store raw results
all_subtopics_raw = []
all_subtopic_terms_raw = []

# Extract keyphrases for each quote
for q in quotes_df["quote"]:
    keywords = kw_model.extract_keywords(
        q,
        keyphrase_ngram_range=(1, 3),   # allow 1–3 word phrases
        stop_words='english',
        use_mmr=True,                   # diversify the keyphrases
        diversity=0.7,                  # more diverse phrases
        top_n=10                        # extract more candidate subtopics
    )

    all_subtopics_raw.append(keywords)
    all_subtopic_terms_raw.append([kw for kw, score in keywords])

# Add to dataframe
quotes_df["subtopics_raw"] = all_subtopics_raw
quotes_df["subtopic_terms_raw"] = all_subtopic_terms_raw

# Inspect the first few rows
print("Sample of extracted raw subtopics:\n")
for i, row in quotes_df.head(10).iterrows():
    print(f"ID {row['id']}")
    print("Quote:", row["quote"])
    print("Raw subtopics:", row["subtopic_terms_raw"])

Sample of extracted raw subtopics:

ID 0
Quote: When you have a cause, the best way to express yourself is artistically.
Raw subtopics: ['express artistically', 'way express artistically', 'artistically', 'cause best way', 'cause best', 'best way', 'cause', 'best', 'way express', 'way']
ID 1
Quote: The coming extinction of art is prefigured in the increasing impossibility of representing historical events.
Raw subtopics: ['coming extinction art', 'impossibility representing historical', 'representing historical events', 'art', 'art prefigured', 'historical', 'events', 'representing', 'prefigured increasing', 'coming']
ID 2
Quote: Art is magic delivered from the lie of being truth.
Raw subtopics: ['art magic delivered', 'art magic', 'magic delivered lie', 'art', 'magic delivered', 'delivered lie truth', 'lie truth', 'magic', 'lie', 'delivered']
ID 3
Quote: I have this very what you call today "square" idea that art is something that makes you breathe with a different kind of happiness.


## Step 3 – Cleaning of Raw Subtopics

We now lightly clean the raw KeyBERT subtopics:

- Lowercase and strip outer punctuation.
- Remove single-word `"art"` / `"arts"` (too generic).
- Remove exact duplicates within each quote.
- Prefer longer phrases over shorter fragments when one is contained in another.

The result is stored in `subtopics_clean` and will be used as input for the semantic similarity step when building the network.

In [3]:
# Step 3 – Cleaning of Raw Subtopics

import string

art_words = {"art", "arts"}  # too generic to keep alone


def clean_subtopics_light(raw_terms, max_final=None):
    """
    raw_terms: list of keyword strings for one quote (from subtopic_terms_raw)
    returns: lightly cleaned list of keyword strings
    """
    cleaned = []
    seen = set()

    # First pass: normalize, drop pure 'art'/'arts', remove exact duplicates
    for term in raw_terms:
        t = term.lower().strip()
        t = t.strip(string.punctuation).strip('"').strip("'")

        if not t:
            continue

        # drop single-word 'art' / 'arts'
        if t in art_words and len(t.split()) == 1:
            continue

        if t not in seen:
            cleaned.append(t)
            seen.add(t)

    # Second pass: prefer longer phrases over fragments
    final_terms = []
    for t1 in cleaned:
        keep = True
        for t2 in cleaned:
            if t1 != t2 and t1 in t2 and len(t2.split()) < len(t2.split()):
                # (this condition is actually wrong, so let's fix it)
                keep = True

        # Better: check that t1 is strictly shorter and fully contained in t2
    final_terms = []
    for t1 in cleaned:
        keep = True
        for t2 in cleaned:
            if t1 != t2:
                # if t1 is fully contained in t2 and t2 has more words, drop t1
                if t1 in t2 and len(t2.split()) > len(t1.split()):
                    keep = False
                    break
        if keep:
            final_terms.append(t1)

    # Optionally limit number of terms per quote
    if max_final is not None:
        final_terms = final_terms[:max_final]

    return final_terms


# Apply light cleaning to each quote
quotes_df["subtopics_clean"] = quotes_df["subtopic_terms_raw"].apply(
    lambda terms: clean_subtopics_light(terms, max_final=None)  # keep all for now
)

# Inspect first 10
print("Sample of CLEANED subtopics (light):\n")
for i, row in quotes_df.head(10).iterrows():
    print(f"ID {row['id']}")
    print("Quote:", row["quote"])
    print("Raw subtopics:   ", row["subtopic_terms_raw"])
    print("CLEAN subtopics: ", row["subtopics_clean"]) 

Sample of CLEANED subtopics (light):

ID 0
Quote: When you have a cause, the best way to express yourself is artistically.
Raw subtopics:    ['express artistically', 'way express artistically', 'artistically', 'cause best way', 'cause best', 'best way', 'cause', 'best', 'way express', 'way']
CLEAN subtopics:  ['way express artistically', 'cause best way']
ID 1
Quote: The coming extinction of art is prefigured in the increasing impossibility of representing historical events.
Raw subtopics:    ['coming extinction art', 'impossibility representing historical', 'representing historical events', 'art', 'art prefigured', 'historical', 'events', 'representing', 'prefigured increasing', 'coming']
CLEAN subtopics:  ['coming extinction art', 'impossibility representing historical', 'representing historical events', 'art prefigured', 'prefigured increasing']
ID 2
Quote: Art is magic delivered from the lie of being truth.
Raw subtopics:    ['art magic delivered', 'art magic', 'magic delivered lie

## Step 4 – Build a Semantic Similarity Network

In this step, we build a quote network using semantic similarity:

1. Flatten all cleaned subtopics (`subtopics_clean`) from all quotes.
2. Use a SentenceTransformer model (e.g., `all-MiniLM-L6-v2`) to embed each subtopic phrase into a vector.
3. For each pair of quotes, compute the maximum cosine similarity between any of their subtopics.
4. If this maximum similarity is above a threshold (e.g. 0.75), we add an edge between those two quotes.
5. The edge **weight** is the maximum similarity score.

This creates a graph where:
- nodes = quotes,
- edges = strong semantic relationships between quotes,
- weights = strength of semantic similarity.

In [4]:
# Step 4 – Build a semantic similarity network from cleaned subtopics

import numpy as np
import networkx as nx
from collections import defaultdict
from itertools import combinations
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

In [5]:
# Make sure we use the cleaned subtopics
# (We assume quotes_df["subtopics_clean"] already exists from Step 3)
quotes_df["subtopics_final"] = quotes_df["subtopics_clean"]

# 1. Flatten all subtopics into a list, keeping track of which quote they belong to
all_terms = []
all_quote_ids = []

for _, row in quotes_df.iterrows():
    qid = row["id"]
    terms = row["subtopics_final"]
    for t in terms:
        all_terms.append(t)
        all_quote_ids.append(qid)

print(f"Total number of subtopic phrases: {len(all_terms)}")

# If some quotes have no subtopics, they won't contribute terms but will still be nodes later.


# 2. Embed all subtopics using SentenceTransformer
sent_model = SentenceTransformer("all-MiniLM-L6-v2")
term_embeddings = sent_model.encode(all_terms, convert_to_numpy=True, show_progress_bar=True)

print("Embeddings shape:", term_embeddings.shape)


# 3. Build a mapping: quote_id -> indices of its subtopic embeddings
quote_to_indices = defaultdict(list)
for idx, qid in enumerate(all_quote_ids):
    quote_to_indices[qid].append(idx)

# 4. Create a new graph for semantic similarity
G_sem = nx.Graph()

# Add all quotes as nodes (even if some have no subtopics)
for _, row in quotes_df.iterrows():
    qid = row["id"]
    G_sem.add_node(
        qid,
        quote=row["quote"],
        subtopics=row["subtopics_final"]
    )

print(f"Number of nodes in semantic graph: {G_sem.number_of_nodes()}")

# 5. Connect quotes if their subtopics are semantically similar

sim_threshold = 0.70  # you can experiment with 0.7, 0.8, etc.

edge_count = 0
num_quotes = len(quotes_df)

for i, j in combinations(quotes_df["id"], 2):
    idx_i = quote_to_indices.get(i, [])
    idx_j = quote_to_indices.get(j, [])

    # Skip if either quote has no subtopics
    if not idx_i or not idx_j:
        continue

    emb_i = term_embeddings[idx_i]
    emb_j = term_embeddings[idx_j]

    # Compute pairwise cosine similarity between subtopics of quote i and quote j
    sims = cosine_similarity(emb_i, emb_j)
    max_sim = sims.max()

    if max_sim >= sim_threshold:
        G_sem.add_edge(i, j, weight=float(max_sim))
        edge_count += 1

print(f"Number of edges in semantic graph (threshold={sim_threshold}): {G_sem.number_of_edges()}")

Total number of subtopic phrases: 2218


Batches:   0%|          | 0/70 [00:00<?, ?it/s]

Embeddings shape: (2218, 384)
Number of nodes in semantic graph: 321
Number of edges in semantic graph (threshold=0.7): 691


In [6]:
# Basic inspection of the semantic graph

# 1. Isolated nodes (quotes with no edges)
isolated_sem = list(nx.isolates(G_sem))
print(f"Number of isolated quotes (degree 0): {len(isolated_sem)}")

# 2. Degree stats
degrees_sem = dict(G_sem.degree())
max_degree_sem = max(degrees_sem.values()) if degrees_sem else 0
avg_degree_sem = sum(degrees_sem.values()) / len(degrees_sem) if degrees_sem else 0

print(f"Max degree (semantic graph): {max_degree_sem}")
print(f"Average degree (semantic graph): {avg_degree_sem:.2f}")

print("\nSample semantic edges with their closest subtopics:\n")

edge_sample = list(G_sem.edges(data=True))[:10]

for u, v, data in edge_sample:
    idx_u = quote_to_indices.get(u, [])
    idx_v = quote_to_indices.get(v, [])
    if not idx_u or not idx_v:
        continue

    emb_u = term_embeddings[idx_u]
    emb_v = term_embeddings[idx_v]

    sims = cosine_similarity(emb_u, emb_v)
    i_max, j_max = np.unravel_index(np.argmax(sims), sims.shape)

    term_u = all_terms[idx_u[i_max]]
    term_v = all_terms[idx_v[j_max]]
    sim_val = sims[i_max, j_max]

    print(f"{u} -- {v} | weight={data['weight']:.3f}")
    print(f"   best match: '{term_u}'  <->  '{term_v}'  (sim={sim_val:.3f})")
    print()

Number of isolated quotes (degree 0): 66
Max degree (semantic graph): 56
Average degree (semantic graph): 4.31

Sample semantic edges with their closest subtopics:

0 -- 286 | weight=0.714
   best match: 'way express artistically'  <->  'art way'  (sim=0.714)

2 -- 44 | weight=0.800
   best match: 'art magic delivered'  <->  'creative art magic'  (sim=0.800)

2 -- 199 | weight=0.831
   best match: 'art magic delivered'  <->  'art got magic'  (sim=0.831)

2 -- 201 | weight=0.863
   best match: 'art magic delivered'  <->  'art magic'  (sim=0.863)

4 -- 138 | weight=0.714
   best match: 'doomed art technique'  <->  'art art cure'  (sim=0.714)

5 -- 7 | weight=0.707
   best match: 'make art'  <->  'art important life'  (sim=0.707)

5 -- 12 | weight=0.743
   best match: 'make art'  <->  'art constitutes'  (sim=0.743)

5 -- 22 | weight=0.732
   best match: 'make art'  <->  'art philosophy'  (sim=0.732)

5 -- 23 | weight=0.716
   best match: 'make art'  <->  'art does involve'  (sim=0.716)

5

## Step 5 — Analyze Graph Connectivity

We compute all **connected components** of the semantic graph.  
Each component is a group of quotes connected (directly or indirectly) through semantic similarity edges.

What this cell does:
- Extracts all connected components  
- Sorts them by size  
- Prints the number of components  
- Shows the sizes of the largest few  
- Reports the size of the **largest connected component**, which represents the main semantic cluster of the quotes.

In [7]:
# Get all connected components (each is a set of node IDs)
components = list(nx.connected_components(G_sem))
components_sorted = sorted(components, key=len, reverse=True)

print(f"Number of connected components: {len(components_sorted)}")

# Size of the largest few components
for i, comp in enumerate(components_sorted[:5]):
    print(f"Component {i}: size={len(comp)}")

# Largest component
largest_comp = components_sorted[0]
print(f"\nLargest component size: {len(largest_comp)} "
      f"({len(largest_comp)/G_sem.number_of_nodes():.1%} of all quotes)")

Number of connected components: 74
Component 0: size=240
Component 1: size=3
Component 2: size=2
Component 3: size=2
Component 4: size=2

Largest component size: 240 (74.8% of all quotes)


In [8]:
# Use the largest component as the main semantic core
largest_comp = components_sorted[0]

G_core = G_sem.subgraph(largest_comp).copy()
print("Nodes in core graph:", G_core.number_of_nodes())
print("Edges in core graph:", G_core.number_of_edges())

Nodes in core graph: 240
Edges in core graph: 683


In [9]:
from networkx.algorithms.community import greedy_modularity_communities

core_communities = list(greedy_modularity_communities(G_core, weight="weight"))
core_communities_sorted = sorted(core_communities, key=len, reverse=True)

print(f"Number of communities in core graph: {len(core_communities_sorted)}")
for i, comm in enumerate(core_communities_sorted[:5]):
    print(f"Community {i}: size={len(comm)}")

Number of communities in core graph: 12
Community 0: size=45
Community 1: size=33
Community 2: size=31
Community 3: size=24
Community 4: size=22


In [10]:
# Map node -> community index (only for core graph)
node_to_core_comm = {}
for idx, comm in enumerate(core_communities_sorted):
    for node in comm:
        node_to_core_comm[node] = idx

quotes_df["core_community"] = quotes_df["id"].map(node_to_core_comm).fillna(-1).astype(int)
quotes_df.head(10)

Unnamed: 0,id,quote,subtopics_raw,subtopic_terms_raw,subtopics_clean,subtopics_final,core_community
0,0,"When you have a cause, the best way to express...","[(express artistically, 0.6677), (way express ...","[express artistically, way express artisticall...","[way express artistically, cause best way]","[way express artistically, cause best way]",2
1,1,The coming extinction of art is prefigured in ...,"[(coming extinction art, 0.6345), (impossibili...","[coming extinction art, impossibility represen...","[coming extinction art, impossibility represen...","[coming extinction art, impossibility represen...",-1
2,2,Art is magic delivered from the lie of being t...,"[(art magic delivered, 0.7664), (art magic, 0....","[art magic delivered, art magic, magic deliver...","[art magic delivered, magic delivered lie, del...","[art magic delivered, magic delivered lie, del...",2
3,3,"I have this very what you call today ""square"" ...","[(square idea art, 0.6752), (art makes breathe...","[square idea art, art makes breathe, art, kind...","[square idea art, art makes breathe, kind happ...","[square idea art, art makes breathe, kind happ...",-1
4,4,Technique without art is shallow and doomed. A...,"[(art technique insulting, 0.718), (technique ...","[art technique insulting, technique art shallo...","[art technique insulting, technique art shallo...","[art technique insulting, technique art shallo...",2
5,5,"I make art because it centers me in my body, a...","[(make art, 0.6742), (art centers, 0.4331), (b...","[make art, art centers, body, experience, make...","[make art, art centers, body doing hope, hope ...","[make art, art centers, body doing hope, hope ...",0
6,6,The arts are a wonderful medicine for the soul.,"[(arts wonderful medicine, 0.7392), (arts wond...","[arts wonderful medicine, arts wonderful, arts...","[arts wonderful medicine, wonderful medicine s...","[arts wonderful medicine, wonderful medicine s...",-1
7,7,Art would not be important if life were not im...,"[(art important life, 0.803), (art important, ...","[art important life, art important, life impor...","[art important life, important life important,...","[art important life, important life important,...",4
8,8,One writes out of one thing only — one's own e...,"[(artist recreate disorder, 0.6987), (art, 0.5...","[artist recreate disorder, art, life order, wr...","[artist recreate disorder, life order, writes,...","[artist recreate disorder, life order, writes,...",8
9,9,Art is made by the alone for the alone.,"[(art, 0.4461)]",[art],[],[],-1


In [12]:
def show_core_community(comm_index, n=8):
    print(f"\n=== Core Community {comm_index} ===\n")
    sub_df = quotes_df[quotes_df["core_community"] == comm_index].head(n)
    for _, row in sub_df.iterrows():
        print(f"ID {row['id']} | core_community {row['core_community']}")
        print("Subtopics:", row["subtopics_clean"])
        print("Quote:", row["quote"])
        print("-" * 80)

# Look at the first 3 biggest core communities
for ci in range(min(3, len(core_communities_sorted))):
    show_core_community(ci, n=8)


=== Core Community 0 ===

ID 5 | core_community 0
Subtopics: ['make art', 'art centers', 'body doing hope', 'hope offer experience', 'centers body doing']
Quote: I make art because it centers me in my body, and by doing so I hope to offer that experience to someone else.
--------------------------------------------------------------------------------
ID 12 | core_community 0
Subtopics: ['art constitutes', 'action paying freedom', 'free zone outside', 'giving real world', 'outside action', 'real world heavy', 'heavy price', 'minor']
Quote: Art constitutes a minor free zone outside action, paying for its freedom by giving up the real world. A heavy price!
--------------------------------------------------------------------------------
ID 13 | core_community 0
Subtopics: ['art introduced term', 'right french writings', 'titian referring', 'bellini defended bella', 'fantasia expectations', 'personal views religion', 'themes remaining', 'point', 'europe question', 'evidence distinguishes']