# Introducing the Semantic Graph

_This notebook is part of a tutorial series on [txtai](https://github.com/neuml/txtai), an AI-powered semantic search platform._

[txtai](https://github.com/neuml/txtai) executes machine-learning workflows to transform data and build AI-powered semantic search applications.

One of the main use cases of txtai is semantic search over a corpus of data. Semantic search provides an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords. Within an Embeddings instance sits a wealth of implied knowledge and relationships between rows. Many approximate nearest neighbor (ANN) indexes are even backed by graphs. What if we are able to tap into this knowledge?

Semantic graphs, also known as knowledge graphs or semantic networks, build a graph network with semantic relationships connecting the nodes. In txtai, they can take advantage of the relationships inherently learned within an embeddings index. This opens exciting possibilities for exploring relationships, such as topics and interconnections in a dataset. 

This notebook introduces the semantic graph.



# Install dependencies

Install `txtai` and all dependencies. We'll install the graph extra for graph functionality, pipeline extra for object detection and similarity extra to load models with the sentence-transformers library.

In [1]:
!pip install git+https://github.com/neuml/txtai#egg=txtai[graph,pipeline,similarity] datasets ipyplot

Collecting txtai[graph,pipeline,similarity]
  Cloning https://github.com/neuml/txtai to /tmp/pip-install-12lemrdy/txtai_419666a06514460eaeba4e2ca245f629
  Running command git clone --filter=blob:none --quiet https://github.com/neuml/txtai /tmp/pip-install-12lemrdy/txtai_419666a06514460eaeba4e2ca245f629
  Resolved https://github.com/neuml/txtai to commit 3e8d63be86a0dbc111ce81e6bc99fc1c460029e1
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting datasets
  Using cached datasets-2.6.1-py3-none-any.whl (441 kB)
Collecting ipyplot
  Using cached ipyplot-1.1.1-py3-none-any.whl (13 kB)
Collecting torch>=1.6.0
  Downloading torch-1.13.0-cp39-cp39-manylinux1_x86_64.whl (890.2 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m890.2/890.2 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Collecting transfo

Collecting skl2onnx
  Using cached skl2onnx-1.13-py2.py3-none-any.whl (288 kB)
Collecting coloredlogs
  Using cached coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
Collecting flatbuffers
  Using cached flatbuffers-22.10.26-py2.py3-none-any.whl (26 kB)
Collecting numpy>=1.18.4
  Using cached numpy-1.23.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Collecting lz4>=1.0.0
  Downloading lz4-4.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m86.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fasteners>=0.14.1
  Using cached fasteners-0.18-py3-none-any.whl (18 kB)
Collecting torchvision
  Downloading torchvision-0.14.0-cp39-cp39-manylinux1_x86_64.whl (24.3 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.3/24.3 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Collecting nvidia-cudnn-

Building wheels for collected packages: annoy, fasttext, hnswlib, python-louvain, sentence-transformers, tika, txtai
  Building wheel for annoy (setup.py) ... [?25ldone
[?25h  Created wheel for annoy: filename=annoy-1.17.1-cp39-cp39-linux_x86_64.whl size=69670 sha256=e0923d97b7029a4f53a5094c6ef3b59f67370f2860aeef53d49b69b79617b738
  Stored in directory: /home/opc/.cache/pip/wheels/5b/7d/31/9a9a4993d085bc85bee21946bce94cd5906ce99730f5467e57
  Building wheel for fasttext (setup.py) ... [?25ldone
[?25h  Created wheel for fasttext: filename=fasttext-0.9.2-cp39-cp39-linux_x86_64.whl size=301381 sha256=6befa575fc230cedba7fb4c07450ab63bf6098a2056adc985ffa2334737e0879
  Stored in directory: /home/opc/.cache/pip/wheels/64/57/bc/1741406019061d5664914b070bd3e71f6244648732bc96109e
  Building wheel for hnswlib (pyproject.toml) ... [?25ldone
[?25h  Created wheel for hnswlib: filename=hnswlib-0.6.2-cp39-cp39-linux_x86_64.whl size=166099 sha256=f0848a5fe39030e6f534c0a85d9ac4de26e4e5af21105937032

In [2]:
!pip install tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
pip install pysbd


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
from sentence_transformers import SentenceTransformer, models

word_embedding_model = models.Transformer('allenai/longformer-base-4096', max_seq_length=4096)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Downloading:   0%|          | 0.00/694 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/597M [00:00<?, ?B/s]

Some weights of the model checkpoint at allenai/longformer-base-4096 were not used when initializing LongformerModel: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing LongformerModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LongformerModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [2]:
model.save("longformer")

# Graph basics

First we'll build a basic [graph](https://en.wikipedia.org/wiki/Graph_theory) and show how it can be used to explore relationships.

The code below builds a graph of animals and relationships between them. We'll add nodes and relationships along with running a couple analysis functions.

The visualization shows the layout of the graph. A centrality and path function were also run. Centrality shows the most central or related nodes. In this case, the `wolf` node has the highest score. We also ran a path function to show how the graph is traversed from `dog` to `horse`.

# Build a Semantic Graph

While txtai graphs can be standalone, with nodes and relationships manually added, the real power comes in indexing an embeddings instance.

The following section builds an embeddings index over the `ag_news` dataset. `ag_news` contains news headlines from the mid 2000s. This configuration sets the familiar vector model and content settings.

Column expressions is a feature starting with txtai 5.0. Column expressions alias expressions allowing SQL statements to use those references as a shorthand for the expression.

Next comes the graph. The configuration sets the maximum number of connections to add per node (15) along with a minimum similarity score (0.1). Topic modeling parameters are also added which we'll cover later.

In [1]:
import pysbd

from txtai.embeddings import Embeddings
import networkx as nx
from tqdm import tqdm

from txtai.graph import GraphFactory

from datasets import load_dataset
dataset = load_dataset("Hellisotherpeople/DebateSum", split = "train")
seg = pysbd.Segmenter(language="en", clean=False)
#from txtai.scoring import ScoringFactory
#scoring = ScoringFactory.create({"method": "bm25", "terms": True, "content" : True})

#sentence-transformers/all-mpnet-base-v2 
#legal-bert
#longformer


# Create embeddings instance with a semantic graph
embeddings = Embeddings({
  "path": "legal-bert",
  "content": True,
  "functions": [
    {"name": "graph", "function": "graph.attribute"},
  ],
  "expressions": [
      {"name": "topic", "expression": "graph(indexid, 'topic')"},
      {"name": "topicrank", "expression": "graph(indexid, 'topicrank')"}
  ],
  "graph": {
      "limit": 100,
      "minscore": 0.10,
      "topics": {
          "terms": 4,
          "resolution" : 100
      }
  }
})

# Load dataset
#dataset = load_dataset("ag_news", split="train")
#rows = dataset["text"]



In [2]:
dataset

Dataset({
    features: ['Unnamed: 0', 'Full-Document', 'Citation', 'Extract', 'Abstract', '#CharsDocument', '#CharsAbstract', '#CharsExtract', '#WordsDocument', '#WordsAbstract', '#WordsExtract', 'AbsCompressionRatio', 'ExtCompressionRatio', 'OriginalDebateFileName', 'DebateCamp', 'Tag', 'Year'],
    num_rows: 240566
})

In [None]:
##TODO: Get some kind of retrieve and rerank pipeline working
##TODO: Think of further columns to add as calculations from what are here? Maybe sentiment?

def stream(dataset, field, limit, index_content=False):
  index = 0
  for count, row in tqdm(enumerate(dataset)):
    segmented_sentence = seg.segment(row[field])
    for sentence in segmented_sentence:
        if index_content:
            yield (index, {"text":sentence, "full-document": row["Full-Document"], "length_doc" : row["#WordsDocument"], "extract" : row["Extract"], "length_extract" : row["#WordsExtract"], "Abstract" : row["Abstract"], "length_abstract" : row["#WordsAbstract"], "citation": row["Citation"], "filename" : row["OriginalDebateFileName"], "debatecamp" : row["DebateCamp"], "tag" : row["Tag"], "year" : row["Year"] ,"evidence_id" : count} , None)
        else:
            yield (index, {"text":sentence, "evidence_id" : count} , None)
        index += 1

    if index >= limit:
      print("Hit Limit, now indexing")
      break
        
def stream_no_limit(dataset, field, index_content=False):
  index = 0
  for count, row in tqdm(enumerate(dataset)):
    segmented_sentence = seg.segment(row[field])
    for sentence in segmented_sentence:
        if index_content:
            yield (index, {"text":sentence, "full-document": row["Full-Document"], "length_doc" : row["#WordsDocument"], "extract" : row["Extract"], "length_extract" : row["#WordsExtract"], "Abstract" : row["Abstract"], "length_abstract" : row["#WordsAbstract"], "citation": row["Citation"], "filename" : row["OriginalDebateFileName"], "debatecamp" : row["DebateCamp"], "tag" : row["Tag"], "year" : row["Year"] ,"evidence_id" : count} , None)
        else:
            yield (index, {"text":sentence, "evidence_id" : count} , None)
        index += 1
    
embeddings.index(stream_no_limit(dataset, "Full-Document", True))

#embeddings.index(stream(dataset, "Full-Document", 10000, True))
#embeddings.save("DebateSum_SemanticGraph")

240566it [3:05:22, 21.63it/s] 


In [2]:
embeddings.index((x, {"text": text["Extract"], "full-document": text["Full-Document"], "length_doc" : text["#WordsDocument"], "extract" : text["Extract"], "length_extract" : text["#WordsExtract"], "Abstract" : text["Abstract"], "length_abstract" : text["#WordsAbstract"], "citation": text["Citation"], "filename" : text["OriginalDebateFileName"], "debatecamp" : text["DebateCamp"], "tag" : text["Tag"], "year" : text["Year"]}, None) for x, text in tqdm(enumerate(dataset)))


240566it [58:19, 68.74it/s]


In [None]:
embeddings.save("DebateSum_SemanticGraph_legal-bert_sentence.tar.gz")

In [33]:
embeddings.load("DebateSum_SemanticGraph_mpnet_extract.tar.gz")

In [None]:
from tqdm import tqdm
# Index dataset

#embeddings.index((x, {"text": text["Abstract"], "extract" : text["Extract"], "document" : text["Full-Document"], "citation": text["Citation"], "filename" : text["OriginalDebateFileName"]}, None) for x, text in tqdm(enumerate(dataset)))


graph = embeddings.graph



mpnet abstractive
240566 1876918

7.802091733661449

1692

mpnet extractive
240566 2133792

8.869881861942252
1203

legalbert extractive

240566 2685362

11.16268300591106

2576

legalbert abstractive

240566 3006572

12.497909097711231

5008

Longformer abstractive 

240566 2604166

10.82516232551566

Longformer Extractive

240566 2128302

8.847060681891872

1592

In [6]:
print(graph.backend.number_of_nodes(), graph.backend.number_of_edges())
graph.backend.number_of_edges() / float(graph.backend.number_of_nodes())

240566 2128302


8.847060681891872

In [7]:
len(embeddings.graph.topics)

1592

In [84]:
#embeddings.save("test_embeddings") ##Note, need original settings to load embeddings, can't do it in seperate embeddings object

In [85]:
#embeddings.load("test_embeddings")

In [8]:
list(graph.topics.keys())

['from_has_have_more',
 'has_its_have_from',
 'from_have_can_more',
 'has_from_have_which',
 'have_has_its_china',
 'from_more_have_than',
 'have_from_can_has',
 'states_its_has_have',
 'which_its_from_has',
 'court_has_have_from',
 'has_have_more_its',
 'its_has_china_states',
 'which_its_can_from',
 'have_has_would_from',
 'which_its_from_we',
 'we_have_can_from',
 'have_from_states_has',
 'his_trump_has_he',
 'social_which_political_from',
 'from_political_which_its',
 'oil_from_has_have',
 'from_have_which_can',
 'can_have_more_has',
 'from_which_can_we',
 'have_from_we_has',
 'have_from_has_who',
 'have_more_we_can',
 'from_can_have_water',
 'have_from_more_who',
 'one_from_can_which',
 'more_have_from_we',
 'has_states_its_united',
 'has_economic_have_from',
 'have_from_has_government',
 'which_from_its_what',
 'have_we_people_from',
 'has_from_have_been',
 'have_house_would_has',
 'from_which_social_has',
 'states_united_have_has',
 'has_have_its_political',
 'from_more_which_ha

In [10]:
import pprint
pprint.pprint(embeddings.search(f"select id, text, evidence_id, score, topic from txtai where similar('China is going to hurt the USA') and topic like '%usa%' "))

[]


In [11]:
import pprint
pprint.pprint(embeddings.search(f"select id, text, evidence_id, score, topic from txtai where similar('The role of the state in hurting the people')"))

[{'evidence_id': None,
  'id': '6757',
  'score': 0.9833215475082397,
  'text': 'the ruling family has pulled the country to the condition where '
          'cries of oppressed people are ignored.',
  'topic': None},
 {'evidence_id': None,
  'id': '22889',
  'score': 0.9824581742286682,
  'text': 'Nothing in the domain of the state can be against the state',
  'topic': None},
 {'evidence_id': None,
  'id': '203698',
  'score': 0.9821363091468811,
  'text': 'we need to see the big picture of the role that immigrants play in '
          'the new global capitalism system',
  'topic': 'world_has_new_war'}]


In [12]:
pprint.pprint(embeddings.search(f"select * from txtai where similar('China is going to hurt the USA') and citation like '%Hague%'"))

[]


In [13]:
pprint.pprint(embeddings.search(f"select id, text, evidence_id, score, topic, tag from txtai where similar('China is going to hurt the USA') and tag like '%Affirmative%'" ))

[{'evidence_id': None,
  'id': '106321',
  'score': 0.9788461327552795,
  'tag': 'Affirmatives',
  'text': 'Kim’s main argument is that China would not accept the grand '
          'bargain',
  'topic': None},
 {'evidence_id': None,
  'id': '27706',
  'score': 0.9782600402832031,
  'tag': 'Affirmatives',
  'text': '  relying on the export growth, the economy recovery would go very '
          'slowly.',
  'topic': None},
 {'evidence_id': None,
  'id': '146847',
  'score': 0.978130042552948,
  'tag': 'Affirmatives',
  'text': 'President Mike Pence? The question is way ahead of where we are',
  'topic': 'we_you_have_about'}]


In [22]:
list(nx.all_simple_paths(graph.backend, 10, 20, cutoff = 4))

KeyboardInterrupt: 

In [23]:
list(nx.all_shortest_paths(graph.backend, 10, 60))

KeyboardInterrupt: 

In [7]:
def david_distance(source, target, attrs):
    distance = max(1.0 - attrs["weight"], 0.0)
    return distance if distance >= 0.15 else 1.00

def david_showpath(source, target, the_graph):
    return nx.shortest_path(the_graph, source, target, david_distance)

The embeddings index is now created. Let's explore!

# Graph analysis

Indexing an embeddings instance into a graph adds the ability to do network analysis. For example, the centrality of the graph can be analyzed to find the most common nodes. Alternatively, pagerank could also be run to rank the importance of nodes within the dataset. 

The section below runs graph centrality and shows the associated topic for the most central nodes. Not surprisingly, many of the topics are top topics.

# Walk the graph

Given that graphs are nodes and relationships, we can traverse the nodes using those relationships. The graph can be used to show how any two nodes are connected. 

In [18]:
from IPython.display import HTML
import string

def highlight(index, result):
  output = f"{index}. "
  spans = [(token, score, "#fff59d" if score > 0.01 else None) for token, score in result["tokens"]]

  for token, _, color in spans:
    output += f"<span style='background-color: {color}'>{token}</span> " if color else f"{token} "

  return output

def showpath(source, target, strip_punctuation = True, the_graph=graph.backend):
  path = david_showpath(source, target, the_graph)
  print(path)
  path = [graph.attribute(p, "text") for p in path]
  list_of_evidence_ids = []
  for text in path:
    if strip_punctuation:
        text = text.translate(str.maketrans("","", string.punctuation))
    list_of_evidence_ids.append(embeddings.search(f"select evidence_id from txtai where similar('{text}') limit 1")[0]['evidence_id'])
  print(list_of_evidence_ids)

  sections = []
  for x, p in enumerate(path):
      if x == 0:
          # Print start node
          
          sections.append(f"{x + 1}. {p}")
          sections.append(dataset["Abstract"][list_of_evidence_ids[x]])
          sections.append(dataset["Citation"][list_of_evidence_ids[x+1]])
          sections.append(dataset["Full-Document"][list_of_evidence_ids[x]])

      if x < len(path) - 1:
          # Explain and highlight next path element
          results = embeddings.explain(p, [path[x + 1]], limit=1)[0]
          sections.append(highlight(x + 2, results))
          sections.append(dataset["Abstract"][list_of_evidence_ids[x+1]])
          sections.append(dataset["Citation"][list_of_evidence_ids[x+1]])
          sections.append(dataset["Full-Document"][list_of_evidence_ids[x+1]])

  return HTML("<br/><br/>".join(sections))

def showpath_any(list_of_arguments, strip_punctuation = True, the_graph=graph.backend):
  list_of_paths = []
  for x, y in zip(list_of_arguments, list_of_arguments[1:]):
    a_path = david_showpath(x, y, the_graph)
    list_of_paths.extend(a_path)
  #print(list_of_paths)
  path = [graph.attribute(p, "text") for p in list_of_paths]
  list_of_evidence_ids = []
  for text in path:
    if strip_punctuation:
        text = text.translate(str.maketrans("","", string.punctuation))
    list_of_evidence_ids.append(int(embeddings.search(f"select id from txtai where similar('{text}') limit 1")[0]['id']))
  print(list_of_evidence_ids)

  sections = []
  for x, p in enumerate(path):
      if x == 0:
          # Print start node
          
          sections.append(f"{x + 1}. {p}")
          #sections.append(dataset["Abstract"][list_of_evidence_ids[x]])
          #sections.append(dataset["Citation"][list_of_evidence_ids[x+1]])
          #sections.append(dataset["Full-Document"][list_of_evidence_ids[x]])

      if x < len(path) - 1:
          # Explain and highlight next path element
          results = embeddings.explain(p, [path[x + 1]], limit=1)[0]
          sections.append(highlight(x + 2, results))
          #sections.append(dataset["Abstract"][list_of_evidence_ids[x+1]])
          #sections.append(dataset["Citation"][list_of_evidence_ids[x+1]])
          #sections.append(dataset["Full-Document"][list_of_evidence_ids[x+1]])

  return HTML("<br/><br/>".join(sections))

def showpath_list(list_of_path, strip_punctuation = True):
  path = list_of_path
  path = [graph.attribute(p, "text") for p in path]
  list_of_evidence_ids = []
  for text in path:
        if strip_punctuation:
            text = text.translate(str.maketrans("","", string.punctuation))
        list_of_evidence_ids.append(embeddings.search(f"select evidence_id from txtai where similar('{text}') limit 1")[0]['evidence_id'])
  print(list_of_evidence_ids)
  sections = []
  for x, p in enumerate(path):
      if x == 0:
          # Print start node
          
          sections.append(f"{x + 1}. {p}")
          sections.append(dataset["Abstract"][list_of_evidence_ids[x]])
          sections.append(dataset["Citation"][list_of_evidence_ids[x+1]])
          sections.append(dataset["Full-Document"][list_of_evidence_ids[x]])


      if x < len(path) - 1:
          # Explain and highlight next path element
          results = embeddings.explain(p, [path[x + 1]], limit=1)[0]
          sections.append(highlight(x + 2, results))
          sections.append(dataset["Abstract"][list_of_evidence_ids[x+1]])
          sections.append(dataset["Citation"][list_of_evidence_ids[x+1]])
          sections.append(dataset["Full-Document"][list_of_evidence_ids[x+1]])


  return HTML("<br/><br/>".join(sections))

In [67]:
selected_nodes = [n for n,v in graph.backend.nodes(data=True) if "z" in v['text']] ##also works for topic
selected_nodes
H = graph.backend.subgraph(selected_nodes)
list(H.nodes(data=True))

[(0,
  {'text': 'A) The Affirmatives construction of a “Nation-Space” manifests in distinction to a colonized Other- locking the colonial Other in a violent cycle. ',
   'topic': 'alternative_other_from_only',
   'topicrank': 11}),
 (1,
  {'text': 'Use of the term modernization is a blanket to cover colonization '}),
 (2,
  {'text': 'The projection of power portrayed of the US gave them an entitled sense of superiority. The American ideas of “modernization” are all really outgrowths of what we believed to be progress.'}),
 (131072,
  {'text': 'Their performative locus is a re-elaboration and consolidation of anti-black practices that is unable to challenge militarization and achieve emancipation/agency for black bodies.',
   'topic': 'anti_violence_through_coalitions',
   'topicrank': 14}),
 (131077,
  {'text': 'Suppression of Indigenous cultures causes dehumanization of natives – prioritize ongoing cultural violence '}),
 (131079,
  {'text': 'Providing an equal education to native Ame

In [70]:
selected_nodes = [n for n,v in graph.backend.nodes(data=True) if n%2 == 0] ##also works for topic
selected_nodes
J = graph.backend.subgraph(selected_nodes)
list(J.nodes(data=True))

[(0,
  {'text': 'A) The Affirmatives construction of a “Nation-Space” manifests in distinction to a colonized Other- locking the colonial Other in a violent cycle. ',
   'topic': 'alternative_other_from_only',
   'topicrank': 11}),
 (2,
  {'text': 'The projection of power portrayed of the US gave them an entitled sense of superiority. The American ideas of “modernization” are all really outgrowths of what we believed to be progress.'}),
 (4,
  {'text': 'The assumed cultural norms of economic engagement entrenches a universal market aesthetic reenacting a pattern of domination inherent in colonial thought. '}),
 (6,
  {'text': 'Economic Engagement “towards” other countries is an act violent social mimesis, re-speaking the colonial position. ',
   'topic': 'political_alternative_social_struggle',
   'topicrank': 8}),
 (8,
  {'text': 'Epistemological decolonization is key to liberating those imprisoned by coloniality. '}),
 (10,
  {'text': 'Colonialism is the foothold for biopower and rac

In [76]:
starting_argument = int(question("No Latin American prolif – regional safeguards and lack of tech", "", "", 100)[0]["id"])
ending_argument = int(question("Iran will use Latin America as a jumping-off point for attacks against the U.S.", "", "", 100)[0]["id"])
showpath_any([starting_argument, ending_argument], the_graph=J)




[21417, 14876, 18168, 194590, 9888, 8650, 70257, 13141]


In [51]:
for n, v in H.nodes(data=True):
    print(n, v)

8710 {'text': 'A serially underachieving economy, repeatedly trumped by dynamic Brazil? ', 'topic': 'repeatedly_underachieving_trumped_brazil', 'topicrank': 0}
3433 {'text': 'As, the United States moves to reshape the geopolitical terrain of the world, \'Nith hundreds of military bases in 130 countries added to hundreds of installations stretched across its own territorial confines, the vast majority of Americans refuse to admit their nation possesses anything resembling an Empire. Yet U.S. global expansion is far more ambitious than anything pursued or even imagined by previous imperial powers. It might be argued that the "new militarism" is rooted in a "new imperialism" that aspires to nothing short of world domination, a project earlier outlined by its exuberant proponents and given new life by the Bush II presidency, which has set out to remove all vestiges of ideological and material impediments to worldwide corporate power-by every means at its disposal. It is hard to resist the 

{(0, 493): 0.9345860481262207,
 (0, 4270): 0.9336660504341125,
 (0, 338): 0.9322676062583923,
 (0, 5280): 0.9321309328079224,
 (0, 366): 0.9293317198753357,
 (0, 4860): 0.9284900426864624,
 (0, 3484): 0.9280604124069214,
 (0, 89): 0.9261342287063599,
 (0, 4367): 0.9253823161125183,
 (0, 503): 0.925256073474884,
 (0, 4315): 0.9241577386856079,
 (0, 355): 0.9238387942314148,
 (0, 4395): 0.9236840605735779,
 (0, 349): 0.9233704805374146,
 (0, 6762): 0.923163890838623,
 (0, 4816): 0.9229990243911743,
 (0, 3023): 0.9227140545845032,
 (0, 4256): 0.9225206971168518,
 (0, 502): 0.9224455952644348,
 (0, 98): 0.9223863482475281,
 (0, 106): 0.922357976436615,
 (0, 5907): 0.9223538041114807,
 (0, 4705): 0.9223498106002808,
 (0, 813): 0.9222754836082458,
 (0, 4815): 0.9222575426101685,
 (0, 5456): 0.9221648573875427,
 (0, 4042): 0.9221648573875427,
 (0, 4803): 0.9221494793891907,
 (0, 54): 0.9221062064170837,
 (0, 4591): 0.9220341444015503,
 (0, 4193): 0.9218004941940308,
 (0, 64): 0.92161643505096

In [47]:
starting_argument = int(question("Evidence that global warming is real which is long and detailed", "", "", 100)[0]["id"])
middle_argument = int(question("Airplanes cause pollution","", "", 100)[0]["id"])
#print(starting_argument, middle_argument, ending_argument)
showpath_any([starting_argument, middle_argument])

[116790, 55806, 32391, 44755, 32295, 221833, 132284, 59445]


In [20]:
embeddings.search(f"select evidence_id from txtai where similar('hello') limit 1")[0]['evidence_id']

50743

In [39]:
showpath_list([10, 33, 60], strip_punctuation = True)

[1, 2, 4]


In [10]:
def question(text, rerank_word = "", rerank_topic = "", limit = 100):
  return embeddings.search(f"select id, text, topic, evidence_id, score from txtai where similar('{text}') and text like '%{rerank_word}%' and topic like '%{rerank_topic}%' limit {limit}")


In [43]:
starting_argument = int(question("Capitalism protects the environment", "", "", 100)[0]["id"])

ending_argument = int(question("Communism will damage the environment", "", "", 100)[0]["id"])

In [82]:
starting_argument

540963

In [14]:
question("Capitalism protects the environment", "", "", 100)

[{'id': '2736722',
  'text': 'IS CAPITALISM GOOD FOR THE ENVIRONMENT? ',
  'topic': 'capitalism_ecological_environmental_capitalist',
  'evidence_id': 148178,
  'score': 0.8689208626747131},
 {'id': '2071893',
  'text': "Doesn't capitalism destroy the environment? ",
  'topic': 'capitalism_ecological_environmental_capitalist',
  'evidence_id': 118460,
  'score': 0.85222327709198},
 {'id': '3175460',
  'text': 'If you want to protect the environment, you should love capitalism. ',
  'topic': 'capitalism_ecological_environmental_capitalist',
  'evidence_id': 166810,
  'score': 0.8346022367477417},
 {'id': '4205710',
  'text': 'Truth be told, capitalism has helped cleanse our planet — improving living standards while protecting the environment. ',
  'topic': 'capitalism_ecological_environmental_capitalist',
  'evidence_id': 213100,
  'score': 0.8308689594268799},
 {'id': '2071918',
  'text': 'WHY CAPITALISM PROTECTS THE ENVIRONMENT What has made this vast improvement in environmental qual

In [46]:
showpath(starting_argument, ending_argument)

[406, 194, 465, 65, 184]


In [None]:
graph.centrality().items() ##

In [None]:
dir(graph)

In [None]:
graph.louvain()

In [None]:
import matplotlib.pyplot as plt
import networkx as nx
import igraph as ig

This shows how text about a `famous squirrel` and the `Red Sox winning the world series` are connected. Notice how the first match pivots to a node about a squirrel running on the field during a baseball game. From there, it's a relatively logical path to the end node. 

This is reminiscent of the game "six degrees of Kevin Bacon". Try running `showpath` with calls to `random.randint(0, len(rows) - 1)`, it's oddly addicting. This is a fun way to explore the interconnectivity of a dataset.

In [None]:
igraph_graph = ig.Graph.from_networkx(graph.backend)


In [None]:
def score_with_keyword_number(query):
    new_sorted_list = []
    list_of_results = question(query)
    for row in list_of_results:
        returned_text = row["text"]
        length_of_text = len(returned_text)
        matching_words = [i for i in query.split() if i in returned_text.split()]
        row["matching_words_percentage"] = len(matching_words) / length_of_text
    return list_of_results
    #sorted_by_matching_words = [x for _, x in sorted(zip(list_of_results, new_sorted_list))]

        
sorted_question = sorted(score_with_keyword_number("Education Policy is good"), key=lambda d: d['matching_words_percentage'], reverse = True)
sorted_question

In [None]:
layout = igraph_graph.layout_umap(epochs = 50)

ig.plot(igraph_graph, target='myfile.pdf',vertex_size=5, edge_width = igraph_graph.es["weight"])

In [None]:
dir(layout)

In [None]:
vids = igraph_graph.random_walk(0, 7)

In [None]:
vids

In [None]:
igraph_graph.pagerank()