`CypherWeb`

A Cypher layer to query the web

Important Note

This repository is currently a work in progress. While you can use it for some basic yet good enough queries, I'm still sketching a proper dev roadmap for it that I can share with interested members of the community.

The reason for this premature release is that to share the main idea behind the project with the community. Upcoming development will enhance it and make this tool usable in production environments.

A detailed article is soon to be published on how cypherweb works...

The key problem

The key problem that cypherweb is trying to solve is very simple: we take for granted HTML data, in other words, we treat it as unstructured text and don't make use of its semistructured format.

Install

Run the following command to get the latest version cypherweb

pip install cypherweb -U

Example of usage

If you are familiar with Cypher language, using cypherWeb will be easy.

Let’s say you are interested in extracting some team members' data from a set of websites. Instead of building custom templates specific to each website, cypherweb will ease this. We know that usually team members are presented as grids, which could be a set of li elements or custom HTML elements that share the same style or inner structure.

Let's test the extraction using an example.

# test new restruction
from cypherweb.core import GraphProcessPipeline, NodeSearchPipeline, Graph

from cypherweb.nodes.html_to_graph import HtmlToGraph
from cypherweb.nodes.connectors import BasicCrawler
from cypherweb.nodes.classifiers import GridClassifier
from cypherweb.nodes.classifiers import LinkListClassifier
from cypherweb.nodes.classifiers import TitleClassifier
from cypherweb.nodes.classifiers import LinkClassifier
from cypherweb.nodes.str_matchers import BaseStrRetriever
from cypherweb.nodes.search import GraphNearestNeighbor
from cypherweb.nodes.cypher import CypherApi
from rich import print as rprint

# step 1: create graph
graph_pipe = GraphProcessPipeline()

graph_pipe.add_node(
    BasicCrawler(),
    "crawler",
)
graph_pipe.add_node(
    HtmlToGraph(),
    "html_to_graph",
)
graph_pipe.add_node(
    GridClassifier(),
    "grid_classifier",
)
graph_pipe.add_node(
    LinkListClassifier(),
    "link_list_classifier",
)
graph_pipe.add_node(
    TitleClassifier(),
    "title_classifier",
)
graph_pipe.add_node(
    LinkClassifier(),
    "link_classifier",
)

# step 2:  use graph for retrieval
search_pipe = NodeSearchPipeline()
search_pipe.add_node(
    CypherApi(),
    "cypher_api",
)
search_pipe.add_node(graph_pipe, "graph_process_pipe")
search_pipe.add_node(
    BaseStrRetriever(),
    "str_matcher",
)
search_pipe.add_node(
    GraphNearestNeighbor(),
    "graph_nn",
)


# use your pipeline
search_pipe.run(
"""
USE "https://weaviate.io/company/about-us"
MATCH (a:Grid)-[e*1..2]->(t:Title)
WHERE t.text contains "team"
RETURN a, t
"""
)
rprint(search_pipe.passing_input)

# preview your select HTML element as string
# visualize node
from cypherweb.utils.visualizations import render_element_from_root

node_id = "div_66321657"
graph = search_pipe.graph._graph
str_agg = ""
render_element_from_root(graph, node_id, str_agg=str_agg, depth=0)

Supported queries

As a start, cypherweb supports two types of queries

Search for a typed/non-typed element using a filter on its text content.

USE "https://weaviate.io/company/about-us"
MATCH (a:Grid)
WHERE a.text contains "feature a"
RETURN a

Search for a typed/non-typed element using its links to other elements.

USE "https://weaviate.io/company/about-us"
MATCH (a:Grid)-[e*1..2]->(t:Title)
WHERE t.text contains "team"
RETURN a, t

The currently supported elements are

Grids Grid
Titles (headings for now) Title
Links Link
list of links ListLink

Work in progress

Add more consumer nodes (HTML consumer, Structured consumer (Python dict, JSON, ...))
Add/improve classifier nodes (paragraph detection,titles)
improve Cypher node (instead of a hardcoded transformation we can use native graph traversal supported by networkx (graph motifs))

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src/cypherweb		src/cypherweb
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`CypherWeb`

A Cypher layer to query the web

Important Note

The key problem

Install

Example of usage

Supported queries

Work in progress

About

Releases

Packages

Contributors 2

Languages

License

AnasAito/cypher-web

Folders and files

Latest commit

History

Repository files navigation

CypherWeb

A Cypher layer to query the web

Important Note

The key problem

Install

Example of usage

Supported queries

Work in progress

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`CypherWeb`

Packages