# Big Data Modelling and Management - Lab 3

## Setup

Prepare for the class:

1. run Jupyter 
2. run Docker
3. run the command with paths to your folders in Command Prompt
4. open http://localhost:7474/browser/  
5. log-in with login: neo4j, password: test
6. start exploring the database

**1. Be sure that you have a neo4j docker container running:**
    
docker run --name Neo4JLab -p 7474:7474 -p 7687:7687 -d -v "/home/gui-moreira/Documents/NOVA/BDMM/Installation_Guide/Neo4JPlugins":/plugins -v "/home/gui-moreira/Documents/NOVA/BDMM/Installation_Guide/Neo4JData/data":/data --env NEO4J_AUTH=neo4j/test --env NEO4J_dbms_connector_https_advertised__address="localhost:7473" --env NEO4J_dbms_connector_http_advertised__address="localhost:7474" --env NEO4J_dbms_connector_bolt_advertised__address="localhost:7687" --env NEO4J_dbms_security_procedures_unrestricted=gds.* --env NEO4J_dbms_security_procedures_allowlist=gds.* neo4j:latest

docker run --name Neo4JLab2 -p 7474:7474 -p 7687:7687 -d -v "/home/gui-moreira/Documents/NOVA/BDMM/Installation_Guide/Neo4JPlugins":/plugins -v "/home/gui-moreira/Documents/NOVA/BDMM/Installation_Guide/Neo4JData/data":/data --env NEO4J_AUTH=neo4j/test --env NEO4J_dbms_connector_https_advertised__address="localhost:7473" --env NEO4J_dbms_connector_http_advertised__address="localhost:7474" --env NEO4J_dbms_connector_bolt_advertised__address="localhost:7687" --env NEO4J_dbms_security_procedures_unrestricted="gds.*" --env NEO4J_dbms_security_procedures_allowlist="gds.*" neo4j:4.4.5

**2. We will be using the same database we used in Lab 1:**

## Graph Data Science (GDS) Algorithms

The Neo4j Graph Data Science (GDS) library contains many graph algorithms. The algorithms are divided into categories which represent different problem classes. Please refer to the [documentation](https://neo4j.com/docs/graph-data-science/current/algorithms/) for further details.

In [2]:
from neo4j import GraphDatabase
from pprint import pprint

In [3]:
NEO4J_URI="neo4j://localhost:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="test"

In [4]:
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD), )

### All The Functions you'll need to run queries in Neo4J

In [5]:
def execute_write(driver, query):
    with driver.session(database="neo4j") as session:
        # Write transactions allow the driver to handle retries and transient errors
        result = session.execute_write(lambda tx, query: list(tx.run(query)), query)
    return result

In [6]:
def execute_read(driver, query):    
    with driver.session(database="neo4j") as session:
        result = session.execute_read(lambda tx, query: list(tx.run(query)), query)
    return result

## Centrality Algorithms

Centrality algorithms are used to determine the importance of distinct nodes in a network. As an example you can use the [pagerank algorithm](https://neo4j.com/docs/graph-data-science/current/algorithms/page-rank/).

### PageRank

The PageRank algorithm measures the importance of each node within the graph, based on the number incoming relationships and the importance of the corresponding source nodes. The underlying assumption roughly speaking is that a page is only as important as the pages that link to it.


![pagerank.png](img/pagerank.png)



#### Example: Which user is the most influential when it comes to the tweets made that mention other users? You can use any of the numeric properties of the tweets as the weight.

In [7]:
# Step 0 - Clear graph, graph names need to be unique

try:
    query = """
            CALL gds.graph.drop('example_1') YIELD graphName;
        """

    result = execute_read(driver, query)

    pprint(result)
except Exception as e:
    pprint(e)

ClientError('Failed to invoke procedure `gds.graph.drop`: Caused by: java.util.NoSuchElementException: Graph with name `example_1` does not exist on database `neo4j`. It might exist on another database.')


In [8]:
# Step 1 - Create an appropriate graph (a subgraph)

try:
    query = """
        CALL gds.graph.project.cypher(
            'example_1',
            
            'match (n:User) return id(n) AS id',
            
            'MATCH (u1:User)-[]->(t:Tweet)-[]->(u2:User) 
            RETURN id(u1) AS source, t.replies as weight, id(u2) AS target'
            )
      """
    result = execute_read(driver, query)
    pprint(result)
except Exception as e:
    pprint(e)
    
#     First we find all the nodes of type User, and return only id values. Use limit while checking!
#     Second we look up the nodes and their relationships

ClientError('Failed to invoke procedure `gds.graph.project.cypher`: Caused by: java.lang.IllegalArgumentException: Node-Query returned no nodes')


In [33]:
# Step 1 - part 1
try:
    query = """
            MATCH (n:User) 
            RETURN id(n) AS id
            LIMIT 5     
        """
    result = execute_read(driver, query)
    pprint(result)
except Exception as e:
    pprint(e)

[<Record id=0>, <Record id=1>, <Record id=3>, <Record id=5>, <Record id=6>]


In [34]:
# Step 1 - part 2
try:
    query = """
            MATCH (u1:User)-[:TWEETED]->(t:Tweet)-[:MENTIONS]->(u2:User) 
            RETURN id(u1) AS source, t.replies as weight, id(u2) AS target
            LIMIT 15
        """

    result = execute_read(driver, query)

    pprint(result)
except Exception as e:
    pprint(e)

[<Record source=59399 weight=0 target=3>,
 <Record source=48209 weight=1 target=3>,
 <Record source=65770 weight=12 target=3>,
 <Record source=39143 weight=3 target=3>,
 <Record source=17674 weight=7 target=3>,
 <Record source=12599 weight=0 target=6>,
 <Record source=12599 weight=2 target=6>,
 <Record source=12599 weight=0 target=6>,
 <Record source=12599 weight=0 target=6>,
 <Record source=9965 weight=0 target=6>,
 <Record source=81379 weight=0 target=8>,
 <Record source=79896 weight=0 target=8>,
 <Record source=36197 weight=7 target=8>,
 <Record source=74055 weight=1 target=8>,
 <Record source=21559 weight=8 target=8>]


In [35]:
# Step 2 - Run the algorithm

try:
    query = """
        CALL gds.pageRank.stream('example_1', {relationshipWeightProperty:'weight'})
            
            YIELD nodeId, score
            
            RETURN gds.util.asNode(nodeId).username AS name, score
            
            ORDER BY score desc
            LIMIT 5
        """
    result = execute_read(driver, query)
    pprint(result)
except Exception as e:
    pprint(e)
    
# yield: In the stream execution mode, the algorithm returns the score for each node. 
# This allows us to inspect the results directly or post-process them in Cypher without any side effects.    
# return: retrieves username based on node ID, and the calculated score

[<Record name='POTUS' score=27.12628728998999>,
 <Record name='TiloJung' score=14.689286591434579>,
 <Record name='Afelia' score=12.760064006336288>,
 <Record name='elonmusk' score=8.584080794800359>,
 <Record name='Johann_v_d_Bron' score=6.748710790285419>]


## Community detection Algorithms

Community detection algorithms are used to evaluate how groups of nodes are clustered or partitioned, as well as their tendency to strengthen or break apart. As an example you can use the [Label Propagation algorithm](https://neo4j.com/docs/graph-data-science/current/algorithms/label-propagation/).

![community.png](img/community.png)

### Label Propagation algorithm

The Label Propagation algorithm (LPA) is a fast algorithm for finding communities in a graph. It detects these communities using network structure alone as its guide, and doesn’t require a pre-defined objective function or prior information about the communities.

#### Example: When considering that users mention other users in their tweets, how many communities are formed from these relationships? How many users has the biggest community?

In [36]:
# Step 0 - Clear graph, graph names need to be unique

try:
    query = """
            CALL gds.graph.drop('example_2') YIELD graphName;
        """

    result = execute_read(driver, query)

    pprint(result)
except Exception as e:
    pprint(e)

ClientError('Failed to invoke procedure `gds.graph.drop`: Caused by: java.util.NoSuchElementException: Graph with name `example_2` does not exist on database `neo4j`. It might exist on another database.')


In [37]:
# Step 1 - Create an appropriate graph

try:
    query = """
        CALL gds.graph.project.cypher(
        
            'example_2',
            
            'MATCH (n:User) RETURN id(n) AS id',
            
            'MATCH (u1:User)-[]->(l:Tweet)-[]->(u2:User) 
            
            RETURN id(u1) AS source, id(u2) AS target'
            )
        """

    result = execute_read(driver, query)

    pprint(result)
except Exception as e:
    pprint(e)

[<Record nodeQuery='MATCH (n:User) RETURN id(n) AS id' relationshipQuery='MATCH (u1:User)-[]->(l:Tweet)-[]->(u2:User) \n            \n            RETURN id(u1) AS source, id(u2) AS target' graphName='example_2' nodeCount=69974 relationshipCount=44241 projectMillis=347>]


In [38]:
# Step 2 - Check if the algorithm stats

try:
    query = """
            CALL gds.labelPropagation.stats(
                'example_2')
            YIELD communityCount, ranIterations, didConverge
        """

    result = execute_read(driver, query)

    pprint(result)
except Exception as e:
    pprint(e)
    
#     In the stats execution mode, the algorithm returns a single row containing a summary of the algorithm result. 
# This execution mode does not have any side effects. 
# It can be useful for evaluating algorithm performance by inspecting the computeMillis return item. 

[<Record communityCount=60543 ranIterations=4 didConverge=True>]


In [39]:
# Step 3 - Run the algorithm

try:
    query = """
            CALL gds.labelPropagation.stream('example_2')
            YIELD nodeId, communityId AS Community
            
            WITH gds.util.asNode(nodeId).username AS Name, Community
            RETURN Community, count(*) as freq
            ORDER BY freq desc
            limit 5
        """

    result = execute_read(driver, query)

    pprint(result)
except Exception as e:
    pprint(e)

[<Record Community=25 freq=930>,
 <Record Community=39 freq=267>,
 <Record Community=1484 freq=199>,
 <Record Community=74 freq=139>,
 <Record Community=937 freq=126>]


In [40]:
# Step 4 - Who belongs to the community?

try:
    query = """
            CALL gds.labelPropagation.stream('example_2')
            YIELD nodeId, communityId AS Community
            
            WITH gds.util.asNode(nodeId).username AS Name, Community
            WHERE Community = 25
            RETURN Name
            LIMIT 20
        """

    result = execute_read(driver, query)

    pprint(result)
except Exception as e:
    pprint(e)

[<Record Name='Denys_Shmyhal'>,
 <Record Name='POTUS'>,
 <Record Name='CleanCarbonNrg'>,
 <Record Name='TGinja8'>,
 <Record Name='SecYellen'>,
 <Record Name='MoonlightNfts'>,
 <Record Name='Safi_awan'>,
 <Record Name='j_yeo1'>,
 <Record Name='donnyfarmshop'>,
 <Record Name='KikiGPatriot'>,
 <Record Name='CarnegieEndow'>,
 <Record Name='AdultVietson'>,
 <Record Name='Hopeful11212715'>,
 <Record Name='JohnBirdsall15'>,
 <Record Name='Diplomacy140'>,
 <Record Name='xavierkress'>,
 <Record Name='361_3933'>,
 <Record Name='Sunnyand74'>,
 <Record Name='norsedog'>,
 <Record Name='USTreasury'>]


## Similarity Algorithms

Similarity algorithms compute the similarity of pairs of nodes based on their neighborhoods or their properties. Several similarity metrics can be used to compute a similarity score. As an example you can use the [Node Similarity algorithm](https://neo4j.com/docs/graph-data-science/current/algorithms/node-similarity/).

![similarity.png](img/similarity.png)

### Node Similarity algorithm

The Node Similarity algorithm compares a set of nodes based on the nodes they are connected to. Two nodes are considered similar if they share many of the same neighbors. Node Similarity computes pair-wise similarities based on either the Jaccard metric, also known as the Jaccard Similarity Score, or the Overlap coefficient, also known as the Szymkiewicz–Simpson coefficient.

#### Example: What are the most similar users in terms of hashtags? Disregard Users with Less than 6 Hashtags used.

In [41]:
# Step 0 - Clear graph

try:
    query = """CALL gds.graph.drop('example_3') YIELD graphName;  """
    result = execute_read(driver, query)
    pprint(result)
except Exception as e:
    pprint(e)

ClientError('Failed to invoke procedure `gds.graph.drop`: Caused by: java.util.NoSuchElementException: Graph with name `example_3` does not exist on database `neo4j`. It might exist on another database.')


In [42]:
# Step 1 - Create an appropriate graph

try:
    query = """
        CALL gds.graph.project.cypher(
            'example_3',
            
            "MATCH (n) 
            WHERE head(Labels(n))='User' or head(Labels(n))='Hashtag' 
            RETURN id(n) AS id",
            
            "MATCH (u1:User)-[]->(tw:Tweet)-[]->(t:Hashtag) 
            
                with u1, count(distinct t.name) as freq
                where freq > 6
                
                MATCH (u2:User)-[]->(tw2:Tweet)-[]->(t2:Hashtag) 
                with collect(distinct id(u1)) as old, u2, t2
                where any(x IN old WHERE x = id(u2) )
                return id(u2) AS source, id(t2) AS target"
            )
        """
    result = execute_read(driver, query)
    pprint(result)
except Exception as e:
    pprint(e)

[<Record nodeQuery="MATCH (n) \n            WHERE head(Labels(n))='User' or head(Labels(n))='Hashtag' \n            RETURN id(n) AS id" relationshipQuery='MATCH (u1:User)-[]->(tw:Tweet)-[]->(t:Hashtag) \n            \n                with u1, count(distinct t.name) as freq\n                where freq > 6\n                \n                MATCH (u2:User)-[]->(tw2:Tweet)-[]->(t2:Hashtag) \n                with collect(distinct id(u1)) as old, u2, t2\n                where any(x IN old WHERE x = id(u2) )\n                return id(u2) AS source, id(t2) AS target' graphName='example_3' nodeCount=74682 relationshipCount=5095 projectMillis=33530>]


In [43]:
# Step 1.1 - Create an appropriate graph
try:
    query = """
            MATCH (n) 
            WHERE head(Labels(n))='User' or head(Labels(n))='Hashtag' 
            RETURN id(n) AS id
            LIMIT 20
        """
    result = execute_read(driver, query)
    pprint(result)
except Exception as e:
    pprint(e)
#     RETURN distinct Labels (n) - to check

[<Record id=0>,
 <Record id=1>,
 <Record id=3>,
 <Record id=5>,
 <Record id=6>,
 <Record id=8>,
 <Record id=10>,
 <Record id=12>,
 <Record id=13>,
 <Record id=15>,
 <Record id=17>,
 <Record id=19>,
 <Record id=21>,
 <Record id=23>,
 <Record id=25>,
 <Record id=26>,
 <Record id=28>,
 <Record id=29>,
 <Record id=31>,
 <Record id=32>]


In [44]:
# Step 1.2 A - Create an appropriate graph
try:
    query = """
                MATCH (u1:User)-[]->(tw:Tweet)-[]->(t:Hashtag) 
                WITH u1, count(distinct t.name) as freq
                WHERE freq > 6
                
                MATCH (u2:User)-[]->(tw2:Tweet)-[]->(t2:Hashtag) 
                
                WITH collect(distinct id(u1)) as old, u2, t2
                WHERE any( x IN old WHERE x = id(u2) )
                RETURN id(u2) AS source, id(t2) AS target
                LIMIT 10
        """
    result = execute_read(driver, query)
    pprint(result)
except Exception as e:
    pprint(e)

[<Record source=28619 target=189646>,
 <Record source=50215 target=189646>,
 <Record source=2270 target=189646>,
 <Record source=3086 target=189647>,
 <Record source=73976 target=189647>,
 <Record source=82055 target=189647>,
 <Record source=26145 target=189647>,
 <Record source=89436 target=189647>,
 <Record source=85006 target=189647>,
 <Record source=1078 target=189647>]


In [45]:
# Step 1.2 B - Example of output analysis - SEE in BROWSER
try:
    query = """
MATCH (u1:User)-[]->(tw:Tweet)-[]->(t:Hashtag) 
WITH u1, count(distinct t.name) as freq
WHERE freq > 6
                
MATCH (u2:User)-[g:TWEETED]->(tw2:Tweet)-[b:MENTIONS_HASHTAG]->(t2:Hashtag) 
WITH collect(distinct id(u1)) as old, u2, t2, g, b, tw2     
WHERE any( x IN old WHERE x = id(u2) )
RETURN u2 AS source, t2 AS target, g, b, tw2
LIMIT 5
        """
    result = execute_read(driver, query)
    pprint(result)
except Exception as e:
    pprint(e)

[<Record source=<Node element_id='28619' labels=frozenset({'User'}) properties={'tweet_count': 1191, 'followers': 15, 'following': 149, 'verified': False, 'description': "Pas besoin de raconter ce que vivent les ukrainiens, nos médias le font H24, a nous de partager ce qu'ils nous cachent pour mieux comprendre. Une pièce a 2faces", 'profile_image_url': 'https://pbs.twimg.com/profile_images/1407789719686201364/ulW-ovfh_normal.jpg', 'id': '1319331133608308736', 'username': 'VertRed'}> target=<Node element_id='189646' labels=frozenset({'Hashtag'}) properties={'name': 'DenazifyUkraine'}> g=<Relationship element_id='167583' nodes=(<Node element_id='28619' labels=frozenset({'User'}) properties={'tweet_count': 1191, 'followers': 15, 'following': 149, 'verified': False, 'description': "Pas besoin de raconter ce que vivent les ukrainiens, nos médias le font H24, a nous de partager ce qu'ils nous cachent pour mieux comprendre. Une pièce a 2faces", 'profile_image_url': 'https://pbs.twimg.com/prof

In [46]:
# Step 2 - Run the algorithm
try:
    query = """
        CALL gds.nodeSimilarity.stream('example_3')
            YIELD node1, node2, similarity
            
            WITH gds.util.asNode(node1).username AS User1, 
            
            gds.util.asNode(node2).username AS User2, similarity
            
            RETURN User1, User2, similarity
            
            ORDER BY similarity ASCENDING
            LIMIT 10
        """
    result = execute_read(driver, query)
    
    pprint(result)
except Exception as e:
    pprint(e)
# WHERE User1<User2

[<Record User1='PreventPapers' User2='_Thirunarayan1' similarity=0.008695652173913044>,
 <Record User1='PreventPapers' User2='ijaydenx' similarity=0.011904761904761904>,
 <Record User1='VeilleCyber3' User2='aarush_patill' similarity=0.016666666666666666>,
 <Record User1='VeilleCyber3' User2='chidambara09' similarity=0.025>,
 <Record User1='epicpitch' User2='Demiurge1165' similarity=0.02631578947368421>,
 <Record User1='CYNews2' User2='Author_JackKost' similarity=0.02857142857142857>,
 <Record User1='thealikecom' User2='ADFCFUTUREFOREX' similarity=0.029411764705882353>,
 <Record User1='wernerkeil' User2='ZAM90s' similarity=0.029411764705882353>,
 <Record User1='epicpitch' User2='RuanoFaxas' similarity=0.029411764705882353>,
 <Record User1='chidambara09' User2='Sergey_Hodor' similarity=0.029411764705882353>]


## Path finding Algorithms

Path finding algorithms find the path between two or more nodes or evaluate the availability and quality of paths. As an example you can use the [Dijkstra Shortest Path algorithm](https://neo4j.com/docs/graph-data-science/current/algorithms/dijkstra-source-target/).

![path2.png](img/path2.png)

### Dijkstra Shortest Path algorithm

The Dijkstra Shortest Path algorithm computes the shortest path between nodes. The algorithm supports weighted graphs with positive relationship weights. The Dijkstra Source-Target algorithm computes the shortest path between a source and a target node. To compute all paths from a source node to all reachable nodes, Dijkstra Single-Source can be used.

#### Example: When considering that users mention other users in their tweets, where these users can also tweet and mention others... Looking at this chain, what is the minimum amount of users/tweets in between user with usernames `UAWeapons` and `WCKitchen`? 

In [47]:
# Step 0 - Clear graph

try:
    query = """
            CALL gds.graph.drop('example_4') YIELD graphName;
        """

    result = execute_read(driver, query)

    pprint(result)
except Exception as e:
    pprint(e)

[<Record graphName='example_4'>]


In [48]:
# Step 1 - Create an appropriate graph

try:
    query = """
        CALL gds.graph.project.cypher(
            'example_4',
            'MATCH (n) RETURN id(n) AS id',
            'MATCH (u1)-[]-(u2) RETURN id(u1) AS source, id(u2) AS target'
            )
        """

    result = execute_read(driver, query)

    pprint(result)
except Exception as e:
    pprint(e)

[<Record nodeQuery='MATCH (n) RETURN id(n) AS id' relationshipQuery='MATCH (u1)-[]-(u2) RETURN id(u1) AS source, id(u2) AS target' graphName='example_4' nodeCount=194356 relationshipCount=400800 projectMillis=691>]


In [49]:
# Step 2 - Run the algorithm

try:
    query = """
        MATCH (source:User {username: 'UAWeapons'}), (target:User {username: 'WCKitchen'})
        CALL gds.shortestPath.dijkstra.stream('example_4', {
            sourceNode: source,
            targetNode: target
        })
        YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
        RETURN
            index,
            gds.util.asNode(sourceNode).username AS sourceNodeName,
            gds.util.asNode(targetNode).username AS targetNodeName,
            totalCost,
            [nodeId IN nodeIds | Labels(gds.util.asNode(nodeId))] AS nodeNames,
            costs,
            Nodes(path) as path
        ORDER BY index
        """

    result = execute_read(driver, query)

    pprint(result)
except Exception as e:
    pprint(e)

[<Record index=0 sourceNodeName='UAWeapons' targetNodeName='WCKitchen' totalCost=6.0 nodeNames=[['User'], ['Tweet'], ['Hashtag'], ['Tweet'], ['Hashtag'], ['Tweet'], ['User']] costs=[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0] path=[<Node element_id='13' labels=frozenset({'User'}) properties={'tweet_count': 2041, 'followers': 481677, 'following': 11, 'verified': False, 'description': '🇷🇺/🇬🇧 Debunking & Tracking Usage/Capture of Materiel in Ukraine. A @CalibreObscura & @ArmoryBazaar project. For commercial inquiries, please email.', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1497287636448382978/wAo98Sgw_normal.jpg', 'id': '1495480590572961792', 'username': 'UAWeapons'}>, <Node element_id='146616' labels=frozenset({'Tweet'}) properties={'date': '2022-04-21T16:51:00.000Z', 'replies': 0, 'id': '1517184136489816071', 'text': "@UAWeapons There are about 76 Russian BTG's (Battalion Tactical Groups) in Ukraine with losing 753 tanks, that's about 10% loss. 20%  would make the invasion com

In [50]:
# YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path

result[0][4] # Node Names

[['User'], ['Tweet'], ['Hashtag'], ['Tweet'], ['Hashtag'], ['Tweet'], ['User']]

---
---
---
## What's next ?! 

The **first project is almost over!** Doubts and questions on the Moodle Forum, **due date** is the **18th of March!**