# Big Data Modelling and Management - Lab 1

Prepare for the class:

0. download Dataset from Moodle Lab 1 -> upzip the folder + rename to Neo4JData -> delete the original zip!
1. run Jupyter 
2. run Docker
3. run the command with paths to your folders in Command Prompt
4. open http://localhost:7474/browser/  
5. log-in with login: neo4j, password: test
6. start exploring the database

bit.ly/BDMM_slido //
osavchuk@novaims.unl.pt 

The default container does not have any data whatsoever, we will have to load a database into our docker image:
- Download and unzip the `Neo4JData` file provided in Moodle. 
- Copy the path of the `Neo4JData` folder of the unziped file, e.g. `C:/Users/.../Neo4JData/data`.
- Download and unzip the `Neo4JPlugins` file provided in Moodle.
- Copy the path of the `Neo4JPlugins` folder of the unziped file, e.g. `C:/Users/.../Neo4Jplugins`.
- Change the code below accordingly. Please use the appropriate path that you got from the previous step. 

- Open Command Prompt, be sure that docker is running!, edit the paths, and run this command: \

`docker run --user=$(id -u):$(id -g) --name Neo4JLab -p 7474:7474 -p 7687:7687 -d -v "/home/gui-moreira/Documents/NOVA/BDMM/Installation_Guide/Neo4JPlugins":/plugins -v "/home/gui-moreira/Documents/NOVA/BDMM/Installation_Guide/Neo4JData/data":/data --env NEO4J_AUTH=neo4j/test --env NEO4J_dbms_connector_https_advertised__address="localhost:7473" --env NEO4J_dbms_connector_http_advertised__address="localhost:7474" --env NEO4J_dbms_connector_bolt_advertised__address="localhost:7687" --env NEO4J_dbms_security_procedures_unrestricted="gds.*" --env NEO4J_dbms_security_procedures_allowlist="gds.*" neo4j:4.4.5`

- Since Neo4j is trying to recognize a new database folder, this might take a bit (let's say 3 minutes), so don't worry.
- You can check that the plugins are correctly installed by running `CALL dbms.procedures() YIELD name RETURN name` in the browser UI, and checking for functions from the "gds" module.

## Graphs and Graph databases

![image_1](img/basic_graph.png)

This is a graph and it will be the focus of the next three classes and the first project.      

---
---

Graph databases are not the most common type of database as they are a completely different way of modeling data which is very intuitive for the data with graph-like structure.
In graph databases, complex join queries that would normally be slow in relational models are made to be fast and easy to construct.  

Let's take a look at the differences:

In the relational model, a many to many relationship between 2 entities is represented with an auxiliary table.  
In the example below students are mapped to courses with the enrollments table. To list all of the courses a student is enrolled in, you need to perform a join between two tables. This can be time consuming when using join operations frequently on large tables.

![sql_example](img/sql_example.png)  

---

Meanwhile using a graph database, relationships between entities become easier to visualize and query.  
Looking at the example below it is easier to see how many students are in a course, and which students are enrolled in which courses.

![graph_example](img/graph_example.png)  

---

In essence, graph databases focus on entities and the relationships between them rather than on properties of the entities themselves.

_All images in this section are extracted from the book __NoSQL for Mere Mortals__ from the recommended readings_

## Social networks!

Social networks are a perfect fit for graph databases.   
Twitter is simple and provides a nice api to extact some data.  
Here is an example of a tweet and its correspondent json data:  
_(For the ones interested, the tweet is from Max Roser a reseacher at Oxford who develops [OurWorldinData](https://ourworldindata.org/) that does great analysis and visualizations on the world's largest problems)_

In [1]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Are the number of confirmed deaths rising faster 
in China, Italy, Spain, South Korea, or the US?<br><br>Our interactive chart shows you the trajectories 
since the day each country had the fifth confirmed death.<br><br>Here is the interactive version: 
    <a href="https://t.co/QiTCXtERj1">https://t.co/QiTCXtERj1</a> <a href="https://t.co/YlguVgdmDH">pic.twitter.com/YlguVgdmDH</a></p>&mdash; Max Roser (@MaxCRoser) <a href="https://twitter.com/MaxCRoser/status/1243252283636416517?ref_src=twsrc%5Etfw">March 26, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

Would be represented in the following json:

``` python
{
 'date': 'Thu Mar 26 19:03:37 +0000 2020',
 'favorites': 166,
 'hashtags': [],
 'retweets': 102,
 'text': 'Are the number of confirmed deaths rising faster in China, Italy, Spain, South Korea, or the US?\n\nOur interactive chart shows you the trajectories since the day each country had the fifth confirmed death.\n\nHere is the interactive version: https://ourworldindata.org/coronavirus#trajectories-since-the-5th-confirmed-death',
 'user_id': 610659001505,
 'user_mentions': [],
 'username': 'MaxCRoser',
 'verified': True
}
```

---

## Neo4J

![neo4j_example](img/neo4j_logo.png)  

There are many graph databases implemented with [different philosophies and syntaxes](https://en.wikipedia.org/wiki/Graph_database#List_of_graph_databases). For the present course we decided to go with Neo4j and its query language Cypher as it is the [most widely used](https://db-engines.com/en/ranking/graph+dbms) and, as a consequence, one of the most mature.
A special mention to [SPARQL](https://www.ontotext.com/knowledgehub/fundamentals/what-is-sparql/) which is a type of graph database used to query the [EU open data portal](https://data.europa.eu/euodp/en/sparql) or [wikidata (which powers wikipedia)](https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial) .


We will go through each major Cypher operation. For now lets just briefly look at a comparative example with SQL extracted from [wikipedia](https://en.wikipedia.org/wiki/Graph_database#Examples):
``` SQL
SELECT p2.person_name 
FROM people p1 
JOIN friend ON (p1.person_id = friend.person_id)
JOIN people p2 ON (p2.person_id = friend.friend_id)
WHERE p1.person_name = 'Jack';
```
Becomes
``` Cypher
MATCH (p1:person)-[:FRIEND-WITH]-(p2:person)
WHERE p1.name = "Jack"
RETURN p2.name
```

Easier and more intuitive to read right?


## Data Model of graph database

Data Model

The data model describes the labels of nodes, relationships, and properties for the graph. It does not have specific data that will be created in the graph.

A graph data model, however is important because it defines the names that will be used for labels, relationship types, and properties when the graph is created and used by the application.

![sample-data-model.png](img/sample-data-model.png)

Instance Model

An important part of the graph data modeling process is to test the model against the use cases. To do this, you need to have a set of sample data that you can use to see if the use cases can be answered with the model.

In this instance model, we have created some instances of Person and Movie nodes, as well as their relationships. Having this type of instance model will help us to test our use cases.

![sample-data-instance-model.png](img/sample-data-instance-model.png)

Fanout

Fanout is when entities are represented not as a single node, but as a network or linked nodes. These additional nodes could have been represented as properties of the 'main' nodes - Person and Residence, respectively.

The main risk about fanout is that it can lead to very dense nodes, or supernodes. These are nodes that have hundreds of thousands of incoming or outgoing relationships Supernodes need to be handled carefully.

![neo4j_fanout.png](img/neo4j_fanout.png)

## Cypher Language

**Node and Relationship**  
A node can be represented with:
```
()                     //anonymous node (no label or variable) can refer to any node in the database
(u:User)               //using variable u and node User
(:Tweet)               //no variable, label Technology
(tag:Hashtag)          //using variable tag and label Hashtag
(u:User{name: "Jonh"}) //Node User with property name
```
And relationships as:
```
()-[:TWEETED]-()                    // Relationship of Label type tweeted between two anonymous nodes
(:Tweet)<-[:MENTIONED_AT]-(u:User)  // Unidirectional relationship between User and Tweet 
()-[h:HAS]-()                       // Relationship with variable h and label Has
```
This example:

![node_relationship.png](img/node_relationship.png)

Would be represented as:  
```
(:User{name: "user_1"})-[:TWEETED]->(:TWEET{text: "This is a tweet"})
```

## Hands on!

For this class, we will continue to use docker with a neo4j database.

There are several libraries to interact with neo4j with python. We will use python's neo4j library as it is the simplest to use. There are plenty more.  
Another way to interact with neo4j is to use the web interface available at: http://localhost:7474/browser/ (pretty graphs there)
```
Connect URL : bolt://http://localhost:7474
Username and password same as bellow  
```

Ok, let's start!

In [2]:
! pip install neo4j

Collecting neo4j
  Using cached neo4j-5.17.0.tar.gz (197 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: neo4j
  Building wheel for neo4j (pyproject.toml) ... [?25ldone
[?25h  Created wheel for neo4j: filename=neo4j-5.17.0-py3-none-any.whl size=273834 sha256=ff6395fc255e75ebdd5a37ade6e4feda5bc3ccce7b362a5b823ac37ac8e82016
  Stored in directory: /home/gui-moreira/.cache/pip/wheels/24/cd/ce/048840b064bfabaefdb7fb0d0cf6dd5898e0f7b9b1cb336cca
Successfully built neo4j
Installing collected packages: neo4j
Successfully installed neo4j-5.17.0


In [1]:
from neo4j import GraphDatabase
from pprint import pprint

## Define a Connection to your external Database

In [2]:
NEO4J_URI="neo4j://localhost:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="test"

In [3]:
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD), )

## All The Functions you'll need to run queries in Neo4J

In [4]:
def execute_query_deprecated(driver, query):
    return list(driver.execute_query(query))

In [5]:
def execute_write(driver, query):
    with driver.session(database="neo4j") as session:
        # Write transactions allow the driver to handle retries and transient errors
        result = session.execute_write(lambda tx, query: list(tx.run(query)), query)
    return result

In [6]:
def execute_read(driver, query):    
    with driver.session(database="neo4j") as session:
        result = session.execute_read(lambda tx, query: list(tx.run(query)), query)
    return result

### Let's test out our functions

## Explore database

In [7]:
query = """
       call db.labels();
    """

result = execute_read(driver, query)

pprint(result)

[<Record label='User'>,
 <Record label='Location'>,
 <Record label='Tweet'>,
 <Record label='Hashtag'>]


In [9]:
query = """
        CALL db.relationshipTypes();
    """

result = execute_read(driver, query)

pprint(result)

[<Record relationshipType='FROM'>,
 <Record relationshipType='TWEETED'>,
 <Record relationshipType='MENTIONS'>,
 <Record relationshipType='MENTIONS_HASHTAG'>]


## CRUD Operations

### CREATE

Each node is given a unique id implicitly

In [10]:
import numpy as np

user_1 = f"user-{np.random.randint(1000000)}"
user_2 = f"user-{np.random.randint(1000000)}"

tweet_content  = "Yet another tweet"

query = f"""
        CREATE (u1:User{{username: "{user_1}"}})-[:TWEETED]->\
              (t:Tweet{{text: "{tweet_content}"}})-[:MENTIONS]->\
              (u2:User{{username: "{user_2}"}})
        RETURN *
"""

result = execute_write(driver, query)

pprint(result)

[<Record t=<Node element_id='194356' labels=frozenset({'Tweet'}) properties={'text': 'Yet another tweet'}> u1=<Node element_id='194354' labels=frozenset({'User'}) properties={'username': 'user-309579'}> u2=<Node element_id='194357' labels=frozenset({'User'}) properties={'username': 'user-981712'}>>]


### SET

Create, if it exists update.   
Can be compared with an `UPSERT` which exists in some SQL languages like PostgresQL

In [11]:
user = user_1

query = f"""
        MATCH (u1:User{{username: \"{user}\"}})
        SET u1.name="Tiago"
        RETURN *
    """

result = execute_write(driver, query)

pprint(result)

[<Record u1=<Node element_id='194354' labels=frozenset({'User'}) properties={'name': 'Tiago', 'username': 'user-309579'}>>]


### Delete
**Please be carefull when executing delete statements on any shared database**

In [12]:
query = """
      MATCH (u:User) 
      WHERE u.username =~ "user-.*"
      DETACH DELETE u
      RETURN count(u)
    """

result = execute_write(driver, query)

pprint(result)

[<Record count(u)=2>]


## Query Tools

### Match Operator

Equivalent to SQL `select` fetchs data based on query specifications. Lets see some examples:

#### Get all nodes in the database, no constraints.

In [13]:
query = """
        MATCH (u)
        RETURN u
        LIMIT 3
    """

result = execute_read(driver, query)

pprint(result)

[<Record u=<Node element_id='0' labels=frozenset({'User'}) properties={'tweet_count': 5481, 'followers': 111, 'following': 134, 'verified': False, 'description': 'Human, trying to make sense of an inhumane world.', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1512892306642325514/pMDPNGJK_normal.jpg', 'id': '1231752363817480199', 'username': 'IsaacsonSchmid'}>>,
 <Record u=<Node element_id='1' labels=frozenset({'User'}) properties={'tweet_count': 27811, 'followers': 81971, 'following': 888, 'verified': True, 'description': 'Chair @TheDemocrats Lawyers Council and @DNC Deputy National Finance Chair. @ObamaWhiteHouse appointee. Trial Lawyer. Music & Hockey Fan. #EndGunViolence', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1482073099285803016/RBcXvS4s_normal.jpg', 'id': '22879254', 'username': 'Weinsteinlaw'}>>,
 <Record u=<Node element_id='2' labels=frozenset({'Location'}) properties={'name': 'Parkland, FL '}>>]


#### Get only Locations

In [14]:
query = """
        MATCH (l:Location)
        RETURN *
        LIMIT 3
    """

result = execute_read(driver, query)

pprint(result)

[<Record l=<Node element_id='2' labels=frozenset({'Location'}) properties={'name': 'Parkland, FL '}>>,
 <Record l=<Node element_id='4' labels=frozenset({'Location'}) properties={'name': 'Washington, DC + The World'}>>,
 <Record l=<Node element_id='7' labels=frozenset({'Location'}) properties={'name': 'United States'}>>]


#### Get variable from Node

In [15]:
query = """
        MATCH (l:Location)
        RETURN l.name as location_name
        LIMIT 3
    """

result = execute_read(driver, query)

pprint(result)

[<Record location_name='Parkland, FL '>,
 <Record location_name='Washington, DC + The World'>,
 <Record location_name='United States'>]


#### Conditional match in 2 different ways

Passing the property on the node

In [16]:
query = """
        MATCH (l:Location{name: 'United States'})
        RETURN l.name as location_name
        LIMIT 3
    """

result = execute_read(driver, query)

pprint(result)

[<Record location_name='United States'>]


The SQL `WHERE`clause

In [17]:
query = """
        MATCH (l:Location)
        where l.name='United States'
        RETURN l.name as location_name
        LIMIT 3
    """

result = execute_read(driver, query)

pprint(result)

[<Record location_name='United States'>]


The `=~` is the [regex operator](https://neo4j.com/docs/cypher-manual/current/syntax/operators/#syntax-using-a-regular-expression-to-filter-words). Similar to the `like`clause in sQL 

In [18]:
query = """
        MATCH (l:Location)
        WHERE l.name =~ 'U.*'
        RETURN l.name as location_name
        LIMIT 10
    """

result = execute_read(driver, query)

pprint(result)

[<Record location_name='United States'>,
 <Record location_name='USA'>,
 <Record location_name='Ukraine'>,
 <Record location_name='Under the Bridge'>,
 <Record location_name='U.S.A.'>,
 <Record location_name='United States of America'>,
 <Record location_name='USA, USA'>,
 <Record location_name='USA 🇺🇸 '>,
 <Record location_name='UK'>,
 <Record location_name='United Kingdom'>]


_If anyone is not familiar with concept of [regex](https://en.wikipedia.org/wiki/Regular_expression)_

#### Maximum number of retweets

In [19]:
query = """
        MATCH (t:Tweet)
        RETURN t
        limit 1
    """

result = execute_read(driver, query)

pprint(result)

[<Record t=<Node element_id='89682' labels=frozenset({'Tweet'}) properties={'date': '2022-04-21T17:45:10.000Z', 'replies': 0, 'id': '1517197768347074563', 'text': "RT @afpfr: Le Donbass, que Moscou affirme vouloir libérer, est une région industrielle située dans l'est de l'Ukraine, au cœur d'un conflit…", 'retweets': 7, 'lang': 'fr', 'likes': 0}>>]


In [20]:
query = """
        MATCH (t:Tweet)
        RETURN t.lang, count(*) as count
        order by count desc
        limit 5
    """

result = execute_read(driver, query)

pprint(result)

[<Record t.lang='en' count=85016>,
 <Record t.lang='de' count=6235>,
 <Record t.lang='fr' count=3196>,
 <Record t.lang='und' count=1524>,
 <Record t.lang='ja' count=607>]


#### Get Relationship

All node operations shown above can also be applied to relationships

In [21]:
query = """
        MATCH (u:User)-[t:TWEETED]->(twt:Tweet)
        where twt.lang='en'
        RETURN u, t, twt
        LIMIT 1
    """

result = execute_read(driver, query)

pprint(result)

[<Record u=<Node element_id='3295' labels=frozenset({'User'}) properties={'tweet_count': 5358, 'followers': 50, 'following': 172, 'verified': False, 'description': '', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1496523769808105478/Wf5ZFed6_normal.jpg', 'id': '1490202434085531649', 'username': 'GriderPruitt'}> t=<Relationship element_id='42157' nodes=(<Node element_id='3295' labels=frozenset({'User'}) properties={'tweet_count': 5358, 'followers': 50, 'following': 172, 'verified': False, 'description': '', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1496523769808105478/Wf5ZFed6_normal.jpg', 'id': '1490202434085531649', 'username': 'GriderPruitt'}>, <Node element_id='89683' labels=frozenset({'Tweet'}) properties={'date': '2022-04-21T17:45:29.000Z', 'replies': 0, 'id': '1517197844972777474', 'text': 'RT @POTUS: We’ve already welcomed tens of thousands of Ukrainians to the United States. And today, I’m announcing “Uniting for Ukraine,” a…', 'retweets': 869, 'l

In [22]:
query = """
        MATCH p=()-[t:TWEETED]->()
        RETURN p
        LIMIT 1
    """

result = execute_read(driver, query)

pprint(result)

[<Record p=<Path start=<Node element_id='0' labels=frozenset({'User'}) properties={'tweet_count': 5481, 'followers': 111, 'following': 134, 'verified': False, 'description': 'Human, trying to make sense of an inhumane world.', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1512892306642325514/pMDPNGJK_normal.jpg', 'id': '1231752363817480199', 'username': 'IsaacsonSchmid'}> end=<Node element_id='167697' labels=frozenset({'Tweet'}) properties={'date': '2022-04-21T16:30:47.000Z', 'replies': 0, 'id': '1517179048862142465', 'text': 'RT @TonyHussein4: Most Americans blame Dictator Vladimir Putin and oil companies for higher gas prices, according to an ABC News/Ipsos poll…', 'retweets': 184, 'lang': 'en', 'likes': 0}> size=1>>]


In [23]:
query = """
        MATCH (l:Location)-[]-(u:User)-[t:TWEETED]->(twt:Tweet)
        RETURN l, u, twt
        limit 1
    """

result = execute_read(driver, query)

pprint(result)

[<Record l=<Node element_id='7' labels=frozenset({'Location'}) properties={'name': 'United States'}> u=<Node element_id='5948' labels=frozenset({'User'}) properties={'tweet_count': 94257, 'followers': 2174, 'following': 4997, 'verified': False, 'description': 'scientist, nature lover, medical professional, environmentalist, art enthusiast and recent political activist. Resist. Basta! #BlueWave', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1324740211880923137/wPnnCFCF_normal.jpg', 'id': '779951186271150080', 'username': 'jk_kause'}> twt=<Node element_id='92753' labels=frozenset({'Tweet'}) properties={'date': '2022-04-21T17:43:59.000Z', 'replies': 0, 'id': '1517197470618595334', 'text': 'RT @NiskanenCenter: THREAD: Biden JUST announced the launch of “Uniting for Ukraine,” a historic effort to welcome Ukrainians into the U.S.…', 'retweets': 28, 'lang': 'en', 'likes': 0}>>]


In [24]:
tweet = result[0]['twt']['text']
print(tweet)

RT @NiskanenCenter: THREAD: Biden JUST announced the launch of “Uniting for Ukraine,” a historic effort to welcome Ukrainians into the U.S.…


#### The use of **f-strings**, compare the differences

In [25]:
username = 'axpraise1515'

query = f"""
        MATCH (u:User{{username: '{username}'}})-[t:TWEETED]->(twt:Tweet)
        RETURN u.username, twt.date, t
    """

result = execute_read(driver, query)

pprint(result)

[<Record u.username='axpraise1515' twt.date='2022-04-21T17:15:20.000Z' t=<Relationship element_id='92294' nodes=(<Node element_id='12' labels=frozenset() properties={}>, <Node element_id='120782' labels=frozenset() properties={}>) type='TWEETED' properties={}>>,
 <Record u.username='axpraise1515' twt.date='2022-04-21T17:29:51.000Z' t=<Relationship element_id='68897' nodes=(<Node element_id='12' labels=frozenset() properties={}>, <Node element_id='106558' labels=frozenset() properties={}>) type='TWEETED' properties={}>>,
 <Record u.username='axpraise1515' twt.date='2022-04-21T17:45:50.000Z' t=<Relationship element_id='45830' nodes=(<Node element_id='12' labels=frozenset() properties={}>, <Node element_id='92234' labels=frozenset() properties={}>) type='TWEETED' properties={}>>,
 <Record u.username='axpraise1515' twt.date='2022-04-21T17:46:59.000Z' t=<Relationship element_id='45738' nodes=(<Node element_id='12' labels=frozenset() properties={}>, <Node element_id='91224' labels=frozenset(

---
---
---
## What's next ?! 

The **first project will be released next week!** Doubts and questions on the Moodle Forum.



![feb14.png](img/feb14.png)