# Exploring Neo4j: A Tutorial on Cypher, PageRank, and Louvain Community Detection

**Mehdi Boustani** - S221594  
**Nicolas Schneiders** - S203005  
**Maxim Piron** - S211493  
**Andreas Stistrup** - S212891  

*Faculty of Applied Sciences, University of Liège*

April 4, 2025


# Introduction

Nowadays, traditional databases often struggle with highly connected data, leading to slow and complex queries. Neo4j, a graph database released in 2007, solves this problem by storing data as nodes and relationships instead of tables. This approach makes it faster and easier to explore connections between entities.

Neo4j is widely used in areas like social networks, recommendation systems, and knowledge graphs—where relationships matter most. In this tutorial, we will explore its capabilities by analyzing a startup ecosystem, fetching data from a JSON file, using Cypher queries, PageRank, and Louvain community detection to uncover key insights.

# Comparison with Relational Databases 

To compare Neo4j and Cypher with traditional relational databases, we will explore the advantages and drawbacks of using a graph-based approach over a tabular one.

### Advantages of Neo4j and cypher over relational SQL
One of the main advantages of Neo4j and its query language, Cypher, is that data is stored in a **graph structure** composed of nodes and relationships rather than in tables. This allows for the efficient traversal of relationships between entities. In Neo4j, navigating from one node to another via a relationship can be done in constant time, regardless of the graph's size.

In contrast, relational databases require joins between tables to establish relationships, which can become computationally expensive—especially as the number of joins increases or when dealing with deep or complex relationships

This is particularly true for relationship-heavy queries, where SQL databases often need to create recursive views or perform iterative joins to explore multi-level relationships. In contrast, Neo4j can efficiently leverage its graph-native structure to maintain **linear performance** relative to the number of hops.

Another key advantage of Neo4j and Cypher is the **simplicity** and **readability** of queries when working with graph data. 
For example, if we want to find the names of people who are friends with someone named Alice, we can write the following Cypher query:

```sql
MATCH (p:Person)-[:FRIEND_OF]->(:Person {name: 'Alice'})
RETURN p.name
```

In relational SQL, achieving the same result would require:

```sql
SELECT p1.name
FROM Person p1
JOIN Friendship f ON p1.id = f.person_id
JOIN Person p2 ON f.friend_id = p2.id
WHERE p2.name = 'Alice';
```

As shown above, relationship queries are significantly more intuitive in Cypher. 
The graph pattern-matching syntax closely reflects the actual structure of the data, making queries easier to write, read, and reason about.

The third advantage of Neo4j is its **schema-free data model**. In Neo4j, we are not required to define all possible relationships or properties for each node type in advance. This allows the data model to handle irregular or evolving data naturally, offering greater flexibility as your application and data grow over time.

In contrast, SQL databases rely on rigid schemas, where structural modifications often require **schema migrations**, making them less adaptable to dynamic or semi-structured data.

Finally, as you will see throughout this tutorial, Neo4j is both well-suited and optimized for implementing graph algorithms. By using in-memory graph projections, we can efficiently run algorithms to uncover patterns such as communities, influence, shortest paths, and more. This makes Neo4j not just a data store, but a powerful analytical tool.

### Drawbacks of Neo4j and cypher over relational SQL
Interestingly, some of Neo4j’s strengths can also become limitations depending on the use case. 

1. **Transactional Workloads**

    If your data is **highly structured** and the focus is on **transactions** rather than relationships, Neo4j may be less efficient than relational databases. Traditional SQL databases typically perform better for workflows centered around aggregations, batch updates, or structured reporting.

2. **Real-Time Data Challenges**

    Neo4j is not ideal for real-time analysis on rapidly changing data. As you will see in this tutorial, running graph algorithms often requires creating **in-memory projections**, which work best with relatively stable snapshots of the data rather than constantly updating streams.

3. **Higher Memory Overhead**

    Finally, graph databases tend to have a higher memory overhead compared to relational SQL databases. This is due to how they store nodes and relationships in memory to enable **fast traversal**, which can increase resource usage—especially for large-scale datasets.

### Conclusion

As with any decision involving database management systems, it's important to carefully analyze your **use cases** before choosing the right tool. Neo4j excels in certain scenarios, but may not be the best fit for others.

A simple roadmap of requirements that might indicate Neo4j is a good choice includes:

- **Unstructured or evolving data**: When the data model is flexible and may change over time.
- **Relationship-focused queries**: When your workload involves exploring or analyzing relationships and traversals (e.g., social networks, recommendations, graphs of dependencies).
- **Stable datasets or snapshot-based analysis**: When the data is relatively stable, or when it's acceptable to analyze it using periodic snapshots rather than in real-time.


# Installation & configuration

## Installing Neo4J
If you don't have Docker installed, you can install it from [here](https://www.docker.com/). 

First, in the terminal, pull the Neo4j image from Docker:

`docker pull neo4j`

Now, create a Neo4J instance (thanks to the docker-compose.yml file).

`docker compose up -d`

You can now access the Neo4j browser by going to [http://localhost:7474](http://localhost:7474) (but not necessary for this tutorial).

The default username is `neo4j` and the default password is `password`.



In [None]:
# Installation of Neo4j 
!pip install neo4j

# Installation of the library to visualize the graph
!pip install yfiles_jupyter_graphs_for_neo4j

In [8]:
# Loading the libraries
from neo4j import GraphDatabase

# Library to visualize the graph
from yfiles_jupyter_graphs_for_neo4j import Neo4jGraphWidget

# Library to load the data from the JSON file
import json

# Connecting to the Neo4j database
driver = GraphDatabase.driver(uri="bolt://localhost:7687", auth=("neo4j", "password"))
session = driver.session()

# Creating the graph instance in order to visualize the graph
g = Neo4jGraphWidget(driver)

# Dataset

First, let's clear the existing database.

In [None]:
session.run("""
    MATCH (n)
    DETACH DELETE n
""")

In Cypher, we select all nodes (n) in the database using ***MATCH (n)***. Since some nodes may have relationships, we use ***DETACH DELETE n*** to first remove all relationships before deleting the node

Secondly, we fetch the data from an external JSON file named ***startups.json***. This file contains structured data that we will use to populate our database. *JSON* (JavaScript Object Notation) is a widely used format for storing and exchanging data (especially in APIs) due to its simplicity. In this case, the data was generated with the help of ChatGPT [<sup>1</sup>](#chatgpt) to create realistic but fake startup and investor information. You can learn more about the JSON format here [<sup>5</sup>](#reference-json).


In [10]:
# Open the json file
with open('startups.json', 'r') as file:
    # Load the file
    data = json.load(file)
    
    # Create startups with their technology
    for tech_name, tech_data in data['technologies'].items():
        for startup in tech_data['startups']:
            # Run the Cypher query to create startups
            session.run("""
                CREATE (s:Startup {
                    name: $name,
                    country: $country,
                    technology: $technology
                })
            """, {
                'name': startup['name'],
                'country': startup['country'],
                'technology': tech_name
            })
    
    # Create investors with their sectors
    for investor in data['investors']:
        # Run the Cypher query to create investors
        session.run("""
            CREATE (i:Investor {
                name: $name,
                sector: $sector
            })
        """, {
            'name': investor['name'],
            'sector': ', '.join(investor['sectors'])
        })


### Investment relationships between investors and startups across various sectors.

In [None]:
# AI Sector Investments
session.run("""
    MATCH (i1:Investor {name: 'Elon Musk'}), (s1:Startup {name: 'OpenAI'}), (s2:Startup {name: 'Anthropic'}),
          (s3:Startup {name: 'Adept AI'}), (s4:Startup {name: 'DeepMind'})
    CREATE (i1)-[:INVESTS_IN]->(s1),
           (i1)-[:INVESTS_IN]->(s2),
           (i1)-[:INVESTS_IN]->(s3),
           (i1)-[:INVESTS_IN]->(s4)
""")

session.run("""
    MATCH (i2:Investor {name: 'Andreessen Horowitz'}), (s1:Startup {name: 'OpenAI'}), (s2:Startup {name: 'Cohere'}),
          (s3:Startup {name: 'Hugging Face'}), (s4:Startup {name: 'Stability AI'})
    CREATE (i2)-[:INVESTS_IN]->(s1),
           (i2)-[:INVESTS_IN]->(s2),
           (i2)-[:INVESTS_IN]->(s3),
           (i2)-[:INVESTS_IN]->(s4)
""")

# Aerospace Sector Investments
session.run("""
    MATCH (i7:Investor {name: 'SoftBank'}), (s1:Startup {name: 'SpaceX'}), (s2:Startup {name: 'Blue Origin'}),
          (s3:Startup {name: 'Rocket Lab'}), (s4:Startup {name: 'Relativity Space'})
    CREATE (i7)-[:INVESTS_IN]->(s1),
           (i7)-[:INVESTS_IN]->(s2),
           (i7)-[:INVESTS_IN]->(s3),
           (i7)-[:INVESTS_IN]->(s4)
""")

session.run("""
    MATCH (i8:Investor {name: 'Peter Thiel'}), (s1:Startup {name: 'SpaceX'}), (s2:Startup {name: 'Rocket Lab'})
    CREATE (i8)-[:INVESTS_IN]->(s1),
           (i8)-[:INVESTS_IN]->(s2)
""")

session.run("""
    MATCH (i7:Investor {name: 'SoftBank'}), (s1:Startup {name: 'OpenAI'}), (s2:Startup {name: 'SpaceX'}),
          (s3:Startup {name: 'Tesla'}), (s4:Startup {name: 'Revolut'})
    CREATE (i7)-[:INVESTS_IN]->(s1),
           (i7)-[:INVESTS_IN]->(s2),
           (i7)-[:INVESTS_IN]->(s3),
           (i7)-[:INVESTS_IN]->(s4)
""")

session.run("""
    MATCH (i2:Investor {name: 'Andreessen Horowitz'}), (s1:Startup {name: 'OpenAI'}), (s2:Startup {name: 'Stripe'}),
          (s3:Startup {name: 'Coinbase'}), (s4:Startup {name: 'Tesla'})
    CREATE (i2)-[:INVESTS_IN]->(s1),
           (i2)-[:INVESTS_IN]->(s2),
           (i2)-[:INVESTS_IN]->(s3),
           (i2)-[:INVESTS_IN]->(s4)
""")

session.run("""
    MATCH (i9:Investor {name: 'Tiger Global'}), (s1:Startup {name: 'Stripe'}), (s2:Startup {name: 'Binance'}),
          (s3:Startup {name: 'Tesla'}), (s4:Startup {name: 'Hugging Face'})
    CREATE (i9)-[:INVESTS_IN]->(s1),
           (i9)-[:INVESTS_IN]->(s2),
           (i9)-[:INVESTS_IN]->(s3),
           (i9)-[:INVESTS_IN]->(s4)
""")


# FinTech Sector Investments
session.run("""
    MATCH (i3:Investor {name: 'Sequoia Capital'}), (s1:Startup {name: 'Stripe'}), (s2:Startup {name: 'Revolut'}),
          (s3:Startup {name: 'Klarna'}), (s4:Startup {name: 'Brex'})
    CREATE (i3)-[:INVESTS_IN]->(s1),
           (i3)-[:INVESTS_IN]->(s2),
           (i3)-[:INVESTS_IN]->(s3),
           (i3)-[:INVESTS_IN]->(s4)
""")

session.run("""
    MATCH (i9:Investor {name: 'Tiger Global'}), (s1:Startup {name: 'Stripe'}), (s2:Startup {name: 'Klarna'}),
          (s3:Startup {name: 'Brex'})
    CREATE (i9)-[:INVESTS_IN]->(s1),
           (i9)-[:INVESTS_IN]->(s2),
           (i9)-[:INVESTS_IN]->(s3)
""")

# Electric Vehicle Sector Investments
session.run("""
    MATCH (i10:Investor {name: 'Cathie Wood'}), (s1:Startup {name: 'Tesla'}), (s2:Startup {name: 'Nio'}),
          (s3:Startup {name: 'Rivian'})
    CREATE (i10)-[:INVESTS_IN]->(s1),
           (i10)-[:INVESTS_IN]->(s2),
           (i10)-[:INVESTS_IN]->(s3)
""")

session.run("""
    MATCH (i11:Investor {name: 'Mark Cuban'}), (s1:Startup {name: 'Tesla'}), (s2:Startup {name: 'Lucid Motors'})
    CREATE (i11)-[:INVESTS_IN]->(s1),
           (i11)-[:INVESTS_IN]->(s2)
""")

# Blockchain Sector Investments
session.run("""
    MATCH (i5:Investor {name: 'Binance Labs'}), (s1:Startup {name: 'Binance'}), (s2:Startup {name: 'Ledger'})
    CREATE (i5)-[:INVESTS_IN]->(s1),
           (i5)-[:INVESTS_IN]->(s2)
""")

session.run("""
    MATCH (i12:Investor {name: 'Accel Partners'}), (s1:Startup {name: 'Chainalysis'}), (s2:Startup {name: 'Coinbase'})
    CREATE (i12)-[:INVESTS_IN]->(s1),
           (i12)-[:INVESTS_IN]->(s2)
""")



#### Establishing relationships between startups: collaboration, competition, and partnerships.

In [None]:
session.run("""
    MATCH (s1:Startup {name: 'OpenAI'}), (s2:Startup {name: 'Tesla'})
    CREATE (s1)-[:COLLABORATES_WITH]->(s2)
""")

session.run("""
    MATCH (s1:Startup {name: 'Revolut'}), (s2:Startup {name: 'Stripe'})
    CREATE (s1)-[:COLLABORATES_WITH]->(s2)
""")

session.run("""
    MATCH (s1:Startup {name: 'Binance'}), (s2:Startup {name: 'Coinbase'})
    CREATE (s1)-[:COMPETES_WITH]->(s2)
""")

session.run("""
    MATCH (s1:Startup {name: 'Tesla'}), (s2:Startup {name: 'Lucid Motors'})
    CREATE (s1)-[:COMPETES_WITH]->(s2)
""")

session.run("""
    MATCH (s1:Startup {name: 'SpaceX'}), (s2:Startup {name: 'Blue Origin'})
    CREATE (s1)-[:COMPETES_WITH]->(s2)
""")

session.run("""
    MATCH (s1:Startup {name: 'DeepMind'}), (s2:Startup {name: 'Mistral AI'})
    CREATE (s1)-[:PARTNERS_WITH]->(s2)
""")

### Cypher query to visualize Startups and Investors [<sup>7</sup>](#vizualisation)

In [None]:
g.show_cypher("MATCH (s)-[r]->(t) RETURN s, r, t")

# PageRank algorithm

The PageRank algorithm ranks the nodes in a graph based on their influence. It is a recursive algorithm in which a node’s score depends on the scores of the nodes linking to it, as well as the number of other nodes those linking nodes connect to. This algorithm is implemented in the ***graph-data-science plugin***, and we will break down its core functionalities.

The first step is to create an **in-memory projection** of the graph. The primary goal of this projection is to streamline the graph and isolate it from live data, allowing for faster execution times by working on a simplified version of the graph. To achieve this, we first need to retrieve the labels of our nodes and the possible relationships between them.

In [None]:
labelsResponse = session.run("""CALL db.labels()""")
labels = labelsResponse.data()
print(labels)

relationshipResponse = session.run("CALL db.relationshipTypes()")
relationships = relationshipResponse.data()
print(relationships)

Then, by using the labels and relationship types, we can create the projection.

In [None]:
projection_query = """
CALL gds.graph.project(
  'generalProjection', 
  ['Startup', 'Investor'], 
  {
    INVESTS_IN: {},
    COLLABORATES_WITH: {},
    COMPETES_WITH: {}
  }
)
"""
session.run(projection_query)

Now, we can run the PageRank algorithm using **stream mode**. In this mode, the algorithm computes a score for each node, allowing us to post-process the results without modifying the underlying data. Additionally, we limit the query to return only the top 10 nodes.

In [None]:
pagerankGeneralQuery = """
    CALL gds.pageRank.stream('generalProjection')
    YIELD nodeId, score
    RETURN gds.util.asNode(nodeId).name AS name, score
    ORDER BY score DESC
    LIMIT 10
"""
pagerankGeneralRes = session.run(pagerankGeneralQuery)
for record in pagerankGeneralRes:
    print(f"Name: {record['name']}, Score: {record['score']}")

Note that we can filter the type of relationship in the projection. For example, if we want to find the most influential node with regard to only the INVESTS_IN relationships, we can use the following projection.

In [None]:
# Drop the graph 'filteredProjection' if it exists already.
# We use YIELD graphName to avoid using deprecated return values like 'schema'.
session.run("""
    CALL gds.graph.drop('filteredProjection', false)
    YIELD graphName
""")

# Define the filtered projection of the graph.
filteredProjectionQuery = """
  CALL gds.graph.project(
    'filteredProjection',
    ['Startup', 'Investor'],
    {
      INVESTS_IN: {}
    }
  )
"""

# Define the PageRank query on the filtered projection.
# It streams the results directly (rather than writing them to the graph), then returns the top 10 nodes with the highest PageRank scores.
pagerankFilteredQuery = """
  CALL gds.pageRank.stream('filteredProjection')
  YIELD nodeId, score
  RETURN gds.util.asNode(nodeId).name AS name, score
  ORDER BY score DESC
  LIMIT 10
"""

# Execute the graph projection query to create the filtered graph.
session.run(filteredProjectionQuery)

# Run the PageRank algorithm on the filtered projection and print results.
pagerankFilteredRes = session.run(pagerankFilteredQuery)
for record in pagerankFilteredRes:
    print(f"Name: {record['name']}, Score: {record['score']}")


Similarly, we can filter nodes based on their labels. For example, if we want to focus only on nodes with the label Startup, we can create a projection like this:

In [None]:
# Drop the graph 'filteredProjection2' if it exists already.
# We use YIELD graphName to avoid using deprecated return values like 'schema'.
session.run("""
    CALL gds.graph.drop('filteredProjection2', false)
    YIELD graphName
""")

# Create 'filteredProjection2' with only Startup nodes and multiple relationship types.
filteredProjectionQuery2 = """
  CALL gds.graph.project(
    'filteredProjection2', 
    ['Startup'], 
    {
      INVESTS_IN: {},
      COLLABORATES_WITH: {},
      COMPETES_WITH: {}
    }
  )
"""

# Define PageRank query to rank startups based on all three relationship types.
pagerankFilteredQuery2 = """
  CALL gds.pageRank.stream('filteredProjection2')
  YIELD nodeId, score
  RETURN gds.util.asNode(nodeId).name AS name, score
  ORDER BY score DESC
  LIMIT 10
"""

# Run the graph projection to create 'filteredProjection2'.
session.run(filteredProjectionQuery2)


# Execute PageRank and print top 10 ranked startups by score.
pagerankFilteredRes2 = session.run(pagerankFilteredQuery2)
for record in pagerankFilteredRes2:
    print(f"Name: {record['name']}, Score: {record['score']}")


There are numerous options available to fine-tune and extend the PageRank algorithm. For more details, you can refer to the Neo4j documentation [<sup>9</sup>](#docpr).

## Use cases

The **PageRank algorithm**, initially designed by Google to rank web pages, can be applied to any work  where data can be represented as a graph. Its core algorithm can be customized to suit various types of data and domains.

For example, it has been used to:
- [Prediction of influential nodes in social networks based on local communities and users’ reaction information](https://www.nature.com/articles/s41598-024-66277-6)

# Louvain algorithm

The **Louvain algorithm** is used for community detection in graphs by maximizing a metric called modularity. It iteratively groups nodes into communities, ensuring that nodes within the same community are densely connected, while connections between different communities remain sparse. This algorithm is also implemented in the ***graph-data-science plugin***.

The Louvain algorithm also operates on an **in-memory graph projection**. In this tutorial, we will use the same projection created for the PageRank algorithm. Like PageRank, Louvain offers various execution modes, and to maintain consistency with our previous approach, we will use stream mode.

We create a query that orders the communities by the number of nodes they contain. The query also lists the nodes in each community and limits the output to the top 5 communities.

In [None]:
louvainGeneralQuery = """
     CALL gds.louvain.stream('generalProjection')
     YIELD nodeId, communityId
     WITH communityId, 
          collect(gds.util.asNode(nodeId).name) AS nodes, 
          count(*) AS communitySize
     ORDER BY communitySize DESC
     LIMIT 5
     RETURN communityId AS community, communitySize, nodes
"""

louvainGeneralRes = session.run(louvainGeneralQuery)
for record in louvainGeneralRes:
    print(f"community: {record['community']}, communitySize: {record['communitySize']}, nodes: {record['nodes']}")


In [None]:
louvainFilteredQuery = """
     CALL gds.louvain.stream('filteredProjection')
     YIELD nodeId, communityId
     WITH communityId, 
          collect(gds.util.asNode(nodeId).name) AS nodes, 
          count(*) AS communitySize
     ORDER BY communitySize DESC
     LIMIT 5
     RETURN communityId AS community, communitySize, nodes
"""

louvainFilteredRes = session.run(louvainFilteredQuery)
for record in louvainFilteredRes:
    print(f"community: {record['community']}, communitySize: {record['communitySize']}, nodes: {record['nodes']}")

In [None]:
louvainFilteredQuery2 = """
     CALL gds.louvain.stream('filteredProjection2')
     YIELD nodeId, communityId
     WITH communityId, 
          collect(gds.util.asNode(nodeId).name) AS nodes, 
          count(*) AS communitySize
     ORDER BY communitySize DESC
     LIMIT 5
     RETURN communityId AS community, communitySize, nodes
"""

louvainFilteredRes2 = session.run(louvainFilteredQuery2)
for record in louvainFilteredRes2:
    print(f"community: {record['community']}, communitySize: {record['communitySize']}, nodes: {record['nodes']}")

There are also numerous options available for configuring the Louvain algorithm. You can find more details in the Neo4j Graph Data Science documentation[<sup>10</sup>](#doclouvain).

## Use cases

As for the PageRank algorithm, the Louvain community detection method can be applied to any work in which data can be represented as a graph. Its core algorithm can also be modified as to adapt to the data that it is handling.


For example, it has been previously used to:
- [Analyze the hierarchical modularity in human brain functional networks](https://www.frontiersin.org/journals/neuroinformatics/articles/10.3389/neuro.11.037.2009/full);
- [Track the Evolution of Communities in
Dynamic Social Networks](http://derekgreene.com/papers/greene10tracking.pdf);
- [Improve profile search results on social media applications](http://jonathanhaynes.org/files/mappingsearchrelevancetosocialnetworks.pdf).

# Cross-analysis

In both the **general** and **filtered** projections, we can see a clear relationship between **PageRank** and **communities**:  

1. **High PageRank often aligns with large or central communities** – Startups like *Lucid Motors, Tesla, and Stripe* score high because they belong to well-connected groups with multiple investors and industry ties. However, community size can sometimes affect PageRank disproportionately, as seen with Tesla compared to Rivian and Nio.

2. **Filtering the graph reduces the impact of wide connections** – *OpenAI*, which was central in the general network, drops out in the filtered version, showing that its PageRank depended on relationships beyond direct startup-investor links.  

3. **PageRank captures the "bridge" effect** – Some companies like *Coinbase* and *Stripe* have high scores because they connect different investment groups, reinforcing their influence beyond just being part of a single strong community.  

This shows that **PageRank and community structure are correlated**, but the strength of a node's connections (connecting different groups) can sometimes be more important than just the size of the community.

#  Real-World Use Cases[<sup>11</sup>](#usecases)

## Common applications

Neo4j (and graph databases in general) are commonly used in a variety of fields where data is highly interconnected. These applications are particularly well-suited for graph databases because of their ability to efficiently manage and query complex relationships.

### 1. Knowledge graphs

A **knowledge graph** is a data model which represents information as a network of entities (nodes) and the relationships between them (edges). It is used to make complex data more manageable and easier to understand. 

With Neo4j, building knowledge graphs becomes more intuitive, as real-world relationships are naturally mapped to the graph structure. The graph is also fast to code, requiring few lines of code, and quick to update due to its flexible data schema.
Additionaly, users can use tools like [Neo4j's Knowledge Graph Builder](https://neo4j.com/labs/genai-ecosystem/llm-graph-builder/) to quickly build knowledge graphs from their data.

### 2. Generative AI

Generative Artificial Intelligences (such as large language models) often struggle with reasoning. The structured nature of knowledge graphs powered by **Neo4j** allows these AIs to access relevant contextual data more efficiently. Therefore, AI-generated responses become more accurate, as they are based in truth.

By incorporating user preferences and histories——which can be easily integrated into the knowledge graphs——AIs are also able to generate more personalized responses.

Example (paper): [The Growing Importance of Graph Databases for AI](https://media.bitpipe.com/io_31x/io_316186/item_2812073/ESG-WP-AWS-Neptune-Jun-2024.pdf).

### 3. Fraud detection

When committing bank fraud, fraudsters often operate in rings, which are highly connected networks. Graph databases make fraud-pattern recognition more efficient than SQL databases since the relationships between transactions and individuals can be transposed to a graph, which can then be analyzed to detect suspicious activity.

Example (paper): [Enhancing fraud detection in banking by integration of graph databases with machine learning](https://www.sciencedirect.com/science/article/pii/S2215016124001377).

### 4. Identity & Access management

Managing roles and access to various data sets is easily done by using graph databases since relationships between entities are represented as graph edges. Using the Cypher query language, users can easily add or remove permissions and roles dynamically, allowing managers to have fast and flexible access control.

Example (paper): [Comparison of Access Control Approaches for Graph-Structured Data](https://www.researchgate.net/publication/381109230_Comparison_of_Access_Control_Approaches_for_Graph-Structured_Data/link/665d6e1dbc86444c7229431a/download?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InB1YmxpY2F0aW9uIiwicGFnZSI6InB1YmxpY2F0aW9uIn19)

### 5. Master data management

Neo4j offers a master data management service that allows to unify your data into a single "*360° view*" of your data. This means that, for example, data related to customer, product, supplier, and logistics information can be put together in order to leverage insights across datasets. 

Example (white paper): [Rethink Your Master Data - How Connections Will Define the Future of MDM](https://go.neo4j.com/rs/710-RRC-335/images/Neo4j-Master-Data-Management-white-paper-EN-US.pdf)

### 6. Network and IT operations

Graph databases are ideal for integrating network and IT assets to help with troubleshooting and analyzing networks. By using a graph database, connections between different monitoring tools is possible, allowing users to improve the management of their networks.

Example (white paper): [How Graph Databases Solve Problems in Network & Data Center Management: a Close Look at Two Deployments](https://we-yun.com/doc/neo4j-book/%E5%9B%BE%E6%95%B0%E6%8D%AE%E5%BA%93%E4%BA%94%E5%A4%A7%E5%BA%94%E7%94%A8%E6%A1%88%E4%BE%8B/EMA_Neo4j_Two_Deployments-WP.pdf)

### 7. Real-time recommendations

The speed of editing, traversing, and extracting data from graph databases is unmatched by other databases types, enabling real-time recommendation processing. Additionally, since graph databases efficiently handle complex and deep relationships between entities, they make recommendations like "People who bought this also bought..." possible. 

Example (white paper): [Powering Real-Time Recommendations with Graph Database Technology](https://go.neo4j.com/rs/710-RRC-335/images/Neo4j_WP_Recommendations_EN_BUS.pdf?_ga=2.216021586.597232194.1522833081-1282627512.1522833081)

### 8. Data privacy, risk and compliance

Neo4j's [*Privacy-Shield*](https://neo4j.com/use-cases/gdpr-compliance/?ref=web-solutions-privacy-risk-compliance) helps companies comply with the EU's [General Data Protection Requirements (GDPR)](https://europa.eu/youreurope/business/dealing-with-customers/data-protection/data-protection-gdpr/index_en.htm), which requires companies to control and manage customer data. 

It connects personal data and tracks:
- The location of private information;
- Which systems and apps use the data;
- How and when personal data is used;
- Who views and uses the data;
- What permissions you have to use the data, and when and how they were obtained;
- Where and when personal data moves.

The regulations of different countries (such as the [California Consumer Privacy Act](https://oag.ca.gov/privacy/ccpa)) are also tackled, but the GDPR is more relevant in this context.

### 9. Supply chain management

Supply chains are complex networks, making graph databases particularly well-suited for managing them. With graph databases, it is then possible to anticipate shifts in demand, predict product handling costs, and adapt to new compliance standards.

Example (paper): [Graph Database to Enhance Supply Chain Resilience for Industry 4.0](https://pdfs.semanticscholar.org/88cf/a5f142b2656090010fdc212260e03098a6f2.pdf)

## Industry use cases

Many industry uses of Neo4j are a combination of the common uses mentioned previously.

Industries use Neo4j for the following applications:

1. Financial services
    - Detect fraud rings (analyze networks of transactions);
    - Model ever-changing complex assets (track dependencies between entities);
    - Securing user data (manage identity & access management).
    
    Notable users: [UBS](https://www.ubs.com/us/en.html), [Cerved](https://www.cerved.com/en), [Royal Bank of Scotland](https://www.rbs.co.uk/).

2. Government:
    - Connect different data records in criminal investigations (connecting separate datasets);
    - Manage the equipment of governmental staff (track resources and supply chains);
    - Detect failure causes (track dependencies between entities).
    
    Notable users: [US Army](https://www.army.mil/), [IQT](https://www.iqt.org/), [MITRE](https://www.mitre.org/), [Lockheed Martin Space](https://www.lockheedmartin.com/), [NASA](https://www.nasa.gov/).

3. Healthcare & life sciences:
    - Model connections between genes, proteins, cells and tissues (create and analyze knowledge graphs);
    - Model molecules (create and analyze knowledge graphs);
    - Map patients' journeys (create and analyze knowledge graphs).
    
    Notable users: [Novartis](https://www.novartis.com/), [Boston Scientific](https://www.bostonscientific.com/en-US/home.html), [ChemAxon](https://chemaxon.com/), [Bayer (Monsanto)](https://www.bayer.com/en/).

4. Retail:
    - Provide real-time product recommendations (updating knowledge graphs dynamically);
    - Change prices dynamically (updating knowledge graphs dynamically);
    - Optimize delivery routing (updating knowledge graphs dynamically).
    
    Notable users: [Walmart](https://www.walmart.com/), [eBay](https://www.ebay.com/), [Adidas](https://www.adidas-group.com/en/), [Transparency-One](https://www.transparency-one.com/).

5. Telecommunications:
    - Model customer-related graphs (create and analyze knowledge graphs);
    - Model communication networks (create and analyze knowledge graphs).
    
    Notable users: [Comcast](https://corporate.comcast.com/), [Telenor](https://www.telenor.com/), [Cisco](https://www.cisco.com/).


## Cloud partners

[Neo4j's AuraDB](https://neo4j.com/product/auradb/) is also used by some of the world's largest companies' cloud storages, namely [Amazon AWS](https://neo4j.com/cloud/aura-aws/), [Microsoft Azure ecosystem](https://neo4j.com/partners/microsoft/), and [Google CLOUD](https://neo4j.com/partners/google/).

The integration of Neo4j into these cloud services allow thousands of developers from all around the world to create AI models, analyze complex data, and create real-time applications.

# Conclusion

Throughout this tutorial, we explored Neo4j's capabilities and learned several key lessons. We started by loading data from a **JSON file**, demonstrating how Neo4j can easily integrate with simple data formats and transform structured data into a graph representation. Neo4j's storage of data as nodes and relationships proved ideal for representing our startup ecosystem, while Cypher provided intuitive syntax for creating and querying the network.

Our implementation of **PageRank** revealed influential nodes in our startup network, showing how companies like *Tesla* and *Lucid Motors* gain influence through their investor connections. The **Louvain algorithm** helped detect natural communities, demonstrating how startups group around similar technologies and investors. **Cross-analysis** of both algorithms showed how a node's influence often correlates with its community structure, highlighting the ability to perform **statistical analysis** using graph databases.

While Neo4j excels with relationship-heavy data, it may not be the best choice for purely transactional workloads. The best choice depends on the specific use case, as there is no one-size-fits-all solution. 

This tutorial covers everything from initial data loading to advanced algorithm implementation, giving you a solid foundation for working with graph databases and analyzing connected data.

# References

1. <a id="neo4jVSsql"></a> [Quora 2023 Sumit Sutariya - Pros and Cons of using Graph Databases compared to Traditional Relational Databases](https://www.quora.com/What-are-the-pros-and-cons-of-using-graph-databases-compared-to-traditional-relational-databases-in-modern-web-development)

2. <a id="diffgraphandrelationaldb"></a> [Amazon - Difference between Graph and Relational Database](https://aws.amazon.com/fr/compare/the-difference-between-graph-and-relational-database/)

3. <a id="graphdbscalelimitation"></a> [Thatdot Rob Malnati - Scale Limitations of Graph Databases](https://www.thatdot.com/blog/understanding-the-scale-limitations-of-graph-databases/#:~:text=Graph%20databases%20are%20great%20at,on%20streaming%20data%20are%20desired.)

4. <a id="whygraphdb"></a> [NEBULAGRAPH 2023 Min.WU](https://www.nebula-graph.io/posts/why-use-graph-databases#:~:text=Graph%20databases%20provide%20a%20flexible,based%20on%20the%20collected%20insights.)

5. <a id="reference-json"></a> [Wikipedia - JSON](https://en.wikipedia.org/wiki/JSON)

6. <a id="chatgpt"></a> [ChatGPT](https://chatgpt.com/)

7. <a id="vizualisation"></a> [Nodes 2024 – Advanced Graph Visualizations in Jupyter Notebooks](https://neo4j.com/videos/nodes-2024-advanced-graph-visualizations-in-jupyter-notebooks/)

8. <a id="Neo4Jdoc"></a> [Neo4j - Official documentation](https://neo4j.com/docs/getting-started/?utm_source=GSearch&utm_medium=PaidSearch&utm_campaign=Evergreen&utm_content=EMEA-Search-SEMCE-DSA-None-SEM-SEM-NonABM&utm_term=&utm_adgroup=DSA&gad_source=1&gclid=Cj0KCQjwtJ6_BhDWARIsAGanmKfTPoCcMxVBQzAo82Ng60-loTIjCV3yfWp9R_PvEh0qp6mz84Ks6yQaAiQlEALw_wcB)

9. <a id="docpr"></a> [Page Rank - Official documentation](https://neo4j.com/docs/graph-data-science/current/algorithms/page-rank/)

10. <a id="doclouvain"></a> [Louvain - Official documentation](https://neo4j.com/docs/graph-data-science/current/algorithms/louvain/)

11. <a id="usecases"></a> [Neo4j's use cases (and their respective white papers)](https://neo4j.com/use-cases/)


# Process and Work Distribution

Mehdi researched and selected the subject for the project (with the opinion of the other members), distributed the tasks equitably and then set deadlines for the tutorial:

- **Jupyter Creation, Introduction, Dataset, and Conclusion**: Mehdi
- **PageRank Implementation, Benefits and Drawbacks of Database Technology (Comparing with Relational Databases)**: Nicolas
- **Louvain Algorithm for Communities**: Maxim
- **Use cases and Real-World Examples in Different Industries**: Andreas

After completing our respective sections, each member reviewed and improved on the work of others to ensure the overall correctness and coherence of the tutorial.
