# Neo4j Twitter Trolls Tutorial

**Goal**: This notebook aims to show how to use PyGraphistry to visualize data from [Neo4j](https://neo4j.com/developer/). We also show how to use [graph algorithms in Neo4j](https://neo4j.com/developer/graph-algorithms/) and use PyGraphistry to visualize the result of those algorithms.

*Prerequesties:* 
* You'll need a Graphistry API key, which you can request [here](https://www.graphistry.com/api-request)
* Neo4j. We'll be using [Neo4j Sandbox](https://neo4j.com/sandbox-v2/) (free hosted Neo4j instances pre-populated with data) for this tutorial. Specifically the "Russian Twitter Trolls" sandbox. You can create a Neo4j Sandbox instance [here](https://neo4j.com/sandbox-v2/)
* Python requirements:
  * [`neo4j-driver`](https://github.com/neo4j/neo4j-python-driver) - `pip install neo4j-driver`
  * [`pygraphistry`](https://github.com/graphistry/pygraphistry/) - `pip install "graphistry[all]"` 
  

## Outline

* Connecting to Neo4j 
  * using neo4j-driver Python client
  * query with Cypher
* Visualizing data in Graphistry from Neo4j 
  * User-User mentions from Twitter data
* Graph algorithms
  * Enhancing our visualization with PageRank

In [1]:
# import required dependencies
from neo4j import GraphDatabase, basic_auth
from pandas import DataFrame
import graphistry

In [3]:
# To specify Graphistry account & server, use:
import os

AUTH_GRAPH = (os.getenv("GRAPH_NAME"), os.getenv("GRAPH_PASSWORD"))
print(AUTH_GRAPH)
graphistry.register(api=3, protocol='https', server='hub.graphistry.com',
                    personal_key_id = AUTH_GRAPH[0],
                    personal_key_secret=AUTH_GRAPH[1])

# For more options, see https://github.com/graphistry/pygraphistry#configure

('PFJACOZJFA', 'X6FY35Y46AZL9ASR')


## Connect To Neo4j

If you haven't already, create an instance of the Russian Twitter Trolls sandbox on [Neo4j Sandbox.](https://neo4j.com/sandbox-v2/) We'll use the [Python driver for Neo4j](https://github.com/neo4j/neo4j-python-driver) to fetch data from Neo4j. To do this we'll need to instantiate a `Driver` object, passing in the credentials for our Neo4j instance. If using Neo4j Sandbox you can find the credentials for your Neo4j instance in the "Details" tab. Specifically we need the IP address, bolt port, username, and password. Bolt is the binary protocol used by the Neo4j drivers so a typical database URL string takes the form `bolt://<IP_ADDRESS>:<BOLT_PORT>`


In [2]:
import os
from dotenv import load_dotenv

# instantiate Neo4j driver instance
# be sure to replace the connection string and password with your own
load_dotenv()
URI = os.getenv("URI")
AUTH = (os.getenv("AUTH_NAME"), os.getenv("AUTH_PASSWORD"))
print(URI, AUTH)
driver = GraphDatabase.driver(URI, auth=AUTH, max_connection_pool_size=20)

bolt://54.89.172.229:7687 ('neo4j', 'iron_felix#1109')


Once we've instantiated our Driver, we can use `Session` objects to execute queries against Neo4j. Here we'll use `session.run()` to execute a [Cypher query](https://neo4j.com/developer/cypher-query-language/). Cypher is the query language for graphs that we use with Neo4j (you can think of Cypher as SQL for graphs).

In [10]:
# neo4j-driver hello world
# execute a simple query to count the number of nodes in the database and print the result
with driver.session() as session:
    results = session.run("MATCH (a) RETURN COUNT(a) AS num")
    print(results.data())

[{'num': 65071}]


If we inspect the datamodel in Neo4j we can see that we have inormation about Tweets and specifically Users mentioned in tweets.


Let's use Graphistry to visualize User-User Tweet mention interactions. We'll do this by querying Neo4j for all tweets that mention users.

## Using Graphistry With Neo4j

Currently, PyGraphistry can work with data as a pandas DataFrame, NetworkX graph or IGraph graph object. In this section we'll show how to load data from Neo4j into PyGraphistry by converting results from the Python Neo4j driver into a pandas DataFrame.

Our goal is to visualize User-User Tweet mention interactions. We'll create two pandas DataFrames, one representing our nodes (Users) and a second representing the relationships in our graph (mentions).

Some users are known Troll accounts so we include a flag variable, `troll` to indicate when the user is a Troll. This will be used in our visualization to set the color of the known Troll accounts.

In [51]:
# Create User DataFrame by querying Neo4j, converting the results into a pandas DataFrame
with driver.session() as session:
    results = session.run("""
    MATCH (u:User) 
    WITH u.screen_name AS screen_name, u.community as community
    WHERE u.community IS NOT NULL
    RETURN screen_name, community""")
    users = DataFrame(results.data())
# show the first 5 rows of the DataFrame
users[:5]

Unnamed: 0,screen_name,community
0,thiccyth0t,0
1,TheFlowHorse,0
2,CL207,0
3,zoomerfied,0
4,Vida_BWE,0


In [4]:
#users.to_csv("graph/nodes.csv", index=False)
import pandas as pd
users = pd.read_csv("app(server)/graph/nodes.csv")

Next, we need some relationships to visualize. In this case we are interested in visualizing user interactions, specifically where users have mentioned users in Tweets.

In [3]:
# Query for tweets mentioning a user and create a DataFrame adjacency list using screen_name
# where u1 posted a tweet(s) that mentions u2
# num is the number of time u1 mentioned u2 in the dataset
with driver.session() as session:
    results = session.run("""
        MATCH (n:User) RETURN n.screen_name as screen_name, n.name as name, n.description as description, n.keywords as keywords, n.categories as categories, n.is_crypto_related as is_crypto_related
    """)
    mentions  = DataFrame(results.data())
mentions[:5]

Unnamed: 0,screen_name,name,description,keywords,categories,is_crypto_related
0,,,,,,
1,,,,,,
2,thiccyth0t,thiccy,cofounder @ScimitarCapital,"[ScimitarCapital, crypto coin, DefiSquared, cr...","[trader, investor, analyst]",True
3,TheFlowHorse,HORSE 🏴‍☠️,"Ex Prop, now pajama trader. Addison Capital De...","[Bitcoin, BTC, Ethereum, ETH, crypto markets, ...","[trader, investor, analyst]",True
4,CL207,CL,"ex nikkei, gold trader @ Bank of Japan, now VR...","[Metaverse, binance, btc, usdt, hyperliquid, Q...","[trader, investor, analyst]",True


In [4]:
mentions['url'] = 'https://user-tweets-991943d5bae2b44ccfb0a711279c8720.s3.us-east-1.amazonaws.com/graph/' + mentions['screen_name'] + '.json'
mentions = mentions[mentions['screen_name'].notna() & (mentions['screen_name'] != '')]

In [6]:
mentions = pd.read_csv("app(server)/graph/edges.csv")

Now we can visualize this mentions network using Graphistry. We'll specify the nodes and relationships for our graph. We'll also use the `troll` property to color the known Troll nodes red, setting them apart from other users in the graph.

In [8]:
#mentions.to_csv("graph/edges.csv", index=False)
mentions = pd.read_csv("app(server)/graph/edges.csv")

In [17]:
viz = graphistry.bind(source="u1", destination="u2", node="screen_name", point_color="community").nodes(users).edges(mentions)
url = viz.plot()

In [24]:
from bs4 import BeautifulSoup

html_code = url.data

soup = BeautifulSoup(html_code, 'html.parser')
iframe = soup.find('iframe')
url = iframe['src']

print("Ссылка на граф:", url)


Ссылка на граф: https://hub.graphistry.com/graph/graph.html?dataset=f332a5b02b174eec87291048ea4dbe50&type=arrow&viztoken=4d01f9fb-8dad-4363-8974-036fa139d149&usertag=2560ad89-pygraphistry-0.36.1&splashAfter=1746255060&info=true


After running the above Python cell you should see an interactive Graphistry visualization like this:

Known Troll user nodes are colored red, regular users colored blue. By default, the size of the nodes is proportional to the degree of the node (number of relationships). We'll see in the next section how we can use graph algorithms such as PageRank and visualize the results of those algorithms in Graphistry.

In [7]:
from collections import Counter
Counter(users.community)

Counter({0: 2027, 1: 1560, 3: 626, 4: 492, 2: 361, 5: 5})

## Graph Algorithms

The above visualization shows us User-User Tweet mention interactions from the data. What if we wanted to answer the question "Who is the most important user in this network?". One way to answer that would be to look at the degree, or number of relationships, of each node. By default, PyGraphistry uses degree to style the size of the node, allowing us to determine importance of nodes at a glance. 

We can also use [graph algorithms](https://github.com/neo4j-contrib/neo4j-graph-algorithms) such as PageRank to determine importance in the network. In this section we show how to [run graph algorithms in Neo4j](https://neo4j.com/developer/graph-algorithms/) and use the results of these algorithms in our Graphistry visualization.

In [4]:
# run PageRank on the projected mentions graph and update nodes by adding a pagerank property score
with driver.session() as session:
    session.run("""
        CALL algo.pageRank("MATCH (t:User) RETURN id(t) AS id",
         "MATCH (u1:User)-[:FOLLOWING]->(u2:User)
         RETURN id(u1) as source, id(u2) as target", {graph:'cypher', write: true})
     """)

ClientError: {code: Neo.ClientError.Procedure.ProcedureNotFound} {message: There is no procedure with the name `algo.pageRank` registered for this database instance. Please ensure you've spelled the procedure name correctly and that the procedure is properly deployed.}

Now that we've calculated PageRank for each User node we need to create a new pandas DataFrame for our user nodes by querying Neo4j:

In [7]:
import json
import pandas as pd
from collections import Counter

with open('app(server)/static/graph/my/modularity_with_score.json', 'r') as f:
    data = json.load(f)
valid_names = set(users['screen_name'])

df = pd.DataFrame([{
    'screen_name': user['screen_name'],
    'community': user['community'],
    'pagerank': user['score']
} for user in data if user['screen_name'] in valid_names])
Counter(df['community'])

Counter({0: 2027, 1: 1560, 3: 626, 4: 492, 2: 361, 5: 5})

In [36]:
#df.to_csv("community.csv", index=False)


In [8]:
import matplotlib.pyplot as plt
from matplotlib import colors

num_colors = max(df['community']) + 1
cmap = plt.get_cmap('tab20b', num_colors)

viz = graphistry.bind(source="u1", destination="u2", node="screen_name", point_size="pagerank").nodes(df).edges(mentions).encode_point_color('community', categorical_mapping={i: colors.rgb2hex(cmap(i)) for i in range(num_colors)}, default_mapping='orange')
viz.plot()

Now when we render the Graphistry visualization, node size is proprtional to the node's PageRank score. This results in a different set of nodes that are identified as most important. 

By binding node size to the results of graph algorithms we are able to draw insight from the data at a glance and further explore the interactive visualization.


In [10]:
html = viz.plot()

In [12]:
html