# Monitoring Cryptocurrency space with NLP and Knowledge Graphs

There are thousands if not more articles produced every day. While there is a lot of knowledge hidden in those articles, it is virtually impossible to read all of them. Even if you only focus on a specific domain, it is still hard to find all relevant articles and read them to get valuable insights. However, there are tools that could help you avoid manual labor and extract those insights automatically. I am, of course, talking about various NLP tools and services.

# Agenda
* Retrieve articles that talk about cryptocurrency
* Translate foreign articles with Google Translate API
* Import articles into Neo4j
* Extract entities and facts with Diffbot's NLP API
* Import entities and facts into Neo4j
* Graph analysis


# Retrieve articles about cryptocurrencies
As mentioned, we will use the Diffbot APIs to retrieve articles that talk about cryptocurrencies. If you want to follow this post, you can create a free trial account on their page, which should be enough to complete all the steps presented here. Once you login to their portal, you can explore their visual query builder interface and inspect what is available.

There is a lot of data available by Diffbot's Knowledge Graph API. So not only can you search for various articles, but you could use their KG APIs to retrieve information around organizations, products, persons, jobs, and more.
This example will retrieve the latest 5000 articles with a tag label Cryptocurrency.

In [2]:
import requests
import pandas as pd

DIFF_TOKEN = "<Insert Diffbot token>"
search_query = 'query=type%3AArticle+tags.label%3A"cryptocurrency"++sortBy%3Adate'
article_count = 5000
articles_per_request = 50

def get_articles(query, offset):
    """
    Fetch relevant articles from Diffbot KG endpoint
    """
    search_host = "https://kg.diffbot.com/kg/dql_endpoint?"
    url = f"{search_host}{query}&token={DIFF_TOKEN}&from={offset}&size={articles_per_request}"
    return requests.get(url).json()['data']

articles = []
for offset in range(0,article_count, articles_per_request):
    articles.extend(get_articles(search_query, offset))

I have constructed the search query in their visual builder and simply copied it to my Python script. That's all the code required to fetch any number of articles that are relevant to your use-case.
# Translate foreign articles with Google Translate API
The retrieved articles are from all over the world and in many languages. In the next step, you will use Google Translate API to translate them to English. You will need to enable the Google Translate API and create an API key. Make sure to check their pricing, as it ended up a bit more than expected for me to use their translation API. I've checked pricing on other sites, and it is usually between $15 to $20 to translate a million characters.

In [6]:
GOOGLE_API_KEY = "<Insert you google API key>"
translate_url = "https://translation.googleapis.com/language/translate/v2"

def translate(text, language):
    """
    Translate text to English with Google Translate API.
    If the text is already in English, simply return it.
    """
    if language == 'en':
        return text
    
    data = {'q': text, 'target': 'en', 'format':'text', 'source': language, 'key': GOOGLE_API_KEY}
    try:
        response = requests.post(translate_url, data=data).json()
        return response['data']['translations'][0]['translatedText']
    except Exception as e:
        print(response)
        return None

# Google Translate API has a limit of 100kb per request
max_character_length = 95_000

for row in articles:
    row['date'] = row.get('date').get('timestamp') if row.get('date') else None
    row['translatedText'] = translate(row['text'][:max_character_length], row['language'])

Before we move on to the NLP extraction part, we will import the articles into Neo4j.
# Import articles into Neo4j
I suggest you either download Neo4j Desktop or use the free Neo4j Aura cloud instance, which should be enough to store information about these 5000 articles. First of all, we have to define the connection to Neo4j instance.

In [4]:
# Define Neo4j connections
from neo4j import GraphDatabase

host = 'bolt://localhost:7687'
user = 'neo4j'
password = 'letmein'
driver = GraphDatabase.driver(host,auth=(user, password))
                                         

def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

We have some meta-data around articles. For example, we know the overall sentiment of the paper and when it was published. In addition, for most of the articles, we know who wrote them and on which site. Lastly, the Diffbot API also returns the categories of an article.

Before continuing, we will define unique constraints in Neo4j, which will speed up the import and subsequent queries.

In [33]:
run_query("CREATE CONSTRAINT IF NOT EXISTS ON (a:Article) ASSERT a.id IS UNIQUE;")
run_query("CREATE CONSTRAINT IF NOT EXISTS ON (e:Entity) ASSERT e.id IS UNIQUE;")
run_query("CREATE CONSTRAINT IF NOT EXISTS ON (c:Category) ASSERT c.id IS UNIQUE;")
run_query("CREATE CONSTRAINT IF NOT EXISTS ON (a:Author) ASSERT a.url IS UNIQUE;")
run_query("CREATE CONSTRAINT IF NOT EXISTS ON (r:Region) ASSERT r.name IS UNIQUE;")
run_query("CREATE CONSTRAINT IF NOT EXISTS ON (s:Site) ASSERT s.name IS UNIQUE;")

Now we can go ahead and import articles into Neo4j.

In [35]:
import_articles_query = """
UNWIND $data as row
MERGE (a:Article {id: row.id})
SET a.title = row.title,
    a.url = row.url,
    a.sentiment = row.sentiment,
    a.date = CASE WHEN row.date IS NOT NULL THEN datetime({epochMillis:row.date}) ELSE Null END,
    a.language = row.language,
    a.text = row.text,
    a.translatedText = row.translatedText
MERGE (s:Site {name: row.siteName})
MERGE (a)-[:ON_SITE]->(s)
FOREACH (_ in case WHEN row.authorUrl IS NOT NULL THEN [1] ELSE [] END |
   MERGE (au:Author {url: row.authorUrl})
    ON CREATE SET au.name = row.author
    MERGE (au)-[:WROTE]->(a))
FOREACH (_ in case WHEN row.publisherRegion IS NOT NULL THEN [1] ELSE [] END |
   MERGE (r:Region {name: row.publisherRegion})
   MERGE (s)-[:HAS_REGION]->(r))
WITH a, row.categories as categories
UNWIND categories AS category
MERGE (c:Category {id:category.id})
ON CREATE SET c.name = category.name
MERGE (a)-[:HAS_CATEGORY]->(c)
"""

run_query(import_articles_query, {'data': articles})

Before we move on to the analysis part of the post, we will use the NLP API to extract entities and relationships, or as Diffbot calls them, facts. The Diffbot website offers an online NLP demo, where you can input any text and evaluate the results.


The NLP API will identify all the entities that appear in the text and possible relationships between them. In this example, we can see that Jack Dorsey is the CEO of Block, which is based in San Francisco and deals with payments and mining. Jack's coworker at Block is Thomas Templeton, who has a background in computer hardware.
To process the entities in the response and store the to Neo4j, we will use the following code:


In [9]:
FIELDS = "entities,sentiment,facts"
HOST = "nl.diffbot.com"

def nlp_request(payload):
    res = requests.post(f"https://{HOST}/v1/?fields={FIELDS}&token={DIFF_TOKEN}", json=payload)
    return res.json()

In [10]:
allowed_types = ['organization', 'person', 'location', 'product']
entity_confidence = 0.7

entity_query = """
MATCH (a:Article {id: $article_id})
WITH a
UNWIND $entities as e
MERGE (entity:Entity {id: e.id})
ON CREATE SET entity.name = e.name
WITH a, entity, e
CALL apoc.create.addLabels(entity, e.types) YIELD node
MERGE (a)-[m:MENTIONS]->(entity)
SET m.confidence = e.confidence,
    m.salience = e.salience,
    m.sentiment = e.sentiment
"""

def clean_and_store_entities(batch, response):
    for article, nlp in zip(batch, response):
        article_id = article['id']
        # Skip processing if there are not entities found
        if (not 'entities' in nlp) or (not nlp['entities']):
            continue
            
        entities = pd.DataFrame.from_dict(nlp['entities'])
        # Filter allowed entity types and capitalize type names
        entities['types'] = [[type['name'].capitalize() for type in types if type['name'] in allowed_types] 
                             for types in entities['allTypes']]

        # Filter entities without a type and confidence greater than entity confidence
        entity_import = entities[[len(e['types']) > 0 and e['confidence'] >= entity_confidence for i,e in entities.iterrows()]]
        # Filter persons who have only a single word
        entity_import = entity_import[entity_import.apply(lambda x: ('Person' not in x['types']) or (len(x['name'].split(' '))) > 1, axis=1)]
        # Define entity id
        entity_import['id'] = [l['allUris'][0] if 0 < len(l['allUris']) else l['name'] for i, l in entity_import.iterrows()]
        entity_params = {'article_id': article_id, 'entities': entity_import.to_dict('records')} 
        # Import to Neo4j
        run_query(entity_query, entity_params)
    

This example will import only entities that have allowed types such as organization, person, product, and location, and their confidence level is greater than 0.7. Diffbot's NLP API also features entity linking, where entities are linked to Wikipedia, Crunchbase, or LinkedIn, as far as I have seen.

Next, we have to prepare the function that will clean and import relationships into Neo4j.

In [11]:
skipProperties = ['gender', 'founding date', 'academic degree', 'age', 
                  'cause of death', 'date of birth', 'date of death']

rel_confidence = 0.7

rels_query ="""
UNWIND $data as row
MATCH (s:Entity {id: row.source})
MATCH (t:Entity {id: row.target})
CALL apoc.merge.relationship(s, row.type,
  {},
  {},
  t,
  {}
)
YIELD rel
RETURN distinct 'done';
"""

def clean_and_store_rels(batch, response):
    relParams = []
    for article, nlp in zip(batch, response):
        article_id = article['id']
        if not 'facts' in nlp or len(nlp['facts']) == 0:
            continue
        facts = pd.DataFrame.from_dict(nlp['facts'])
        # define confidence level
        facts = facts[facts['confidence'] >= rel_confidence]
        # Skip unrelated facts
        facts = facts[facts['property'].apply(lambda x: x['name'] not in skipProperties)]
        # skip if facts is empty
        if len(facts) == 0:
            continue

        # Construct data
        for i, row in facts[['entity', 'property', 'value']].iterrows():
            source = row['entity']['name'] if len(row['entity']['allUris']) == 0 else row['entity']['allUris'][0]
            target = row['value']['name'] if len(row['value']['allUris']) == 0 else row['value']['allUris'][0]
            type = row['property']['name'].replace(' ', '_').upper()
            relParams.append({'source':source,'target':target,'type':type})
        run_query(rels_query, {'data': relParams})

I have omitted the import of the properties that are defined in the skipProperties list. To me, it makes more sense to store them as node properties rather than relationships between entities. However, in this example, we will simply ignore them during import.
Now that we have the functions for importing entities and relationships prepared, we can go ahead and process the articles. You can send multiple articles in a single request. I've chosen to batch the requests by 50 pieces of content.

In [12]:
step = 50
total = len(articles)

for offset in range(0, total, step):
    batch = [el for el in articles[offset:offset + step] if el['translatedText']]
    # Prepare payload
    payload = [{'content': el['translatedText']} for el in batch]
    # Make the request to Diffbot API
    nlp_response = nlp_request(payload)
    # Clean and store entities and facts in Neo4j
    clean_and_store_entities(batch, nlp_response)
    clean_and_store_rels(batch, nlp_response)

By following these steps you have successfully constructed a knowledge graph in Neo4j.

# Graph analysis
In the last part of this post, I will walk you through some example applications that you could use with a knowledge graph like this. First, we will evaluate the timeline of the articles.

In [51]:
run_query("""
MATCH (a:Article)
RETURN date(a.date) AS date,
       count(*) AS count
ORDER BY date DESC
LIMIT 10
""")

Unnamed: 0,date,count
0,2022-01-14,144
1,2022-01-13,256
2,2022-01-12,290
3,2022-01-11,388
4,2022-01-10,365
5,2022-01-09,140
6,2022-01-08,198
7,2022-01-07,316
8,2022-01-06,427
9,2022-01-05,326


There is between 150 to 450 articles per day about cryptocurrencies around the world. Next, we will evaluate which entities are most frequently mentioned in articles.

In [39]:
run_query("""
MATCH (e:Entity)
RETURN e.name AS entity,
       size((e)<-[:MENTIONS]-()) AS articles
ORDER BY articles
DESC LIMIT 5
""")

Unnamed: 0,entity,articles
0,cryptocurrency,3182
1,bitcoin,1928
2,United States of America,1553
3,Ethereum,1137
4,blockchain,998


As you would expect from articles revolving around cryptocurrencies, the most frequently mentioned entities are cryptocurrency, bitcoin, Ethereum, and blockchain. The sentiment is available on the article level as well as entity level. For example, we can examine the sentiment regarding bitcoin grouped by region.

In [40]:
run_query("""
MATCH (e:Entity {name:'bitcoin'})<-[m:MENTIONS]-()-[:ON_SITE]->()-[:HAS_REGION]->(region)
WITH region.name AS region, m.sentiment AS sentiment
RETURN region, avg(sentiment) AS avgSentiment, 
       stdev(sentiment) AS stdSentiment, 
       max(sentiment) AS maxSentiment, 
       min(sentiment) AS minSentiment, 
       count(*) AS articles
ORDER BY articles DESC
LIMIT 5
""")

Unnamed: 0,region,avgSentiment,stdSentiment,maxSentiment,minSentiment,articles
0,North America,0.273279,0.658506,0.997222,-0.997094,592
1,Western Europe,0.120103,0.706812,0.996824,-0.988779,174
2,Southern Asia,0.28304,0.647172,0.994452,-0.996317,105
3,South America,0.346131,0.600056,0.99645,-0.993456,79
4,Northern Europe,0.157956,0.685148,0.992812,-0.994532,71


The sentiment is on average positive, but it heavily fluctuates between articles based on the standard deviation values. We could explore bitcoin sentiment more. Instead, we will examine which persons have the highest and lowest average sentiment in and also present in most articles in North America.

In [41]:
run_query("""
MATCH (e:Person)<-[m:MENTIONS]-()-[:ON_SITE]->()-[:HAS_REGION]->(region)
WHERE region.name = "North America"
RETURN e.name AS entity,
       count(*) AS articles,
       avg(m.sentiment) AS sentiment
ORDER BY sentiment * articles DESC
LIMIT 5
UNION
MATCH (e:Person)<-[m:MENTIONS]-()-[:ON_SITE]->()-[:HAS_REGION]->(region)
WHERE region.name = "North America"
RETURN e.name AS entity,
       count(*) AS articles,
       avg(m.sentiment) AS sentiment
ORDER BY sentiment * articles ASC
LIMIT 5
""")

Unnamed: 0,entity,articles,sentiment
0,Shiba Inu,92,0.116428
1,Elon Musk,47,0.21496
2,Jack Dorsey,23,0.170886
3,Brandon Brown,9,0.427313
4,Mark Cuban,9,0.34788
5,Floyd Mayweather,17,-0.365198
6,Charles Ponzi,8,-0.673924
7,Paul Pierce,14,-0.261665
8,Donald Trump,14,-0.154147
9,Kim Kardashian,18,-0.104895


Now, we can explore the titles of articles in which, for example, Mark Cuban appears.

In [55]:
run_query("""
MATCH (site)<-[:ON_SITE]-(a:Article)-[m:MENTIONS]->(e:Entity {name: 'Mark Cuban'})
RETURN a.title AS title, a.language AS language, m.sentiment AS sentiment, site.name AS site
ORDER BY sentiment DESC
LIMIT 5
""")

Unnamed: 0,title,language,sentiment,site
0,The biggest consumer brands that engaged with ...,en,0.955021,Cointelegraph
1,Billionaire Investor Mark Cuban to Share Stage...,en,0.920241,Crowdfund Insider
2,Billionaire Investor Mark Cuban to Share Stage...,en,0.897312,Crowdfund Insider
3,Mark Cuban says this is ‘the least important p...,en,0.601242,CNBC
4,Mark Cuban says 80% of his investments that ar...,en,0.52063,CNBC


While the titles themselves might not the most descriptive, we can also examine which other entities frequently co-occur in articles where Mark Cuban is mentioned.

In [56]:
run_query("""
MATCH (o:Entity)<-[:MENTIONS]-(a:Article)-[m:MENTIONS]->(e:Entity {name: 'Mark Cuban'})
WITH o, count(*) AS countOfArticles
ORDER BY countOfArticles DESC
LIMIT 5
RETURN o.name AS entity, countOfArticles
""")

Unnamed: 0,entity,countOfArticles
0,cryptocurrency,17
1,bitcoin,16
2,Dallas Mavericks,16
3,blockchain,9
4,Ethereum,7


Not surprisingly, various crypto tokens are present. Also, the Dallas Mavericks appear, which is the NBA club that Mark owns. Does Dallas Mavericks support crypto, or do reporters like to state that Mark owns the Dallas Mavericks, that I don't know. You could proceed with that route of analysis, but here, we'll also look at what facts we extracted during NLP processing.

In [59]:
run_query("""
MATCH p=(e:Entity {name: "Mark Cuban"})-[r]-(e2:Entity)
WITH distinct e,r,e2
RETURN e.name AS source, type(r) AS relationship, e2.name AS target
""")

Unnamed: 0,source,relationship,target
0,Mark Cuban,EMPLOYEE_OR_MEMBER_OF,National Basketball Association
1,Mark Cuban,WORK_RELATIONSHIP,Francis X. Suarez
2,Mark Cuban,WORK_RELATIONSHIP,Francis X. Suarez
3,Mark Cuban,EMPLOYEE_OR_MEMBER_OF,Dallas Mavericks
4,Mark Cuban,EMPLOYEE_OR_MEMBER_OF,AXS TV


Next, we will quickly evaluate the article titles where Floyd Mayweather appears, as the average sentiment is quite low.

In [57]:
run_query("""
MATCH (a:Article)-[m:MENTIONS]->(e:Entity {name: 'Floyd Mayweather'})
RETURN a.title AS title, a.language AS language, m.sentiment AS sentiment
ORDER BY sentiment ASC
LIMIT 5
""")

Unnamed: 0,title,language,sentiment
0,"Kim Kardashian, Floyd Mayweather Accused of ‘P...",en,-0.960115
1,Kim Kardashian and Floyd Mayweather sued over ...,en,-0.947587
2,"Kim Kardashian, Floyd Mayweather Sued by Crypt...",en,-0.919837
3,Kim Kardashian sued over crypto that crashed 9...,en,-0.912651
4,Kim Kardashian and Floyd Mayweather Jr ‘misled...,en,-0.901501


It seems that Kim Kardashian and Floyd Mayweather are being sued over an alleged crypto scam. The NLP processing also identifies various tokens and stock tickers, so we can analyze which are popular at the moment and their sentiment.

In [49]:
run_query("""
MATCH (e:Entity)<-[m:MENTIONS]-()
WHERE (e)<-[:STOCK_SYMBOL]-()
RETURN e.name AS stock, 
       count(*) as mentions, 
       avg(m.sentiment) AS averageSentiment,
       min(m.sentiment) AS minSentiment,
       max(m.sentiment) AS maxSentiment
ORDER BY mentions DESC
LIMIT 5
""")

Unnamed: 0,stock,mentions,averageSentiment,minSentiment,maxSentiment
0,XRP,156,0.167954,-0.966287,0.995899
1,DOT,64,0.137851,-0.981497,0.977433
2,BUSD,63,0.24585,-0.884344,0.971559
3,DOGE,59,0.108355,-0.978014,0.98504
4,LINK,57,0.183612,-0.85945,0.990779


I have only scratched the surface of the available insights we could extract. Play around and develop your own visualizations!