# Big Data Modelling and Management - Lab 2

## Graphs and Tweets

With all the setup done on the past class. This one is all about analysing data and extracting information from graphs!

Checking the connection still works...

## Setup

**1. Be sure that you have a neo4j docker container running:**
   
`docker run --name Neo4JLab -p 7474:7474 -p 7687:7687 -d -v `**<span style="color:red">"c:\Users\osavc\Documents\Nova_BDMM\_2024\Installations\Neo4JPlugins"</span>** `:/plugins -v `**<span style="color:red">"c:\Users\osavc\Documents\Nova_BDMM\_2024\Installations\Neo4JData/data"</span>** `:/data --env NEO4J_AUTH=neo4j/test --env NEO4J_dbms_connector_https_advertised__address="localhost:7473" --env NEO4J_dbms_connector_http_advertised__address="localhost:7474" --env NEO4J_dbms_connector_bolt_advertised__address="localhost:7687" --env NEO4J_dbms_security_procedures_unrestricted="gds.*" --env NEO4J_dbms_security_procedures_allowlist="gds.*" neo4j:4.4.5`

docker run --name Neo4JLab -p 7474:7474 -p 7687:7687 -d -v "c:\Users\osavc\Documents\Nova_BDMM\_2024\Installations\Neo4JPlugins":/plugins -v "c:\Users\osavc\Documents\Nova_BDMM\_2024\Installations\Neo4JData/data":/data --env NEO4J_AUTH=neo4j/test --env NEO4J_dbms_connector_https_advertised__address="localhost:7473" --env NEO4J_dbms_connector_http_advertised__address="localhost:7474" --env NEO4J_dbms_connector_bolt_advertised__address="localhost:7687" --env NEO4J_dbms_security_procedures_unrestricted="gds.*" --env NEO4J_dbms_security_procedures_allowlist="gds.*" neo4j:4.4.5

**2. We will be using the same database we used in Lab 1:**

In [1]:
from neo4j import GraphDatabase
from pprint import pprint

## Define a Connection to your external Database

In [2]:
NEO4J_URI="neo4j://localhost:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="test"

In [3]:
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD), )

## All The Functions you'll need to run queries in Neo4J

In [4]:
def execute_write(driver, query):
    with driver.session(database="neo4j") as session:
        # Write transactions allow the driver to handle retries and transient errors
        result = session.execute_write(lambda tx, query: list(tx.run(query)), query)
    return result

In [5]:
def execute_read(driver, query):    
    with driver.session(database="neo4j") as session:
        result = session.execute_read(lambda tx, query: list(tx.run(query)), query)
    return result

In [6]:
query = """
        MATCH () RETURN count(*)
    """

result = execute_read(driver, query)

pprint(result)

[<Record count(*)=194356>]


## Today we will answer some questions about our database

In [7]:
query = """
        call db.labels;
    """

result = execute_read(driver, query)

pprint(result)

[<Record label='User'>,
 <Record label='Location'>,
 <Record label='Tweet'>,
 <Record label='Hashtag'>]


In [8]:
query = """
        MATCH (node) 
        RETURN distinct labels(node)
    """

result = execute_read(driver, query)

pprint(result)

[<Record labels(node)=['User']>,
 <Record labels(node)=['Location']>,
 <Record labels(node)=['Tweet']>,
 <Record labels(node)=['Hashtag']>]


### How many Hashtags does the database have?

In [9]:
# How many nodes of Hashtag label are there?

query = """
        MATCH (h:Hashtag)
        RETURN count(h)
    """

result = execute_read(driver, query)

pprint(result)

[<Record count(h)=4708>]


### Directed Graphs

As some of you know, graph relationships can have directionality and neo4j can accommodate such types as well.

For example, a person tweets a tweet! Which means there are two nodes `(p:Person)` and `(t:Tweet)` which are connected through a relationship `[:TWEETED]` with a very strict direction.

Obviously a Tweet cannot tweet a person!

In [10]:
query = """
        MATCH ()-[]-(t:Tweet)
        RETURN count(t)
    """

result = execute_read(driver, query)

pprint(result)

[<Record count(t)=158269>]


In [11]:
query = """
        MATCH ()<-[]-(t:Tweet)
        RETURN count(t)
    """

result = execute_read(driver, query)

pprint(result)

[<Record count(t)=61376>]


### Let's count relationships

In [12]:
query = """
        MATCH (t:Tweet)-[r]->(h:Hashtag)
        RETURN count(r)
    """

result = execute_read(driver, query)

pprint(result)

[<Record count(r)=17135>]


### Order by Hashtag name

In [13]:
query = """
        MATCH (t:Tweet)-[r]->(h:Hashtag)
        RETURN h, count(r)
        ORDER BY h.name
        SKIP 100
        LIMIT 5
    """

result = execute_read(driver, query)

pprint(result)

[<Record h=<Node element_id='191469' labels=frozenset({'Hashtag'}) properties={'name': 'AlexanderVindman'}> count(r)=1>,
 <Record h=<Node element_id='192467' labels=frozenset({'Hashtag'}) properties={'name': 'AlexeiNavalny'}> count(r)=1>,
 <Record h=<Node element_id='191329' labels=frozenset({'Hashtag'}) properties={'name': 'AliVelshi'}> count(r)=1>,
 <Record h=<Node element_id='191855' labels=frozenset({'Hashtag'}) properties={'name': 'AllCaps'}> count(r)=1>,
 <Record h=<Node element_id='193721' labels=frozenset({'Hashtag'}) properties={'name': 'AllRussiansAreResponsible'}> count(r)=2>]


### Order by Hashtag count: Which is the most used Hashtag in our DB?

In [15]:
query = """
        MATCH (t:Tweet)-[r]->(h:Hashtag)
        RETURN h, count(r)
        ORDER BY count(r) DESC
        LIMIT 5
    """

result = execute_read(driver, query)

pprint(result)

[<Record h=<Node element_id='189647' labels=frozenset({'Hashtag'}) properties={'name': 'Ukraine'}> count(r)=2489>,
 <Record h=<Node element_id='189670' labels=frozenset({'Hashtag'}) properties={'name': 'Russia'}> count(r)=624>,
 <Record h=<Node element_id='189652' labels=frozenset({'Hashtag'}) properties={'name': 'ukraine'}> count(r)=317>,
 <Record h=<Node element_id='189656' labels=frozenset({'Hashtag'}) properties={'name': 'UkraineRussiaWar'}> count(r)=249>,
 <Record h=<Node element_id='189701' labels=frozenset({'Hashtag'}) properties={'name': 'StandWithUkraine'}> count(r)=192>]


### Which hashtag gets used the most together with the most used (Ukraine)?

In [16]:
query = """
        MATCH p=()<-[]-()-[]->()
        RETURN p
        limit 1
    """

result = execute_read(driver, query)

pprint(result)

[<Record p=<Path start=<Node element_id='89706' labels=frozenset({'Tweet'}) properties={'date': '2022-04-21T17:47:14.000Z', 'replies': 0, 'id': '1517198289115963392', 'text': 'RT @Weinsteinlaw: Russia is using hunger as a weapon in Ukraine, which makes @WCKitchen and #ChefsForUkraine critical soldiers in the fight…', 'retweets': 51, 'lang': 'en', 'likes': 0}> end=<Node element_id='167697' labels=frozenset({'Tweet'}) properties={'date': '2022-04-21T16:30:47.000Z', 'replies': 0, 'id': '1517179048862142465', 'text': 'RT @TonyHussein4: Most Americans blame Dictator Vladimir Putin and oil companies for higher gas prices, according to an ABC News/Ipsos poll…', 'retweets': 184, 'lang': 'en', 'likes': 0}> size=2>>]


In [17]:
query = """
        MATCH (h1:Hashtag{name:"Ukraine"})<-[]-(t)-[]->(h2:Hashtag)
        RETURN h2.name as NAMENAME, count(*) as freq
        order by freq desc
        limit 10
    """

result = execute_read(driver, query)

pprint(result)

[<Record NAMENAME='Russia' freq=554>,
 <Record NAMENAME='UkraineRussiaWar' freq=188>,
 <Record NAMENAME='Putin' freq=154>,
 <Record NAMENAME='Mariupol' freq=133>,
 <Record NAMENAME='StandWithUkraine' freq=119>,
 <Record NAMENAME='UkraineWar' freq=112>,
 <Record NAMENAME='Russian' freq=99>,
 <Record NAMENAME='Ukrainian' freq=78>,
 <Record NAMENAME='USA' freq=75>,
 <Record NAMENAME='NATO' freq=71>]


### Which users (top ten) have the most tweeted tweets?

In [18]:
query = """
        MATCH (u:User)-[]->(t:Tweet)
        RETURN id(u), u.username, count(*) as freq
        order by freq desc
        limit 10
    """ 
# id() functions retrives the unique id of the nodes

result = execute_read(driver, query)

pprint(result)

[<Record id(u)=47366 u.username='educwation' freq=88>,
 <Record id(u)=42951 u.username='tryin2dowhatsr1' freq=78>,
 <Record id(u)=9453 u.username='StopVladdyDaddy' freq=71>,
 <Record id(u)=17208 u.username='BaumannPater' freq=68>,
 <Record id(u)=737 u.username='Missy72228463' freq=68>,
 <Record id(u)=240 u.username='NatForTrump2024' freq=64>,
 <Record id(u)=38763 u.username='goribernoob' freq=60>,
 <Record id(u)=650 u.username='anna_kouadio' freq=58>,
 <Record id(u)=5908 u.username='roskirakanang' freq=57>,
 <Record id(u)=62953 u.username='2vprh32texobi' freq=57>]


### Users (top ten) that used the most popular hashtag the most?

In [22]:
query = """
        MATCH (h:Hashtag)<-[]-(t:Tweet)<-[]-(u:User)
        
        RETURN id(u), h.name, count(distinct h) as freq
        order by freq desc
        limit 10
    """

result = execute_read(driver, query)

pprint(result)

[<Record id(u)=70112 h.name='TonerCartridges' freq=2>,
 <Record id(u)=2270 h.name='DenazifyUkraine' freq=1>,
 <Record id(u)=88055 h.name='Ukraine' freq=1>,
 <Record id(u)=89610 h.name='Ukraine' freq=1>,
 <Record id(u)=87656 h.name='Ukraine' freq=1>,
 <Record id(u)=89425 h.name='Ukraine' freq=1>,
 <Record id(u)=750 h.name='Ukraine' freq=1>,
 <Record id(u)=38037 h.name='Ukraine' freq=1>,
 <Record id(u)=3086 h.name='Ukraine' freq=1>,
 <Record id(u)=50215 h.name='DenazifyUkraine' freq=1>]


### What are verified users saying?

    Let's check properties of User nodes

In [23]:
query = """
        MATCH (u:User)
        WHERE u.verified=True
        RETURN u
        LIMIT 2
    """

result = execute_read(driver, query)

pprint(result)

[<Record u=<Node element_id='1' labels=frozenset({'User'}) properties={'tweet_count': 27811, 'followers': 81971, 'following': 888, 'verified': True, 'description': 'Chair @TheDemocrats Lawyers Council and @DNC Deputy National Finance Chair. @ObamaWhiteHouse appointee. Trial Lawyer. Music & Hockey Fan. #EndGunViolence', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1482073099285803016/RBcXvS4s_normal.jpg', 'id': '22879254', 'username': 'Weinsteinlaw'}>>,
 <Record u=<Node element_id='3' labels=frozenset({'User'}) properties={'tweet_count': 15034, 'followers': 254914, 'following': 693, 'verified': True, 'description': 'Founded by @chefjoseandres in 2010. Wherever there is a fight so that hungry people may eat, we will be there. #ChefsForTheWorld', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1134164643147132934/yBX6dZH4_normal.jpg', 'id': '156653779', 'username': 'WCKitchen'}>>]


In [24]:
query = """
        MATCH (u:User{verified:True})
        RETURN u
        LIMIT 2
    """

result = execute_read(driver, query)

pprint(result)

[<Record u=<Node element_id='1' labels=frozenset({'User'}) properties={'tweet_count': 27811, 'followers': 81971, 'following': 888, 'verified': True, 'description': 'Chair @TheDemocrats Lawyers Council and @DNC Deputy National Finance Chair. @ObamaWhiteHouse appointee. Trial Lawyer. Music & Hockey Fan. #EndGunViolence', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1482073099285803016/RBcXvS4s_normal.jpg', 'id': '22879254', 'username': 'Weinsteinlaw'}>>,
 <Record u=<Node element_id='3' labels=frozenset({'User'}) properties={'tweet_count': 15034, 'followers': 254914, 'following': 693, 'verified': True, 'description': 'Founded by @chefjoseandres in 2010. Wherever there is a fight so that hungry people may eat, we will be there. #ChefsForTheWorld', 'profile_image_url': 'https://pbs.twimg.com/profile_images/1134164643147132934/yBX6dZH4_normal.jpg', 'id': '156653779', 'username': 'WCKitchen'}>>]


In [25]:
query = """
        MATCH (u:User)-[r:MENTIONS]->(t:Tweet)
        WHERE u.verified=False
        RETURN t.retweets as retweets
        SKIP 500
        LIMIT 5
    """

result = execute_read(driver, query)

pprint(result)

[]


### What are the most commonly used hashtags by a specific user?

In [26]:
username = "Elliott12345671"

query = f"""
        MATCH (u:User)-[:TWEETED]-(t:Tweet)-[]-(h:Hashtag)
        
        where u.username='{username}' and t.lang='en'
        RETURN h.name, count(*) as freq
        order by freq desc
        limit 5
    """

result = execute_read(driver, query)

pprint(result)

[<Record h.name='Новосибирск' freq=19>,
 <Record h.name='Екатеринбург' freq=19>,
 <Record h.name='Казань' freq=19>,
 <Record h.name='Челябинск' freq=19>,
 <Record h.name='Омск' freq=19>]


### Who is talking about specific subject?

In [27]:
subject = "Los Angeles"

query = f"""
        MATCH (u:User)-[]->(t:Tweet)
        WHERE t.text contains '{subject}'
        RETURN u.username, t.text
        LIMIT 5
    """

result = execute_read(driver, query)

pprint(result)

[<Record u.username='JUkrainian333' t.text='RT @standwithukr_la: This mural in the heart of Los Angeles \nEveryone welcome to see! \nSupport Ukraine: scan QR code - save a life!\n📍 7753…'>,
 <Record u.username='dgershwin' t.text='The latest The Los Angeles, Politics, Sports Daily! https://t.co/oBWSgIfqdd Thanks to @JHWeissmann @lizzieohreally @ChrisCillizza #ukraine #breaking'>,
 <Record u.username='NigelQuartey' t.text='The translation makes it funnier!:\nIn Los Angeles, on the wall of one of the buildings located on Melrose Avenue, a mural in support of Ukraine appeared with the image of American actor Will Smith, who beats Putin.\nFrom https://t.co/r1WLfmdAqG\n#СлаваУкраїні https://t.co/Wg9df62o3D'>,
 <Record u.username='tonernews' t.text="IT'S THE LAW: If you Sell Chinese Toner in Los Angeles, You May be Sued and Lose Your Business!\nhttps://t.co/tB7AZhk5Ad\n#TonerCartridges #InkCartridges #stopthewar #Ukraine #Printing #Toner #Ink #Hp #Canon #Ricoh #Konica #Xerox #Lexmark #Closedl

#### Multiple Subjects

In [40]:
subjects = ['Germany', 'France']

query = f"""
        WITH {subjects} AS words

        MATCH (l:Location)<--(u:User)-[r:TWEETED|RETWEETS]->(t:Tweet)
        

        WHERE all(word IN words WHERE t.text CONTAINS word)

        RETURN l.name, t.text
        ORDER BY l.name
        LIMIT 5
    """

result = execute_read(driver, query)

pprint(result)

[<Record l.name='Alabama, USA🇺🇸' t.text="@RT_com's account has been withheld in Portugal, Finland, Sweden, Ireland, Slovenia, Czech Republic, Poland, Slovakia, Hungary, Italy, Malta, Germany, Greece, Romania, Netherlands, Bulgaria, Austria, Luxembourg, Latvia, Denmark, Lithuania, Croatia, Estonia, Cyprus, France, Spain, Belgium in response to a legal demand. Learn more.">,
 <Record l.name='Alabama, USA🇺🇸' t.text="@RT_com's account has been withheld in Portugal, Finland, Sweden, Ireland, Slovenia, Czech Republic, Poland, Slovakia, Hungary, Italy, Malta, Germany, Greece, Romania, Netherlands, Bulgaria, Austria, Luxembourg, Latvia, Denmark, Lithuania, Croatia, Estonia, Cyprus, France, Spain, Belgium in response to a legal demand. Learn more.">,
 <Record l.name='Belgrade/Calgary' t.text="@RT_com's account has been withheld in Portugal, Finland, Sweden, Ireland, Slovenia, Czech Republic, Poland, Slovakia, Hungary, Italy, Malta, Germany, Greece, Romania, Netherlands, Bulgaria, Austria, Luxem

### Who are the people most mentioned in tweets made in the most popular place?

In [33]:
# Let's obtain the unique location names and their count

query = f"""
        MATCH (l:Location)<--(:User)-[r2:TWEETED]->(t:Tweet)-[r1:MENTIONS]->(u:User)
        
        RETURN distinct l.name as Location, count(*) as Count, r1, r2
        
        ORDER BY Count DESC
        LIMIT 5
    """

result = execute_read(driver, query)

pprint(result)

[<Record Location='Microtitre Plates' Count=2 r1=<Relationship element_id='96503' nodes=(<Node element_id='123773' labels=frozenset() properties={}>, <Node element_id='38322' labels=frozenset() properties={}>) type='MENTIONS' properties={}> r2=<Relationship element_id='96484' nodes=(<Node element_id='28814' labels=frozenset() properties={}>, <Node element_id='123773' labels=frozenset() properties={}>) type='TWEETED' properties={}>>,
 <Record Location='United States' Count=1 r1=<Relationship element_id='198937' nodes=(<Node element_id='188566' labels=frozenset() properties={}>, <Node element_id='89333' labels=frozenset() properties={}>) type='MENTIONS' properties={}> r2=<Relationship element_id='198872' nodes=(<Node element_id='89306' labels=frozenset() properties={}>, <Node element_id='188566' labels=frozenset() properties={}>) type='TWEETED' properties={}>>,
 <Record Location='United States' Count=1 r1=<Relationship element_id='198924' nodes=(<Node element_id='188566' labels=frozenset

**Who are the people most mentioned in tweets made in Pennsylvania, the most popular place?**

In [35]:
query = """
        MATCH (l:Location{name:"Pennsylvania"})<--(:User)-[:TWEETED]->(t:Tweet)-[:MENTIONS]->(u:User)
        RETURN u.username, count(*) as Count
        ORDER BY Count DESC
        LIMIT 5
    """

result = execute_read(driver, query)

pprint(result)

[<Record u.username='Bigteethyouhave' Count=53>,
 <Record u.username='akuscg' Count=53>,
 <Record u.username='Kittie_Svengali' Count=53>,
 <Record u.username='victoria_roark' Count=53>,
 <Record u.username='ShellyRKirchoff' Count=53>]


### Which hashtags are used in the tweets with the most retweets?

In [36]:
query = """
        MATCH (u:User)-[]->(t:Tweet)-[]->(h:Hashtag)
        RETURN t.retweets, h.name
        ORDER BY t.retweets DESC
        LIMIT 10
    """

result = execute_read(driver, query)

pprint(result)

[<Record t.retweets=489 h.name='Ukraine'>,
 <Record t.retweets=489 h.name='Russian'>,
 <Record t.retweets=299 h.name='Ukraine'>,
 <Record t.retweets=299 h.name='Belarusian'>,
 <Record t.retweets=276 h.name='Ukraine'>,
 <Record t.retweets=276 h.name='Ringtausch'>,
 <Record t.retweets=191 h.name='DemVoice1'>,
 <Record t.retweets=191 h.name='ItsTimeToAdmitThat'>,
 <Record t.retweets=191 h.name='Fresh'>,
 <Record t.retweets=171 h.name='Ukraine'>]


### Top Mentioned User

In [37]:
query = """
        MATCH (:Tweet)-[:MENTIONS]->(u:User)
        RETURN u.username, count(*) as freq
        ORDER BY freq DESC 
        LIMIT 5
    """

result = execute_read(driver, query)

pprint(result)

[<Record u.username='POTUS' freq=1101>,
 <Record u.username='Ukraine' freq=369>,
 <Record u.username='SaveThe_Ukraine' freq=338>,
 <Record u.username='Bigteethyouhave' freq=338>,
 <Record u.username='ShellyRKirchoff' freq=335>]


### Most frequently co-ocurring tags

In [38]:
query = """
     MATCH (h1:Hashtag)-[]-(:Tweet)-[]-(h2:Hashtag)  
     WHERE h1.name < h2.name                     
     RETURN h1.name, h2.name, count(*) AS Frequency
     ORDER BY Frequency DESC
     limit 5
    """

result = execute_read(driver, query)

pprint(result)

[<Record h1.name='Russia' h2.name='Ukraine' Frequency=554>,
 <Record h1.name='Ukraine' h2.name='UkraineRussiaWar' Frequency=188>,
 <Record h1.name='Putin' h2.name='Ukraine' Frequency=154>,
 <Record h1.name='Mariupol' h2.name='Ukraine' Frequency=133>,
 <Record h1.name='StandWithUkraine' h2.name='Ukraine' Frequency=119>]
