# Practice Session 04: Networks from text

Author: <font color="blue">Jose Giner</font>

E-mail: <font color="blue">joseginer67@gmail.com</font>

Date: <font color="blue">22/02/2022</font>

# 1. Create the directed mention network

In [2]:
import io
import json
import gzip
import csv
import re

In [3]:
# Leave this code as-is

# Input file
COMPRESSED_INPUT_FILENAME = "data/CovidLockdownCatalonia.json.gz"

# These are the output files, leave as-is
OUTPUT_ALL_EDGES_FILENAME = "data/CovidLockdownCatalonia.csv"
OUTPUT_FILTERED_EDGES_FILENAME = "data/CovidLockdownCatalonia-min-weight-filtered.csv"
OUTPUT_CO_MENTIONS_FILENAME = "data/CovidLockdownCatalonia-co-mentions.csv"

## 1.1. Extract mentions

In [4]:
# Leave this code as-is

def extract_mentions(text):
    return re.findall("@([a-zA-Z0-9_]{5,20})", text)

print(extract_mentions("RT @Jordi: check this post by @Xavier"))

['Jordi', 'Xavier']


## 1.2. Count mentions

<font size="+1" color="red">Replace this cell with your code to read the compressed input file and create the mentions_counter dictionary.</font>

In [6]:
with gzip.open(COMPRESSED_INPUT_FILENAME, "rt", encoding="utf-8") as input_file:
    mentions_counter = {}
    for line in input_file:
        tweet = json.loads(line)
        author = tweet["user"]["screen_name"]
        message = tweet["full_text"]
        mentions = extract_mentions(message)
        for m in mentions:
            key = (author,m)
            if key in mentions_counter:
                mentions_counter[key] += 1
            else:
                mentions_counter[key] = 1


In [7]:
mentions_counter[('joanmariapique','catalangov')]

9

In [16]:
# Leave this code as-is
with io.open(OUTPUT_ALL_EDGES_FILENAME, "w") as output_file:
    writer = csv.writer(output_file, delimiter='\t', quotechar='"', lineterminator='\n')
    writer.writerow(["Source", "Target", "Weight"])
    for key in mentions_counter:
        author = key[0]
        mention = key[1]
        weight = mentions_counter[key]
        writer.writerow([author, mention, weight])

In [24]:
import pandas as pd
edges = pd.read_csv(OUTPUT_ALL_EDGES_FILENAME, sep = '\t')
edges = edges.drop(edges[edges.Weight < 2].index)

In [25]:
edges.head()

Unnamed: 0,Source,Target,Weight
7,XaviMarti5,XSalaimartin,2
50,EnriqueTgn,vpartal,2
83,mjorubiomari,elnacionalcat,2
93,baco8baco,Gargotejant,2
97,ERoigM,marfanta,2


In [26]:
edges.to_csv(OUTPUT_FILTERED_EDGES_FILENAME, sep = '\t')

# 2. Visualize the directed mention network

![Mentions graph](mentions.png)

This graph is composed of several components, where there is a larger one at the top that contains some nodes with a high in-degree centrality (greater size), that represent popular actors like the government, newspaper or health services of Catalonia. Also, there are many small disconnected components, that may ilustrate casual friend mentions. The largest component contais 699 nodes out of 1600 that are in total, about a 43%. The second largest component contains 16 nodes, composing a 1% of the total nodes in the graph.

![Mentions graph largest connected component](mentions-largest-cc.png)

![Mentions in-degree distribution](mentions-in-degree-distribution.png)

![Mentions out-degree distribution](mentions-out-degree-distribution.png)

From the largest component, we can see a scale-free graph with more attachments to nodes with higher degree often called hubs. With the node groups created from the affinity propagation clustering, it is clearly appreciated how all those nodes or users who mention to the same high degree node, belong to the same group. There are some cases where the node acting as a hub belongs to a group different than the one containing all less popular nodes who are pointing to such node. 

In these two node degree distributions, we can see that both of them follow a Power Law distribution. From the in-degree graph, we can see that most of the nodes have zero incoming edges which are users who are not being mentioned by other user accounts and the high centrality nodes are the few ones with higher in-degree value. As said before, these represent official services, government and politicians of the Catalonia region that were the centre of attention back in the pandemic days. From the out-degree graph, a similar behaviour occurs, as there are many nodes with low out-degree (lots of users who didn´t mention others) and few of them have high out-degree, that are casual users who mentioned many of other users that day.

It is interesting to see that the diameter of this largest component is of 20 (high value in my opinion and relevant because lots of interconnections are needed to travel from one node to another in some cases despite hubs having a high betweenness centrality) and a characteristic path length of almost 6 that supports the small world theory in these types of networks as the average distance follows log(N), with N = 699.


# 3. Create the undirected co-mention network

In [21]:
with gzip.open(COMPRESSED_INPUT_FILENAME, "rt", encoding="utf-8") as input_file:
    co_mentions_counter = {}
    for line in input_file:
        tweet = json.loads(line)
        message = tweet["full_text"]
        mentions = extract_mentions(message)
        for i in range(len(mentions)-1):
            for j in range(i+1,len(mentions)):
                tup = (mentions[i],mentions[j])
                key = tuple(sorted(tup))
                
                if key in co_mentions_counter:
                    co_mentions_counter[key] += 1
                else:
                    co_mentions_counter[key] = 1

In [25]:
key = ('agriculturacat', 'uniopagesos')
print(co_mentions_counter[key])

8


In [27]:
with io.open(OUTPUT_CO_MENTIONS_FILENAME, "w") as output_file:
    writer = csv.writer(output_file, delimiter='\t', quotechar='"', lineterminator='\n')
    writer.writerow(["Source", "Target", "Weight"])
    for key in co_mentions_counter:
        user1 = key[0]
        user2 = key[1]
        weight = co_mentions_counter[key]
        writer.writerow([user1, user2, weight])

![Co-mentions graph](co-mentions.png)

Zoom in to the largest component:

![Co-mentions largest component](co-mentions-largest-comp.png)

In the co-mentions graph, we can see that it contains several components and one of them is the largest of all, which has 1514 nodes out of the 4044 that are in total (a 37%). The secondary components are in most of the cases highly interconnected or complete subgraphs, suggesting that these reprent users who are only mentioned all together in the same tweet. This is also appreciated in the largest component, where there exists high densely connected parts but this time some users appear in mentions with others not beloning to this zone, enabling a path to the rest of the component. 

Looking at the degree distribution, we can see that this largest component follows a Power Law distribution proper of scale free networks and there are lots of nodes having a degree less than the average degree. The high centrality nodes are located at the central spot of the component, being those user accounts that have been mentioned together with many other users. For example, Catalonia government, health authorities or politicians. As more followers an account has, it is expected to be mentioned together with more users. From the stats of the network analyzer, we can see a small characteristic path length of 5.4 so the average distance between nodes is not that large, a clustering coeficient of 0.5 meaning that in average half of the total possible edges that connect neighbours of a node are present and this supports the small world phenomenon in these types of networks. In this case, the average degree is not very informative due to the 'long tail' in the degree distribution, where the high degree nodes produce a high variance. 

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, text, and figures were produced by myself.</font>
