# Part 1: Social Media Behaviour Data Analysis


---

### Install Python packages (pip only)

In [1]:
#e.g., %pip install some-package
%pip install networkx
%pip install scipy

Note: you may need to restart the kernel to use updated packages.



### Import Python packages

In [2]:
#e.g., import some-package
import networkx as nx
from scipy import stats

---

### Task 1 of 1

Examine the Graph Modelling Language (gml) files "socialmedia_cmt224_reply_network.gml" (reply network) and "socialmedia_cmt224_social_network.gml" (social network) which represent Twitter data between a sample of users over several days at the time of the Higgs boson particle discovery. Both networks are directed and share the same ids for nodes (anonymised Twitter users).  However, the shared user ids are contained within the "label" attribute in the .gml files, not the node "id" attribute of each individual .gml file.

In the reply network, an edge from a node, 𝑢, to some other node, 𝑣, indicates that 𝑢 replied to a Tweet made by 𝑣 during the time period. Replies are also Tweets. Edges are weighted with the weight representing the number of times this happened over the time period.

In the social network, an edge from node 𝑢 to 𝑣 indicates that 𝑢 follows 𝑣 on the social media platform.

Using these networks, answer the following questions:

##### Q1. What fraction of users do not reply to or follow any other user, but have had others reply to their Tweets?

In [3]:
#CODE:
# 1 Load dataset.
rn = nx.read_gml("socialmedia_cmt224_reply_network.gml")
sn = nx.read_gml("socialmedia_cmt224_social_network.gml")

# 2 Find people who don't reply to people, don't follow people, but have other people reply to them.
# That means replies and social networks are only in and out.
r_order = list(rn.nodes())
s_order = list(sn.nodes())
only_replyed_by_others = []
for node in r_order:
    r_in_degree = rn.in_degree(node)
    r_out_degree = rn.out_degree(node)
    s_out_degree = sn.out_degree(node)
    if r_in_degree > 0 and r_out_degree == 0 and s_out_degree == 0:
        only_replyed_by_others.append(node)
# Computing the fraction.
fraction = len(only_replyed_by_others) / len(r_order)
print(f"The fraction of users do not reply to or follow any other user, but have had others reply to their Tweets: {fraction * 100:.2f}%")

The fraction of users do not reply to or follow any other user, but have had others reply to their Tweets: 0.91%


##### Q2. How does the topological structure of the reply network differ from the social network in terms of overall sparsity of edges between users and the number of connected groups of users?

In [4]:
#CODE:
# Computing the density of each network.
density_reply = nx.density(rn)
density_social = nx.density(sn)
# Computing the number of strongly connected components and weakly connected components of each network.
rn_number_scc_components = nx.number_strongly_connected_components(rn)
rn_number_wcc_components = nx.number_weakly_connected_components(rn)
sn_number_scc_components = nx.number_strongly_connected_components(sn)
sn_number_wcc_components = nx.number_weakly_connected_components(sn)
# Print the results.
print(f"The density of reply network: {density_reply}")
print(f"The density of social network: {density_social}")
print(f"Number of strongly connected components in reply network: {rn_number_scc_components}")
print(f"Number of weakly connected components in reply network: {rn_number_wcc_components}")
print(f"Number of strongly connected components in social network: {sn_number_scc_components}")
print(f"Number of weakly connected components in social network: {sn_number_wcc_components}")

The density of reply network: 4.080533494732535e-05
The density of social network: 0.000992358613762732
Number of strongly connected components in reply network: 16217
Number of weakly connected components in reply network: 5920
Number of strongly connected components in social network: 4648
Number of weakly connected components in social network: 436


##### Q3. Does the number of users a user follows in the social network correlate with the number of replies that they make?

In [5]:
#CODE:
# Get the out-degree of each node in the reply network and the social network.
r_out_degree_dict = dict(rn.out_degree())
s_out_degree_dict = dict(sn.out_degree())

# Sort the nodes to make the nodes correspond to.
# Sort social network nodes according to the order of nodes in the reply network
s_out_degree_sequence_sorted = []
for node in r_order:
    if node in s_out_degree_dict:
        s_out_degree_sequence_sorted.append(s_out_degree_dict[node])

# Ensure that r_out_degree_sequence is also sorted by r_order.
# Initialise the empty list to store the out degrees sorted by r_order.
r_out_degree_sequence = []
# Traverse each node in r_order.
for node in r_order:
    # Check if this node exists in the key of the r_out_degree_dict dictionary.
    if node in r_out_degree_dict:
        # If the node exists, add the node's out-degree to the list
        r_out_degree_sequence.append(r_out_degree_dict[node])

# Using pearson correlation coefficient.
r, p = stats.pearsonr(r_out_degree_sequence, s_out_degree_sequence_sorted)
print(f"pearson r: {r:.2f}, p: {p}")

pearson r: 0.06, p: 2.310792620769861e-15


##### Q4. Is a user that replies to another user's Tweet multiple times more likely to follow that user in comparison to if they only replied once?

In [6]:
#CODE:
# Using the weighted network
w_rn = nx.DiGraph(rn)

# Initialising the counter.
reply_count = {}
follow_with_multireplies = 0
sum_multireplies = 0
follow_with_onereply = 0
sum_onereply = 0

# Analysing the relationship between replies and follows.
for u, v, data in w_rn.edges(data=True):
    weight = data.get('occurrences', 1)
    # Check if u is following v.
    if sn.has_edge(u, v):
        # Reply multiple times.
        if weight > 1:
            follow_with_multireplies += 1
        # Reply just single time.
        else:
            follow_with_onereply += 1
    # Count all multiple replies.
    if weight > 1:
        sum_multireplies += 1
    # Count all single replies.
    else:
        sum_onereply += 1
        
# Computing the probability.
prob_single_reply = follow_with_onereply / sum_onereply if sum_onereply else 0
prob_multi_replies = follow_with_multireplies / sum_multireplies if sum_multireplies else 0

# Print the results.
print(f"Probability of a user replying single time and following other: {prob_single_reply:.2f}")
print(f"Probability of a user replying multiple times and following other: {prob_multi_replies:.2f}")

Probability of a user replying single time and following other: 0.84
Probability of a user replying multiple times and following other: 0.88


##### Q5. How many users have only mutual following connections (i.e., every user they follow also follows them) and only mutual reply connections with these same users?

In [7]:
#CODE:
def find_mutual_users(sn, rn):
    # Initialise a list to store fully qualified users.
    sn_rn_mutual_users = []
    # Iterate all users in the social network.
    for user in rn.nodes():
        # Get the current user's followers and followed list.
        follow_out = set(sn.successors(user))
        follow_in = set(sn.predecessors(user))
        # Mutual follow list.
        mutual_follows = follow_out & follow_in
        
        # check the reply network.
        if mutual_follows:
            # Get the list of current users replying to each other.
            reply_out = set(rn.successors(user))
            reply_in = set(rn.predecessors(user))
            mutual_reply = reply_out & reply_in
            #  Ensure that the user only replies to people he follows mutually.
            if mutual_reply == mutual_follows:
                sn_rn_mutual_users.append(user)
    return sn_rn_mutual_users
mutual_users = find_mutual_users(sn, rn)
print(f"The number of user have only mutual following connections and only mutual reply connections with these same users: {len(mutual_users)}")

The number of user have only mutual following connections and only mutual reply connections with these same users: 324
