# Part 1: Social Media Behaviour Data Analysis


---

### Install Python packages (pip only)

In [1]:
#e.g., %pip install some-package
%pip install networkx numpy scipy matplotlib

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


### Import Python packages

In [1]:
#e.g., import some-package
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
from typing import List,Set
from scipy.stats import pearsonr

---

### Task 1 of 1

Examine the Graph Modelling Language (gml) files "socialmedia_reply_network.gml" (reply network) and "socialmedia_social_network.gml" (social network) which represent Twitter data between a sample of users over several days at the time of the Higgs boson particle discovery. Both networks are directed and share the same ids for nodes (anonymised Twitter users).  However, the shared user ids are contained within the "label" attribute in the .gml files, not the node "id" attribute of each individual .gml file.

In the reply network, an edge from a node, 𝑢, to some other node, 𝑣, indicates that 𝑢 replied to a Tweet made by 𝑣 during the time period. Replies are also Tweets. Edges are weighted with the weight representing the number of times this happened over the time period.

In the social network, an edge from node 𝑢 to 𝑣 indicates that 𝑢 follows 𝑣 on the social media platform.

Using these networks, answer the following questions:

##### Q1. What fraction of users do not reply to or follow any other user, but have had others reply to their Tweets?

In [2]:
#CODE:

# using networkx to read the networks. 
social_net = nx.read_gml('socialmedia_social_network.gml', label='label')
reply_net= nx.read_gml('socialmedia_reply_network.gml', label='label')


def generate_relevant_set(user_dict: dict) -> Set[int]: 
    
    return set([node for node,value in user_dict.items() if value > 0])

# all users that have replies 
r_in_degree_users = dict(reply_net.in_degree(nbunch=reply_net.nodes))
users_with_replies = generate_relevant_set(r_in_degree_users)

# users that reply to others 
r_out_degree_users = dict(reply_net.out_degree(nbunch=reply_net.nodes))
users_that_reply = generate_relevant_set(r_out_degree_users)

# users that follow other users 
s_out_degree_users = dict(social_net.out_degree(nbunch=social_net.nodes))
users_that_follow_others = generate_relevant_set(s_out_degree_users)

#users that have replies but don't reply or follow other users 
users_that_dont_reply_follow = users_with_replies - users_that_reply - users_that_follow_others

frac = len(users_that_dont_reply_follow)/reply_net.number_of_nodes()

print(f"Fraction of users: {frac:.4f}")

Fraction of users: 0.0091


##### Q2. How does the topological structure of the reply network differ from the social network in terms of overall sparsity of edges between users and the number of connected groups of users?

In [3]:
#CODE: 

# Calculating density for social and reply networks
reply_net_density = nx.density(reply_net)
social_net_density = nx.density(social_net)

# getting average clustering for both networks
avg_clustering_reply = nx.average_clustering(reply_net)
avg_clustering_social = nx.average_clustering(social_net)

# getting the number of weakly connected components in each network 
reply_weak_components = nx.number_weakly_connected_components(reply_net)
social_weak_components = nx.number_weakly_connected_components(social_net)

# getting the number of strong connected components in each network
reply_strong_components = nx.number_strongly_connected_components(reply_net)
social_strong_components = nx.number_strongly_connected_components(social_net)

print(f"Reply Network Density or Sparsity: {reply_net_density:.6f}")
print(f"Social Network Density or Sparsity: {social_net_density:.6f}")
print(f"Reply Network Average clustering: {avg_clustering_reply:.4f}")
print(f"Social Network Average clustering: {avg_clustering_social:.4f}")
print(f"Social Network Weakly Connected Components:{social_weak_components}")
print(f"Reply Network Weakly Connected Components:{reply_weak_components}")
print(f"Social Network Strongly Connected Components:{social_strong_components}")
print(f"Reply Network Strongly Connected Components:{reply_strong_components}")

Reply Network Density or Sparsity: 0.000041
Social Network Density or Sparsity: 0.000992
Reply Network Average clustering: 0.0012
Social Network Average clustering: 0.1684
Social Network Weakly Connected Components:436
Reply Network Weakly Connected Components:5920
Social Network Strongly Connected Components:4648
Reply Network Strongly Connected Components:16217


##### Q3. Does the number of users a user follows in the social network correlate with the number of replies that they make?

In [7]:
#CODE:

""" correlation between no of follows and no of replies """

# creating dictionary mapping each user to their follow count in the social network
follow_counts = dict(social_net.out_degree(nbunch=social_net.nodes))

# # creating a dictionary mapping each user to their reply count in the reply network
reply_counts = dict(reply_net.out_degree(nbunch=reply_net.nodes))    

# get users present in both network first
common_users = set(follow_counts.keys()) & set(reply_counts.keys())
follows = [follow_counts[user] for user in common_users]
replies = [reply_counts[user] for user in common_users]

# calculate the Pearson correlation coefficent 
correlation, _ = pearsonr(follows, replies)
print(f"{correlation:.2f}")



0.06


##### Q4. Is a user that replies to another user's Tweet multiple times more likely to follow that user in comparison to if they only replied once?

In [8]:
#CODE:
# function to check if a user who replies multiple times is more likely to follow the other user
def is_more_likely_to_follow(reply_network, social_network):
    multiple_reply_follow_count = 0
    multiple_reply_count = 0
    single_reply_follow_count = 0
    single_reply_count = 0

    # iterate over edges in the reply network
    for u, v, data in reply_network.edges(data=True):
        # check if the user follows the user being replied to
        follows = social_network.has_edge(u, v)

        # count replies based on the number of occurrences and follow status
        if data['occurrences'] > 1:
            multiple_reply_count += data['occurrences']
            if follows:
                multiple_reply_follow_count += data['occurrences']
        elif data['occurrences'] == 1:
            single_reply_count += 1
            if follows:
                single_reply_follow_count += 1

    # calculate percentages
    multiple_reply_follow_percentage = (multiple_reply_follow_count / multiple_reply_count * 100) if multiple_reply_count else 0
    single_reply_follow_percentage = (single_reply_follow_count / single_reply_count * 100) if single_reply_count else 0

    return multiple_reply_follow_percentage, single_reply_follow_percentage

# check if a user who replies multiple times is more likely to follow the other user
multiple_reply_follow_percentage, single_reply_follow_percentage = is_more_likely_to_follow(reply_net, social_net)

print(f"Multiple replies + follow percentage: {multiple_reply_follow_percentage:.2f}%")
print(f"Single reply + follow percentage: {single_reply_follow_percentage:.2f}%")

Multiple replies + follow percentage: 88.07%
Single reply + follow percentage: 83.82%


##### Q5. How many users have only mutual following connections (i.e., every user they follow also follows them) and only mutual reply connections with these same users?

In [9]:
#CODE:
users_with_mutual_connections = set()

# iterate over nodes to find users with only mutual connections
for user in social_net.nodes():
    # users that 'user' follows
    following = set(social_net.successors(user))  
    # users that follow 'user'
    followers = set(social_net.predecessors(user))  
    # check for mutual following
    if following == followers:  
        # users that 'user' replied to
        replied_users = set(reply_net.successors(user))
        # users that replied to 'user'
        replied_back_users = set(reply_net.predecessors(user)) 
        # check for mutual replies
        if replied_users == replied_back_users: 
            # Add user if mutual
            users_with_mutual_connections.add(user) 

# count users with only mutual connections
num_users_with_mutual_connections = len(users_with_mutual_connections)

print(f"{num_users_with_mutual_connections}")


261
