# Part 1: Social Media Behaviour Data Analysis


---

### Install Python packages (pip only)

In [1]:
#pip install networkx


### Import Python packages

In [2]:
import networkx as nx
import operator
from scipy import stats

---

### Task 1 of 1

Examine the Graph Modelling Language (gml) files "socialmedia_cmt224_reply_network.gml" (reply network) and "socialmedia_cmt224_social_network.gml" (social network) which represent Twitter data between a sample of users over several days at the time of the Higgs boson particle discovery. Both networks are directed and share the same ids for nodes (anonymised Twitter users).

In the reply network, an edge from a node, 𝑢, to some other node, 𝑣, indicates that 𝑢 replied to a Tweet made by 𝑣 during the time period. Replies are also Tweets. Edges are weighted with the weight representing the number of times this happened over the time period.

In the social network, an edge from node 𝑢 to 𝑣 indicates that 𝑢 follows 𝑣 on the social media platform.

Using these networks, answer the following questions:

##### Q1. How does the topological structure of the reply network differ from the social network in terms of overall sparsity of edges between users and the number of connected groups of users?

In [3]:
#CODE:
followGraph, replyGraph= nx.read_gml('socialmedia_cmt224_social_network.gml'), nx.read_gml("socialmedia_cmt224_reply_network.gml") #loading gml file
followDensity, replyDensity = nx.density(followGraph), nx.density(replyGraph) #density calc
followStronglyConnected, replyStronglyConnected = nx.number_strongly_connected_components(followGraph), nx.number_strongly_connected_components(replyGraph) #strongly connected comps
print("social network graph Density is %0.5f"%followDensity + "\n"+
      "reply network graph Density is %0.5f"%replyDensity)
print("social network graph connected group of users(using strongly connected components function) "+  str(followStronglyConnected)+"\n"+
      "reply network graph connected group of users(using strongly connected components function) "+ str(replyStronglyConnected))

social network graph Density is 0.00060
reply network graph Density is 0.00002
social network graph connected group of users(using strongly connected components function) 9146
reply network graph connected group of users(using strongly connected components function) 36132


##### Q2. Do the 25 users with highest number of followers also have the highest number of repliers to their Tweets?

In [4]:
#CODE: 
topFollowed = [i[0] for i in sorted(followGraph.in_degree(), reverse=True, key = operator.itemgetter(1))[:25]] #for loop top 25 follwed user
topRepliedToTweets = [i[0] for i in sorted(replyGraph.in_degree(weight='occurences'), reverse=True, key = operator.itemgetter(1))[:25]] #Top 25 users with highest repliers
print("Top 25 followed users\n", topFollowed)
print("Top 25 users with highest repliers\n", topRepliedToTweets)
print("Number of the top 25 followed users who are also presient in the top 25 most replied to tweets\n", list(set(topFollowed).intersection(topRepliedToTweets)), 
                                                                                                               ' is: ',len(list(set(topFollowed).intersection(topRepliedToTweets))))


Top 25 followed users
 ['88', '677', '1503', '220', '352', '206', '317', '301', '3549', '1062', '1988', '138', '383', '979', '465', '8', '960', '2417', '965', '1274', '15', '3419', '407', '396', '205']
Top 25 users with highest repliers
 ['88', '677', '220', '3549', '317', '349', '7690', '1988', '16460', '3369', '2177', '6940', '2014', '4368', '327', '9021', '4259', '9964', '3998', '216', '12281', '5245', '1880', '13808', '2280']
Number of the top 25 followed users who are also presient in the top 25 most replied to tweets
 ['1988', '3549', '220', '677', '317', '88']  is:  6


##### Q3. To what extent does the number of followers a user has in the social network correlate with the number of users that they have replied to in the reply network?

In [5]:
#CODE:
followGraphNodes = followGraph.nodes()
numOfFollowers, numOfReplies = [i[1] for i in list(followGraph.in_degree(followGraphNodes))], [i[1] for i in list(replyGraph.out_degree(followGraphNodes))] 
r, p = stats.pearsonr(numOfFollowers, numOfReplies)
print("pearson r: %.3f"%r, "p: %.3f"%p)

pearson r: -0.042 p: 0.000


Approach:
we get the number of nodes for the social graph then we store the number of followers for each node,
a personr correlation test is then carried on the data.
Answer:
a significant weak negative corelation between the number of followers a user has to the number of users that they have replied to has been identfied. r: -0.042 p: 0.000

##### Q4. Do users typically ONLY reply to Tweets, are ONLY replied to, or BOTH?

In [6]:
#CODE:
replyGraphNodes = replyGraph.nodes()
outOnlyUsers, inOnlyUsers, normalUser = list(), list(), list()
for user in replyGraphNodes:
    inDegree = replyGraph.in_degree(user)
    outDegree = replyGraph.out_degree(user)
    if inDegree != 0 and outDegree != 0:
        normalUser.append(user)
    elif inDegree > 0 and outDegree == 0:
        inOnlyUsers.append(user)
    elif outDegree > 0 and inDegree == 0:
        outOnlyUsers.append(user)
    else:
        print("error")
print("Number of users that only reply to tweets", len(outOnlyUsers))
print("Number of users that are only replied to", len(inOnlyUsers))
print("Number of users who both reply and are replied to(both)",len(normalUser) )        



Number of users that only reply to tweets 20332
Number of users that are only replied to 11663
Number of users who both reply and are replied to(both) 6923


##### Q5. Do the majority of users that ONLY reply to Tweets, ONLY do so to those users they are following?

In [7]:
#CODE:
other = set()
for node in outOnlyUsers:
    replys = sorted(replyGraph.out_edges(node))
    for reply in replys:
        if not(followGraph.has_edge(reply[0],reply[1])):
            other.add(node)
            
            
print("total number of users who only reply and only do so to thoes who are following : ", len(outOnlyUsers)-len(other), " out of : ",len(outOnlyUsers) )

total number of users who only reply and only do so to thoes who are following :  17281  out of :  20332


##### Q6. How many users have ONLY mutual following connections (i.e., every user they follow also follows them) AND ONLY mutual reply connections with these SAME users?

In [8]:
#CODE:
onlyMutualFollwoing = {k: v for k, v in nx.reciprocity(followGraph, followGraph.nodes()).items() if v ==1}
count =0

for node in onlyMutualFollwoing.items():
    follows = []
    for tup in followGraph.edges(node[0]):
        follows.append((tup[1]))
    isMutual = True
    for follow in follows:
        if (replyGraph.has_edge(node[0], follow) and replyGraph.has_edge(follow, node[0])):
            pass
        else: 
            isMutual = False
    if  isMutual:
        count +=1




print('The number of users that only have mutual connections: ',len(onlyMutualFollwoing))
print('The number of users that only have only mutual following and only mutual replies is: ', count)

The number of users that only have mutual connections:  1196
The number of users that only have only mutual following and only mutual replies is:  189
