# Social Graph Final project 
> An analysis and visualization for security people using Twitter data.

- toc: true 
- badges: false
- author: Peter Bom Jakobsen & Søren Fritzbøger & Yucheng Ren 
- comments: false
- categories: [data_analysis, network]

> Important: The dataset we were used to create the network comes from Twitter, you can view and download them from [here](https://raw.githubusercontent.com/Glorforidor/SocialGraphAssignments/master/twitter_data.zip). The Explainer [notebook]().

As we all know people working within the field of cyber security are "nerds" who do not have any friends. Or at least that is the common stereotype. You know, greasy bearded men who only cares about bits and bytes, and have not seen daylight since they went into puberty and hid in their basements. Furthermore, these "nerds" are split into two communities, offensive and defensive security, and they DO NOT like each other. Right?

The following story will show you that you are actually wrong. That people working within the cyber security realm are real people with real friends (Yes, Twitter friends are real friends. Just like Facebook friends 😉). Furthermore, they are probably not as excluded as you think, and they might even have more friends than you (Just kidding, nobody have more friends than you 😉).

---

In [2]:
# hide
# Standard libraries.
import collections
import csv
from functools import wraps
import math
import os
import os.path

# Third party libraries.
from fa2 import ForceAtlas2
import networkx as nx
import nltk
from nltk import word_tokenize
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import texttable
from wordcloud import WordCloud

In [3]:
# hide
# Filenames of all the data files which makes up our dataset.
tweets_filename = "tweets.csv"
id_to_screen_name_filename = "id_to_screen_name.csv"
user_and_friends_filename = "user_and_friends_ids.csv"
user_to_friend_filename = "user_to_friend_screen_names.csv"
bios_filename = "bios.csv"
sentiment_tweets_filename = "sentiment_tweets.csv"
communities_filename = "communities.csv"
top_5_communities_filename = "top_5_communities.csv"

# The saved graph - it is an undirected graph.
graph_filename = "security_network.gml"

## Dataset
To prove to you that cyber security people are real people with real friends, we decided to stalk them on Twitter. By using the following search query, we found all the popular cyber security tweets:

`(infosec OR cve OR cybersec OR cybersecurity OR ransomware) -filter:retweets min_faves:10`

And yes, i know you're thinking "10 LiKes iS nOt PoPuLaR, I gEt 1000 oN mY InStA pOsTs", but 99% of those are bots that you paid for, so calm down "influencer" (😉).

From this very beautiful query we get a long list of tweets from a long list of cyber security people. Furthermore, we get mentions to other people supposedly related to these cyber security people. From this list of important cyber security people we can now extract a looooot of data such as:

* Screen names.
* Friends' ids and their screen names.
* Timeline tweets - 20 unfiltered tweets.
* User descriptions and locations.

## Analysis

And now we show you the revolutionary, myth debunking, science creating, unexplainably beautiful network of security people:

![](imgs/network.png)

As you can clearly see there are 2050 nodes corresponding to Twitter profiles and 18040 edges corresponding to friendships. So yes, "nerds" do have friends (😉).

In fact, nerds probably have more friends than you do. In our dataset consisting of ONLY 2050 security people, most have at least 12 friends, and some have even more:

![](imgs/degree_distribution.png)

As you can see, the number of friends in our network clearly follows a power law distribution. Even in the world of "nerds", there are popular people. Now that we have debunked this very real and very serious myth that cyber security people have no friends (😉), let us move on to the next myth. That security people are split into two groups, where they either protect the world from evil forces, or they are evil forces (You know, split into defensive and offensive security😉)

To explore this, why not discover communities within our network? Surely this will show that only two communities exist, and we can finally be right about something. To do this we have created this awesome histogram that should only show two communities (That is, if we are right, which we almost never are 😉

In [4]:
# hide
g = nx.read_gml(graph_filename)
node_sizes = [d for __, d in g.degree]

![](imgs/community_size.png)

So clearly, we were wrong again. Within our little Twitter sphere a bunch of communities exist. Most of them are rather small, but a few larger ones exist. Now that we have singlehandedly been wrong about any assumption we have made, lets just stop making any more assumptions and  explore the dataset instead.

The following word clouds show that our communities are spread widely within the cyber security domain:

![](imgs/wordcloud1.png)

We have a community that cares highly about social justice and identity (There goes our assumption about cyber security people being greasy men in a basement 😢), a community related to malware and antivirus, one related to topics such as digital transformation and fintech, one clearly devoted to security within the european union and one that is probably just profiles related to sharing news about cyber security.

Well, that was boring. To spice things up once again, let us make an assumption! We assume that all of these communities are populated with people living in USA (🇺🇸), except for community 4. As we all know, people from 'MURICA think that the European Union is a labor union for european people (And we all know that americans hate labor unions (1) (2) (3))

![](imgs/wordcloud2.png)

And as the clever reader can see, we were right. All communities are dominated by people living within the United States of America, except for community 4 that is dominated by people living in Europe.

Now, we know what you are thinking. "This whole analysis is wrong because people who tweet about cyber security are robots guided by a sophisticated AI created by Elon Musk". To disprove this, we show you the following homemade table of sentiment values created from the latest 20 tweets of each member of the communities(4).

In [16]:
# hide
url = "https://ndownloader.figstatic.com/files/360592"
words_of_happiness = pd.read_csv(url, delimiter="\t", skiprows=3)


def compute_average_sentiment(tokens):
    """compute_average_sentiment returns the average sentiment value of the tokens.
    
    Each token in tokens must be in lowercase.
    """
    sentiment = 0.0
    if not len(tokens):
        return sentiment

    avg = np.nan_to_num(words_of_happiness[words_of_happiness["word"].isin(tokens)]["happiness_average"].mean())
    return avg


communities = {i: set(members) for i, members in enumerate(top_5_largest_communites)}
text_of_communities = collections.defaultdict(str)
with open("sentiment_tweets.csv", newline="") as f:
    csv_reader = csv.DictReader(f)
    for row in csv_reader:
        for i, members in communities.items():
            if row["screen_name"] in members:
                text_of_communities[i] += f" {row['tweets']}"

sentiment_of_communities = {k: compute_average_sentiment(bag_of_words(v)) for k, v in text_of_communities.items()}

In [17]:
# hide_input
table = texttable.Texttable()
table.set_cols_align(["l", "r"])
table.set_cols_valign(["t", "b"])
table.set_precision(2)
table.add_row(["Community", "Sentiment value"])

for com, sentiment in sorted(sentiment_of_communities.items()):
    table.add_row([com+1, sentiment])

print(table.draw())

+-----------+-----------------+
| Community | Sentiment value |
+-----------+-----------------+
| 1         |            5.46 |
+-----------+-----------------+
| 2         |            5.44 |
+-----------+-----------------+
| 3         |            5.52 |
+-----------+-----------------+
| 4         |            5.51 |
+-----------+-----------------+
| 5         |            5.46 |
+-----------+-----------------+


Aaaaand, we just received a email from "Nole Ksum"(5) telling us to stop this analysis, because he is not happy! (Neither are the communities. They are actually pretty neutral 😉)

In [8]:
# hide_input
# The top 10 Twitter profiles with most friends - in network term highest degree.
table = texttable.Texttable()
table.set_cols_align(["l", "r"])
table.set_cols_valign(["t", "b"])
table.add_row(["Name", "Friends"])

for screen_name, degree in sorted(g.degree, key=lambda x: x[1], reverse=True)[:10]:
    table.add_row([screen_name, degree])

print(table.draw())

+------------------+---------+
| Name             | Friends |
+------------------+---------+
| @HackingDave     |     216 |
+------------------+---------+
| @AlyssaM_InfoSec |     198 |
+------------------+---------+
| @RayRedacted     |     193 |
+------------------+---------+
| @NicoleBeckwith  |     178 |
+------------------+---------+
| @DfirDiva        |     170 |
+------------------+---------+
| @sherrod_im      |     161 |
+------------------+---------+
| @cybergeekgirl   |     161 |
+------------------+---------+
| @gabsmashh       |     160 |
+------------------+---------+
| @LisaForteUK     |     158 |
+------------------+---------+
| @UK_Daniel_Card  |     154 |
+------------------+---------+


## References

(1) https://www.vox.com/identities/2019/9/30/20891314/elon-musk-tesla-labor-violation-nlrb

(2) https://www.theguardian.com/technology/2020/dec/02/google-labor-laws-nlrb-surveillance-worker-firing

(3) https://www.csmonitor.com/USA/Politics/2020/1106/Uber-Lyft-gig-companies-win-fight-against-labor-unions

## Notes
(4) The 20 latest tweets were gathered on the 2nd December

(5) "Nole Ksum" is Elon Musk kinda backwards. We are very surprised you did not get this reference.