# Lab 2: Networking Data

What we will do:

1. Create an edge list for a mention and a hashtag network based on the data collected in Lab 1
2. Exploratively analyse both networks in Gephi

Again, there will be two versions of this so called Jupyter Notebook for you to follow along:

* One already filled out for you, in case you want to pay more attention on other things than typing or rather alter the code to try new things.
* Another one with the code 'cells' emptied for you to practice your Python typing skills alongside the lecturer (or maybe sometimes find even better solutions to the given problems)

Secret tip: If you want to try this at home, Github Copilot (free for students), ChatGPT and Bing Chat got pretty good at generating code for you. However, you still should be able to make sure that the code they produced actually does what you want it to do. So you still have to learn some Python.

But now let's start.

## Read in the data

Make sure you still have or have uploaded the `leo_tweets.csc` file into the root folder of this notebook. If you have not done so, please do so. You were asked to download it last time. If you have lost it, ask the lecturer for it.

In [100]:
# import the necessary packages
import pandas as pd

# load the dataset
df = pd.read_csv("../leo_tweets.csv")

# show the first 5 rows of the dataset
df.head()

Unnamed: 0,query,id,timestamp_utc,local_time,user_screen_name,text,possibly_sensitive,retweet_count,like_count,reply_count,...,media_urls,media_files,media_types,media_alt_texts,mentioned_names,mentioned_ids,hashtags,intervention_type,intervention_text,intervention_url
0,ukraine Germany (tank OR tanks OR leopard) AND...,1639628403018981381,1679752848,2023-03-25T14:00:48,RabiaSalem02,U.S. Faces Timeline Issues Over Delivery Of Ab...,0.0,0,0,0,...,,,,,,,america|bakhmut|canada|germany|ukraine,,,
1,ukraine Germany (tank OR tanks OR leopard) AND...,1639574976050208769,1679740110,2023-03-25T10:28:30,SantanuB01,"How is #USA a superpower, while others are not...",0.0,0,2,1,...,https://pbs.twimg.com/media/FsDwwIuX0AIRgg6.jpg,1639574976050208769_FsDwwIuX0AIRgg6.jpg,photo,,,,germany|ukraine️|usa|usarmy,,,
2,ukraine Germany (tank OR tanks OR leopard) AND...,1639525124469669889,1679728225,2023-03-25T07:10:25,RabiaSalem02,U.S. Faces Timeline Issues Over Delivery Of Ab...,0.0,0,0,0,...,,,,,,,america|bakhmut|canada|germany|ukraine,,,
3,ukraine Germany (tank OR tanks OR leopard) AND...,1639370747498954752,1679691418,2023-03-24T20:56:58,yusr35144430,Brutal Attack !! Ukraine Sends dozen Bayraktar...,0.0,0,0,0,...,,,,,,,bachmut|canada|germany|russia|ukraine,,,
4,ukraine Germany (tank OR tanks OR leopard) AND...,1639347028110057472,1679685763,2023-03-24T19:22:43,DEFENSEEXPRESS,Germany and Finland Deliver Engineering Tanks ...,0.0,0,7,0,...,https://pbs.twimg.com/media/FsAhbrGXoAMWcbr.jpg,1639347028110057472_FsAhbrGXoAMWcbr.jpg,photo,,,,,,,


In [101]:
# list the columns of the dataset
df.columns

Index(['query', 'id', 'timestamp_utc', 'local_time', 'user_screen_name',
       'text', 'possibly_sensitive', 'retweet_count', 'like_count',
       'reply_count', 'impression_count', 'lang', 'to_username', 'to_userid',
       'to_tweetid', 'source_name', 'source_url', 'user_location', 'lat',
       'lng', 'user_id', 'user_name', 'user_verified', 'user_description',
       'user_url', 'user_image', 'user_tweets', 'user_followers',
       'user_friends', 'user_likes', 'user_lists', 'user_created_at',
       'user_timestamp_utc', 'collected_via', 'match_query', 'retweeted_id',
       'retweeted_user', 'retweeted_user_id', 'retweeted_timestamp_utc',
       'quoted_id', 'quoted_user', 'quoted_user_id', 'quoted_timestamp_utc',
       'collection_time', 'url', 'place_country_code', 'place_name',
       'place_type', 'place_coordinates', 'links', 'domains', 'media_urls',
       'media_files', 'media_types', 'media_alt_texts', 'mentioned_names',
       'mentioned_ids', 'hashtags', 'intervention

## Create Mention Network

Now we want to create a so called edge list for a mention network. This means one column contains the accounts mentioning other accounts and the other column contains the accounts being mentioned.

In [102]:
# filter the dataset to possibly relevant columns
mentions = df[["user_screen_name", "to_username", "text", "mentioned_names"]]
mentions.head()

Unnamed: 0,user_screen_name,to_username,text,mentioned_names
0,RabiaSalem02,,U.S. Faces Timeline Issues Over Delivery Of Ab...,
1,SantanuB01,,"How is #USA a superpower, while others are not...",
2,RabiaSalem02,,U.S. Faces Timeline Issues Over Delivery Of Ab...,
3,yusr35144430,,Brutal Attack !! Ukraine Sends dozen Bayraktar...,
4,DEFENSEEXPRESS,,Germany and Finland Deliver Engineering Tanks ...,


In [103]:
# inspect rows where the to_username is not null
mentions[mentions["to_username"].notnull()].head()

Unnamed: 0,user_screen_name,to_username,text,mentioned_names
7,Skeiron6,20mmMG151,@20mmMG151 @oryxspioenkop The German governmen...,20mmmg151|oryxspioenkop
12,RubyTuesday828,kukashnjudster,@kukashnjudster @Tendar Did Germany finally de...,kukashnjudster|tendar
13,bernielomax,bernielomax,"@kadams190 Also, if you didn't notice. The who...",kadams190
14,TyIertheGiant,chris_pyak,@chris_pyak @MarkHertling @McFaul @general_ben...,chris_pyak|general_ben|markhertling|mcfaul|pmb...
15,tomfinnautor,Koti_Wernyhora,@Koti_Wernyhora @MarinaTrusch @Dan17875040 @Ge...,dan17875040|georgef31318968|koti_wernyhora|kul...


In [104]:
# inspect rows where the mentioned_names is not null
mentions[mentions["mentioned_names"].notnull()].head()

Unnamed: 0,user_screen_name,to_username,text,mentioned_names
6,Michalpiotrjace,,............The last reserves of equipmemnt of...,afdimbundestag|linksfraktion|swagenknecht
7,Skeiron6,20mmMG151,@20mmMG151 @oryxspioenkop The German governmen...,20mmmg151|oryxspioenkop
12,RubyTuesday828,kukashnjudster,@kukashnjudster @Tendar Did Germany finally de...,kukashnjudster|tendar
13,bernielomax,bernielomax,"@kadams190 Also, if you didn't notice. The who...",kadams190
14,TyIertheGiant,chris_pyak,@chris_pyak @MarkHertling @McFaul @general_ben...,chris_pyak|general_ben|markhertling|mcfaul|pmb...


In [105]:
# mentioned_names seems to be the relevant column
# It is a list of users mentioned in the tweet
# Let's filter the dataset to only include rows where mentioned_names is not null
mentions = mentions[mentions["mentioned_names"].notnull()]
mentions.head()

Unnamed: 0,user_screen_name,to_username,text,mentioned_names
6,Michalpiotrjace,,............The last reserves of equipmemnt of...,afdimbundestag|linksfraktion|swagenknecht
7,Skeiron6,20mmMG151,@20mmMG151 @oryxspioenkop The German governmen...,20mmmg151|oryxspioenkop
12,RubyTuesday828,kukashnjudster,@kukashnjudster @Tendar Did Germany finally de...,kukashnjudster|tendar
13,bernielomax,bernielomax,"@kadams190 Also, if you didn't notice. The who...",kadams190
14,TyIertheGiant,chris_pyak,@chris_pyak @MarkHertling @McFaul @general_ben...,chris_pyak|general_ben|markhertling|mcfaul|pmb...


In [106]:
# Now we have to split the mentioned_names column into multiple rows
# We can do this by using the split and explode function
# First we split the mentioned_names column by the pipe character
mentions["mentioned_names"] = mentions["mentioned_names"].str.split("|")
mentions.head()

Unnamed: 0,user_screen_name,to_username,text,mentioned_names
6,Michalpiotrjace,,............The last reserves of equipmemnt of...,"[afdimbundestag, linksfraktion, swagenknecht]"
7,Skeiron6,20mmMG151,@20mmMG151 @oryxspioenkop The German governmen...,"[20mmmg151, oryxspioenkop]"
12,RubyTuesday828,kukashnjudster,@kukashnjudster @Tendar Did Germany finally de...,"[kukashnjudster, tendar]"
13,bernielomax,bernielomax,"@kadams190 Also, if you didn't notice. The who...",[kadams190]
14,TyIertheGiant,chris_pyak,@chris_pyak @MarkHertling @McFaul @general_ben...,"[chris_pyak, general_ben, markhertling, mcfaul..."


In [107]:
# Now we explode the mentioned_names column
mentions = mentions.explode("mentioned_names")
mentions.head()

Unnamed: 0,user_screen_name,to_username,text,mentioned_names
6,Michalpiotrjace,,............The last reserves of equipmemnt of...,afdimbundestag
6,Michalpiotrjace,,............The last reserves of equipmemnt of...,linksfraktion
6,Michalpiotrjace,,............The last reserves of equipmemnt of...,swagenknecht
7,Skeiron6,20mmMG151,@20mmMG151 @oryxspioenkop The German governmen...,20mmmg151
7,Skeiron6,20mmMG151,@20mmMG151 @oryxspioenkop The German governmen...,oryxspioenkop


In [108]:
# by now we can remove the to_username and text columns
mentions = mentions[["user_screen_name", "mentioned_names"]]
mentions.head()

Unnamed: 0,user_screen_name,mentioned_names
6,Michalpiotrjace,afdimbundestag
6,Michalpiotrjace,linksfraktion
6,Michalpiotrjace,swagenknecht
7,Skeiron6,20mmmg151
7,Skeiron6,oryxspioenkop


In [109]:
# are there duplicate rows?
mentions.duplicated().sum()

166

In [110]:
# This means we  have to take care of weights later on
# Let's rename the columns ot Source and Target and export to a csv file for Gephi
mentions.columns = ["Source", "Target"]
mentions.to_csv("mentions.csv", index=False)

## Create Hashtag Co-Use Network

In [111]:
# Now we want to create a network of hashtags
# Whenever two hashtags appear in the same tweet, we want to create an edge between them
# Let's start by looking at the possibly relevant columns again
df.columns

Index(['query', 'id', 'timestamp_utc', 'local_time', 'user_screen_name',
       'text', 'possibly_sensitive', 'retweet_count', 'like_count',
       'reply_count', 'impression_count', 'lang', 'to_username', 'to_userid',
       'to_tweetid', 'source_name', 'source_url', 'user_location', 'lat',
       'lng', 'user_id', 'user_name', 'user_verified', 'user_description',
       'user_url', 'user_image', 'user_tweets', 'user_followers',
       'user_friends', 'user_likes', 'user_lists', 'user_created_at',
       'user_timestamp_utc', 'collected_via', 'match_query', 'retweeted_id',
       'retweeted_user', 'retweeted_user_id', 'retweeted_timestamp_utc',
       'quoted_id', 'quoted_user', 'quoted_user_id', 'quoted_timestamp_utc',
       'collection_time', 'url', 'place_country_code', 'place_name',
       'place_type', 'place_coordinates', 'links', 'domains', 'media_urls',
       'media_files', 'media_types', 'media_alt_texts', 'mentioned_names',
       'mentioned_ids', 'hashtags', 'intervention

In [112]:
df[['text', 'hashtags']].head()

Unnamed: 0,text,hashtags
0,U.S. Faces Timeline Issues Over Delivery Of Ab...,america|bakhmut|canada|germany|ukraine
1,"How is #USA a superpower, while others are not...",germany|ukraine️|usa|usarmy
2,U.S. Faces Timeline Issues Over Delivery Of Ab...,america|bakhmut|canada|germany|ukraine
3,Brutal Attack !! Ukraine Sends dozen Bayraktar...,bachmut|canada|germany|russia|ukraine
4,Germany and Finland Deliver Engineering Tanks ...,


In [113]:
# Lets filter the dataset to only include rows where hashtags is not null and keep only the relevant column
hashtags = df[df["hashtags"].notnull()][["hashtags"]]
hashtags.head()

Unnamed: 0,hashtags
0,america|bakhmut|canada|germany|ukraine
1,germany|ukraine️|usa|usarmy
2,america|bakhmut|canada|germany|ukraine
3,bachmut|canada|germany|russia|ukraine
6,russia|ukraine


In [114]:
# Our goal is to have one row for every pair of hashtags in hashtags
# First we split the hashtags column by the pipe character again
hashtags["hashtags"] = hashtags["hashtags"].str.split("|")
hashtags.head()

Unnamed: 0,hashtags
0,"[america, bakhmut, canada, germany, ukraine]"
1,"[germany, ukraine️, usa, usarmy]"
2,"[america, bakhmut, canada, germany, ukraine]"
3,"[bachmut, canada, germany, russia, ukraine]"
6,"[russia, ukraine]"


In [115]:
# Now we need to find all possible combinations of hashtags with the itertools package
import itertools

hashtags['hashtag_pairs'] = hashtags['hashtags'].apply(lambda x: list(itertools.combinations(x, 2)))
hashtags.head()

Unnamed: 0,hashtags,hashtag_pairs
0,"[america, bakhmut, canada, germany, ukraine]","[(america, bakhmut), (america, canada), (ameri..."
1,"[germany, ukraine️, usa, usarmy]","[(germany, ukraine️), (germany, usa), (germany..."
2,"[america, bakhmut, canada, germany, ukraine]","[(america, bakhmut), (america, canada), (ameri..."
3,"[bachmut, canada, germany, russia, ukraine]","[(bachmut, canada), (bachmut, germany), (bachm..."
6,"[russia, ukraine]","[(russia, ukraine)]"


In [116]:
# now we can explode the hashtag_pairs column
hashtags = hashtags.explode("hashtag_pairs")
hashtags.head()

Unnamed: 0,hashtags,hashtag_pairs
0,"[america, bakhmut, canada, germany, ukraine]","(america, bakhmut)"
0,"[america, bakhmut, canada, germany, ukraine]","(america, canada)"
0,"[america, bakhmut, canada, germany, ukraine]","(america, germany)"
0,"[america, bakhmut, canada, germany, ukraine]","(america, ukraine)"
0,"[america, bakhmut, canada, germany, ukraine]","(bakhmut, canada)"


In [117]:
# now we can remove the hashtags column and split the hashtag_pairs column into two columns
hashtags = hashtags["hashtag_pairs"]
hashtags = hashtags.apply(pd.Series) # split the column into two columns (unintuitive, but it works)
hashtags.head()

Unnamed: 0,0,1
0,america,bakhmut
0,america,canada
0,america,germany
0,america,ukraine
0,bakhmut,canada


In [118]:
# Finally, let's rename the columns to Source and Target and export to a csv file for Gephi
hashtags.columns = ["Source", "Target"]
hashtags.to_csv("hashtags.csv", index=False)