# Lab 2: Networking Data

What we will do:

1. Create an edge list for a mention and a hashtag network based on the data collected in Lab 1 again, *but this time with timestamps* ☝️
2. Exploratively analyse both networks in Gephi

Again, there will be two versions of this so called Jupyter Notebook for you to follow along:

* One already filled out for you, in case you want to pay more attention on other things than typing or rather alter the code to try new things.
* Another one with the code 'cells' emptied for you to practice your Python typing skills alongside the lecturer (or maybe sometimes find even better solutions to the given problems)

Secret tip: If you want to try this at home, Github Copilot (free for students), ChatGPT and Bing Chat got pretty good at generating code for you. However, you still should be able to make sure that the code they produced actually does what you want it to do. So you still have to learn some Python.

But now let's start.

## Read in the data

Make sure you still have or have uploaded the `leo_tweets.csv` file into the root folder of this notebook. If you have not done so, please do so. You were asked to download it last time. If you have lost it, ask the lecturer for it.

In [None]:
# import the necessary packages
import pandas as pd

# load the dataset
df = pd.read_csv("../leo_tweets.csv")

# show the first 5 rows of the dataset
df.head()

## Create Mention Network

In [None]:
# filter the dataset to possibly relevant columns
mentions = df[["local_time", "user_screen_name", "mentioned_names"]]
mentions.head()

In [None]:
# as we know from last time, mentioned_names is the relevant column
# It is a list of users mentioned in the tweet
# Let's filter the dataset to only include rows where mentioned_names is not null
mentions = mentions[mentions["mentioned_names"].notnull()]
mentions.head()

In [None]:
# Now we have to split the mentioned_names column into multiple rows
# We can do this by using the split and explode function
# First we split the mentioned_names column by the pipe character
mentions["mentioned_names"] = mentions["mentioned_names"].str.split("|")
mentions.head()

In [None]:
# Now we explode the mentioned_names column
mentions = mentions.explode("mentioned_names")
mentions.head()

In [17]:
# are there duplicate rows?
mentions.duplicated().sum()

0

In [18]:
# Let's rename the columns ot Source and Target and export to a csv file for Gephi
mentions.columns = ["local_time", "Source", "Target"]
mentions.to_csv("mentions.csv", index=False)

## Create Hashtag Co-Use Network

In [25]:
# Lets filter the dataset to only include rows where hashtags is not null and keep only the relevant column
hashtags = df[df["hashtags"].notnull()][["hashtags", "local_time"]]
hashtags.head()

Unnamed: 0,hashtags,local_time
0,america|bakhmut|canada|germany|ukraine,2023-03-25T14:00:48
1,germany|ukraine️|usa|usarmy,2023-03-25T10:28:30
2,america|bakhmut|canada|germany|ukraine,2023-03-25T07:10:25
3,bachmut|canada|germany|russia|ukraine,2023-03-24T20:56:58
6,russia|ukraine,2023-03-24T15:03:39


In [26]:
# Our goal is to have one row for every pair of hashtags in hashtags
# First we split the hashtags column by the pipe character again
hashtags["hashtags"] = hashtags["hashtags"].str.split("|")
hashtags.head()

Unnamed: 0,hashtags,local_time
0,"[america, bakhmut, canada, germany, ukraine]",2023-03-25T14:00:48
1,"[germany, ukraine️, usa, usarmy]",2023-03-25T10:28:30
2,"[america, bakhmut, canada, germany, ukraine]",2023-03-25T07:10:25
3,"[bachmut, canada, germany, russia, ukraine]",2023-03-24T20:56:58
6,"[russia, ukraine]",2023-03-24T15:03:39


In [27]:
# Now we need to find all possible combinations of hashtags with the itertools package
import itertools

hashtags['hashtag_pairs'] = hashtags['hashtags'].apply(lambda x: list(itertools.combinations(x, 2)))
hashtags.head()

Unnamed: 0,hashtags,local_time,hashtag_pairs
0,"[america, bakhmut, canada, germany, ukraine]",2023-03-25T14:00:48,"[(america, bakhmut), (america, canada), (ameri..."
1,"[germany, ukraine️, usa, usarmy]",2023-03-25T10:28:30,"[(germany, ukraine️), (germany, usa), (germany..."
2,"[america, bakhmut, canada, germany, ukraine]",2023-03-25T07:10:25,"[(america, bakhmut), (america, canada), (ameri..."
3,"[bachmut, canada, germany, russia, ukraine]",2023-03-24T20:56:58,"[(bachmut, canada), (bachmut, germany), (bachm..."
6,"[russia, ukraine]",2023-03-24T15:03:39,"[(russia, ukraine)]"


In [28]:
# now we can explode the hashtag_pairs column
hashtags = hashtags.explode("hashtag_pairs")
hashtags.head()

Unnamed: 0,hashtags,local_time,hashtag_pairs
0,"[america, bakhmut, canada, germany, ukraine]",2023-03-25T14:00:48,"(america, bakhmut)"
0,"[america, bakhmut, canada, germany, ukraine]",2023-03-25T14:00:48,"(america, canada)"
0,"[america, bakhmut, canada, germany, ukraine]",2023-03-25T14:00:48,"(america, germany)"
0,"[america, bakhmut, canada, germany, ukraine]",2023-03-25T14:00:48,"(america, ukraine)"
0,"[america, bakhmut, canada, germany, ukraine]",2023-03-25T14:00:48,"(bakhmut, canada)"


In [29]:
# now we can remove the hashtags column and split the hashtag_pairs column into two columns
hashtags = hashtags.drop("hashtags", axis=1)
hashtags[["Source", "Target"]] = pd.DataFrame(hashtags["hashtag_pairs"].tolist(), index=hashtags.index)
hashtags = hashtags.drop("hashtag_pairs", axis=1)
hashtags.head()

Unnamed: 0,local_time,Source,Target
0,2023-03-25T14:00:48,america,bakhmut
0,2023-03-25T14:00:48,america,canada
0,2023-03-25T14:00:48,america,germany
0,2023-03-25T14:00:48,america,ukraine
0,2023-03-25T14:00:48,bakhmut,canada


In [30]:
# save to a csv
hashtags.to_csv("hashtags.csv", index=False)