# Lab 2: Networking Data

What we will do:

1. Create an edge list for a mention and a hashtag network based on the data collected in Lab 1
2. Exploratively analyse both networks in Gephi

Again, there will be two versions of this so called Jupyter Notebook for you to follow along:

* One already filled out for you, in case you want to pay more attention on other things than typing or rather alter the code to try new things.
* Another one with the code 'cells' emptied for you to practice your Python typing skills alongside the lecturer (or maybe sometimes find even better solutions to the given problems)

Secret tip: If you want to try this at home, Github Copilot (free for students), ChatGPT and Bing Chat got pretty good at generating code for you. However, you still should be able to make sure that the code they produced actually does what you want it to do. So you still have to learn some Python.

But now let's start.

## Read in the data

Make sure you still have or have uploaded the `leo_tweets.csv` file into the root folder of this notebook. If you have not done so, please do so. You were asked to download it last time. If you have lost it, ask the lecturer for it.

In [None]:
# import the necessary packages
import pandas as pd

# load the dataset
df = pd.read_csv("../leo_tweets.csv")

# show the first 5 rows of the dataset
df.head()

In [None]:
# list the columns of the dataset
df.columns

## Create Mention Network

Now we want to create a so called edge list for a mention network. This means one column contains the accounts mentioning other accounts and the other column contains the accounts being mentioned.

In [None]:
# filter the dataset to possibly relevant columns
mentions = df[["user_screen_name", "to_username", "text", "mentioned_names"]]
mentions.head()

In [None]:
# inspect rows where the to_username is not null
mentions[mentions["to_username"].notnull()].head()

In [None]:
# inspect rows where the mentioned_names is not null
mentions[mentions["mentioned_names"].notnull()].head()

In [None]:
# mentioned_names seems to be the relevant column
# It is a list of users mentioned in the tweet
# Let's filter the dataset to only include rows where mentioned_names is not null
mentions = mentions[mentions["mentioned_names"].notnull()]
mentions.head()

In [None]:
# Now we have to split the mentioned_names column into multiple rows
# We can do this by using the split and explode function
# First we split the mentioned_names column by the pipe character
mentions["mentioned_names"] = mentions["mentioned_names"].str.split("|")
mentions.head()

In [None]:
# Now we explode the mentioned_names column
mentions = mentions.explode("mentioned_names")
mentions.head()

In [None]:
# by now we can remove the to_username and text columns
mentions = mentions[["user_screen_name", "mentioned_names"]]
mentions.head()

In [None]:
# are there duplicate rows?
mentions.duplicated().sum()

In [None]:
# This means we  have to take care of weights later on
# Let's rename the columns ot Source and Target and export to a csv file for Gephi
mentions.columns = ["Source", "Target"]
mentions.to_csv("mentions.csv", index=False)

## Create Hashtag Co-Use Network

In [None]:
# Now we want to create a network of hashtags
# Whenever two hashtags appear in the same tweet, we want to create an edge between them
# Let's start by looking at the possibly relevant columns again
df.columns

In [None]:
df[['text', 'hashtags']].head()

In [None]:
# Lets filter the dataset to only include rows where hashtags is not null and keep only the relevant column
hashtags = df[df["hashtags"].notnull()][["hashtags"]]
hashtags.head()

In [None]:
# Our goal is to have one row for every pair of hashtags in hashtags
# First we split the hashtags column by the pipe character again
hashtags["hashtags"] = hashtags["hashtags"].str.split("|")
hashtags.head()

In [None]:
# Now we need to find all possible combinations of hashtags with the itertools package
import itertools

hashtags['hashtag_pairs'] = hashtags['hashtags'].apply(lambda x: list(itertools.combinations(x, 2)))
hashtags.head()

In [None]:
# now we can explode the hashtag_pairs column
hashtags = hashtags.explode("hashtag_pairs")
hashtags.head()

In [None]:
# now we can remove the hashtags column and split the hashtag_pairs column into two columns
hashtags = hashtags["hashtag_pairs"]
hashtags = hashtags.apply(pd.Series) # split the column into two columns (unintuitive, but it works)
hashtags.head()

In [None]:
# Finally, let's rename the columns to Source and Target and export to a csv file for Gephi
hashtags.columns = ["Source", "Target"]
hashtags.to_csv("hashtags.csv", index=False)