# Exploring and analysing the network data in Python

## Part 0: Overview

### The data
For this part of the workshop, you will work with the data export you performed at the end of the Gephi part of the workshop.
If you did not manage to make a good export, or if you wish to take this part , you can use the file located at `files/example_gephi_export.csv`

### Prior experience with Python
This part of the workshop presumes **no** knowledge of Python. Depending on your experience, try to read the materials below as a book, change parts of the code to your liking, or write whole new analyses. 

### Working with Jupyter Python Notebooks
If you are not used to working with Python Notebooks, note that it is easiest to run the notebooks through Google Colab. The instructors will explain how to do so.
If you wish to run the Notebooks on your local machine, you will need to install a few packages. These are listed in `python/requirements.in`.

## Part 1: Opening and exploring the Gephi export


In [8]:
# we will use the Pandas package for much of this workshop
# you do not need to understand everything that is going on, as long as you are able to follow the steps in a general sense
import pandas as pd

# read the nodes table
nodes = pd.read_csv('../files/example_gephi_export.csv', sep=';')

In [None]:
# display some information about the data
# do you know what each column means? If not, go back to Gephi and try to figure it out
print(nodes.info())

In [None]:
# using .head(), we can display the first N rows (default=5)
print(nodes.head(1))

In [None]:
# inspect a column
print(nodes['author.description'].head(5))

# display the full value of the author.description for the first row
print(nodes.iloc[0]['author.description'])

## Part 2: Extracting information
Inspecting a single user descirption reveals an interesting detail. Hashtags (e.g. #IamAChristian) are a vital part of Twitter/X's vocabulary. 
Extracting them from the descriptions will enable easier comparison with other users.
To extract the hashtags, we will use a *Regular Expression*: a pattern that defines part of a text that we are interested in.

In [19]:
# import Python's built-in Regular Expressions package
import re

hashtag_pattern = r'\#\w+'

### explanation of the hashtag pattern (thanks GenAI)

The regular expression `\#\w+` can be broken down into the following components:

1. **`\#`**: 
   - The backslash (`\`) is an escape character, which means it is used to treat the hash symbol (`#`) as a literal character rather than a special character in regular expressions. So, this part of the expression matches the `#` symbol literally.

2. **`\w+`**: 
   - `\w` is a shorthand character class that matches any "word" character. Specifically, it matches:
     - Any letter (uppercase or lowercase),
     - Any digit (0-9),
     - The underscore (`_`).
   - The `+` following `\w` means "one or more" of the preceding character class. So, `\w+` will match one or more word characters.

#### In summary:
The regular expression `\#\w+` matches any string that starts with a `#` symbol, followed by one or more word characters (letters, digits, or underscores).

Example matches:
- `#hello`
- `#123`
- `#word_example`

This pattern could be used to match hashtags or identifiers that start with `#` followed by alphanumeric characters or underscores.

### Extracting hashtags to a separate column

In [45]:
# put the code for extracting a hashtag in a function so we can easily re-use it
def extract_hashtags(text):
    # make sure the input is a string
    text = str(text)

    # make and return a list of all found hashtags
    hashtags = re.findall(hashtag_pattern, text)
    return hashtags

In [46]:
# let's try the pattern on the description we found earlier
description = nodes.iloc[0]['author.description']
print(extract_hashtags(description))

['#IamAChristian', '#ProLife', '#PrayToEndAbortion']


In [None]:
# now that we have a way to extract hashtags, let's do so for our entire data
# we search the author.description column, and make a new column called hashtags with the results
nodes['hashtags'] = nodes['author.description'].apply(extract_hashtags)

# display the first few rows of hashtags, what stands out?
print(nodes['hashtags'].head(20))

In [None]:
# let's say we are only interested in users with hashtags in their description
#first, filter the data

hashtag_users = nodes[nodes['hashtags'].str.len() != 0]

# only selecting hashtag users has a big cost: we ignore a large number of users
print('original amount of users:', len(nodes))
print('amount of users using hashtags:', len(hashtag_users))

original amount of users: 47404
amount of users using hashtags: 7269


### Counting hashtags
Now that we have extracted the hashtags, let's see if we can detect some patterns in them.

In [None]:
# first, make a flat list of all hashtags
# .explode(column) makes a separate row for each hashtags if a single user has more than one
flat_hashtags = hashtag_users.explode('hashtags')['hashtags']

# now count the occurence of each hashtag
flat_hashtags.value_counts()


### Comparing modularity classes
Remember the modularity classes we generated in Gephi?

We can see if users clustered in different classes prefer different hashtags.

In [None]:


# first, get the two biggest classes
classes_counts = hashtag_users['modularity_class'].value_counts()
print(classes_counts)

first_class = classes_counts.index[0]
second_class = classes_counts.index[1]

print('largest class:', first_class)
print('second largest class:', second_class)

In [81]:
# repeat the counting of hashtags for each of the clusters

largest_cluster = hashtag_users[hashtag_users['modularity_class'] == first_class]
second_cluster = hashtag_users[hashtag_users['modularity_class'] == second_class]

largest_cluster_hashtags = largest_cluster.explode('hashtags')['hashtags'].value_counts()
second_cluster_hashtags = second_cluster.explode('hashtags')['hashtags'].value_counts()


### Compare the hashtags for the two largest clusters
For the example data, there is a clear difference in hashtags! We could even identify the clusters based on their hashtags.

In [88]:
print(largest_cluster_hashtags)

hashtags
#MAGA                 93
#Bitcoin              41
#bitcoin              35
#2A                   33
#TrudeauMustGo        30
                      ..
#COVID1984             1
#PVV                   1
#jevoteRN              1
#zemmour               1
#HillaryIndictment     1
Name: count, Length: 1810, dtype: int64


In [89]:
print(second_cluster_hashtags)

hashtags
#BLM                 98
#Resist              53
#BlackLivesMatter    49
#TheResistance       33
#StandWithUkraine    29
                     ..
#VoterRights          1
#rocknroll            1
#comics               1
#starwars             1
#woodworking          1
Name: count, Length: 1739, dtype: int64
