# Analysis of the 2020 US Elections Hashtag Network

Authors:
- Radu-Andrei Bourceanu
- Juan Arturo Abaurrea Calafell

## 1. Introduction

Social Network Analysis (SNA) allows us to model complex systems by analyzing the interactions between their components. This project applies SNA techniques to analyze the digital conversation surrounding the 2020 United States Presidential Elections.

### 1.1 Dataset Description
The analysis is based on the network file **`hashtags_cleaned.graphml`**. The dataset was collected using a snowball sampling technique starting from the hashtag **#elections2020**, capturing tweets in multiple languages.

In this graph representation:
* **Nodes:** Represent hashtags used in the dataset.
* **Edges:** Connect two nodes if the hashtags appeared together in the same message.
* **Weights:** The edge attribute "weight" represents the number of tweets in which the two hashtags appeared together.

### 1.2 Objectives and Methodology
The goal of this project is to characterize the structure of the network and extract semantic insights regarding the political discussion. Following the course methodology, the analysis is divided into three main levels:

1.  **Meso-Analysis (Community Detection):**
    * We will identify communities (groups of densely connected hashtags) to understand the different topics of discussion.
    * We will use the **Leiden Algorithm**, a method designed to improve upon Louvain by guaranteeing well-connected communities.
    * We will then treat these communities as independent graphs and calculate their structural similarity using **Weisfeiler-Lehman Graph Kernels**, which generate feature vectors for graph comparison (similar to a "bag-of-words" for graphs).

2.  **Macro-Analysis (Global Structure):**
    * To handle the complexity of the large graph, we will perform **graph contraction**, collapsing each community into a single super-node to analyze the global topology more efficiently.
    * We will evaluate whether an overlapping community structure would be more appropriate than a non-overlapping partition.

3.  **Micro-Analysis (Centrality & Prediction):**
    * We will calculate centrality metrics (such as **Degree** and **Betweenness**) to identify "Hubs" (central topics) and "Bridges" (connectors between topics).
    * Finally, we will apply **Link Prediction** algorithms (e.g., Jaccard, Adamic-Adar) to determine which communities have the highest probability of becoming connected in the future.

## Imports

In [35]:
# Data Handling & Math
import networkx as nx
import pandas as pd
import numpy as np
from collections import Counter

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Community Detection
from cdlib import algorithms, viz, evaluation
import leidenalg
import community as community_louvain  # python-louvain

# Graph Kernels

# Configuration for clearer plots
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 6]

## Loading the Graph

In [2]:
# 1. Load the graph
# Ensure the file 'hashtags_cleaned.graphml' is in the same folder as your notebook
filename = "hashtags_cleaned.graphml"

try:
    # We create the graph object 'G'
    G = nx.read_graphml(filename)
    
    # 2. Basic verification
    print("Graph loaded successfully!")
    print(f"Type: {type(G)}")
    print(f"Number of nodes: {G.number_of_nodes()}")
    print(f"Number of edges: {G.number_of_edges()}")
    
    # 3. Check node attributes (to see if hashtags are stored correctly)
    # Print the first node and its attributes
    first_node = list(G.nodes(data=True))[0]
    print(f"\nExample Node: {first_node}")
    
    # 4. Check edge attributes (looking for 'weight')
    first_edge = list(G.edges(data=True))[0]
    print(f"Example Edge: {first_edge}")

except FileNotFoundError:
    print(f"Error: The file '{filename}' was not found. Please check the path.")

Graph loaded successfully!
Type: <class 'networkx.classes.graph.Graph'>
Number of nodes: 47544
Number of edges: 536124

Example Node: ('Υστερογραφα', {})
Example Edge: ('Υστερογραφα', 'Trump', {'weight': '14'})


## 1: Community Detection (Leiden Algorithm)

In this exercise, we perform **meso-analysis** to identify communities within the hashtag network. We use the **Leiden algorithm**, which is an improvement over the Louvain method. Leiden is designed to find well-connected communities and guarantees that communities are not disconnected, a common issue with Louvain.

We will:
1.  Apply the Leiden algorithm using the edge weights (frequency of co-occurrence).

In [40]:
print("\n--- Community Detection using Leiden Algorithm ---")
leiden_coms = algorithms.leiden(G).to_node_community_map()
print("Algorithm applied successfully.")

coms = {k:v for k,v in leiden_coms.items()}

print(f"Total detected communities: {max(coms.values())}")


--- Community Detection using Leiden Algorithm ---
Algorithm applied successfully.
Total detected communities: [63]
