# Polarization behaviors on Twitter: temporal analysis

In this notebook, the following steps are performed:
* Definition of studied time frames
* Computation of polarization factors over each time frame
* Computation of polarization scores (GRAIL) over each time frame
* Identification and characterization of behavioral classes over each time frame
* Identification of polarization dynamics

It can be run for two debates: 
* The COVID-19 debate, opposing the pro-vaccine and anti-vaccine communities.
* The Ukraine conflict debate, opposing the pro-Ukraine and pro-Russia communities. 

At certain stages, it is needed to comment and/or uncomment the code, depending on the debate being studied. 
By default, the Ukraine conflict debate is studied.

# Libraries import

As a first step, all necessary libraries are imported

In [1]:
import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm
from tqdm import tnrange, tqdm_notebook
import os
from datetime import datetime, date, timedelta
from scipy.stats import entropy
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import davies_bouldin_score
import matplotlib.cm as cm
import itertools
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from pydlc import dense_lines

# Data import

Then, data are imported. 
For the temporal analysis, two datasets are required:
* edges: contains all the links between standard users and elite users (sources). One link corresponds to one retweet. Each interaction is timestamped.
* factors_data: contains computed factors for each standard users, for each studied time frame. These factors are either computed directly from collected dataset (number of retweets, proportion of retweets on the studied debate, frequency of interactions, number of elite users retweeted in each community), or from the graph built from the collected dataset (proximity centrality, betweenness centrality, PageRank)/

In [2]:
#Vaccine debate 
# edges = pd.read_csv('../data/vaccine_debate/interactions_vaccine_debate.csv', index_col=0)
# factors_data = pd.read_csv('../data/vaccine_debate/temporal_indicators_vaccine_debate.csv', index_col=0)

#Ukraine conflict debate
edges = pd.read_csv('../data/ukraine_debate/interactions_ukraine_debate.csv', index_col=0)
factors_data = pd.read_csv('../data/ukraine_debate/temporal_indicators_ukraine_debate.csv', index_col=0)

In [3]:
#Create a list with identified of each user
retweeters = edges['Source'].unique().tolist()

In [4]:
#Create lists with identifier of elite users of each community
#Vaccine debate
# provax_usernames = edges[edges['Side']=='provax']['Target'].unique().tolist()
# antivax_usernames = edges[edges['Side']=='antivax']['Target'].unique().tolist()

#Ukraine conflict debate
prorussia_usernames = edges[edges['Side']=='prorussia']['Target'].unique().tolist()
proukraine_usernames = edges[edges['Side']=='proukraine']['Target'].unique().tolist()

In [5]:
#vaccine_usernames = provax_usernames + antivax_usernames
war_usernames = prorussia_usernames + proukraine_usernames

# Time frames definition

Here we define the time frames on which we perform the temporal analysis. 
Time frames can vary in their length (in weeks), and in the number of common weeks between each time frame (sliding window). 

In [6]:
#The format of time is set up
edges['Timeset'] = pd.to_datetime(edges['Timeset'], infer_datetime_format=True)

## Périodes

In [7]:
def get_timeframes(w,c):
    start_date = pd.Timestamp('2022-01-01 00:00:00')
    end_date = pd.Timestamp('2022-01-01 00:00:00')
    last_date = pd.Timestamp('2022-07-31 23:59:39')
    list_periods = []
    while end_date < last_date:
        end_date = start_date + timedelta(weeks=w) +timedelta(hours=23) + timedelta(minutes=59)+ timedelta(seconds=59)
        list_periods.append([start_date, end_date])
        if (w==c):
            start_date = start_date + timedelta(weeks=c) + timedelta(days=1)
        else:
            start_date = start_date + timedelta(weeks=c)
    list_periods[-1][1] = last_date
    return list_periods

In [8]:
#We define time frames of 4 weeks, with 2 weeks of overlay
periods = get_timeframes(4,2)

In [9]:
periods

[[Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-29 23:59:59')],
 [Timestamp('2022-01-15 00:00:00'), Timestamp('2022-02-12 23:59:59')],
 [Timestamp('2022-01-29 00:00:00'), Timestamp('2022-02-26 23:59:59')],
 [Timestamp('2022-02-12 00:00:00'), Timestamp('2022-03-12 23:59:59')],
 [Timestamp('2022-02-26 00:00:00'), Timestamp('2022-03-26 23:59:59')],
 [Timestamp('2022-03-12 00:00:00'), Timestamp('2022-04-09 23:59:59')],
 [Timestamp('2022-03-26 00:00:00'), Timestamp('2022-04-23 23:59:59')],
 [Timestamp('2022-04-09 00:00:00'), Timestamp('2022-05-07 23:59:59')],
 [Timestamp('2022-04-23 00:00:00'), Timestamp('2022-05-21 23:59:59')],
 [Timestamp('2022-05-07 00:00:00'), Timestamp('2022-06-04 23:59:59')],
 [Timestamp('2022-05-21 00:00:00'), Timestamp('2022-06-18 23:59:59')],
 [Timestamp('2022-06-04 00:00:00'), Timestamp('2022-07-02 23:59:59')],
 [Timestamp('2022-06-18 00:00:00'), Timestamp('2022-07-16 23:59:59')],
 [Timestamp('2022-07-02 00:00:00'), Timestamp('2022-07-30 23:59:59')],
 [Time

# Polarization factors computation on each time frame

Here, polarization factors (opinions and sources) are computed on each time frame. The characteristics of each factor are the same as those computed in the holistic analysis.

To compute these factors, it is first necessary to get distributions of interactions on each community and on each source for every time frame. Then, polarization factors can be computed. 

In [10]:
#Lists of ids of standard and elite users. 
standard_users = edges['Source'].unique().tolist()
elite_users = edges['Target'].unique().tolist()

In [11]:
#Function to get distributions of interactions on community and sources of each community.
def get_distributions(periods, comm_1, comm_2, comm_1_usernames, comm_2_usernames):
    df_distri_communities = pd.DataFrame(columns=standard_users)
    df_distri_comm1 = pd.DataFrame(columns=standard_users)
    df_distri_comm2 = pd.DataFrame(columns=standard_users)
    
    for i in tqdm(range(len(periods))):
        edges_period = edges[(edges['Timeset']>= periods[i][0]) & (edges['Timeset']<= periods[i][1])]
        
        df_interactions_period = pd.DataFrame(columns=elite_users, index=standard_users)
        df_interactions_period = pd.crosstab(edges_period['Source'], edges_period['Target']).reindex(index=standard_users, columns=elite_users, fill_value=0)
        
        df_interactions_comm1_period = df_interactions_period[comm_1_usernames]
        df_interactions_comm2_period = df_interactions_period[comm_2_usernames]


        distri_sources_period = df_interactions_period.div(df_interactions_period.sum(axis=1), axis=0)
        distri_sources_comm1_period = df_interactions_comm1_period.div(df_interactions_comm1_period.sum(axis=1), axis=0)
        distri_sources_comm2_period = df_interactions_comm2_period.div(df_interactions_comm2_period.sum(axis=1), axis=0)
        
        #distri_communities_period = pd.DataFrame({comm_1:distri_sources_comm1_period.sum(axis=1).to_list(), comm_2:distri_sources_comm2_period.sum(axis=1).to_list()}, index=standard_users)
        distri_communities_period = pd.DataFrame({comm_1: df_interactions_comm1_period.sum(axis=1)/df_interactions_period.sum(axis=1).to_list(), comm_2:df_interactions_comm2_period.sum(axis=1)/df_interactions_period.sum(axis=1).to_list()})

        for j in tqdm(range(len(standard_users))):
            df_distri_communities.loc[i,standard_users[j]] = distri_communities_period.loc[standard_users[j]].tolist()
            if df_distri_communities.loc[i,standard_users[j]] == [0.0, 0.0] :
                df_distri_communities.loc[i,standard_users[j]] = np.nan

            df_distri_comm1.loc[i,standard_users[j]] = distri_sources_comm1_period.loc[standard_users[j]].tolist() 

            df_distri_comm2.loc[i,standard_users[j]] = distri_sources_comm2_period.loc[standard_users[j]].tolist()
            
    return df_distri_communities, df_distri_comm1, df_distri_comm2

In [12]:
#distri_comm, distri_sources_comm1, distri_sources_comm2 = get_distributions(periods, 'provax', 'antivax', provax_usernames, antivax_usernames)
distri_comm, distri_sources_comm1, distri_sources_comm2 = get_distributions(periods, 'proukraine', 'prorussia', proukraine_usernames, prorussia_usernames)

  0%|          | 0/15 [00:00<?, ?it/s]

100%|██████████| 1000/1000 [00:00<00:00, 1837.28it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1554.22it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1700.21it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2013.23it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2204.32it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2249.47it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2029.75it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2210.64it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2169.40it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2180.31it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2218.96it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2133.03it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1382.01it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1437.10it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1360.18it/s]
100%|██████████| 15/15 [00:10<00:00,  1.46it/s]


In [13]:
distri_comm = distri_comm.astype(str)
distri_sources_comm1 = distri_sources_comm1.astype(str)
distri_sources_comm2 = distri_sources_comm2.astype(str)

## Pre-processing of data

As some users are inactive during certain periods, we first apply distributions to handle missing values. 

### Necessary functions

In [14]:
def normalized_entropy(distribution):
    return entropy(distribution, base=2)/np.log2(len(distribution))

In [15]:
def complete_nan(df):
    df = df.replace(regex=r'nan', value=np.nan)
    df = df.fillna(method='bfill')
    df = df.fillna(method='ffill')
    return df

In [16]:
def complete_nan_by_zeros(df):
    df = df.replace(regex=r'nan', value=np.nan)
    df = df.fillna(value= '[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]')
    return df

In [17]:
#Convert each element of df to list, and each of element of list to float
def convert_str_to_list(string):
    if type(string) == str: 
        return [float(i) for i in string.strip('][').split(', ')]
    else:
        return(string)

### Users selection

To avoid inferring too much data while retaining a significant number of users, users with 20% or more inactive periods are removed from the dataset for the temporal analysis.

In [18]:
#Identification of users with less than 3 inactive time frames
sum_users = 0
list_users = []
for u in standard_users:
    if distri_comm[u].replace(regex=r'nan', value=np.nan).isna().sum() >= 3:
        sum_users = sum_users + 1
    else:
        list_users.append(u)

In [19]:
print('Number of conserved users:', 1000-sum_users)

Number of conserved users: 685


In [20]:
distri_comm = distri_comm[list_users]
distri_sources_comm1 = distri_sources_comm1[list_users]
distri_sources_comm2 = distri_sources_comm2[list_users]

In [21]:
distri_comm = complete_nan(distri_comm)

In [22]:
distri_sources_comm1 = complete_nan_by_zeros(distri_sources_comm1)
distri_sources_comm2 = complete_nan_by_zeros(distri_sources_comm2)

In [23]:
distri_comm = distri_comm.applymap(convert_str_to_list)
distri_sources_comm1 = distri_sources_comm1.applymap(convert_str_to_list)
distri_sources_comm2 = distri_sources_comm2.applymap(convert_str_to_list)

Once users are selected and data are pre-processed, we compute cumulative distributions. For each time frame, the resulting distribution is a combination between the previous one and the new one. 

These cumulative distributions allow to retain a memory of user behavior on previous time frame. The weight of interactions on past time frames is gradually decreasing.

In [24]:
cumul_distri_comm = pd.DataFrame(columns=distri_comm.columns, index=distri_comm.index)
cumul_distri_comm1 = pd.DataFrame(columns=distri_sources_comm1.columns, index=distri_sources_comm1.index)
cumul_distri_comm2 = pd.DataFrame(columns=distri_sources_comm2.columns, index=distri_sources_comm2.index)

for i in tqdm(range(cumul_distri_comm.shape[1])):
    for j in range(len(distri_comm)): 
        if j == 0:
            cumul_distri_comm.iloc[j,i] = distri_comm.iloc[j,i]
        else:
            cumul_distri_comm.iloc[j,i] = [sum(x)/2 for x in zip(cumul_distri_comm.iloc[j-1,i], distri_comm.iloc[j,i])]


for i in tqdm(range(distri_sources_comm1.shape[1])):
    for j in range(len(distri_sources_comm1)): 
        if j == 0:
            cumul_distri_comm1.iloc[j,i] = distri_sources_comm1.iloc[j,i]
        else:
            if cumul_distri_comm1.iloc[j-1,i] == [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]:
                cumul_distri_comm1.iloc[j,i] = distri_sources_comm1.iloc[j,i]
            else:
                cumul_distri_comm1.iloc[j,i] = [sum(x)/2 for x in zip(cumul_distri_comm1.iloc[j-1,i], distri_sources_comm1.iloc[j,i])]

for i in tqdm(range(distri_sources_comm2.shape[1])):
    for j in range(len(distri_sources_comm2)): 
        if j == 0:
            cumul_distri_comm2.iloc[j,i] = distri_sources_comm2.iloc[j,i]
        else:
            if cumul_distri_comm2.iloc[j-1,i] == [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]:
                cumul_distri_comm2.iloc[j,i] = distri_sources_comm2.iloc[j,i]
            else:
                cumul_distri_comm2.iloc[j,i] = [sum(x)/2 for x in zip(cumul_distri_comm2.iloc[j-1,i], distri_sources_comm2.iloc[j,i])]

100%|██████████| 685/685 [00:01<00:00, 602.50it/s]
100%|██████████| 685/685 [00:01<00:00, 594.11it/s]
100%|██████████| 685/685 [00:01<00:00, 576.93it/s]


Finally, polarization factors are computed based on distributions. 

In [25]:
#OPINIONS
H_comm_cumul = 1 - cumul_distri_comm.applymap(normalized_entropy)

#Sign according to the community within which the user make more interactions
H_comm_cumul_oriented = H_comm_cumul.copy(deep=True)
for i in tqdm(range(cumul_distri_comm.shape[1])):
    for j in range(cumul_distri_comm.shape[0]):
        if cumul_distri_comm.iloc[j,i][0] >= cumul_distri_comm.iloc[j,i][1]:
             H_comm_cumul_oriented.iloc[j,i] = H_comm_cumul_oriented.iloc[j,i] #On ne change pas le signe
        elif cumul_distri_comm.iloc[j,i][0] < cumul_distri_comm.iloc[j,i][1]:
             H_comm_cumul_oriented.iloc[j,i] = -H_comm_cumul_oriented.iloc[j,i] #On change le signe


#SOURCES
H_sources1_cumul = 1-cumul_distri_comm1.applymap(normalized_entropy)
H_sources2_cumul = 1-cumul_distri_comm2.applymap(normalized_entropy)

100%|██████████| 685/685 [00:01<00:00, 408.94it/s]
  pk = 1.0*pk / np.sum(pk, axis=axis, keepdims=True)


In [26]:
H_sources1_cumul = H_sources1_cumul.fillna(0)
H_sources2_cumul = H_sources2_cumul.fillna(0)

In [27]:
#The opinion factor, ranging in [-1, 1] is transformed to be in [0,1]
def transform(x):
    return (x+1)/2

In [28]:
H_comm_cumul_trans = H_comm_cumul_oriented.applymap(transform)

In [29]:
#Dataframes with distributions are savec in the results folder

# H_comm_cumul_oriented.to_csv('../results/temporal_analysis/factors_timeframes/vaccine_opinions.csv')
# H_sources1_cumul.to_csv('../results/temporal_analysis/factors_timeframes/vaccine_sources_C1.csv')
# H_sources2_cumul.to_csv('../results/temporal_analysis/factors_timeframes/vaccine_sources_C2.csv')

H_comm_cumul_oriented.to_csv('../results/temporal_analysis/factors_timeframes/ukraine_conflict_opinions.csv')
H_sources1_cumul.to_csv('../results/temporal_analysis/factors_timeframes/ukraine_conflict_sources_C1.csv')
H_sources2_cumul.to_csv('../results/temporal_analysis/factors_timeframes/ukraine_conflict_sources_C2.csv')

# Temporal analysis - polarization behaviors on each time frame

First, we study how polarization behaviors vary over each time frame, both in terms of number and characteristics. 

## Necessary functions

In [30]:
def optimal_k_means(X):
    res = pd.DataFrame(columns=['k','Silhouette_index','DaviesBouldin_index'])
    for k in (range(2, 13, 1)):
        model = KMeans(n_clusters=k)
        model.fit_predict(X)
        db = davies_bouldin_score(X, model.labels_)
        silhouette_avg = silhouette_score(X, model.labels_)
        res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)  
    return res

In [31]:
def apply_k_means(X, k_opt):
    model = KMeans(n_clusters=k_opt)
    model.fit_predict(X)

    cluster_labels = model.labels_
    centers = pd.DataFrame(data=model.cluster_centers_, columns=X.columns)
    clusters = pd.DataFrame(centers, columns = X.columns)

    labels_df = pd.DataFrame(model.labels_.tolist(), columns=['Cluster'])
    size_clusters = labels_df.groupby('Cluster').size()
    size_clusters = size_clusters.to_frame()
    size_clusters.columns=['size']
    sizes = size_clusters['size'].tolist()
    clusters['size'] = sizes
    cols = clusters.columns.tolist()
    cols = cols[-1:] + cols[:-1]
    clusters = clusters[cols]
    clusters = clusters.round(2)
    clusters = clusters.T

    return cluster_labels, centers, clusters

In [32]:
def get_silhouette_plot(X, k_opt):
    model = KMeans(n_clusters=k_opt)
    model.fit_predict(X)
    DB = davies_bouldin_score(X, model.labels_)
    print('Davies Bouldin Index : ', DB)

    silhouette_avg = silhouette_score(X, model.labels_)
    print('Silhouette Index : ', silhouette_avg)

    # Create a subplot with 1 row and 2 columns
    fig, (ax1) = plt.subplots(1, 1)
    fig.set_size_inches(12, 6)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (model.n_clusters + 1) * 10])

    sample_silhouette_values = silhouette_samples(X, model.labels_)
    y_lower = 10
    for i in range (model.n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[model.labels_ == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / model.n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                        0, ith_cluster_silhouette_values,
                        facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])


    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                "with n_clusters = %d" % model.n_clusters),
                fontsize=14, fontweight='bold')

In [33]:
def plot_clusters(X, x, y, z, c):
    fig = plt.figure(figsize=(10,10))
    ax = plt.axes(projection='3d')
    ax.w_xaxis.set_pane_color((1.0, 1.0, 1.0, 0.0))
    ax.w_yaxis.set_pane_color((1.0, 1.0, 1.0, 0.0))
    ax.w_zaxis.set_pane_color((1.0, 1.0, 1.0, 0.0))

    from matplotlib.colors import ListedColormap
    ax.scatter(X[x], X[y], X[z], c=X[c], s= 120, cmap='tab10')

    ax.set_xlabel(x, fontsize='15', labelpad=10)
    ax.set_ylabel(y, fontsize='15', labelpad=10)
    ax.set_zlabel(z, fontsize='15', labelpad=10)

## Identification of behavioral classes

#### Tuning of the number of clusters $k$

Here we tune the number of clusters for each time frame, based on the Silhouette index and Davies-Bouldin index.

In [34]:
#Vaccine
#a = 1/2

#Ukraine
a = 1/3

def f_sigmoid_optim(x):
    return(x**(a)/(x**(a)+(1-x)**(a)))

In [35]:
def periodic_optimal_kmeans(df_opinions, df_sources_C1, df_sources_C2, periods_df, alpha):
    optim_clustering_periods = pd.DataFrame(columns=df_opinions.index, index=['k', 'Silhouette', 'Davies-Bouldin'])
    df_opinions = df_opinions.apply(f_sigmoid_optim)
    df_sources_C1 = df_sources_C1.apply(f_sigmoid_optim)
    df_sources_C2 = df_sources_C2.apply(f_sigmoid_optim)

    for p in tqdm(range(len(periods_df))):
        X = pd.DataFrame(columns=['x', 'y_C1', 'y_C2'])
        X['x'] = alpha*df_opinions.iloc[p,:]
        X['y_C1'] = ((1-alpha)/2)*df_sources_C1.iloc[p,:]
        X['y_C2'] = ((1-alpha)/2)*df_sources_C2.iloc[p, :]
        res = optimal_k_means(X)
             
        optim_clustering_periods.loc['k',p] = res[res['Silhouette_index'] == res['Silhouette_index'].max()]['k'].values[0]
        optim_clustering_periods.loc['Silhouette',p] = res[res['Silhouette_index'] == res['Silhouette_index'].max()]['Silhouette_index'].values[0]
        optim_clustering_periods.loc['Davies-Bouldin', p] = res[res['Silhouette_index'] == res['Silhouette_index'].max()]['DaviesBouldin_index'].values[0]  

    return optim_clustering_periods

In [36]:
optim_k_means_periods = periodic_optimal_kmeans(H_comm_cumul_trans, H_sources1_cumul, H_sources2_cumul, periods, alpha=0.6)

  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append(

In [37]:
optim_k_means_periods

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
k,4.0,4.0,4.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
Silhouette,0.850687,0.857264,0.858188,0.856534,0.855888,0.855337,0.856391,0.858009,0.853898,0.85162,0.853102,0.852546,0.852067,0.852989,0.856094
Davies-Bouldin,0.317456,0.321213,0.346978,0.199934,0.206543,0.206607,0.199637,0.200749,0.206669,0.204889,0.205119,0.209962,0.214882,0.214394,0.213646


### Clustering 

Here we apply the clustering algorithm on optimal values of $k$, and save results.

In [38]:
def periodic_kmeans(df_opinions, df_sources_C1, df_sources_C2, periods_df, alpha, debate):
    clusters_periods = pd.DataFrame(index=['size','x', 'y_C1', 'y_C2'])
    clusters_users = pd.DataFrame(index=df_opinions.index, columns=df_opinions.columns)
    clusters_users_names = pd.DataFrame(index=df_opinions.index, columns=df_opinions.columns)

    df_opinions = df_opinions.apply(f_sigmoid_optim)
    df_sources_C1 = df_sources_C1.apply(f_sigmoid_optim)
    df_sources_C2 = df_sources_C2.apply(f_sigmoid_optim)

    for p in tqdm(range(len(periods))):
        X = pd.DataFrame(columns=['x', 'y_C1', 'y_C2'])

        X['x'] = alpha*(df_opinions.iloc[p,:])
        X['y_C1'] = ((1-alpha)/2)*(df_sources_C1.iloc[p,:])
        X['y_C2'] = ((1-alpha)/2)*(df_sources_C2.iloc[p, :])

        res = optimal_k_means(X)
        k_opt = int(res[res['Silhouette_index'] == res['Silhouette_index'].max()]['k'].values[0])


        labels, centers, clusters = apply_k_means(X, k_opt)

        sort_clusters = clusters.sort_values(by=['x','y_C1','y_C2'], ascending=False,axis=1)
        map = {}
        for i in range(k_opt):
            map[sort_clusters.columns[i]]=clusters.columns[i]
        labels_map = [map[i] for i in labels]

        clusters_users.loc[p, :] = labels_map
        clusters_final = clusters.sort_values(by=['x','y_C1','y_C2'], ascending=False, ignore_index=True,axis=1)

        if debate == 'ukraine':
            for i in range(clusters_final.shape[1]):
                x_value = clusters_final.loc['x',i]
                if abs(0.6 - x_value) <= 0.05:
                    clusters_final.loc['name',i] = 'proU'
                elif abs(0.0 - x_value) <= 0.05:
                    clusters_final.loc['name',i] = 'proR'
                elif (0.25 <= x_value <= 0.35):
                    clusters_final.loc['name',i] = 'inter'
                elif (abs(0.6 - x_value) >= 0.05) and (x_value > 0.35):
                    clusters_final.loc['name',i] = 'interU'
                elif (abs(0.0 - x_value) >= 0.05) and (x_value < 0.25):
                    clusters_final.loc['name',i] = 'interR'

        elif debate == 'vaccine':
            for i in range(clusters_final.shape[1]):
                x_value = clusters_final.loc['x',i]
                if abs(0.6 - x_value) <= 0.05:
                    clusters_final.loc['name',i] = 'pro'
                elif abs(0.0 - x_value) <= 0.05:
                    clusters_final.loc['name',i] = 'anti'
                elif (0.25 <= x_value <= 0.35):
                    clusters_final.loc['name',i] = 'inter'
                elif (abs(0.6 - x_value) >= 0.05) and (x_value > 0.35):
                    clusters_final.loc['name',i] = 'interPro'
                elif (abs(0.0 - x_value) >= 0.05) and (x_value < 0.25):
                    clusters_final.loc['name',i] = 'interAnti'   
        
        map_names = {}
        for i in range(clusters_final.shape[1]):
            map_names[i] = clusters_final.loc['name',i]
        
        labels_names = [map_names[i] for i in labels_map]
        clusters_users_names.loc[p, :] = labels_names

        for k in range(k_opt):
            name = 'P'+str(p)+'_C'+str(k)
            clusters_periods[name]=clusters_final.loc[:,k]

    return clusters_periods, clusters_users, clusters_users_names

In [39]:
#k_means_periods, users_clusters_periods, users_clusters_periods_names = periodic_kmeans(H_comm_cumul_trans, H_sources1_cumul, H_sources2_cumul, periods, 0.6, 'vaccine')
k_means_periods, users_clusters_periods, users_clusters_periods_names = periodic_kmeans(H_comm_cumul_trans, H_sources1_cumul, H_sources2_cumul, periods, 0.6, 'ukraine')


  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append({'k':k, 'Silhouette_index': silhouette_avg, 'DaviesBouldin_index':db}, ignore_index=True)
  res = res.append(

In [40]:
k_means_periods.head()

Unnamed: 0,P0_C0,P0_C1,P0_C2,P0_C3,P1_C0,P1_C1,P1_C2,P1_C3,P2_C0,P2_C1,...,P10_C0,P10_C1,P11_C0,P11_C1,P12_C0,P12_C1,P13_C0,P13_C1,P14_C0,P14_C1
size,219.0,17.0,88.0,361.0,218.0,18.0,100.0,349.0,218.0,18.0,...,228.0,457.0,229.0,456.0,230.0,455.0,230.0,455.0,231.0,454.0
x,0.6,0.37,0.1,0.0,0.6,0.37,0.08,-0.0,0.6,0.37,...,0.59,0.01,0.59,0.01,0.59,0.01,0.59,0.01,0.59,0.01
y_C1,0.11,0.12,0.19,0.0,0.1,0.12,0.19,0.0,0.1,0.12,...,0.12,0.05,0.11,0.06,0.11,0.06,0.11,0.06,0.12,0.06
y_C2,0.0,0.17,0.09,0.1,0.0,0.17,0.09,0.1,-0.0,0.17,...,0.01,0.11,0.01,0.11,0.01,0.11,0.01,0.11,0.02,0.11


In [41]:
#The dataframe with users' cluster for each time frame is saved in the results folder

# users_clusters_periods.to_csv('../results/temporal_analysis/vaccine_clusters.csv')
# users_clusters_periods_names.to_csv('../results/temporal_analysis/vaccine_clusters_labels.csv')

users_clusters_periods.to_csv('../results/temporal_analysis/ukraine_conflict_clusters.csv')
users_clusters_periods_names.to_csv('../results/temporal_analysis/ukraine_conflict_clusters_labels.csv')