# V User Vectors

## Table of Contents

1. [Loading the Data and Necessary Libraries](#loading-dependencies)
2. [Proximity Prestige & Degree Centrality ](#prestige)
3. [Comment Quality](#quality)
4. [Save Results](#save)


## Loading Data and Libraries 
<a class="anchor" id="loading-dependencies"></a>

In [1]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

df_c             = pd.read_parquet('Comments.parquet')
Explicit_links   = pd.read_parquet('explicit_links.parquet')          
vector_values_df = pd.read_parquet('Opinion_rank_scores.parquet')

## Proximity Prestige & Degree Centrality 
<a class="anchor" id="prestige"></a>

The graph user network is constructed for each article. While the graph is in memory,  
the Proximity Prestige and Degree Centrality calculations are performedfor each user in the graph.

In [5]:
'''
-Iterates over each article 
-Builds a Graph User Network of each article
-Computes Degree Centrality & Proximity Presige for each User
-Maps DC & PP back to the users in vector_values_df
'''

def calculate_proximity_prestige(G):
    pp_score = {}
    N = G.number_of_nodes()

    for i in G.nodes():
        reachable_nodes = nx.single_source_shortest_path_length(G, i)
        reachable_count = len(reachable_nodes) - 1  # excluding the node itself
        
        if reachable_count > 0:
            total_distance = sum(reachable_nodes.values())  # total distance to the node i from all reachable nodes
            average_distance = total_distance / reachable_count
            pp_score[i] = (reachable_count / (N - 1)) / average_distance
        else:
            pp_score[i] = 0

    return pp_score
    
vector_values_df['degree_centrality'] = pd.Series(dtype=float)
vector_values_df['proximity_prestige'] = pd.Series(dtype=float)

for article in  tqdm(vector_values_df.articleID.unique()):

    df = Explicit_links[Explicit_links.articleID == article]
    df = df.dropna()

    G=nx.from_pandas_edgelist(df, 'user_ID_a','user_ID_b' ,create_using=nx.DiGraph())

    degree_centrality = nx.out_degree_centrality(G)
    proximity_prestige = calculate_proximity_prestige(G)
    
    vector_values_df.loc[vector_values_df.articleID == article, "degree_centrality"] = vector_values_df.loc[vector_values_df.articleID == article, "userID"].map(degree_centrality)
    vector_values_df.loc[vector_values_df.articleID == article, "proximity_prestige"] = vector_values_df.loc[vector_values_df.articleID == article, "userID"].map(proximity_prestige)
    #print(article)

100%|██████████████████████████████████████████████████████████████████████████| 16787/16787 [2:20:02<00:00,  2.00it/s]


## Comment Quality
<a class="anchor" id="quality"></a>

The comment_quality is calculated from the previously computed Opinion_rank_score_sum and other metrics.

In [6]:
"""
- Reads in df_c and groups by article and user.
- Counts each user's comment count per article.
- Calculates the average length of each user's comments per article.
- Identifies the maximum average comment length across all users per article.
- Merges these metrics back into the vector_values_df for further analysis.
"""

number_of_comments_per_user = df_c.groupby(['articleID', 'userID']).size().reset_index(name='comment_count')

grouped_df = df_c.groupby(['articleID', 'userID'])
df_c['CommentLength'] = df_c['commentBody'].apply(len)

average_length_per_user = grouped_df['CommentLength'].mean()
average_lengt_user_comment = average_length_per_user.reset_index(name='average_comment_lenght')

max_length_per_article = average_lengt_user_comment.groupby('articleID')['average_comment_lenght'].max()
max_length_per_article = max_length_per_article.reset_index(name='max_comment_length')

vector_values_df = pd.merge(vector_values_df, number_of_comments_per_user, on=["articleID", "userID"], how="left")
vector_values_df = pd.merge(vector_values_df, average_lengt_user_comment, on=["articleID", "userID"], how="left")
vector_values_df = pd.merge(vector_values_df, max_length_per_article, on=["articleID"], how="left")

vector_values_df['comment_quality'] = ((vector_values_df['Opinion_rank_score_sum']/vector_values_df['comment_count'])*
                                       (vector_values_df['average_comment_lenght']/vector_values_df['max_comment_length']))
vector_values_df.head()

## Save Results
<a class="anchor" id="save"></a>

The filtered dataframe will be exported as a Parquet file, with NaN values set to zero to retain isolated comments and preserve their Opinion_rank_score_sum for indirect influence.

In [None]:
'''
We set nan values to zer0, this means that we do not drop any comments that are isolated in the GUN.
Tropping them would mean thath we loose the Opinion_rank_score_sum which would still contain the info in
case the comennt had influenced other indirectly 

- Exports the filtered dataframe to a Parquet.
'''
df = vector_values_df.fillna(0)
df.to_parquet('user_Vectors.parquet')