# Exploring StackOverflow!

In [326]:
import networkx as nx
from datetime import datetime as dt

## Populating the Graphs

Each graph is build independantly from the provided `.txt` files of temporal network of interactions. \
Users are represented as nodes and answers\comments as edges.

The design choices of the following method are:
- Using undirected graphs for robustness and efficacy of the model. 
- *Simple* graphs, there are no loops in the graphs. Users who answer to themselves are discarded cases.
- Only one attributes is assigned to the edges: weight. The weights are:
    - `1` for answers to questions
    - `2/3` for comments on questions
    - `1/2` for comments to answers
- Time resolution is one day.
- The graphs are build given a time input to avoid such attribute for the sake of simplicity and robustess.
---
The following function is populates a graph:

In [351]:
def PopulateGraph(File, TimeInterval):
   """
   Input: 
      File: txt file of temporal network of interactions
      TimeInterval: a tuple of two dates in ISO format representing an interval of time, e.g. ("2015-12-04", "2016-12-04") 
   
   Output:
      A graph corresponding to the input request   
   """
   
   TimeInterval = tuple(map(dt.fromisoformat, TimeInterval)) # converting time interval into datetime format
   TimeInterval = tuple(map(dt.timestamp, TimeInterval)) # converting time interval into POSIX timestamp 
   StartDate = str(int(TimeInterval[0])) #converting to string to compare with the txt
   EndDate = str(int(TimeInterval[1]))
   
   WeightDict = {"sx-stackoverflow-a2q.txt": 1, "sx-stackoverflow-c2q.txt": 2/3, "sx-stackoverflow-c2a.txt": 1/3} # assigning the 
                                                                                                                  # weights to 
                                                                                                                  # interactions
   WeightValue = WeightDict[File] # picking the corresponding weight value
   
   Graph = nx.Graph() # initializing the graph
   
   with open(File) as f: # opening the file
      while f.readline().split()[-1] < StartDate: # discarding lines with dates before the start date
         pass
      while f.readline().split()[-1] < EndDate: # reading lines with dates until the end date
         Line = f.readline() # reading lines one at each step form the file
         Line = tuple(map(int, Line.split())) # splitting the line to distinguish it's contents and converting string to integers
         Edge = Line[:2] # the first two elements are nodes, adding the edge to graph automatically adds the nodes
         if Edge[0] != Edge[1]: # discarding loops checking if the nodes coincide, i.e. the user answered to himself
            if Graph.has_edge(*Edge): # checking if the edge already exists
               Graph[Edge[0]][Edge[1]]['weight'] += WeightValue # updating the weight
            else: # otherwise...
               Graph.add_edge(*Edge, time = TimeValue, weight = WeightValue) # ...adding the edge to the graph
                                                                  
               
   return(Graph)

In [366]:
a2qGraph = PopulateGraph("sx-stackoverflow-a2q.txt", ("2015-01-01", "2016-01-01"))

### Merging the graphs

In [199]:
#TODO
MergedGraphs = nx.Graph()

## Implementation of the backend

### Functionality 1 - Get the overall features of the graph

Graph density/sprcity is computed as defined in *"Introduction to Algorithms" by Cormern, Leiserson, Rivest, and Stein*: \
A graph $G = (E, V)$, with $E$ denoting the edges and $V$ denoting the vertices,  is sparse if $|E| << |V|^2$ and dense if $|E|$ is close to $|V|$. \
And so if $|E|$ differs an order of magnitude from $|V|$ the graph is considered sparse, otherwise it is dense.
The density degree expression is:

\begin{equation}
2\frac{|V|}{|E|(|E| - 1)} \approx 2\frac{|V|}{|E|^2}
\end{equation}

In [372]:
def F1_OverallFeatures(Graph):
   """
   Input: One of the 3 graphs
   
   Output:
      Whether the graph is directed or not
      Number of users
      Number of answers/comments
      Average number of links per user
      Density degree of the graph
      Whether the graph is sparse or dense
   """

   Directed = Graph.is_directed() # quering if the graph is directed
   if Directed: # using a variable to print the output
      IsDirectedPrint = 'directed'
   else:
      IsDirectedPrint = 'undirected'
   
   NofUsers = Graph.number_of_nodes() # computing the number of users
   NofInteractions = Graph.number_of_edges() # computing the number of interactions
   
   DegreeDict = dict(a2qGraph.degree()) # a dictionary with keys -> nodes, values -> the number of edges incident 
                                        # to the node a.k.a degree
   AvgUserLinks = sum(DegreeDict.values()) / len(DegreeDict) # computing the avarege number of links per user
   
   DensityDegree = 2 * NofInteractions / NofUsers**2 # computing density degree
   
   if NofUsers / NofInteractions**2 < 10: # evaluating sparsity/density
      SparseDense = 'sparse'
   else:
      SparseDense = 'dense'
   
   print(f"The input graph is {IsDirectedPrint}\n\
Number of users: {NofUsers}\n\
Number of answers/comments: {NofInteractions}\n\
Average number of links per user: {AvgUserLinks:.2}\n\
Density degree: {DensityDegree: .2}\n\
The graph is {SparseDense}")

In [373]:
F1_OverallFeatures(a2qGraph)

The input graph is undirected
Number of users: 714167
Number of answers/comments: 1429455
Average number of links per user: 4.0
Density degree:  5.6e-06
The graph is sparse


### Functionality 2 - Find the best users!

#### Degree centrality
As defined in *Introduction to Graph Concepts for Data Science of Aris Anagnostopoulos*, the normalized degree centrality of node $v$ is:

\begin{equation}
\frac{d_v}{|V|-1} \approx \frac{d_v}{|V|}
\end{equation}

with $d_v$ the degree of the node $v$, i.e. the number of edges incident to the node.

In [206]:
def DegreeCentrality(Graph, Node):
   return Graph.degree() / Graph.number_of_edges()

#### Closeness centrality
As defined in *Introduction to Graph Concepts for Data Science of Aris Anagnostopoulos*, the normalized closeness centrality of node $v$ is:


\begin{equation}
\frac{|V|-1}{\sum_{u \epsilon V} d(v,u)} \approx \frac{|V|}{\sum_{u \epsilon V} d(v,u)}
\end{equation}

with $d(v,u)$ the distance between nodes $v$ and $u$, that is the length of a shortest path between $v$ and $u$.\
As a design choice $d(v,u)$ is taken as the inverse of the weight of the edge of $(v,u)$, this leads to an inverse relationship between interaction and distances: the more a user posts, the closer he gets to the comunity.  

In [183]:
def ClosenessCentrality(Graph, Node):
   
   SummedDistances = Graph[Node]
   
   return Graph.number_of_nodes() / SummedDistances

#### Page rank

In [181]:
def PageRank(Graph, Node):
   pass

#### Betweeness

As defined in *Introduction to Graph Concepts for Data Science of Aris Anagnostopoulos*, the Betweeness centrality of node $v$ is:

\begin{equation}
\sum_{u, w \epsilon V \backslash \{v\}} \frac{g_{u w}^v}{g_{uw}} \frac{2}{|V|^2 - 3|V| + 2} 
\end{equation}


with $g_{uw}$ the shortest path that connects nodes $u$ and $w$ and $g_{u w}^v$ the set of those shortest paths between u and w that contain node $v$.

In [195]:
def Betweeness(Graph, Node):
   

In [201]:
def F2_BestUser(Node, TimeInterval, Metric, Graph = MergedGraphs):
   """
   Input:
   A user/node
   A tuple of two dates in ISO format representing an interval of time, e.g. (2015-12-04, 2016-12-04) 
   An integer corresponding to the following metrics: 
      1 -> Betweeness 
      2 -> PageRank
      3 -> ClosenessCentrality 
      4 -> DegreeCentrality
   The graph on which to perform the analysis.
   
   Output:
   The value of the given metric applied over the complete graph for the given interval of time.
   """
   
   TimeInterval = tuple(map(dt.fromisoformat, TimeInterval)) # converting time interval into datetime format
   IntToMetric = {1: Betweeness, 2: PageRank, 3: ClosenessCentrality, 4: DegreeCentrality} # dictionary associating the
                                                                                           # input integer to the
                                                                                           # corresponding metric function
    
   TimeIntervalGraph = Graph.copy() # creating a copy on which to perform the time filtering
   TimeIntervalGraph.remove_edges_from([(n1, n2) for n1, n2, time in TimeIntervalGraph.edges(data="time") 
                                        if (time >= TimeInterval[0] & time <= TimeInterval[1])]) # removing the edges that not belong
                                                                                                 # to the input time interval  
   
   return IntToMetric[Metric](TimeIntervalGraph, Node)