### Things to understand:
- Density degree of the graph? Whether the graph is sparse or dense?
- handle the case that the graph is not connected


- Maybe discard unanswered questions?

A 2013 study has found that 75% of users only ask one question, 65% only answer one question, and only 8% of users answer more than 5 questions.

# Exploring StackOverflow!

In [15]:
import networkx as nx
from datetime import date as dt
import matplotlib.pyplot as plt

Global variables:

In [2]:
N_LINES_TO_READ = 100000

## Populating the Graphs

Each graph is build independantly from the provided `.txt` files of temporal network of interactions. \
Users are represented as nodes and answers\comments as edges.

The design choices of the following method are:
- Undirected graphs. The benifits provided by the information of the order of the edge pairs is not relevant for the following analisis and would make the model less robust.
- *Simple* graphs, there are no loops in the graphs. Users who answer to themselves are discarded cases.
- Two attribute are assigned to the edges: time and weight. The weights are:
    - `1` for answers to questions
    - `2/3` for comments on questions
    - `1/2` for comments to answers
- Time resolution is one day.

### The `a2q` graph: answers of questions as edges

In [84]:
with open('sx-stackoverflow-a2q.txt') as f: # opening the file
    a2qGraph = nx.Graph() # initializing the graph
    for i in range(0, N_LINES_TO_READ): # only a part of the dataset is used
        Line = f.readline() # reading lines one at each step form the file
        Line = tuple(map(int, Line.split())) # splitting the line to distinguish it's contents and converting string to integers
        Edge = Line[:2] # the first two elements are nodes, adding the edge to graph automatically adds the nodes
        if Edge[0] != Edge[1]: # discarding loops checking if the nodes coincide, i.e. the user answered to himself
            TimeValue = Line[2]  # the third elemt is time in Unix timestamp
            TimeValue = dt.fromtimestamp(TimeValue) # the conversion to date format

            a2qGraph.add_edge(*Edge, time = TimeValue, weight = 1) # adding the edge to the graph

## Implementation of the backend

### Functionality 1 - Get the overall features of the graph

In [None]:
def F1OverallFeatures(InputGraph):
   """
   Input: One of the 3 graphs
   
   Output:
      Whether the graph is directed or not
      Number of users
      Number of answers/comments
      Average number of links per user
      Density degree of the graph
      Whether the graph is sparse or dense
   """

   Directed = InputGraph.is_directed() # quering if the graph is directed
   if Directed: # using a variable to print the output
      OutputString = 'Directed'
   else:
      OutputString = 'Unirected'
   
   NofUsers = InputGraph.number_of_nodes() # computing the number of users
   NofInteractions = InputGraph.number_of_edges() # computing the number of interactions
   
   DegreeDict = dict(a2qGraph.degree()) # a dictionary with keys -> nodes, values -> the number of edges incident 
                                        # to the node a.k.a degree
   AvgUserLinks = sum(DegreeDict.values()) / len(DegreeDict)
   
   
   
   print(f"The input graph is {OutputString}\n\
   Number of users: {NofUsers}\n\
   Number of answers/comments: {NofInteractions}\n\
   Average number of links per user: {AvgUserLinks}\n\
   ")