# Homework 5 - Exploring StackOverflow!

![Alt Text](https://www.linuxadictos.com/wp-content/uploads/stack-overflow-1024x244.jpg.webp)

## 1. Data

The first step, as always, is to download the data you will be working on. You can download the data to build the system [here](https://snap.stanford.edu/data/sx-stackoverflow.html). Please download the **3 files** which can be found under the description ['Answers to questions', 'Comments to questions', 'Comments to answers']
  
  In particular, each file will contain the following information:
  * __Answers to questions__ - User u answered user v's question at time t
  * __Comments to questions__ - User u commented on user v's question at time t
  * __Comments to answers__  - User u commented on user v's answer at time t

Unless specified differently we will handle the 3 graphs together, therefore as a first step please think about a nice and appropriate manner to merge them. You are free to merge them as you prefer, but we do expect the output graph to be a weighted gragh. For instance, imagine you have a user X has answered to a question and comment from user Y. In the combined graph we expect you to have a weighted link between these two users. How you construct this weight is fully up to you :). If the algorithm we request you does not have a weighted variant please mention it clearly and convert the weighted graph into an unweighted one.

Some recommendations that might be helpful for dealing with the data is:

 - The date is provided with a very high precision, please round it to a reasonable value (e.g. day, hours, etc. whatever you feel makes more sense)
 - You might also see that there are several answers/comments which the user answer do to themselves... please deal with these accordingly and explain what you have decided to do. 
 - We are aware that the data is a lot. For this reason we typically ask you to only focus on a smaller intrerval of time. Please test all your implementations on a sufficiently large interval of time, and use this in your benefit to get the best possible results. 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Solution 1

### Preprocessing of the txt files

#### Timestamps
The data are to be cleaned, starting from the timestamps I've made two operations:

1) The dataset are really huge and the timestamp are too precise, so I've decided to consider only time **windows of at least one day**.
    in order to to this I've built a function (```cast_times```) that for each line of the txt files takes the timestamp and replace it with the same timestamp subtracting the reminder of the division for 86400 (number of seconds in a day)
2) The timestamps refers to the 01 Jan 1970, a little bit out of date... So I've decided to add to every timestamp the number 1221436800 that's the timestamp of the day   when (accordingly with [wikipedia en](https://en.wikipedia.org/wiki/Stack_Overflow)) the first public beta release of stackoverflow has been created.

#### Format
It's decisely more easy to work with csv files, so I've saved the results into csv files

In [1]:
import functions

In [2]:
print("Starting preprocessing for: answer to questions")
functions.cast_times('data/original/a2q.txt')
print("Done\n")

print("Starting preprocessing for: comment to answers")
functions.cast_times('data/original/c2a.txt')
print("Done\n")

print("Starting preprocessing for: comment to questions")
functions.cast_times('data/original/c2q.txt')
print("Done\n")

Starting preprocessing for: answer to questions
the output file: data/original/a2q_casted.csv already exists!
Done

Starting preprocessing for: comment to answers
the output file: data/original/c2a_casted.csv already exists!
Done

Starting preprocessing for: comment to questions
the output file: data/original/c2q_casted.csv already exists!
Done



### Exploring the data

In the following cells I've analysed the date range over which the relationships are distributed and if there's some user that is connected in some way to itself

In [3]:
import pandas as pd

# Reading the processed csv files
a2q = pd.read_csv('data/original/a2q_casted.csv', header='infer')
c2a = pd.read_csv('data/original/c2a_casted.csv', header='infer')
c2q = pd.read_csv('data/original/c2q_casted.csv', header='infer')

In [4]:
from datetime import datetime
print("a2q:")
print(f"\tmin_timestamp: {a2q.time.min()} in_date: {datetime.fromtimestamp(a2q.time.min())}")
print(f"\tmax_timestamp: {a2q.time.max()} in_date: {datetime.fromtimestamp(a2q.time.max())}")
print("c2a:")
print(f"\tmin_timestamp: {c2a.time.min()} in_date: {datetime.fromtimestamp(c2a.time.min())}")
print(f"\tmax_timestamp: {c2a.time.max()} in_date: {datetime.fromtimestamp(c2a.time.max())}")
print("c2q:")
print(f"\tmin_timestamp: {c2q.time.min()} in_date: {datetime.fromtimestamp(c2q.time.min())}")
print(f"\tmax_timestamp: {c2q.time.max()} in_date: {datetime.fromtimestamp(c2q.time.max())}")

a2q:
	min_timestamp: 1221436800 in_date: 2008-09-15 02:00:00
	max_timestamp: 1227398400 in_date: 2008-11-23 01:00:00
c2a:
	min_timestamp: 1221436800 in_date: 2008-09-15 02:00:00
	max_timestamp: 1227398400 in_date: 2008-11-23 01:00:00
c2q:
	min_timestamp: 1221436800 in_date: 2008-09-15 02:00:00
	max_timestamp: 1227398400 in_date: 2008-11-23 01:00:00


In [5]:
a2q[a2q.user_from.isin(a2q.user_to)]

Unnamed: 0,user_from,time,user_to


In [6]:
c2a[c2a.user_from.isin(c2a.user_to)]

Unnamed: 0,user_from,time,user_to


In [7]:
c2q[c2q.user_from.isin(c2q.user_to)]

Unnamed: 0,user_from,time,user_to


## Getting the graphs
Now we've the dataframes containing:<br>
```	user_from	time	user_to```<br>
that are respectively the user id that has the interaction (comment or answer), the timestamp starting from the 15 Sep 2008 where this happened and the userid that has received it on it's action (answer or question).

### Processing the dataframes

1) When it's asked to build a graph for specific timestamps intuitivelly it's more convenient to **filter only the rows** where the time attribute is in the given range, so it'll be the first step.

2) Once we have only the row of interest we need to estabilish how to connect the users, well it can be done using a weighted edge where the weight it's the number of interaction that two users had in the specific interval of time, so we'll pass from a dataframe with two user and one timestamp for each row to a dataframe with the two users of above and a counter that **counts the occurrencies for each pair**. This counter will be also the weight in the final graph.

3) Once we've this data we've all, so the next step is to build a graph starting from it, we've done it **representing the graph as a dictionary** containing the adiacence list of each node that has exiting edges, and this structure is **wrapped into a more general class** called ```MyGraph``` that is built by a dataframe in the form (user, user, weight) and will the methods to implement functions for the part 2.

## Examples
In the following cells I'll show how most of these methods works

In [8]:
filtered = functions.filter_dataframe_dates(c2q, date_range=('2008-11-13', '2008-11-23'))
filtered

Unnamed: 0,user_from,time,user_to
19205128,17389,1226534400,1444433689
19205129,5105044,1226534400,1443530974
19205130,929510,1226534400,1438280294
19205131,5179979,1226534400,1452019156
19205132,885189,1226534400,1438259025
...,...,...,...
20231193,141172,1227312000,1456018680
20231194,871050,1227312000,1454636399
20231195,5884566,1227312000,1454637029
20231196,4389062,1227312000,1454647251


In [9]:
print(f"\tmin_timestamp: {filtered.time.min()} in_date: {datetime.fromtimestamp(filtered.time.min())}")
print(f"\tmax_timestamp: {filtered.time.max()} in_date: {datetime.fromtimestamp(filtered.time.max())}")

	min_timestamp: 1226534400 in_date: 2008-11-13 01:00:00
	max_timestamp: 1227312000 in_date: 2008-11-22 01:00:00


In [10]:
weighted = functions.time_to_weight(filtered)
weighted.sort_values(by='time', ascending=False)

Unnamed: 0,user_from,user_to,time
779730,5336818,1455852631,2
24483,62576,1454325288,2
789760,5355912,1442822575,2
684049,5165935,1452565553,1
684037,5165930,1438359378,1
...,...,...,...
342028,2026276,1440638002,1
342029,2026276,1440778413,1
342030,2026276,1440785486,1
342031,2026276,1441741834,1


In [11]:
graph = functions.MyGraph(weighted)
print(f"Created a graph with {len(graph.nodes)} nodes")

Created a graph with 1160597 nodes


## Putting all together

We can now employee two functions of the [functions.py](functions.py) file that are usefull to create directly the graphs starting from the relative path of the csv file and the date_range over which filter the data:
1) ```get_single_graph``` that returns a graph that refers to a single csv file
2) ```get_global_graph``` that returns the graph that is the union f the three files. At this point is important to undestrand how to merge the edges and this is done by using three coefficients (one for type of edge) that multiplies the weight of the relative edges in the datasets. In this way we can merge the three graphs maybe saying that an answer to a question is a more important interaction than a comment to an answer that's an interaction more important to a comment to a question, so we can evaluate the first 3, the second 2 and the third 1.

### WARNING!!!
Using big time range will **drastically slow down** the execution

In [12]:
G = functions.get_global_graph(['data/original/a2q_casted.csv','data/original/c2a_casted.csv', 'data/original/c2q_casted.csv'], date_range=('2008-11-13', '2008-11-23'), coefficients=[3,2,1])

reading the files:
	-answer to questions: data/original/a2q_casted.csv	-comment to answers: data/original/c2a_casted.csv	-comment to questions: data/original/c2q_casted.csv
done in 12.7s
filtering the dataframes...
done in 0.975s
generating the weights
- multiplying by the coefficients: [3, 2, 1]
done in 0.78s
putting all togheter, it may require some time...
done in 1.491s
retrieving the graph
done in 12.699s tot time elapsed: 28.648


In [13]:
g_partial = functions.get_single_graph('data/original/a2q_casted.csv', date_range=('2008-11-13', '2008-11-14'))

Reading the file: data/original/a2q_casted.csv
done in 3.464s
Retrieving the graph...
done in 1.782s
