# ADA 2018 - Homework 3



## Undestanding the StackOverflow community


Deadline: Nov 7th 2018, 23:59:59

Submission link: Check channel homework-3-public

StackOverflow is the most popular programming-related Q&A website. It serves as a platform for users to ask and answer questions and to vote questions and answers up or down. Users of StackOverflow can earn reputation points and "badges"; for example, a person is awarded 10 reputation points for receiving an "up" vote on an answer given to a question, and 5 points for the "up" vote on a question asked. Also, users receive badges for their valued contributions, which represents a kind of gamification of the traditional Q&A site. 

[Learn more about StackOverflow on Wikipedia](https://en.wikipedia.org/wiki/Stack_Overflow)

----

Dataset link:

https://drive.google.com/open?id=1POlGjqzw9v_pZ_bUnXGihOgk45kbvNjB

http://iccluster053.iccluster.epfl.ch/Posts.json.zip (mirror 1)

https://iloveadatas.com/datasets/Posts.json.zip (mirror 2)

Dataset description:

* **Id**: Id of the post
* **CreationDate**: Creation date of the post (String format)
* **PostTypeId**: Type of post (Question = 1, Answer = 2)
* **ParentId**: The id of the question. Only present if PostTypeId = 2
* **Score**: Points assigned by the users
* **Tags**: Tags of the question. Only present if PostTypeId = 1
* **Title**: Only present if PostTypeId = 1
* **ViewCount**: Only present if PostTypeId = 1

The dataset format is JSON. Here are examples of a question and an answer:

Question:
```json
{
    "Id": 10130734,
    "CreationDate": "2012-04-12T19:51:25.793+02:00",
    "PostTypeId": 1,
    "Score": 4,
    "Tags": "<python><pandas>",
    "Title": "Best way to insert a new value",
    "ViewCount": 3803
}
```

Answer:
```json
{  
   "CreationDate":"2010-10-26T03:19:05.063+02:00",
   "Id":4020440,
   "ParentId":4020214,
   "PostTypeId":2,
   "Score":1
}
```

----
Useful resources:

**Spark SQL, DataFrames and Datasets Guide**

https://spark.apache.org/docs/latest/sql-programming-guide.html

**Database schema documentation for the public data dump**

https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

----

**Note:** Use Spark where possible. Some computations can take more than 10 minutes on a common notebook. Consider to save partial results on disk.

In [2]:
# Add your imports here
import pandas as pd
import numpy as np
import scipy as sp
from pyspark.sql import *
%matplotlib inline
from collections import Counter
import re

In [3]:
spark = SparkSession.builder.getOrCreate()

In [5]:
DATA_DIR = 'data/'
POST_PATH = DATA_DIR + "Posts.json"
POST_PQ = "Posts.parquet"
TAGS_PQ = "Tags.parquet"
QUESTIONS_PQ = "Questions.parquet"

### Task A: Convert the dataset to a more convenient format
As a warm-up task (and to avoid to warm up your laptop too much), load the dataset into a Spark dataframe, show the content, and save it in the _Parquet_ format. Use this step to convert the fields to a more convenient form.

Answer the following questions:

1. How many questions have been asked on StackOverflow?
2. How many answers have been given?
3. What is the percentage of questions with a score of 0?

**Hint:** The next tasks involve a time difference. Consider storing time in numeric format.

In [13]:
# Load dataset into Spark Dataframe
posts_df = spark.read.json(POST_PATH)
# Save data set into Parquet
posts_df.write.save("Posts.parquet")
posts_df.show()

+--------------------+-------+--------+----------+-----+--------------------+--------------------+---------+
|        CreationDate|     Id|ParentId|PostTypeId|Score|                Tags|               Title|ViewCount|
+--------------------+-------+--------+----------+-----+--------------------+--------------------+---------+
|2010-10-26T03:17:...|4020437|    null|         1|    0|<asp.net-mvc><jqu...|display jquery di...|      510|
|2010-10-26T03:18:...|4020438|    null|         1|    0|<javascript><html...|Why can only my m...|       62|
|2010-10-26T03:19:...|4020440| 4020214|         2|    1|                null|                null|     null|
|2010-10-26T03:19:...|4020441| 3938154|         2|    0|                null|                null|     null|
|2010-10-26T03:20:...|4020443| 4020419|         2|  324|                null|                null|     null|
|2010-10-26T03:20:...|4020444| 4020433|         2|    0|                null|                null|     null|
|2010-10-26T03:21:.

In [6]:
posts = spark.read.parquet(POST_PQ)

In [7]:
# 1. How many questions have been asked on StackOverflow?
questions = posts.filter(posts['PostTypeId'] == 1)
nb_questions = questions.count()
# Save for later use
questions.write.save(QUESTIONS_PQ)
print("There are {} questions asked on StackOverflow.".format(nb_questions))

There are 15647060 questions asked on StackOverflow.


In [6]:
# 2. How many answers have been given?
nb_answers = posts.filter(posts['PostTypeId'] == 2).count()
print("There are {} answers given on StackOverflow.".format(nb_answers))

There are 25192772 answers given on StackOverflow.


In [7]:
# 3. What is the percentage of questions with a score of 0?
nb_quest_0_score = questions.filter(questions['Score'] == 0).count()
percentage_0_score = float(nb_quest_0_score) / float(nb_questions) * 100
print("{} % of the questions have score of 0.".format(percentage_0_score))

46.54365740273253 % of the questions have score of 0.


**Hint:** Load the dataset from the Parquet file for the next tasks.

### Task B: What are the 10 most popular tags?

What are the most popular tags in StackOverflow? Use Spark to extract the information you need, and answer the following questions with Pandas and Matplotlib (or Seaborn):

1. What is the proportion of tags that appear in fewer than 100 questions?
2. Plot the distribution of the tag counts using an appropriate representation.
3. Plot a bar chart with the number of questions for the 10 most popular tags.

For each task describe your findings briefly.

In [18]:
# Read previously saved files
questions_df = spark.read.parquet(QUESTIONS_PQ)
tags_rdd = spark.read.parquet(TAGS_PQ).rdd

In [19]:
# 1. What is the proportion of tags that appear in fewer than 100 questions?
# tags_rows = posts.filter(posts['Tags'] != "null")
# tags = tags_rows.select(tags_rows['Tags'])
# tags_rdd = tags.rdd

In [9]:
all_tags = tags_rdd.flatMap(lambda x: re.findall(r'<(.*?)>', x[0])).distinct()
nb_tags = all_tags.count()
nb_tags, all_tags.take(20)

(52994,
 ['python',
  'share',
  'facebook-opengraph',
  'php',
  'excel-formula',
  'tortoisesvn',
  'spring-cloud-stream',
  'runtime',
  'remote-server',
  'deployment',
  'session',
  'python-3.6',
  'botframework',
  'ember.js',
  'transitions',
  'concurrency',
  'asp.net-core',
  'function',
  'iframe',
  'testing'])

In [None]:
# Counter to count each tag in questions
tags_counter = Counter()
tags_frequence = dict.fromkeys(all_tags.collect())
l = []
def count_frequence(x):
    """Increment counter of the tag by 1"""
    # Get all tags in the line
    tags = re.findall(r'<(.*?)>', x)
    # Increment frequence
    for tag in tags:
        l.append(tag)
        tags_counter[tag] += 1
        if tags_frequence[tag] is None:
            tags_frequence[tag] = 1
        else:
            tags_frequence[tag] = tags_frequence[tag] + 1

In [20]:
question_tags_rdd = questions_df.select(questions_df['Tags']).rdd
#question_tags_rdd.map(Counter).reduce(lambda x, y: x + y)
questions_df.show(5)

+--------------------+--------+--------+----------+-----+--------------------+--------------------+---------+
|        CreationDate|      Id|ParentId|PostTypeId|Score|                Tags|               Title|ViewCount|
+--------------------+--------+--------+----------+-----+--------------------+--------------------+---------+
|2017-08-17T16:20:...|45740348|    null|         1|    2|<flash><react-nat...|Is it possible to...|      143|
|2017-08-17T16:20:...|45740355|    null|         1|    1|<postgresql><form...|Remove trailing z...|      444|
|2017-08-17T16:21:...|45740358|    null|         1|    0|<python><websocke...|Python websockets...|      280|
|2017-08-17T16:21:...|45740363|    null|         1|    0|<facebook><facebo...|Image meta tag no...|       97|
|2017-08-17T16:21:...|45740371|    null|         1|    1|    <mongodb><shell>|Using Mongo-cli t...|      185|
+--------------------+--------+--------+----------+-----+--------------------+--------------------+---------+
only showi

In [36]:
question_tags = question_tags_rdd.flatMap(lambda x: re.findall(r'<(.*?)>', x[0]))
question_tags = question_tags.map(lambda x: (x, 1))

In [38]:
question_tags.reduceByKey(lambda x, y: x + y).take(5)

KeyboardInterrupt: 

In [41]:
tags_frequence

{'python': None,
 'share': None,
 'facebook-opengraph': None,
 'php': None,
 'excel-formula': None,
 'tortoisesvn': None,
 'spring-cloud-stream': None,
 'runtime': None,
 'remote-server': None,
 'deployment': None,
 'session': None,
 'python-3.6': None,
 'botframework': None,
 'ember.js': None,
 'transitions': None,
 'concurrency': None,
 'asp.net-core': None,
 'function': None,
 'iframe': None,
 'testing': None,
 'module': None,
 'python-module': None,
 'string-literals': None,
 'nuclide-editor': None,
 'python-sphinx': None,
 'upload': None,
 'multipartform-data': None,
 'sorting': None,
 'loader': None,
 'split': None,
 'lines': None,
 'azure-cloud-services': None,
 'odoo-website': None,
 'postman': None,
 'environment': None,
 'redis': None,
 'smarty': None,
 'nltk': None,
 'data-structures': None,
 'integration': None,
 'vue-component': None,
 'windows-10-universal': None,
 'gpio': None,
 'http': None,
 'iostream': None,
 'pyzo': None,
 'require': None,
 'file-io': None,
 'rust': 

Counter({'apples': 1, 'oranges': 1})

### Task C: View-score relation

We want to investigate the correlation between the view count and the score of questions.

1. Get the view count and score of the questions with tag ```random-effects``` and visualize the relation between these two variables using an appropriate plot.
2. Are these two variables correlated? Use the Pearson coefficient to validate your hypothesis. Discuss your findings in detail.

**Hint:** Inspect the data visually before drawing your conclusions.

In [4]:
# Add your code and description here

### Task D: What are the tags with the fastest first answer?

What are the tags that have the fastest response time from the community? We define the response time as the difference in seconds between the timestamps of the question and of the first answer received.

1. Get the response time for the first answer of the questions with the tags ```python``` and ```java```.
2. Plot the two distributions in an appropriate format. What do you observe? Describe your findings and discuss the following distribution properties: mean, median, standard deviation.
3. We believe that the response time is lower for questions related to Python (compare to Java). Contradict or confirm this assumption by estimating the proper statistic with bootstrapping. Visualize the 95% confidence intervals with box plots and describe your findings.
3. Repeat the first analysis (D1) by using the proper statistic to measure the response time for the tags that appear at least 5000 times. Plot the distribution of the 10 tags with the fastest response time.


In [5]:
# Add your code and description here

### Task E: What's up with PySpark?
The number of questions asked regarding a specific topic reflect the public’s interest on it. We are interested on the popularity of PySpark. Compute and plot the number of questions with the ```pyspark``` tag for 30-day time intervals. Do you notice any trend over time? Is there any correlation between time and number of questions?


In [None]:
# Add your code and description here