# ADA 2018 - Homework 3



## Understanding the StackOverflow community


StackOverflow is the most popular programming-related Q&A website. It serves as a platform for users to ask and answer questions and to vote questions and answers up or down. Users of StackOverflow can earn reputation points and "badges"; for example, a person is awarded 10 reputation points for receiving an "up" vote on an answer given to a question, and 5 points for the "up" vote on a question asked. Also, users receive badges for their valued contributions, which represents a kind of gamification of the traditional Q&A site. 

[Learn more about StackOverflow on Wikipedia](https://en.wikipedia.org/wiki/Stack_Overflow)

----

Dataset link:

https://drive.google.com/open?id=1POlGjqzw9v_pZ_bUnXGihOgk45kbvNjB

Dataset description:

* **Id**: Id of the post
* **CreationDate**: Creation date of the post (String format)
* **PostTypeId**: Type of post (Question = 1, Answer = 2)
* **ParentId**: The id of the question. Only present if PostTypeId = 2
* **Score**: Points assigned by the users
* **Tags**: Tags of the question. Only present if PostTypeId = 1
* **Title**: Only present if PostTypeId = 1
* **ViewCount**: Only present if PostTypeId = 1

The dataset format is JSON. Here are examples of a question and an answer:

Question:
```json
{
    "Id": 10130734,
    "CreationDate": "2012-04-12T19:51:25.793+02:00",
    "PostTypeId": 1,
    "Score": 4,
    "Tags": "<python><pandas>",
    "Title": "Best way to insert a new value",
    "ViewCount": 3803
}
```

Answer:
```json
{  
   "CreationDate":"2010-10-26T03:19:05.063+02:00",
   "Id":4020440,
   "ParentId":4020214,
   "PostTypeId":2,
   "Score":1
}
```

----
Useful resources:

**Spark SQL, DataFrames and Datasets Guide**

https://spark.apache.org/docs/latest/sql-programming-guide.html

**Database schema documentation for the public data dump**

https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

----

**Note:** Use Spark where possible. Some computations can take more than 10 minutes on a common notebook. Consider to save partial results on disk.

In [1]:
import re
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import findspark
import zipfile
findspark.init('/opt/spark/spark-2.3.2-bin-hadoop2.7/')

from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.functions import min

from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

import zipfile

### Task A: Convert the dataset to a more convenient format
As a warm-up task (and to avoid to warm up your laptop too much), load the dataset into a Spark dataframe, show the content, and save it in the _Parquet_ format. Use this step to convert the fields to a more convenient form.

Answer the following questions:

1. How many questions have been asked on StackOverflow?
2. How many answers have been given?
3. What is the percentage of questions with a score of 0?

**Hint:** The next tasks involve a time difference. Consider storing time in numeric format.

In [2]:
# Add your code and description here

In [2]:
DATA_DIR = 'data/'

In [3]:
try:
    zipfile.ZipFile(DATA_DIR + "Posts.json.zip").extract("Posts.json").close()
except zipfile.BadZipFile as e:
    print("Zip file might have problems")

Zip file might have problems


In [14]:
#posts_rdd = spark.read.json(DATA_DIR + "Posts.json.zip", multiLine=True)
posts_rdd = spark.read.json(DATA_DIR + "Posts.json")

In [16]:
posts_rdd.show(5)

+--------------------+-------+--------+----------+-----+--------------------+--------------------+---------+---------------+
|        CreationDate|     Id|ParentId|PostTypeId|Score|                Tags|               Title|ViewCount|_corrupt_record|
+--------------------+-------+--------+----------+-----+--------------------+--------------------+---------+---------------+
|2010-10-26T03:17:...|4020437|    null|         1|    0|<asp.net-mvc><jqu...|display jquery di...|      510|           null|
|2010-10-26T03:18:...|4020438|    null|         1|    0|<javascript><html...|Why can only my m...|       62|           null|
|2010-10-26T03:19:...|4020440| 4020214|         2|    1|                null|                null|     null|           null|
|2010-10-26T03:19:...|4020441| 3938154|         2|    0|                null|                null|     null|           null|
|2010-10-26T03:20:...|4020443| 4020419|         2|  324|                null|                null|     null|           null|


In [17]:
posts_rdd.count()

11759909

In [18]:
posts_rdd.write.parquet(DATA_DIR + "posts.parquet")

In [19]:
posts_parquet = spark.read.parquet(DATA_DIR + "posts.parquet")
posts_parquet.show(5)

+--------------------+--------+--------+----------+-----+--------------------+--------------------+---------+---------------+
|        CreationDate|      Id|ParentId|PostTypeId|Score|                Tags|               Title|ViewCount|_corrupt_record|
+--------------------+--------+--------+----------+-----+--------------------+--------------------+---------+---------------+
|2014-11-16T07:04:...|26954698|26901004|         2|    3|                null|                null|     null|           null|
|2014-11-16T07:04:...|26954699|26954620|         2|    1|                null|                null|     null|           null|
|2014-11-16T07:04:...|26954700|    null|         1|    0|<android><android...|Animate sprites u...|       63|           null|
|2014-11-16T07:04:...|26954701|    null|         1|    0|<android><sqlite>...|Not Able to retri...|       60|           null|
|2014-11-16T07:04:...|26954702|26954653|         2|    1|                null|                null|     null|         

__1) How many questions have been asked on StackOverflow?__

In [22]:
#Questions are of Post type 1
posts_parquet.filter('PostTypeId == 1').count()


4030853

__2) How many answers have been given?__

In [23]:
posts_parquet.filter('PostTypeId == 2').count()

7729055

__3) What is the percentage of questions with a score of 0?__

In [25]:
(posts_parquet.filter('Score == 0').count()/posts_parquet.count()) * 100

33.477002245510576

**Hint:** Load the dataset from the Parquet file for the next tasks.

### Task B: What are the 10 most popular tags?

What are the most popular tags in StackOverflow? Use Spark to extract the information you need, and answer the following questions with Pandas and Matplotlib (or Seaborn):

1. What is the proportion of tags that appear in fewer than 100 questions?
2. Plot the distribution of the tag counts using an appropriate representation.
3. Plot a bar chart with the number of questions for the 10 most popular tags.

For each task describe your findings briefly.

In [None]:
# Add your code and description here

### Task C: View-score relation

We want to investigate the correlation between the view count and the score of questions.

1. Get the view count and score of the questions with tag ```random-effects``` and visualize the relation between these two variables using an appropriate plot.
2. Are these two variables correlated? Use the Pearson coefficient to validate your hypothesis. Discuss your findings in detail.

**Hint:** Inspect the data visually before drawing your conclusions.

In [None]:
# Add your code and description here

### Task D: What are the tags with the fastest first answer?

What are the tags that have the fastest response time from the community? We define the response time as the difference in seconds between the timestamps of the question and of the first answer received.

1. Get the response time for the first answer of the questions with the tags ```python``` and ```java```.
2. Plot the two distributions in an appropriate format. What do you observe? Describe your findings and discuss the following distribution properties: mean, median, standard deviation.
3. We believe that the response time is lower for questions related to Python. Contradict or confirm this assumption by estimating the proper statistic with bootstrapping. Visualize the 95% confidence intervals with box plots and describe your findings.
3. Repeat the first analysis (D1) by using the proper statistic to measure the response time for the tags that appear at least 5000 times. Plot the distribution of the 10 tags with the fastest response time.


In [None]:
# Add your code and description here

### Task E: What's up with PySpark?
The number of questions asked regarding a specific topic reflect the public’s interest on it. We are interested on the popularity of PySpark. Compute and plot the number of questions with the ```pyspark``` tag for 30-day time intervals. Do you notice any trend over time? Is there any correlation between time and number of questions?


In [None]:
# Add your code and description here