#### Names of people in the group

Please write the names of the people in your group in the next cell.

Oda Colquhoun

Emil Bjørlykke Berglund

In [0]:
# Loading modules that we need
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *


# Add your imports below this line
from pyspark.sql.dataframe import DataFrame
from typing import Any
from pyspark.sql import SparkSession
from pyspark.sql.functions import stddev

In [0]:
# A helper function to load a table (stored in Parquet format) from DBFS as a Spark DataFrame 
def load_df(table_name: "name of the table to load") -> DataFrame:
    return spark.read.parquet(table_name)

users_df = load_df("/user/hive/warehouse/users")
posts_df = load_df("/user/hive/warehouse/posts")

# Uncomment if you need
# comments_df = load_df("/user/hive/warehouse/comments")
# badges_df = load_df("/user/hive/warehouse/badges")

#### The problem: mining the interests of experts

The primary role of a questions and answering platform such as Stack Exchange is to connect two types of people. Namely, people who have questions in areas such as computer science or data science and knowledgeable people who can answer those questions reliably. Let's call the first category of people' knowledge seekers' and the second one 'expert users' or 'experts' for short.

Here we want to answer a question related to the diversity of topics that experts are interested in using our data. We want to know if expert users only answer questions in a specific set of topics or their interests include a wide variety of topics.

To answer the above question, we will compute the correlation between a user's expertise level and the diversity of topics of questions they have answered. The first step is to define two variables (or measures); first for 'user expertise level' and then for 'user interest diversity'. Then we will use the Pearson correlation coefficient to measure the linear correlation between the two variables. We define the variables as:

   - VariableA (the measure of user expertise level). We will use the 'Reputation' column from 'users' table, which according to Stack Exchange's documentation "is a rough measurement of how much the community trusts you; it is earned by convincing your peers that you know what you're talking about" as an indicator of a user's expertise level on the platform. 

   - VariableB (The measure of user interest diversity). We measure the diversity of a user's interests by computing the total number of distinct tags associated with the questions each user has answered divided by the total number of unique tags which is 638.

Compute the Pearson correlation coefficient between VariableA and VariableB, and based on the result you've got, answer the following question: 

     Do expert users have specif interests or do they have general interests?

Please explain your reasoning on how you reached your answer.

You should use Apache Spark API for your implementation. You can use the Spark implementation of the Pearson correlation coefficient.

In [0]:
"""
Scratch: 

variable a: Reputation of user, from user table
for every user, find reputation
  helper method: return a dataframe with useriD and reputation?
  

variable b: 
for every question per user count unique tags and divide by total
total unique tags = 638

postTypeId [1] = questions
postTypeId [2] = answers


SELECT COUNT(DISTINCT tags)


dataframe a
postTypeId[1]
Cols(postId(int), tags(string))

dataframe b
postTypeId[2] dfa joined on parentId
Cols(postId(int), tags(string))

dataframe c
UserId dfb joined on OwnerId
Cols(UserId(int), tags(string))

"""

In [0]:
#Methods

#Run query from task 1
def run_query(query: "a SQL query string", df: "the DataFrame that the query will be executed on") -> Any:
    df.createOrReplaceTempView("df")
    return spark.sql(query).collect()
  
#Run query from task 2  
def run_query2(query: "a SQL query string", df1: "DataFrame A", df2: "DataFrame B") -> Any:
    df1.createOrReplaceTempView("df1")
    df2.createOrReplaceTempView("df2")
    result = spark.sql(query).collect()
    return result
  
#Max num tags
numberOfDistinctTags = 638

def getVarA(df: "the DataFrame that the query will be executed on") -> DataFrame:
  df = df.select("Id","Reputation")  
  #df.show()
  return df

def getVarB(df: "the DataFrame that the query will be executed on") -> DataFrame:
  query = "SELECT first(df2.Tags) FROM df as df1, df as df2 WHERE df1.ParentId = df2.Id GROUP BY df1.OwnerUserId"
  Tags = run_query(query, df)
  df = df.withColumn("tagsArray", split(df.Tags, "[<>]"))
  df = df.withColumn("tag", explode(df.tagsArray))
  df = df.groupBy("Tags", "OwnerUserId").agg((countDistinct("tag")/numberOfDistinctTags).alias("Tag_Ratio"))
  df = df.drop("Tags")
  return df

def joinDfAandB(df1: "Dataframe A", df2: "Dataframe B") -> DataFrame:
  query = "SELECT Reputation, Tag_Ratio FROM df1, df2 WHERE df1.Id = df2.OwnerUserId"
  ret = run_query2(query, df1, df2)
  spark = SparkSession.builder.appName("toDf").getOrCreate()
  ret = spark.createDataFrame(ret, ['Reputation', 'Tag_Ratio'])
  return ret

# Compute pearson from task 3
def compute_pearsons_r(df: "a DataFrame", col1: "name of column A", col2: "name of column B") -> float:
  covariance = df. cov(col1, col2)
  standardDeviation1 = df.select(stddev(col1)).collect()[0][0]
  standardDeviation2 = df.select(stddev(col2)).collect()[0][0]
  #print("cov")
  #print(covariance)
  #print("standardDeviation1")
  #print(standardDeviation1)
  #print("standardDeviation2")
  #print(standardDeviation2)
  r = covariance/(standardDeviation1*standardDeviation2)
  return r




In [0]:
dfVarA = getVarA(users_df)
dfVarB = getVarB(posts_df)
joined = joinDfAandB(dfVarA,dfVarB)
r = compute_pearsons_r(sparkdf, "Reputation", "Tag_Ratio")
print("Here we have pearsons correlation coefficient: " + str(r))

###### Reflection on question: "Do expert users have specif interests or do they have general interests?"

Firstly, the way we implemented this was to first find the variable A which was simply to extract the reputation column from the users table. The second one, variable b was more complex. We first found all the tags for each post with type 1, namely questions, then we joined those columns with the answers to those questions on parentId so that the tags would transfer from the questions to each answer to the respective questions, then we would connect those tags to the userId of the person who wrote that answer. Finally we would join these two tables on the userId and OwnerUserId and first then could we take the pearsons correlation coefficient on the columns. We didnt quite recieve an answer that we expected, we expected it to be higher, which might indicate some either no correlation or some small bug in the implementation code. But the value is negative which indicates a negative correlation between the two, which can mean that if the reputation goes up they get more specific interests while if the reputation is low they could have more general interests. But, since the correlation value is so small in magnitude, it can indicate that there isnt too strong of a correlation between the two columns.

In [0]:
## YOUR IMPLEMENTATION ##

In [0]:
## YOUR IMPLEMENTATION ##