#### Names of people in the group

Please write the names of the people in your group in the next cell.

Eivind Kjosbakken

In [0]:
# Loading modules that we need
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Add your imports below this line

from pyspark.sql.types import *
import re

In [0]:
# A helper function to load a table (stored in Parquet format) from DBFS as a Spark DataFrame 
def load_df(table_name: "name of the table to load") -> DataFrame:
    return spark.read.parquet(table_name)

users_df = load_df("/user/hive/warehouse/users")
posts_df = load_df("/user/hive/warehouse/posts")

# Uncomment if you need
# comments_df = load_df("/user/hive/warehouse/comments")
# badges_df = load_df("/user/hive/warehouse/badges")

#### The problem: mining the interests of experts

The primary role of a questions and answering platform such as Stack Exchange is to connect two types of people. Namely, people who have questions in areas such as computer science or data science and knowledgeable people who can answer those questions reliably. Let's call the first category of people' knowledge seekers' and the second one 'expert users' or 'experts' for short.

Here we want to answer a question related to the diversity of topics that experts are interested in using our data. We want to know if expert users only answer questions in a specific set of topics or their interests include a wide variety of topics.

To answer the above question, we will compute the correlation between a user's expertise level and the diversity of topics of questions they have answered. The first step is to define two variables (or measures); first for 'user expertise level' and then for 'user interest diversity'. Then we will use the Pearson correlation coefficient to measure the linear correlation between the two variables. We define the variables as:

   - VariableA (the measure of user expertise level). We will use the 'Reputation' column from 'users' table, which according to Stack Exchange's documentation "is a rough measurement of how much the community trusts you; it is earned by convincing your peers that you know what you're talking about" as an indicator of a user's expertise level on the platform. 

   - VariableB (The measure of user interest diversity). We measure the diversity of a user's interests by computing the total number of distinct tags associated with the questions each user has answered divided by the total number of unique tags which is 638.

Compute the Pearson correlation coefficient between VariableA and VariableB, and based on the result you've got, answer the following question: 

     Do expert users have specif interests or do they have general interests?

Please explain your reasoning on how you reached your answer.

You should use Apache Spark API for your implementation. You can use the Spark implementation of the Pearson correlation coefficient.

In [0]:
## YOUR IMPLEMENTATION ##
#function used to count how many unique tags are in a string, returns number of tags divided by 638 which is the total numbers of tags
def forEachRow(x):
  values = x #take the value from the rowobject (x is a row object), now I only have the values with < > around them
  print("values are: ", values)
  if (values==None): #if null in the table, there is no tag, so the number of tags is 0
    return 0.0
  values = re.split("\<|\>", values) #making a list of the tags, without the \, < or >, so I can count the number of tags
  valueSet = set() #a set because I only want unique elements
  a = []
  for i in range(len(values)):
    if values[i] != "":
      valueSet.add(values[i])
      if (values[i] not in a):
        a.append(values[i])
  numTags = len(valueSet)
  return (numTags/638)

  


In [0]:
#NOTE: in my implementation I have assumed I only look at data where a user has replied to a post with tags on it, in other words: only looked at expert users. I could also have looked at all users and assumed number of tags = 0 if they did had not replied to any posts (I did that too and got a coefficient of approximately 0.7, so almost the same as in the implementation under)

df1 = users_df.createOrReplaceTempView("df1") #users
df2 = posts_df.createOrReplaceTempView("df2") #posts


q = "select df1.id as userId, parentId, Reputation from df1 inner join df2 on df1.id = df2.owneruserid where posttypeid = 2" #gather all parentposts, and userid where a user has answered a post

ans3 = spark.sql(q)
df3 = ans3.createOrReplaceTempView("df3")
q = "select userId, Tags, Reputation from df2 inner join df3 on df3.parentId = df2.id" #get the tags to the posts the users replied to, and the repliers id, as well as the repliers Reputation
ans4 = spark.sql(q)

ans5 = ans4.orderBy("Tags", ascending=True).groupBy("userId", "Reputation").agg(array_join(collect_list("Tags"), delimiter="",).alias("Tags")) #if a user has posted several times, they will have several different rows with tag values in them, this line just concatenates all those tags together, so I keep all the tag values for the question a user has replied to, in one row. I assume it is ok to use the collect list here, since the information is not getting stored in a python list or such, but in a dataframe. Have to group by both userId and Reputation to concatenate the tags rows for each user

ans6 = ans5.rdd.map(lambda x: (x[1], forEachRow(x[2])) ) #then counting the number of unique tags with the "forEachRow" function defined in the cell above, so instead of the tags on the column, I now have an int with the number of unique tags, x[1] is the Reputation column from the ans5 DF, x[2] is the Tags column, and x[0] is the id (but I don't need that now)

schema = StructType([StructField('Reputation', IntegerType(), False), StructField('Diversity', FloatType(), False)])  #make schema so I can make the rdd a df again
df5 = sqlContext.createDataFrame(ans6,schema)


corr = df5.corr("Reputation", "Diversity", "pearson")
print("corr is: ", corr)
#with this I get a Pearson correlation coefficient of 0.7217677665044138

corr is:  0.7217677665044138


The Pearson correlation coefficient is a number between -1 and 1. Where a coefficient closer to 1 represents a strong connection between two lists of numbers. If the numbers from one list increases, so do the numbers from the other list. A coefficient closer to -1 represents the opposite, where if the numbers from one list increases, the numbers from the other list will decrease. I got a coefficient of  0.7217677665044138, which represents a pretty strong correlation between the expertise level, and the interest diversity. If there is a strong correlation, that means that the more expertise a user has, the more diversity the user has in its interests. So my conclusion is that expert users tend to have diverse interests, since the correlation is much closer to 1, than a -1.