#### Names of people in the group

Please write the names of the people in your group in the next cell.

Name of person A Karl Edvin Undheim

Name of person B

In [0]:
# Loading modules that we need
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Add your imports below this line

In [0]:
# A helper function to load a table (stored in Parquet format) from DBFS as a Spark DataFrame 
def load_df(table_name: "name of the table to load") -> DataFrame:
    return spark.read.parquet(table_name)

users_df = load_df("/user/hive/warehouse/users")
posts_df = load_df("/user/hive/warehouse/posts")

# Uncomment if you need
# comments_df = load_df("/user/hive/warehouse/comments")
# badges_df = load_df("/user/hive/warehouse/badges")

#### The problem: mining the interests of experts

The primary role of a questions and answering platform such as Stack Exchange is to connect two types of people. Namely, people who have questions in areas such as computer science or data science and knowledgeable people who can answer those questions reliably. Let's call the first category of people' knowledge seekers' and the second one 'expert users' or 'experts' for short.

Here we want to answer a question related to the diversity of topics that experts are interested in using our data. We want to know if expert users only answer questions in a specific set of topics or their interests include a wide variety of topics.

To answer the above question, we will compute the correlation between a user's expertise level and the diversity of topics of questions they have answered. The first step is to define two variables (or measures); first for 'user expertise level' and then for 'user interest diversity'. Then we will use the Pearson correlation coefficient to measure the linear correlation between the two variables. We define the variables as:

   - VariableA (the measure of user expertise level). We will use the 'Reputation' column from 'users' table, which according to Stack Exchange's documentation "is a rough measurement of how much the community trusts you; it is earned by convincing your peers that you know what you're talking about" as an indicator of a user's expertise level on the platform. 

   - VariableB (The measure of user interest diversity). We measure the diversity of a user's interests by computing the total number of distinct tags associated with the questions each user has answered divided by the total number of unique tags which is 638.

Compute the Pearson correlation coefficient between VariableA and VariableB, and based on the result you've got, answer the following question: 

     Do expert users have specif interests or do they have general interests?

Please explain your reasoning on how you reached your answer.

You should use Apache Spark API for your implementation. You can use the Spark implementation of the Pearson correlation coefficient.

In [0]:
# I make tempviews to use them in queries
users_df.createOrReplaceTempView('users')
posts_df.createOrReplaceTempView('posts')

# First I make a dataframe with the user Id in one column and Tags in the other. A row shows the user Id and every tag associated with a specific question the user has answered. There is one record for every answer, so one user can be in multiple records.

# This can be done in a query. I make a table with user Ids and every parentId of the questions they have answered. This table is then joined with posts on user Id to get the tags for the question. Finally I select user Id and Tags from this table.

user_tags_query = "SELECT C.Id, posts.Tags FROM (SELECT users.Id, posts.parentId FROM users INNER JOIN posts ON users.Id==posts.OwnerUserId WHERE posts.PostTypeId==2) C INNER JOIN posts ON C.parentId==posts.Id SORT BY C.Id ASC"

df1 = spark.sql(user_tags_query)
df1.show()

# This is shown below:

+---+--------------------+
| Id|                Tags|
+---+--------------------+
|  9|<bigdata><scalabi...|
|  9|<recommender-syst...|
| 11|<bigdata><apache-...|
| 11|    <classification>|
| 14|<bigdata><scalabi...|
| 14|<data-mining><clu...|
| 14|<bigdata><statist...|
| 14|<nlp><topic-model...|
| 14|<classification><...|
| 14|               <svm>|
| 14|   <r><dataset><pca>|
| 14|<machine-learning...|
| 14|<data-cleaning><l...|
| 14|<r><classificatio...|
| 14|<machine-learning...|
| 17|<machine-learning...|
| 17|      <keras><tools>|
| 17|<machine-learning...|
| 21|<machine-learning...|
| 21|<bigdata><google>...|
+---+--------------------+
only showing top 20 rows



In [0]:
# Now we need to count how many distinct tags are associated with each user. To do this I first expand the table such that every user-tag pairing is a record. This is done by replacing >< with >==< so that I can split a tag string between separate tags. Then I use sparks explode function and split on ==.
df1 = df1.withColumn('Tags', regexp_replace('Tags', '><', '>==<')).withColumn("Tags", explode(split("Tags", "==")))

df1.show()
df1.createOrReplaceTempView('df1')

# Now I count how many distinct tags are associated with each user with a simple query:
user_tagCount_query = "SELECT df1.Id, COUNT(DISTINCT df1.Tags) AS TagCount FROM df1 GROUP BY df1.Id"

df2 = spark.sql(user_tagCount_query)
df2.show()

df2.createOrReplaceTempView('df2')

+---+--------------------+
| Id|                Tags|
+---+--------------------+
|  9|           <bigdata>|
|  9|       <scalability>|
|  9|        <efficiency>|
|  9|       <performance>|
|  9|<recommender-system>|
|  9|<information-retr...|
| 11|           <bigdata>|
| 11|     <apache-hadoop>|
| 11|    <classification>|
| 14|           <bigdata>|
| 14|       <scalability>|
| 14|        <efficiency>|
| 14|       <performance>|
| 14|       <data-mining>|
| 14|        <clustering>|
| 14|            <octave>|
| 14|           <k-means>|
| 14|  <categorical-data>|
| 14|           <bigdata>|
| 14|        <statistics>|
+---+--------------------+
only showing top 20 rows

+-----+--------+
|   Id|TagCount|
+-----+--------+
|13285|      16|
|43527|       4|
|50223|       5|
|57693|       5|
|46465|       3|
|69478|      66|
|  471|      56|
|45615|      10|
|49717|       5|
|27760|       7|
|91299|       5|
|29054|       4|
|11141|      19|
| 9465|      11|
|80579|       3|
|85321|       4|
|74

In [0]:
# Now I select TagCount from df2 and convert it to diversity by dividing by 638(total amount of distinct tags), and Reputation from users. 
# This is our final table which can be used in the pearson calculation.
q = "SELECT TagCount/638 AS Diversity, Reputation FROM df2 INNER JOIN users ON df2.Id==users.Id SORT BY Diversity DESC"

df3 = spark.sql(q)
df3.show()

r1 = df3.corr("Diversity", "Reputation")
print(r1)

# Do expert users have specif interests or do they have general interests?
# Answer:
# The correlation coefficient is about 0.72, which generally means a strong correlation. 
# So there is a strong correlation between a users reputation and the number of distinct tags associated with the questions the user has answered.
# This means an expert user has general interests.

+-------------------+----------+
|          Diversity|Reputation|
+-------------------+----------+
|0.48746081504702193|     10037|
|0.46551724137931033|     10346|
|0.32445141065830724|     11711|
|0.31347962382445144|      8206|
| 0.3056426332288401|      4782|
|0.30094043887147337|      2899|
|0.28213166144200624|      6349|
| 0.2664576802507837|      4611|
| 0.2476489028213166|     24229|
|0.24294670846394983|      4083|
|0.23981191222570533|     11793|
| 0.2335423197492163|      4549|
| 0.2225705329153605|      4679|
| 0.2225705329153605|      4109|
|0.21786833855799373|      5211|
|0.21630094043887146|      7044|
| 0.2115987460815047|      7248|
| 0.2115987460815047|      7613|
|0.21003134796238246|      9821|
| 0.2084639498432602|       674|
+-------------------+----------+
only showing top 20 rows

0.7217677648622982
