#### Names of people in the group

Please write the names of the people in your group in the next cell.

Oda Colquhoun

Emil Bjørlykke Berglund

In [0]:
# We need to install 'ipython_unittest' to run unittests in a Jupyter notebook
!pip install -q ipython_unittest

In [0]:
# Loading modules that we need
from pyspark.sql.dataframe import DataFrame
from collections import Counter
from pyspark.sql.functions import desc

In [0]:
# A helper function to load a table (stored in Parquet format) from DBFS as a Spark DataFrame 
def load_df(table_name: "name of the table to load") -> DataFrame:
    return spark.read.parquet(table_name)

users_df = load_df("/user/hive/warehouse/users")
posts_df = load_df("/user/hive/warehouse/posts")

#users_df = load_df("/FileStore/tables/users.parquet")
#posts_df = load_df("/FileStore/tables/posts.parquet")

#### Subtask 1: implementing two functions
Implement these two functions:
1. 'compute_pearsons_r' that receives a DataFrame and two column names and returns the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between values of two columns;
2. 'make_tag_graph' that in the input receives the DataFrame containing the records related to 'questions' and returns a DataFrame with two columns 'u' and 'v'; the record for row i from the resulting DataFrame is a tuple (u_i, v_i). u_i and v_j are distinct tags and have appeared together for a question.

Please note that you should implement the 'compute_pearsons_r' yourself, so you should not use the 'DataFrame.stat.corr' method. Nevertheless, you can use 'DataFrame.stat.corr' to verify the correctness of your implementation.

In [0]:
from pyspark.sql.functions import stddev

def compute_pearsons_r(df: "a DataFrame", col1: "name of column A", col2: "name of column B") -> float:
    
    ## YOUR IMPLEMENTATION ##
  cov = df. cov(col1, col2)
  s1 = df.select(stddev(col1)).collect()[0][0]
  s2 = df.select(stddev(col2)).collect()[0][0]
  r = cov/(s1*s2)
  return r

def tupling(tag_list):
        tuples = set()
        if len(tag_list) == 1:
            return [(tag_list[0], tag_list[0])]
        else:
            for i in tag_list:
                for j in tag_list:
                    if i != j and (j, i) not in tuples:
                        tuples.add((i, j))
                        tuples.add((j, i))
            return list(tuples)

def make_tag_graph(df: "DataFrame containing question data") -> DataFrame:
  tags = df.select("Tags").rdd.map(lambda x: x[0].lstrip("<").rstrip(">").split("><")).flatMap(tupling)
  edges_df = spark.createDataFrame(tags).toDF("u","v")                 
  return edges_df

#def make_tag_graph(df: "DataFrame containing question data") -> DataFrame:
    
  """
  ## YOUR IMPLEMENTATION ##
  # Får inn postDf kun id = 1
  # iterer gjennom rader
  #    velge strengen fra tags kolonnen
  # tags[rad] = streng, dele strengen på x = txt.spilit(">"), alle kombinasjoner legges inn som rader i en dataframe(?)
  # returner dataframe
  """
   
    

In [0]:
from pyspark.sql.functions import stddev

def compute_pearsons_r(df: "a DataFrame", col1: "name of column A", col2: "name of column B") -> float:
  covariance = df. cov(col1, col2)
  standardDeviation1 = df.select(stddev(col1)).collect()[0][0]
  standardDeviation2 = df.select(stddev(col2)).collect()[0][0]
  r = covariance/(standardDeviation1*standardDeviation2)
  return r

def tupling(tag_list):
        tuples = set()
        if len(tag_list) == 1:
            return [(tag_list[0], tag_list[0])]
        else:
            for i in tag_list:
                for j in tag_list:
                    if i != j and (j, i) not in tuples:
                        tuples.add((i, j))
                        tuples.add((j, i))
            return list(tuples)

def make_tag_graph(df: "DataFrame containing question data") -> DataFrame:
  tags = df.select("Tags").rdd.map(lambda x: x[0].lstrip("<").rstrip(">").split("><")).flatMap(tupling)
  edges_df = spark.createDataFrame(tags).toDF("u","v")                 
  return edges_df
    
  """
  ## YOUR IMPLEMENTATION ##
  # Får inn postDf kun id = 1
  # iterer gjennom rader
  #    velge strengen fra tags kolonnen
  # tags[rad] = streng, dele strengen på x = txt.spilit(">"), alle kombinasjoner legges inn som rader i en dataframe(?)
  # returner dataframe
  """


In [0]:
%load_ext ipython_unittest


In [0]:
%%unittest_main
class TestTask3(unittest.TestCase):
  
  error_threshold = 0.03
  
  def test_corr1(self):
    # Pearson correlation coefficient between 'user reputation' and 'upvotes' received by users
    result = compute_pearsons_r(users_df, "Reputation", "UpVotes")
    self.assertLessEqual(abs(result-0.5218138310114108), self.error_threshold)
    print(result)
    
  def test_corr2(self):
    # Pearson correlation coefficient between 'user reputation' and 'downvotes' received by users
    result = compute_pearsons_r(users_df, "Reputation", "DownVotes")
    self.assertLessEqual(abs(result-0.1473558141546844), self.error_threshold)
    print(result)

  def test_corr3(self):
    # Pearson correlation coefficient between 'question score' and the 'number of answers' it received
    result = compute_pearsons_r(posts_df[posts_df["PostTypeId"] == 1], "Score", "AnswerCount")
    self.assertLessEqual(abs(result-0.47855272641249674), self.error_threshold)
    print(result)
    
  def test_make_tag_graph(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    self.assertIsInstance(result, DataFrame)
    
    coulmn_names = Counter(map(str.lower, ['u', 'v']))
    self.assertCountEqual(coulmn_names, Counter(map(str.lower, result.columns)), "Missing column(s) or column name mismatch")
    
    #display(result)
    
    self.assertEqual(result.count(), 228830)


In [0]:
# Importing GraphFrames graph library; make sure you have GraphFrames installed on the cluster
from graphframes import *

#### Subtask 2: implementing three functions
Impelment these three functions:
1. 'get_nodes' that, given the result from execution of 'make_tag_graph', returns a DataFrame with one column named 'id' that includes the tags that have appeared in the tag graph;
2. 'get_edges' that, given the result from execution of 'make_tag_graph', returns a DataFrame with two columns 'src' and 'dst' where 'src' is the source node and 'dst' is the destination node.
3. 'compute_pagerank' that receives a GraphFrames graph object in the input and computes the PageRank for nodes in the graph and returns the result as a DataFrame with two columns named 'id' and 'pagerank'; the rows in the in the resulting DataFrame should be sorted by the values of 'pagerank' column.

Note that the term 'tag graph' in this context refers to the DataFrame reuturned by executing 'make_tag_graph'. Furthermore, 'src' and 'dst' are distinct, so 'src' != 'dst'.

In [0]:
def get_nodes(df: "DataFrame of the tag graph") -> DataFrame:
  ## YOUR IMPLEMENTATION ##
  tags = df.dropDuplicates(["u"]).select("u").collect()
  nodes_dataframe = spark.createDataFrame(tags).toDF("id")
  return nodes_dataframe

def get_edges(df: "DataFrame of the tag graph") -> DataFrame:
  ## YOUR IMPLEMENTATION ##
  #dataframe nesten lik som tag graph bare u = src og v = dst(???)
  df = df.filter(df.u != df.v).collect()
  edges_dataframe = spark.createDataFrame(df).toDF("src","dst")
  return edges_dataframe

def compute_pagerank(graph: "a Graphframes graph") -> DataFrame:
  ## YOUR IMPLEMENTATION ##
  ## Note: We were unable to download and install graphframes for implementing this method
  
  #print(graph.n)
  #results = graph.pageRank()
  #pagerank_dataframe = spark.createDataFrame(df).toDF("id","pagerank")
  
  return null
  

In [0]:
# Loading 'ipython_unittest' so we can use '%%unittest_main' magic command
%load_ext ipython_unittest

#### Subtask 3: validating the implementation by running the tests

Run the cell below and make sure that all the tests run successfully.

In [0]:
%%unittest_main
class TestTask3(unittest.TestCase):
  
  error_threshold = 0.03
  
  def test_corr1(self):
    # Pearson correlation coefficient between 'user reputation' and 'upvotes' received by users
    result = compute_pearsons_r(users_df, "Reputation", "UpVotes")
    self.assertLessEqual(abs(result-0.5218138310114108), self.error_threshold)
    print(result)
  
  def test_corr2(self):
    # Pearson correlation coefficient between 'user reputation' and 'downvotes' received by users
    result = compute_pearsons_r(users_df, "Reputation", "DownVotes")
    self.assertLessEqual(abs(result-0.1473558141546844), self.error_threshold)
    print(result)

  def test_corr3(self):
    # Pearson correlation coefficient between 'question score' and the 'number of answers' it received
    result = compute_pearsons_r(posts_df[posts_df["PostTypeId"] == 1], "Score", "AnswerCount")
    self.assertLessEqual(abs(result-0.47855272641249674), self.error_threshold)
    print(result)
    
  def test_make_tag_graph(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    self.assertIsInstance(result, DataFrame)
    
    coulmn_names = Counter(map(str.lower, ['u', 'v']))
    self.assertCountEqual(coulmn_names, Counter(map(str.lower, result.columns)), "Missing column(s) or column name mismatch")
    
    display(result)
    
    self.assertEqual(result.count(), 228830)
    
  def test_get_nodes(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    n = get_nodes(result)
    self.assertEqual(n.count(), 638)
    n.show()

  def test_get_edges(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    e = get_edges(result)
    
    coulmn_names = Counter(map(str.lower, ['src', 'dst']))
    self.assertCountEqual(coulmn_names, Counter(map(str.lower, e.columns)), "Missing column(s) or column name mismatch")
    
    self.assertEqual(e.count(), 225290)
    e.show()
    
  def test_compute_pagerank(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    n = get_nodes(result)
    e = get_edges(result)
    g = GraphFrame(n, e)
    ranks = compute_pagerank(g)
    self.assertEqual(ranks.first()[0], 'machine-learning')
    ranks.show()

u,v
machine-learning,machine-learning
open-source,education
education,open-source
definitions,data-mining
data-mining,definitions
databases,databases
bigdata,machine-learning
bigdata,libsvm
machine-learning,bigdata
machine-learning,libsvm


#### Subtask 4: answering to questions about Spark related concepts

Please write a short description for the terms below---one to two short paragraphs for each term. Don't copy-paste; instead, write your own understanding.

1. What do the terms 'User-Defined Functions (UDFs)', 'Data Locality', 'Bucketing', 'Distributed Filesystem' mean in the context of Spark?

Write your descriptions in the next cell.

###### User-Defined Functions (UDFs): 
In Spark, User-Defined Functions (UDFs) allow users to define their own custom functions to process data within Spark's distributed environment. UDFs can be written in various programming languages such as Scala, Python, and Java, and can be used to perform complex transformations on DataFrames, Datasets, or RDDs. UDFs are used to extend the built-in functions provided by Spark and allow users to perform more complex computations.
 
###### Data Locality: 
Data locality refers to the principle of moving the computation to the data instead of moving the data to the computation. In a distributed computing environment like Spark, where data is spread across multiple nodes in a cluster, data locality is crucial for performance optimization. By ensuring that the computation is performed on the same node where the data resides, data movement across the network can be minimized, resulting in faster processing times and reduced network traffic.
 
###### Bucketing: 
Bucketing is a data organization technique in Spark that groups similar data into partitions based on a specified key. In Spark, bucketing is typically used in conjunction with Hive tables, where data is stored in buckets based on a specific column. Bucketing can improve query performance by reducing the number of files that need to be scanned during query execution. By grouping similar data into partitions, Spark can avoid reading unnecessary data and reduce the amount of data transferred over the network.
 
###### Distributed Filesystem: 
A distributed filesystem is a type of filesystem that is spread across multiple nodes in a cluster, providing a unified view of the data stored on each node. In the context of Spark, a distributed filesystem is used to store and manage data that is processed by Spark. Examples of distributed filesystems used with Spark include Hadoop Distributed File System (HDFS) and Amazon S3. A distributed filesystem provides fault tolerance and scalability, enabling Spark to process large amounts of data in a distributed computing environment.