#### Common warnings:

1. __Backup your solution into the 'work' directory inside the home directory ('/home/jovyan'). It is the only one that state will be saved over sessions.__

1. Please, ensure that you call the right interpreter (python2 or python3). Do not write just "python" without the major version. There is no guarantee that any particular version of Python is set as the default one in the Grading system.

1. One cell must contain only one programming language.
E.g. if a cell contains Python code and you also want to call a bash-command (using “!”) in it, you should move the bash to another cell.

1. Our IPython converter is an improved version of the standard converter Nbconvert and it can process most of Jupyter's magic commands correctly (e.g. it understands "%%bash" and executes the cell as a "bash"-script). However, we highly recommend to avoid magics wherever possible.

#### Spark specific warnings:

1. It is a good practice to run Spark with master "yarn". However, containered system's performance is limited. If you see repeating Py4JavaErrors or Py4JNetworkErrors exceptions which you assume are not relevant to your code, feel free to change master to “local”.

1. You should eliminate extra symbols in output (such as quotes, brackets etc.). When you finally get the resulting dataframe it is easier to print wiki.take(1) instead of traverse RDD using for cycle. But in this case a lot of junk symbols will be printed like: `[['Anarchism', 'is', .. ]]`. See the right output example in the task.

In [1]:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.enableHiveSupport().master("yarn").getOrCreate()

In [2]:
graphPath = "/data/graphDFSample"

In [3]:
from pyspark.sql.functions import explode, collect_list, size, col, row_number, sort_array
from pyspark.sql import Window

In [5]:
reversedGraph = spark_session.read.parquet(graphPath) \
    .withColumn("friend", explode('friends')) \
    .groupBy("friend") \
    .agg(collect_list("user").alias("users")) \
    .withColumn("users", sort_array('users')) \
    .where(size("users") > 1)

In [10]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField

In [12]:
def emit_all_pairs(friends_array):
    ret = []
    friends_len = len(friends_array)
    for i in range(friends_len):
        for j in range(i+1,friends_len):
            ret.append((friends_array[i], friends_array[j]))
    return ret

pair_schema = StructType([
    StructField("u1", IntegerType(), False),
    StructField("u2", IntegerType(), False)
])

emit_all_pairs_udf = udf(emit_all_pairs, ArrayType(pair_schema))

In [17]:
from pyspark.sql.functions import count, desc
pairs = reversedGraph.withColumn("pairs", emit_all_pairs_udf("users")) \
    .withColumn("pair", explode("pairs")) \
    .groupBy("pair") \
    .agg(count("pair").alias("pair_count"))

In [18]:
results = pairs.select(col("pair_count"), "pair.*") \
           .orderBy(desc("pair_count"), desc("u1"), desc("u2")) \
           .limit(49) \
          .collect()

In [None]:
spark_session.stop()

#### Final notice:

1. Please take into account that you must __not__ redirect __stderr__ to anywhere. Hadoop, Hive, and Spark print their logs to stderr and the Grading system also reads and analyses it.

1. During checking the code from the notebook, the system runs all notebook's cells and reads the output of only the last filled cell. It is clear that any exception should not be thrown in the running cells. If you decide to write some text in a cell, you should change the style of the cell to Markdown (Cell -> Cell type -> Markdown).

1. The Grader takes into account the output from the sample dataset you have in the notebook. Therefore, you have to "Run All" cells in the notebook before you send the ipynb solution.

1. The name of the notebook must contain only Roman letters, numbers and characters “-” or “_”. For example, Windows adds something like " (2)" (with the leading space) at the end of a filename if you try to download a file with the same name. This is a problem, because you will have a space character and curly braces "(" and ")". 

In [None]:
for val in results:
    print('%s %s %s' % val)