In [None]:
%%html
<style>
.h1_cell, .just_text {
    box-sizing: border-box;
    padding-top:5px;
    padding-bottom:5px;
    font-family: "Times New Roman", Georgia, Serif;
    font-size: 125%;
    line-height: 22px; /* 5px +12px + 5px */
    text-indent: 25px;
    background-color: #fbfbea;
    padding: 10px;
}
.code_block {
    box-sizing: border-box;
    padding-top:5px;
    padding-bottom:5px;
    font-size: 75%;
    line-height: 22px; /* 5px +12px + 5px */
    #text-indent: 25px;
    #background-color: #fbfbea;
    padding: 5px;
}

hr { 
    display: block;
    margin-top: 0.5em;
    margin-bottom: 0.5em;
    margin-left: auto;
    margin-right: auto;
    border-style: inset;
    border-width: 2px;
}
</style>

<h2>
<center>
Can we Parallelize KNN?
</center>
</h2>

In [None]:
import os
import sys
import subprocess

In [None]:
os.environ['SPARK_HOME'] = os.environ['HOME'] + '/spark'
os.environ['PATH'] += ':' + os.environ['SPARK_HOME'] + '/bin'
sys.path.append(os.environ['SPARK_HOME'] + '/python')
sys.path.append(os.environ['SPARK_HOME'] + '/python/lib/py4j-0.10.6-src.zip')

In [None]:
# run start-all.sh
subprocess.call(os.environ['SPARK_HOME'] + "/sbin/start-all.sh", env=os.environ)

In [None]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession(SparkContext(master='spark://instance-1:7077'))
spark

<div class=h1_cell>
<p>
Sometimes its better think about these problems from the bottom up. Is there a row-wise operation that can be run on the dataset in parallel?
<p>
Yes. Given a row, we can calculate the distance from every other row in the distributed dataset.
<p>
Lets import our dataset using pandas and use our NLP code from pyspark_lsa.ipynb to process the text.
</div>

In [None]:
!pip install pandas  # if not already installed

In [None]:
import pandas as pd

pd.set_option('display.max_columns', 500)
gothic_table = pd.read_csv('https://bit.ly/2HVSx3X', encoding='utf-8')
gothic_table.head(5)

In [None]:
!pip install nltk  # if not already installed

<div class=h1_cell>
<p>
Generating the cooccurence matrix for the first 10 sentences produces a matrix with 180 columns. The first 100 sentences produces 1149 columns. Including all sentences would produce an incredibly large and sparse matrix, which takes too long to do here. I'm reading it in from google drive, but the code for computing the matrix is below if you want to run that. Getting the first 100 rows with head(100) is plenty for this exercise.
<p>    
If you want the entire matrix, the code below needs about 10-15 GB of memory at maximum to compute the matrix. At least, that is what happened on my local machine. I suggest you compute on your local machine, upload it to google drive and read it in with pandas OR send the csv directly to the head node with scp. The csv file is just over 1 GB.
<p>
If you want or need to compute it here, I suggest you destroy this node and spin up a new disk with at least 20GB of space. Then, install spark and rewire your cluster configuration to connect to the new node as the master. The other more complex route is to add another disk and merge it into the main partition, see https://cloud.google.com/compute/docs/disks/add-persistent-disk
</div>

In [None]:
"""
from nlp import get_bag_and_tokenize
from lnalg import comatrix

bag, sentences = get_bag_and_tokenize(gothic_table.head(100), 'text')
cm = comatrix(bag, sentences, window=3)
cm.head(5)
"""

cm = pd.read_csv("../three_authors.csv", index_col=0)
cm.head()

In [None]:
cm.to_csv('three_authors.csv')

<div class=h1_cell>
<p>
Lets create a function that calculates the distance against a numpy ndarray. We can pass python functions to spark, and spark will pass each row in the distributed dataset.
</div>

In [None]:
import numpy as np
from random import random

randv = np.random.rand(len(cm.columns))

def distance(x):
    return float(np.linalg.norm(x-randv).item())

distance([random() for _ in range(len(cm.columns))])

<div class=h1_cell>
<p>
Lets convert of pandas dataframe to a spark dataframe. We can pass a python data structure or a pandas dataframe itself. However, we will ultimately need to pass in a single column with all the data in it, so lets create that.
</div>

In [None]:
df = spark.createDataFrame([[word] for word in cm.values.tolist()], schema=["features"])
df.printSchema()
df.show(5)

<div class=h1_cell>
<p>
This is good, but we can make it better. Spark can use 'SparseVectors' to represent arrays with mostly zeros. This cuts down on the space needed to store our features vector. We'll pass in the pandas dataframe, create the features vector using SparseVector, and drop the rest of the columns.
</div>

In [None]:
cm['idx'] = cm.index
cmdf = spark.createDataFrame(cm, schema=list(cm.columns))  # pass the pandas dataframe straight in
cmdf.select('idx', 'gold', 'groundwork', 'desk', 'fantastic', 'years').show(5)  # many, many columns

In [None]:
from pyspark.ml.feature import VectorAssembler

cols = list(cmdf.columns)
cols.remove('idx')
vdf = VectorAssembler(inputCols=cols, outputCol="features").transform(cmdf)
vdf = vdf.drop(*cols)
vdf.show(5)

In [None]:
type(vdf.head().features)

In [None]:
type(vdf.head())

In [None]:
distance(vdf.head().features)

<div class=h1_cell>
<p>
Lets spark-ify that distance function.
</div>

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

dist = udf(distance, FloatType())  # spark user-defined-function for distance

In [None]:
vdf = vdf.withColumn('distance', dist('features'))

In [None]:
vdf.orderBy(vdf.distance).show(5)