# Recipe 1M + Spark

Outline
- Downloads
- Imports
- Loading Data
- Formatting Data
- Pyspark NLP Pipeline
- Spark ML
- Visualize Ingredients/Instructions

# Downloads

Java 8

In [1]:
# !apt-get update
# !apt-get install openjdk-8-jdk-headless -qq > /dev/null

Java 11

In [2]:
# !sudo apt-get update
# !sudo apt-get install openjdk-11-jdk-headless -qq > /dev/null

Pyhon libary installs
NOTE: Might need apache spark downloads, check later

In [3]:
# !pip install pyspark
# !pip install spark-nlp==3.0.0

# !pip install pandas
# !python -m pip install -U matplotlib #might not need this
# !pip install bokeh

check if GPU has been allocated

In [53]:
!nvidia-smi

Thu Apr 29 19:05:41 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 3070    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   40C    P8    18W / 220W |    880MiB /  7979MiB |     26%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

 Install RAPIDS Google Colab

In [56]:
# # Install RAPIDS Google Colab
# !git clone https://github.com/rapidsai/rapidsai-csp-utils.git
# !bash rapidsai-csp-utils/colab/rapids-colab.sh stable

# import sys, os, shutil

# sys.path.append('/usr/local/lib/python3.7/site-packages/')
# os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
# os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
# os.environ["CONDA_PREFIX"] = "/usr/local"
# for so in ['cudf', 'rmm', 'nccl', 'cuml', 'cugraph', 'xgboost', 'cuspatial']:
#     fn = 'lib'+so+'.so'
#     source_fn = '/usr/local/lib/'+fn
#     dest_fn = '/usr/lib/'+fn
#     if os.path.exists(source_fn):
#         print(f'Copying {source_fn} to {dest_fn}')
#         shutil.copyfile(source_fn, dest_fn)
# # fix for BlazingSQL import issue
# # ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /usr/local/lib/python3.7/site-packages/../../libblazingsql-engine.so)
# if not os.path.exists('/usr/lib64'):
#     os.makedirs('/usr/lib64')
# for so_file in os.listdir('/usr/local/lib'):
#     if 'libstdc' in so_file:
#         shutil.copyfile('/usr/local/lib/'+so_file, '/usr/lib64/'+so_file)
#         shutil.copyfile('/usr/local/lib/'+so_file, '/usr/lib/x86_64-linux-gnu/'+so_file)

Install RAPIDS Local

CUDA Toolkit to use GPU
https://developer.nvidia.com/cuda-11.2.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal
TAPIDS Instructions: https://rapids.ai/start.html
Personally: `conda create -n rapids-0.19 -c rapidsai -c nvidia -c conda-forge \ 
blazingsql=0.19 cuml=0.19 python=3.7 cudatoolkit=11.2`


In [62]:
import cuml

ModuleNotFoundError: No module named 'cuml'

# Imports

In [4]:
# TODO: Check which ones get used
import random
import json
import os

import pandas as pd
import matplotlib.pyplot as plt

import pyspark
from pyspark.ml import Pipeline
from pyspark.ml.feature import SQLTransformer
from pyspark.ml.feature import Normalizer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector, Vectors, VectorUDT
from pyspark.ml.feature import PCA
# from org.apache.spark.ml.linal import Vector, Vectors # Typo in John Snow Labs

from pyspark.sql.types import StringType, DoubleType
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import explode
from pyspark.sql.functions import udf, col
from pyspark.ml.functions import vector_to_array
from pyspark.ml.clustering import LDA
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# visualizations
from bokeh.plotting import figure, show
from bokeh.palettes import d3
from bokeh.models import ColumnDataSource, CategoricalColorMapper, HoverTool
from bokeh.io import output_notebook

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *              
from sparknlp.pretrained import PretrainedPipeline

Java 8 Environment Setup

In [5]:
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-8-openjdk-amd64"

Java 11 Environment Setup

In [6]:
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Jupyter Notebook Import

In [7]:
# Expands the width of the notebook
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

Google Colab Import

In [8]:
# from google.colab import drive
# drive.mount('/content/drive')

Spark NLP: Without GPU

In [9]:
# spark = sparknlp.start()
# print('[INFO] Spark type: ', type(spark))

Adjust GPU flag for individual use

In [10]:
spark = sparknlp.start(gpu=True, memory="32G")
print('[INFO] Spark type: ', type(spark))

[INFO] Spark type:  <class 'pyspark.sql.session.SparkSession'>


**Learning Point:** Difference between _spark session_ and _spark context_

- can access spark context through spark session
- unified view of contexts via session

Reference: https://medium.com/@achilleus/spark-session-10d0d66d1d24

In [11]:
sparknlp.version()
spark.version

'3.1.1'

# Load Data

In [12]:
def load_to_spark_dataframe(file_path):
    return spark.read.json(file_path)

In [13]:
def drop_extra_colums(df):
    columns_to_drop = ("_corrupt_record", "url", "partition")
    return df.drop(*columns_to_drop).dropna()

In [14]:
def display_dataframe_info(df):
    print('[COUNT] ', df.count())
    print()
    print('[SCHEMA] ')
    df.printSchema()
    print()
    print('[SHOW] ')
    df.show()

In [15]:
def sample_df(df, sample_size):
    return spark.createDataFrame(df.head(sample_size))

In [16]:
# jupyter notebook file path
file_path='layer1.json'
# google colab file path
# file_path = '/content/drive/MyDrive/layer1.json'

In [17]:
#TODO: Change this later and seperate it out
def load_data():
    df = load_to_spark_dataframe(file_path)
    df = drop_extra_colums(df)
    display_dataframe_info(df)
    
    sample_size = 100
    df = sample_df(df, sample_size)
    return df

# Format Data

The ingredients and instructions are currently in lists, but I want to combine them into one string to create a "document". So I will need to change the ingredients/instructions from an aray of strings to one string

**Learning Point:** Using a User Defined Function (UDF)
- functions that act on one row at a time
- alternatively, you can use a map function
- expensive operations but can help when you need something outside of the typical SQL functions

Reference: 
- https://medium.com/@fqaiser94/udfs-vs-map-vs-custom-spark-native-functions-91ab2c154b44
- https://stackoverflow.com/questions/29109916/updating-a-dataframe-column-in-spark

In [18]:
df = load_data()

[COUNT]  1029720

[SCHEMA] 
root
 |-- id: string (nullable = true)
 |-- ingredients: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- text: string (nullable = true)
 |-- instructions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- text: string (nullable = true)
 |-- title: string (nullable = true)


[SHOW] 
+----------+--------------------+--------------------+--------------------+
|        id|         ingredients|        instructions|               title|
+----------+--------------------+--------------------+--------------------+
|000018c8a5|[{6 ounces penne}...|[{Preheat the ove...|Worlds Best Mac a...|
|000033e39b|[{1 c. elbow maca...|[{Cook macaroni a...|Dilly Macaroni Sa...|
|000035f7ed|[{8 tomatoes, qua...|[{Add the tomatoe...|            Gazpacho|
|00003a70b1|[{2 12 cups milk}...|[{Preheat oven to...|Crunchy Onion Pot...|
|00004320bb|[{1 (3 ounce) pac...|[{Dissolve Jello ...|Cool 'n Easy Crea...|
|0000

In [19]:
# Returns the list joined together with periods as seperators
def join_lists_udf(list_to_join):
    return '. '.join([data['text'] for data in list_to_join])
    return '. '.join([data['text'] for data in ingredients])

def apply_udf(df, column_name):
    udf = UserDefinedFunction(join_lists_udf, StringType())
    return df.select(*[udf(column).alias(column_name) if column == column_name else column for column in df.columns])

In [20]:
df = apply_udf(df, 'ingredients')
df = apply_udf(df, 'instructions')

The lists have been removed

In [21]:
df.show(10)

+----------+--------------------+--------------------+--------------------+
|        id|         ingredients|        instructions|               title|
+----------+--------------------+--------------------+--------------------+
|000018c8a5|6 ounces penne. 2...|Preheat the oven ...|Worlds Best Mac a...|
|000033e39b|1 c. elbow macaro...|Cook macaroni acc...|Dilly Macaroni Sa...|
|000035f7ed|8 tomatoes, quart...|Add the tomatoes ...|            Gazpacho|
|00003a70b1|2 12 cups milk. 1...|Preheat oven to 3...|Crunchy Onion Pot...|
|00004320bb|1 (3 ounce) packa...|Dissolve Jello in...|Cool 'n Easy Crea...|
|0000631d90|12 cup shredded c...|In a large skille...|Easy Tropical Bee...|
|000075604a|2 Chicken thighs....|Pierce the skin o...|Kombu Tea Grilled...|
|00007bfd16|6 -8 cups fresh r...|Put ingredients i...|Strawberry Rhubar...|
|000095fc1d|8 ounces, weight ...|Layer all ingredi...|     Yogurt Parfaits|
|0000973574|2 cups flour. 1 t...|Sift dry ingredie...|  Zucchini Nut Bread|
+----------+

# Introducing Pyspark NLP Pipeline

There are manythings to learn relating to Pyspark NLP. I highly reccomend reading some of the material prior to getting started. I will specefically be using BERT sentence embeddings for my pipeline. 

**Learning Point:** Implementing PySpark NLP Pipeline
- DocumentAssembler: Allows you to transform your raw data into a format the pipeline can use
- SentenceEmbeddings: 
  - I will ues BertSentenceEmbeddings for the purposes of this project
  - Alternative Embeddings are mentioned in John Snow Labs Annotator (see reference below)
  - You may also annotate each word and then aggregate the mean to create sentence embeddings (to provide more flexibility for stemming and other preprocessing)
  
Note: There are some typos in the documentation, I have noted them as we go along as [WARNING]
  
Reference:
- https://towardsdatascience.com/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59 (Highly reccomend)
- https://nlp.johnsnowlabs.com/docs/en/transformers
- https://nlp.johnsnowlabs.com/docs/en/annotators

In [22]:
def create_pipeline(column_name):
    # Document Assembler: Converts our raw data into a format for the Pyspark NLP to use for other parts of the pipeline, 
    # specefically in the "Document" type
    documentAssembler = DocumentAssembler() \
        .setInputCol(column_name) \
        .setOutputCol("document") \
        .setCleanupMode("shrink")

    # [WARNING] Typo in the documentation: 
    # Documentation: BertSentenceembeddings (missing uppercase in the E)
    # Correction: BertSentenceEmbeddings
    sentence_embeddings = BertSentenceEmbeddings.pretrained() \
        .setInputCols("document") \
        .setOutputCol("sentence_embeddings") \
        .setLazyAnnotator(False)
    
    # [WARNING]
    # The documentation states after SentenceEmbeddings that you will need to convert 
    # these embeddings into Vectors or what's known as Feature column so it can be used in Spark ML
    # Correction: The documentation states its in python, however the code is not python 
    # and a mixture between Scala and other typos
    # Instead, you can use the embeddings finisher to help format the vectors

    embeddings_finisher = EmbeddingsFinisher() \
                .setInputCols(["sentence_embeddings"]) \
                .setOutputCols("sentence_embeddings_vectors") \
                .setCleanAnnotations(True) \
                .setOutputAsVector(True)
    
    pipeline = Pipeline(
    stages=[
        documentAssembler, 
        sentence_embeddings,
        embeddings_finisher
    ])
    return pipeline

In [23]:
ingredients_pipeline = create_pipeline('ingredients')

sent_small_bert_L2_768 download started this may take some time.
Approximate size to download 139.6 MB
[OK!]


In [24]:
instructions_pipeline = create_pipeline('instructions')

sent_small_bert_L2_768 download started this may take some time.
Approximate size to download 139.6 MB
[OK!]


Below we will fit and transform the data to create a BERT representation of our ingredients and instructions data respectively

Through the pipeline, new columns are created for each part of the pipeline (setOutputCol). Thus the names may be duplicated and need to change

In [25]:
# Create ingredients embeddings
ingredients_pipeline_model = ingredients_pipeline.fit(df)

transformed_df = ingredients_pipeline_model.transform(df)

transformed_df = transformed_df.withColumnRenamed("sentence_embeddings", "ingredients_sentence_embeddings")
transformed_df = transformed_df.withColumnRenamed("sentence_embeddings_vectors", "ingredients_sentence_embeddings_vectors")

# Create instruction embeddings
instructions_pipeline_model = instructions_pipeline.fit(df)

transformed_df = instructions_pipeline_model.transform(transformed_df)

transformed_df = transformed_df.withColumnRenamed("sentence_embeddings", "instructions_sentence_embeddings")
transformed_df = transformed_df.withColumnRenamed("sentence_embeddings_vectors", "instructions_sentence_embeddings_vectors")

In [26]:
transformed_df.show()

+----------+--------------------+--------------------+--------------------+-------------------------------+---------------------------------------+--------------------------------+----------------------------------------+
|        id|         ingredients|        instructions|               title|ingredients_sentence_embeddings|ingredients_sentence_embeddings_vectors|instructions_sentence_embeddings|instructions_sentence_embeddings_vectors|
+----------+--------------------+--------------------+--------------------+-------------------------------+---------------------------------------+--------------------------------+----------------------------------------+
|000018c8a5|6 ounces penne. 2...|Preheat the oven ...|Worlds Best Mac a...|           [{sentence_embedd...|                   [[0.2496253103017...|            [{sentence_embedd...|                    [[0.6857207417488...|
|000033e39b|1 c. elbow macaro...|Cook macaroni acc...|Dilly Macaroni Sa...|           [{sentence_embedd...|     

In [27]:
# drop coulmns
embedding_columns = ["ingredients_sentence_embeddings", "instructions_sentence_embeddings"]
transformed_df = transformed_df.drop(*embedding_columns)

In [28]:
transformed_df.show()

+----------+--------------------+--------------------+--------------------+---------------------------------------+----------------------------------------+
|        id|         ingredients|        instructions|               title|ingredients_sentence_embeddings_vectors|instructions_sentence_embeddings_vectors|
+----------+--------------------+--------------------+--------------------+---------------------------------------+----------------------------------------+
|000018c8a5|6 ounces penne. 2...|Preheat the oven ...|Worlds Best Mac a...|                   [[0.2496253103017...|                    [[0.6857207417488...|
|000033e39b|1 c. elbow macaro...|Cook macaroni acc...|Dilly Macaroni Sa...|                   [[0.3729802668094...|                    [[0.5114012360572...|
|000035f7ed|8 tomatoes, quart...|Add the tomatoes ...|            Gazpacho|                   [[0.2992067039012...|                    [[0.5063223242759...|
|00003a70b1|2 12 cups milk. 1...|Preheat oven to 3...|Crun

Hark! Another issue, the vectors produced in the transformed_df are of type list, but we need a Dense Vector to input into the PySpark ML side.

In [29]:
# Demonstrate the type here is a list when a Dense vector is desired
type(transformed_df.collect()[0]['ingredients_sentence_embeddings_vectors'])

list

# Spark ML

## Preparing Data

Lark! We will address the challenge mentioned above. We want to use a UDF to change the instruction or ingredient lists to vectors

In [30]:
# The data for the sentence_embeddings_vectors is in a list of DenseVectors rather than just DenseVectors
# I have no idea why, have to figure that out
# Correction: I use a UDF to convert the list to the DenseVector 

def format_list_to_vector(df, column_name):
    list_to_vector_udf = udf(lambda l: l[0], VectorUDT())
    
#     return df.select(df['id'], list_to_vector_udf(df[column_name]).alias("features"))
#     return df.select(df['id'], list_to_vector_udf(df[column_name]).alias(column_name))
    return list_to_vector_udf(df[column_name]).alias(column_name)

In [31]:
column_name = "ingredients_sentence_embeddings_vectors" 
transformed_df = transformed_df.withColumn(column_name, format_list_to_vector(transformed_df, column_name))

In [32]:
column_name = "instructions_sentence_embeddings_vectors"
transformed_df = transformed_df.withColumn(column_name, format_list_to_vector(transformed_df, column_name))

In [33]:
transformed_df.show()

+----------+--------------------+--------------------+--------------------+---------------------------------------+----------------------------------------+
|        id|         ingredients|        instructions|               title|ingredients_sentence_embeddings_vectors|instructions_sentence_embeddings_vectors|
+----------+--------------------+--------------------+--------------------+---------------------------------------+----------------------------------------+
|000018c8a5|6 ounces penne. 2...|Preheat the oven ...|Worlds Best Mac a...|                   [0.24962531030178...|                    [0.68572074174880...|
|000033e39b|1 c. elbow macaro...|Cook macaroni acc...|Dilly Macaroni Sa...|                   [0.37298026680946...|                    [0.51140123605728...|
|000035f7ed|8 tomatoes, quart...|Add the tomatoes ...|            Gazpacho|                   [0.29920670390129...|                    [0.50632232427597...|
|00003a70b1|2 12 cups milk. 1...|Preheat oven to 3...|Crun

In [34]:
# Demonstrate the type here is a Dense vector as desired
type(transformed_df.collect()[0]['ingredients_sentence_embeddings_vectors'])

pyspark.ml.linalg.DenseVector

## K-Means

To select the best number of clusters for k-means. I will se the Silhouette Score.

**Learning Point:** K-Means Silhouette Score
- Visualize to create the number of clusters

Resources:
- https://towardsdatascience.com/k-means-clustering-using-pyspark-on-big-data-6214beacdc8b
- https://rsandstroem.github.io/sparkkmeans.html

### Find the number of clusters

In [35]:
def calculate_silhoutte_scores(df, column_name, max_number_of_clusters, ):
    df = df.withColumnRenamed(column_name + "_sentence_embeddings_vectors", "features")
    
    silhouette_scores=[]
    evaluator = ClusteringEvaluator()

    for i in range(2, max_number_of_clusters):
        kmeans = KMeans().setK(i).setSeed(1)
        kmeans_model = kmeans.fit(df)
        predictions = kmeans_model.transform(df)
        silhouette = evaluator.evaluate(predictions)
        silhouette_scores.append(silhouette)
        
    max_silhouette = max(silhouette_scores)
    print('[IDEAL CLUSTER ]', silhouette_scores.index(max_silhouette) + 2)
    return silhouette_scores

In [36]:
def display_silhoutte_scores(silhouette_scores, max_number_of_clusters):
    fig, ax = plt.subplots(1,1, figsize =(8,6))
    ax.plot(range(2,10),silhouette_scores)
    ax.set_xlabel('k')
    ax.set_ylabel('cost')

Number of "ideal" clusters for Ingredients

In [37]:
# max_number_of_clusters = 10
# silhouette_scores = calculate_silhoutte_scores(ingredients_df, max_number_of_clusters)

In [38]:
# display_silhoutte_scores(silhouette_scores, max_number_of_clusters)

Number of "ideal" clusters for Instructions

In [39]:
# max_number_of_clusters = 10
# silhouette_scores = calculate_silhoutte_scores(instructions_df, max_number_of_clusters)

In [40]:
# display_silhoutte_scores(silhouette_scores, max_number_of_clusters)

### Create the clusters based on instruction features

In [41]:
def create_clusters(df, column_name, number_of_clusters):
    df = df.withColumnRenamed(column_name + "_sentence_embeddings_vectors", "features")
    
    kmeans = KMeans().setK(number_of_clusters).setSeed(1)
    kmeans_model = kmeans.fit(df)

    # Make predictions
    transformed = kmeans_model.transform(df).select('id', 'prediction')

    # Evaluate clustering by computing Silhouette score
    evaluator = ClusteringEvaluator()
    
    rows = transformed.collect()
    
    df_pred = spark.createDataFrame(rows)
    
    df = df.withColumnRenamed("features", column_name + "_sentence_embeddings_vectors")
    
    return df_pred.join(df, 'id')

In [42]:
number_of_clusters = 5
column_name = "instructions"
predicted_df = create_clusters(transformed_df, column_name, number_of_clusters)

In [43]:
predicted_df.show()

+----------+----------+--------------------+--------------------+--------------------+---------------------------------------+----------------------------------------+
|        id|prediction|         ingredients|        instructions|               title|ingredients_sentence_embeddings_vectors|instructions_sentence_embeddings_vectors|
+----------+----------+--------------------+--------------------+--------------------+---------------------------------------+----------------------------------------+
|00011fc1f9|         1|1 cup lentils. 12...|Saute the onions,...|Lentils Vegetable...|                   [0.36389616131782...|                    [0.48283424973487...|
|00004320bb|         0|1 (3 ounce) packa...|Dissolve Jello in...|Cool 'n Easy Crea...|                   [0.46526822447776...|                    [0.55120623111724...|
|000033e39b|         3|1 c. elbow macaro...|Cook macaroni acc...|Dilly Macaroni Sa...|                   [0.37298026680946...|                    [0.51140123605

## PCA

Now we will take our vectors for ingredients and instructions and reduce them to two dimensions for visualizations. To reduce, we will use Pysparks ml PCA and then we will split the features into separate columns.

Goals:
- reduce vectors to 2-D
- split 2-D to two columns

**Learning Point:** Pyspark mlib is incompatible with ml
- ml library is designed for DataFrame-Objects
- mllib library is for RDD-objects
- pay attention where you import from

Resources:
- https://stackoverflow.com/questions/41074182/cannot-convert-type-class-pyspark-ml-linalg-sparsevector-into-vector
- MLB PCA: https://spark.apache.org/docs/latest/mllib-dimensionality-reduction#principal-component-analysis-pca uses mlibml pc
- 
ML PCA: https://spark.apache.org/docs/1.5.1/ml-features.html#pca

In [44]:
 def pca_dimentionality_reduction(df, column_name, number_of_dimensions):
    pca = PCA(k=number_of_dimensions,
              inputCol=column_name + "_sentence_embeddings_vectors",
              outputCol=column_name + "_pca_features")
    pca_model = pca.fit(df)
    return pca_model.transform(df).select("id", "prediction", "title", 
                                          "ingredients", "instructions",
                                         column_name + "_pca_features")

In [45]:
def split_pca_features(df, column_name, number_of_dimensions):
    return df.withColumn("dim", vector_to_array(column_name + "_pca_features")).select(["id"] + ["prediction"] + 
                                                                                           ["title"] + ["ingredients"] + 
                                                                                           ["instructions"] + 
                                                                                           [col("dim")[i] for i in range(2)])

### PCA on Ingredients Features

In [46]:
column_name = "ingredients"
number_of_dimensions = 2
reduced_ingredients_df = pca_dimentionality_reduction(predicted_df, column_name, number_of_dimensions)
reduced_ingredients_df = split_pca_features(reduced_ingredients_df, column_name, number_of_dimensions)

In [47]:
reduced_ingredients_df.show()

+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
|        id|prediction|               title|         ingredients|        instructions|              dim[0]|              dim[1]|
+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+
|00011fc1f9|         1|Lentils Vegetable...|1 cup lentils. 12...|Saute the onions,...| -1.5658635956364213|  -1.753411199120299|
|00004320bb|         0|Cool 'n Easy Crea...|1 (3 ounce) packa...|Dissolve Jello in...| -1.0292415508838946| 0.11491375932398373|
|000033e39b|         3|Dilly Macaroni Sa...|1 c. elbow macaro...|Cook macaroni acc...| -0.6938908218629017| 0.20321849206833065|
|00050db874|         1| Peanut Butter Bread|34 cup peanut but...|In a large bowl, ...| -3.5653463062817417|  -1.075541862002314|
|000368efd3|         1|Cheesy Herbed Egg...|2 eggs, scrambled...|COOK the eggs ove...| -0.6741251

### PCA on Instruction Features

In [48]:
column_name = "instructions"
number_of_dimensions = 2
reduced_instructions_df = pca_dimentionality_reduction(predicted_df, column_name, number_of_dimensions)
reduced_instructions_df = split_pca_features(reduced_instructions_df, column_name, number_of_dimensions)

In [49]:
reduced_instructions_df.show()

+----------+----------+--------------------+--------------------+--------------------+--------------------+-------------------+
|        id|prediction|               title|         ingredients|        instructions|              dim[0]|             dim[1]|
+----------+----------+--------------------+--------------------+--------------------+--------------------+-------------------+
|00011fc1f9|         1|Lentils Vegetable...|1 cup lentils. 12...|Saute the onions,...| -1.2722409938104375| -3.342701242340617|
|00004320bb|         0|Cool 'n Easy Crea...|1 (3 ounce) packa...|Dissolve Jello in...|   1.040194973565729|-2.6965306736431547|
|000033e39b|         3|Dilly Macaroni Sa...|1 c. elbow macaro...|Cook macaroni acc...| 0.25938181098541246|-3.7679415654076873|
|00050db874|         1| Peanut Butter Bread|34 cup peanut but...|In a large bowl, ...| -0.6284153726924759|-3.1889812019712473|
|000368efd3|         1|Cheesy Herbed Egg...|2 eggs, scrambled...|COOK the eggs ove...|-0.247285222431617

# Visualize Ingredient Predictions

In [50]:
def visualize_reduced_clusters(df):
    output_notebook()
    df = df.toPandas().set_index('id')
    prediction = [str(x) for x in df["prediction"]]

    source = ColumnDataSource(dict(
        x=df["dim[0]"], 
        y=df["dim[1]"],
        titles=df["title"],
        label=prediction,
    ))


    predictions_set = list(set(prediction))

    palette = d3['Category10'][len(predictions_set)]

    color_map = CategoricalColorMapper(factors=predictions_set,
                                       palette=palette)

    TOOLTIPS = [
        ('name', "@titles"),
    ]

    p = figure(plot_width=800, plot_height=800, tools='hover', tooltips=TOOLTIPS)
    p.circle(x='x', 
             y='y',
             source=source,
             size=20,
             color={'field': 'label', 'transform': color_map},
             alpha=0.5)
#     hover = p.select(dict(type=HoverTool))
    show(p)

In [51]:
visualize_reduced_clusters(reduced_ingredients_df)

# Visualize Instructions with Ingredient Predictions

Here we are going to use the labels based on the ingredient features of each recipe and apply those labels/colors to the instructions when plotted.
Example:

Ingredient Plot
- Title: Ham and Cheese Sandwhich
- Ingredient Vector: [0, 0, 0, 1]
- Ingredient Prediction: 1
- Plot(x,y) = (2,3)

Using the labels from the Ingredient plot, we are going to plot using the Instruction Vector
Instruction Plot
- Title: Ham and Cheese Sandwhich'
- Instruction Vector: [0, 1, 0, 0, 1, 0]
- Ingredient Prediction: 1
- Plot(x,y) = (5,3)

In essence we want to see how the labels from ingredients will be grouped in the instruction space. We want to know if Turkey and Cheese Sandwhich with Ingredient Prediction 1 will plot near the Ham and Cheese when only looking at the instruction vectors. 


In [57]:
visualize_reduced_clusters(reduced_instructions_df)

In [52]:
!pip install cuml

Collecting cuml
  Downloading cuml-0.6.1.post1.tar.gz (1.1 kB)
Building wheels for collected packages: cuml
  Building wheel for cuml (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /home/egong/miniconda3/envs/767_py3/bin/python3.9 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ps2isrsd/cuml_0a331e3df5b740f186ced5f58836b3c8/setup.py'"'"'; __file__='"'"'/tmp/pip-install-ps2isrsd/cuml_0a331e3df5b740f186ced5f58836b3c8/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-p95eexb9
       cwd: /tmp/pip-install-ps2isrsd/cuml_0a331e3df5b740f186ced5f58836b3c8/
  Complete output (25 lines):
  running bdist_wheel
  running build
  installing to build/bdist.linux-x86_64/wheel
  running install
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    

In [None]:
from cuml.manifold import TSNE

# https://medium.com/rapids-ai/tsne-with-gpus-hours-to-seconds-9d9c17c941db
# acccelerating TSNE with GPU: 4 optimizations