# TensorFrames  

This chapter will provide a high-level primer on the burgeoning field of Deep Learning and the reasons why it is important. It will provide the fundamentals surrounding feature learning and neural networks required for deep learning. As well, this chapter will provide a quick start for TensorFrames for Apache Spark.  

In this chapter, you will learn about:  
• What is Deep Learning?  
• A primer on feature learning  
• What is feature engineering?  
• What is TensorFlow?  
• Introducing TensorFrames  
• TensorFrames – quick start  

As you can see in the preceding breakdown, we will be initially discussing deep learning – more specifically we will start with neural networks.

## What is Deep Learning?  

Deep Learning is part of a family of machine learning techniques based on learning representations of data. Deep Learning is loosely based on our brain's own neural networks, the purpose of this structure is to provide a large number of highly interconnected elements (in biological systems, this would be the neurons in our brains); there are approximately 100 billion neurons in our brain, each connected to approximately 10,000 other neurons, resulting in a mind-boggling 1015 synaptic connections. These elements work together to solve problems through learning processes – examples include pattern recognition and data classification.  

Learning within this architecture involves modifications of the connections between
the interconnected elements similar to how our own brains make adjustments to the
synaptic connections between neurons:  

The traditional algorithmic approach involves programming known steps or
quantities, that is, you already know the steps to solve a specific problem, now repeat
the solution and make it run faster. Neural networks are an interesting paradigm
because neural networks learn by example and are not actually programmed to
perform a specific task per se. This makes the training process in neural networks
(and Deep Learning) very important in that you must provide good examples for
the neural network to learn from otherwise it will "learn" the wrong thing (that is,
provide unpredictable results).  

The most common approach to building an artificial neural network involves the
creation of three layers: input, hidden, and outer; as noted in the following diagram:  

![image.png](attachment:image.png)  

Each layer is comprised of one or more nodes with connections (that is, flow of data)
between each of these nodes, as noted in the preceding diagram. Input nodes are
passive in that they receive data, but do not modify the information. The nodes in the
hidden and output layers will actively modify the data. For example, the connections
from the three nodes in the input layer to one of the nodes in the first hidden layer is
noted in the following diagram:  

![image-2.png](attachment:image-2.png)  

Referring to a signal processing neural network example, each input (denoted as i x
) has a weight applied to it ( i w ), which produces a new value. In this case, one of the
hidden nodes ( 1 h ) is the result of three modified input nodes:  

![image-3.png](attachment:image-3.png)  

There is also a bias applied to the sum in a form of a constant that also gets adjusted
during the training process. The sum (the h1 in our example) passes through so-called
activation function that determines the output of the neuron. Some examples of such
activation functions are presented in the following image:  

![image-4.png](attachment:image-4.png)  

This process is repeated for each node in the hidden layers as well as the output layer.
The output node is the accumulation of all the weights applied to the input values for
every active layer node. The learning process is the result of many iterations running
in parallel, applying and reapplying these weights (in this scenario).  

Neural networks appear in all the different sizes and shapes. The most popular
are single- and multi-layer feedforward networks that resemble the one presented
earlier; such structures (even with two layers and one neuron!) neuron in the output
layer are capable of solving simple regression problems (such as linear and logistic)
to highly complex regression and classification tasks (with many hidden layers
and a number of neurons). Another type commonly used are self-organizing maps,
sometimes referred to as Kohonen networks, due to Teuvo Kohonen, a Finnish
researcher who first proposed such structures. The structures are trained withouta-
teacher, that is, they do not require a target (an unsupervised learning paradigm).
Such structures are used most commonly to solve clustering problems where the aim
is to find an underlying pattern in the data.

### The need for neural networks and Deep Learning  

There are many potential applications with neural networks (and Deep Learning).
Some of the more popular ones include facial recognition, handwritten digit
identification, game playing, speech recognition, language translation, and object
classification. The key aspect here is that it involves learning and pattern recognition.  

While neural networks have been around for a long time (at least within the context
of the history of computer science), they have become more popular now because
of the overarching themes: advances and availability of distributed computing and
advances in research:  
* Advances and availability of distributed computing and hardware:
Distributed computing frameworks such as Apache Spark allows you to
complete more training iterations faster by being able to run more models
in parallel to determine the optimal parameters for your machine learning
models. With the prevalence of GPUs – graphic processing units that were
originally designed for displaying graphics – these processors are adept at
performing the resource intensive mathematical computations required for
machine learning. Together with cloud computing, it becomes easier to harness
the power of distributed computing and GPUs due to the lower up-front costs,
minimal time to deployment, and easier to deploy elastic scale.  

* Advances in deep learning research: These hardware advances have helped
return neural networks to the forefront of data sciences with projects such
as TensorFlow as well as other popular ones such as Theano, Caffe, Torch,
Microsoft Cognitive Toolkit (CNTK), mxnet, and DL4J.  

As noted previously, Deep Learning is part of a family of machine learning methods
based on learning representations of data. In the case of learning representations,
this can also be defined as feature learning. What makes Deep Learning so exciting
is that it has the potential to replace or minimize the need for manual feature
engineering. Deep Learning will allow the machine to not just learn a specific task,
but also learn the features needed for that task. More succinctly, automating feature
engineering or teaching machines to learn how to learn (a great reference on feature
learning is Stanford's Unsupervised Feature Learning and Deep Learning tutorial:
http://deeplearning.stanford.edu/tutorial/).  

Breaking these concepts down to the fundamentals, let's start with a feature. As
observed in Christopher Bishop's Pattern Recognition and machine learning (Berlin:
Springer. ISBN 0-387-31073-8. 2006) and as noted in the previous chapters on MLlib
and ML, a feature is a measurable property of the phenomenon being observed.  

If you are more familiar in the domain of statistics, a feature would be in reference to
the independent variables (x1, x2, …, xn) within a stochastic linear regression model:  
![image.png](attachment:image.png)  

In this specific example, y is the dependent variable and xi are the independent
variables.  


Feature engineering is about determining which of these features (for example, in
statistics, the independent variables) are important in defining the model that you
are creating. Typically, it involves the process of using domain knowledge to create
the features to allow the ML models to work.  
*Coming up with features is difficult, time-consuming, requires expert knowledge.
"Applied machine learning" is basically feature engineering.  
— Andrew Ng, Machine Learning and AI via Brain simulations (http://
helper.ipam.ucla.edu/publications/gss2012/gss2012_10595.pdf)*





#### What is feature engineering?  

Typically, performing feature engineering involves concepts such as feature selection
(selecting a subset of the original feature set) or feature extraction (building a new set
of features from the original feature set):  
* In feature selection, based on domain knowledge, you can filter the variables
that you think define the model (for example, predicting football scores
based on number of turnovers). Often data analysis techniques such as
regression and classification can also be used to help you determine this.  

* In feature extraction, the idea is to transform the data from a high dimensional
space (that is, many different independent variables) to a smaller space of fewer
dimensions. Continuing the football analogy, this would be the quarterback
rating, which is based on several selected features (e.g. completions,
touchdowns, interceptions, average gain per pass attempt, etc.). A common
approach for feature extraction within the linear data transformation space
is principal component analysis (PCA)  



#### Bridging the data and algorithm  

Let's bridge the feature and feature engineering definitions within the context of
feature selection using the example of restaurant recommendations:    
![image.png](attachment:image.png)  

While this is a simplified model, the analogy describes the basic premise of applied
machine learning. It would be up to a data scientist to analyze the data to determine
the key features of this restaurant recommendation model.  

In our restaurant recommendation case, while it may be easy to presume that
geolocation and cuisine type are major factors, it will require some digging into the
data to understand how the user (that is, restaurant-goer) has chosen their preference
for a restaurant. Different restaurants often have different characteristics or weights
for the mode.  

For example, the key features for high-end restaurant catering businesses are often
related to location (that is, proximity to their customer's location), the ability to make
reservations for large parties, and the diversity of the wine list:  

![image-2.png](attachment:image-2.png)  

Meanwhile, for specialty restaurants, often few of those previous factors are
involved; instead, the focus is on the reviews, ratings, social media buzz, and,
possibly whether the restaurant is good for kids:  

![image-3.png](attachment:image-3.png)  

[ 166 ]
The ability to segment these different restaurants (and their target audience) is a
key facet of applied machine learning. It can be an arduous process where you try
different models and algorithms with different variables and weights and then retry
after iteratively training and testing many different combinations. But note how this
time consuming iterative approach itself is its own process that can potentially be
automated? This is the key context of building algorithms of helping machines learn
to learn: Deep Learning has the potential to automating the learning process when
building our models.

## What is TensorFlow?  

TensorFlow is a Google open source software library for numerical computation
using data flow graphs; that is, an open source machine learning library focusing on
Deep Learning. Based loosely on neural networks, TensorFlow is the culmination of
the work of Google's Brain Team researchers and engineers to apply Deep Learning
to Google products and build production models for various Google teams including
(but not limited to) search, photos, and speech.  

Built on C++ with a Python interface, it has quickly become one of the most popular
Deep Learning projects in a short amount of time.  

As noted previously, TensorFlow performs numerical computation using data flow
graphs. When thinking about graph (as per the previous chapter on GraphFrames),
the node (or vertices) of this graph represent mathematical operations while
the graph edges represent the multidimensional arrays (that is, tensors) that
communicate between the different nodes (that is, mathematical operations).  

Referring to the following diagram, t1 is a 2x3 matrix while t2 is a 3x2 matrix;
these are the tensors (or edges of the tensor graph). The node is the mathematical
operations represented as op1:  

![image.png](attachment:image.png)  

In this example, op1 is a matrix multiplication operation represented by the following
diagram, though this could be any of the many mathematics operations available
in TensorFlow:  

![image-2.png](attachment:image-2.png)  

Together, to perform your numerical computations within the graph, there is a flow
of multidimensional arrays (that is, tensors) between the mathematical operations
(nodes) - that is, the flow of tensors, or TensorFlow.  



#### Matrix multiplication using constants  

To better describe tensors and how TensorFlow works, let's start with a matrix
multiplication calculation involving two constants. As noted in the following
diagram, we have c1 (3x1 matrix) and c2 (1x3 matrix), where the operation (op1)
is a matrix multiplication:  

![image.png](attachment:image.png)  

We will now define c1 (1x3 matrix) and c2 (3x1 matrix) using the following code:

In [1]:
# Import TensorFlow, TensorFrames, and Row
import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row

# Setup the matrix
# c1: 1x3 matrix
# c2: 3x1 matrix

c1 = tf.constant([[3., 2., 1.]])
c2 = tf.constant([[-1.], [2.], [1.]])

In [2]:
c1

<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[3., 2., 1.]], dtype=float32)>

Now that we have our constants, let's run our matrix multiplication using the
following code. Within the context of a TensorFlow graph, recall that the nodes in
the graph are called operations (or ops). The following matrix multiplication is the
ops, while the two matrices (c1, c2) are the tensors (typed multi-dimensional array).
An op takes zero or more tensors as its input, performs the operation such as a
mathematical calculation, with the output being zero or more tensors in the form
of numpy ndarray objects

In [3]:
# m3: matrix multiplication (m1 x m3)
mp = tf.matmul(c1, c2)



Now that this TensorFlow graph has been established, execution of this operation
(for example, in this case, the matrix multiplication) is done within the context of a
session; the session places the graph ops into the CPU or GPU (that is, devices)
to be executed:

In [5]:
# # Launch the default graph
# s = tf.Session()

# # run: Execute the ops in graph
# r = s.run(mp)
# print(r)

tf.print(mp)

[[2]]


#### Matrix multiplication using placeholders  
Now we will perform the same task as before, except this time, we will use tensors
instead of constants. As noted in the following diagram, we will start off with two
matrices (m1: 3x1, m2: 1x3) using the same values as in the previous section:

![image.png](attachment:image.png)

Within TensorFlow, we will use placeholder to define our two tensors as per the
following code snippet:

In [7]:
# Setup placeholder for your model
# t1: placeholder tensor
# t2: placeholder tensor
#t1 = tf.placeholder(tf.float32)
#t2 = tf.placeholder(tf.float32)
# t3: matrix multiplication (m1 x m3)
# tp = tf.matmul(t1, t2)

The advantage of this approach is that, with placeholders you can use the same
operations (that is, in this case, the matrix multiplication) with tensors of different
sizes and shape (provided they meet the criteria of the operation). Like the
operations in the previous section, let's define two matrices and execute the graph
(with a simplified session execution).

#### Discussion
As noted previously, TensorFlow provides users with the ability to perform deep
learning using Python libraries by representing computations as graphs where the
tensors represent the data (edges of the graph) and operations represent what is
to be executed (for example, mathematical computations) (vertices of the graph).

## Introducing TensorFrames  

At the time of writing, TensorFrames is an experimental binding for Apache Spark;
it was introduced in early 2016, shortly after the release of TensorFlow. With
TensorFrames, one can manipulate Spark DataFrames with TensorFlow programs.
Referring to the tensor diagrams in the previous section, we have updated the
figure to show how Spark DataFrames work with TensorFlow, as shown in the
following diagram:  

![image.png](attachment:image.png)  

As noted in the preceding diagram, TensorFrames provides a bridge between
Spark DataFrames and TensorFlow. This allows you to take your DataFrames and
apply them as input into your TensorFlow computation graph. TensorFrames also
allows you to take the TensorFlow computation graph output and push it back into
DataFrames so you can continue your downstream Spark processing.  

In terms of common usage scenarios for TensorFrames, these typically include the
following:  
**Utilize TensorFlow with your data**  
The integration of TensorFlow and Apache Spark with TensorFrames allows
data scientists to expand their analytics, streaming, graph, and machine learning
capabilities to include Deep Learning via TensorFlow. This allows you to both
train and deploy models at scale.  

**Parallel training to determine optimal hyperparameters**  
When building deep learning models, there are several configuration parameters
(that is, hyperparameters) that impact on how the model is trained. Common in
Deep Learning/artificial neural networks are hyperparameters that define the
learning rate (if the rate is high it will learn quickly, but it may not take into account
highly variable input - that is, it will not learn well if the rate and variability in the
data is too high) and the number of neurons in each layer of your neural network
(too many neurons results in noisy estimates, while too few neurons will result in
the network not learning well).  



### TensorFrames – quick start  

After all this preamble, let's jump start our use of TensorFrames with this quick start
tutorial.  

* **Using TensorFlow to add a constant to an existing column**  
This is a simple TensorFrames program where the op is to perform a simple addition.  

The first thing we will do is import TensorFlow, TensorFrames, and pyspark.sql.
row to create a DataFrame based on an RDD of floats:


In [8]:
import os
import sys
from os.path import abspath
from pyspark.sql import SparkSession
import pyspark.sql.functions as fn
import pyspark.sql.types as typ
import findspark
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

import pyspark.ml.feature as ft
import glob

In [9]:
os.environ["SPARK_HOME"] = "C:/Program Files/spark-3.5.4-bin-hadoop3"
os.environ["JAVA_HOME"] = "C:/Program Files/Java/jre1.8.0_431" 
os.environ['HADOOP_HOME '] = 'C:/Program Files/hadoop-3.4.0'

spark_python = os.path.join(os.environ.get('SPARK_HOME',None),'python')
py4j = glob.glob(os.path.join(spark_python,'lib','py4j-*.zip'))[0]
graphf = glob.glob(os.path.join(spark_python,'graphframes.zip'))[0]
sys.path[:0]=[spark_python,py4j]
sys.path[:0]=[spark_python,graphf]
os.environ['PYTHONPATH']=py4j+os.pathsep+graphf


In [10]:
import findspark
findspark.init()
findspark.find()

'C:/Program Files/spark-3.5.4-bin-hadoop3'

In [11]:
# Create a SparkSession

spark = SparkSession.builder.appName("GraphFrames for Apache Spark").getOrCreate()

sc = spark.sparkContext  # Accès au SparkContext à partir de SparkSession


In [12]:
# Import TensorFlow, TensorFrames, and Row
import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row  

# Create RDD of floats and convert into DataFrame `df`
rdd = [Row(x=float(x)) for x in range(10)]
df = spark.createDataFrame(rdd)

To view the df DataFrame generated by the RDD of floats, we can use the show
command:

In [13]:
df.show()

+---+
|  x|
+---+
|0.0|
|1.0|
|2.0|
|3.0|
|4.0|
|5.0|
|6.0|
|7.0|
|8.0|
|9.0|
+---+



* **Executing the Tensor graph**  

As noted previously, this tensor graph consists of adding 3 to the tensor created by
the df DataFrame generated by the RDD of floats. We will now execute the following
code snippet:

* x utilizes tfs.block where block builds a block placeholder based on the content of a column in a dataframe.  
* z is a the output tensor from the tensorflow add method (tf.add)  
* df2 is the new DataFrame which adds extra columns to the df DataFrame with the z tensor block by block  

In [None]:
import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row 

# Run TensorFlow program executes:
#   The `op` performs the addition (i.e. `x` + `3`)
#   Place the data back into a DataFrame
with tf.Graph().as_default() as g:
    # The TensorFlow placeholder that corresponds to column 'x'.
    # The shape of the placeholder is automatically inferred from the DataFrame.
    x = tfs.block(df, "x")
    
    # The output that adds y to x
    z = tf.add(x, 3, name="z")
    
    # The resulting dataframe
    df2 = tfs.map_blocks(z, df)
    
# Note that `z` is the tensor output from the `tf.add` operation
print(z)


In [17]:
# Build a DataFrame of vectors
data = [Row(y=[float(y), float(-y)]) for y in range(10)]
df = spark.createDataFrame(data)
df.show()

+-----------+
|          y|
+-----------+
| [0.0, 0.0]|
|[1.0, -1.0]|
|[2.0, -2.0]|
|[3.0, -3.0]|
|[4.0, -4.0]|
|[5.0, -5.0]|
|[6.0, -6.0]|
|[7.0, -7.0]|
|[8.0, -8.0]|
|[9.0, -9.0]|
+-----------+



With a few lines of TensorFlow code with TensorFrames, we can take the data stored
within the df DataFrame and execute a Tensor Graph to perform element wise
sum and min, merge the data back into a DataFrame, and (in our case) print out
the final values.

## Summary  
In this chapter, we have reviewed the fundamentals of neural networks and Deep
Learning, including the components of feature engineering. With all this new
excitement in Deep Learning, we introduced TensorFlow and how it can work
closely together with Apache Spark through TensorFrames.  

TensorFrames is a powerful deep learning tool that allows data scientists and
engineers to work with TensorFlow with data stored in Spark DataFrames. This
allows you to expand the capabilities of Apache Spark to a powerful deep learning
toolset that is based on the learning process of neural networks. To help continue
your Deep Learning journey, the following are some great TensorFlow and
TensorFrames resources: