# Hitchhiker's Guide to .NET for Apache Spark

Welcome to the .NET for Apache Spark tutorial! We are glad to have you here. Before we begin, let us cover answers to a few quick questions:

 - #### What is .NET for Apache Spark?
  .NET for Apache Spark provides high performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data.

  .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.

 - #### Where can I find more on .NET for Apache Spark?
  https://github.com/dotnet/spark

 - #### I did not know there was a REPL for C#!?
   Great question! :) We collaborated with the .NET team and they built one for us! https://github.com/dotnet/interactive 

Whew! Now that we have covered some basic information, let's begin! 

Since the .NET REPL is something very new, let us start by exploring what you can do with the REPL. 

# Basic Capabilities of the C# REPL

In [None]:
// Simple assignments should just work 
var x = 1 + 25;

In [None]:
// You can either use traditional approach to printing a variable...
Console.WriteLine(x);

// ... or just type it and execute a cell
256

In [None]:
// You can even play with built-in libraries/functions
Enumerable.Range(1, 5)

In [None]:
// And now for some C# 8.0 features. If you haven't read it,
// here's the link: 
// https://docs.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-8
1..4

In [None]:
// We can even do pattern matching!
public static string RockPaperScissors(string first, string second)
    => (first, second) switch
    {
        ("rock", "paper") => "rock is covered by paper. Paper wins.", // <-- Next cell prints this out
        ("rock", "scissors") => "rock breaks scissors. Rock wins.",
        ("paper", "rock") => "paper covers rock. Paper wins.",
        ("paper", "scissors") => "paper is cut by scissors. Scissors wins.",
        ("scissors", "rock") => "scissors is broken by rock. Rock wins.",
        ("scissors", "paper") => "scissors cuts paper. Scissors wins.",
        (_, _) => "tie"
    };

In [None]:
RockPaperScissors("rock", "paper")

In [None]:
// Now, for the fun part! You can render HTML
display(
    div(
        h1("Our Incredibly Declarative Example"),
        p("Can you believe we wrote this ", b("in C#"), "?"),
        img[src:"https://media.giphy.com/media/xUPGcguWZHRC2HyBRS/giphy.gif"],
        p("What will ", b("you"), " create next?")
    )
);

# Looking at data through Spark.NET


In [None]:
// Let us use some sample data. In this cell, we create this data 
// from *scratch* but you can also load it from your storage container. 
// For instance, 
// var df = spark.Read().Json("wasbs://<account>@<container>.blob.core.windows.net/people.json");

using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
using static Microsoft.Spark.Sql.Functions;

var schema = new StructType(new List<StructField>()
    {
        new StructField("id", new IntegerType()),
        new StructField("name", new StringType())
    });

var data = new List<GenericRow>();
data.Add(new GenericRow(new object[] { 0,  "Michael" }));
data.Add(new GenericRow(new object[] { 1,  "Elva"    }));
data.Add(new GenericRow(new object[] { 2,  "Terry"   }));
data.Add(new GenericRow(new object[] { 3,  "Steve"   }));
data.Add(new GenericRow(new object[] { 4,  "Brigit"  }));
data.Add(new GenericRow(new object[] { 5,  "Niharika"}));
data.Add(new GenericRow(new object[] { 6,  "Rahul"   }));
data.Add(new GenericRow(new object[] { 7,  "Tomas"   }));
data.Add(new GenericRow(new object[] { 8,  "Euan"   }));
data.Add(new GenericRow(new object[] { 9,  "Lev"   }));
data.Add(new GenericRow(new object[] { 10, "Saveen"   }));

var df = spark.CreateDataFrame(data, schema);
df.Show();

In [None]:
// Wait, that rendering is old-school plain! Let's spice things up a bit!
// What we're doing here is to define a specific formatter that is tied to 
// Microsoft.Spark.Sql.DataFrame and registering it. When we then invoke
// display() and pass a DataFrame, the formatter is invoked, which then
// generates the necessary HTML

Microsoft.DotNet.Interactive.Formatting.Formatter<Microsoft.Spark.Sql.DataFrame>.Register((df, writer) =>
{
    var headers = new List<dynamic>();
    var columnNames = df.Columns();
    headers.Add(th(i("index")));
    headers.AddRange(columnNames.Select(c => th(c)));

    var rows = new List<List<dynamic>>();
    var currentRow = 0;
    var dfRows = df.Take(Math.Min(20, (int)df.Count()));
    foreach (Row dfRow in dfRows)
    {
        var cells = new List<dynamic>();
        cells.Add(td(currentRow));

        foreach (string columnName in columnNames)
        {
            cells.Add(td(dfRow.Get(columnName)));
        }

        rows.Add(cells);
        ++currentRow;
    }

    var t = table[@border: "0.1"](
        thead[@style: "background-color: blue; color: white"](headers),
        tbody[@style: "color: red"](rows.Select(r => tr(r))));

    writer.Write(t);
}, "text/html");

In [None]:
// Now, let's try rendering the Spark's DataFrame in two ways...

// ... a regular way ...
df.Show();

// Using dotnet-interactive's display method (so it invokes the formatter we just defined)
display(df);


In [None]:
// ... and just typing df (equivalent to "display(df);")
df

In [None]:
// Let us now try something more advanced like, defining C# classes on-the-fly...
public static class A {
    public static readonly string s = "The person named ";
}

In [None]:
// ... and just for illustration, let's define one more simple class
public static class B {
    private static Random _r = new Random();
    private static List<string> _moods = new List<string>{ "happy","funny","awesome","cool"};

    public static string GetMood() {
        return _moods[_r.Next(_moods.Count)];
    }
}

In [None]:
// Let us now define a Spark User-defined Function (UDF) that utilizes
// the classes we just defined above. If you do not recognize the syntax
// below, here's some relevant documentation:
// https://docs.microsoft.com/en-us/dotnet/api/system.func-2?view=netcore-3.1
// https://github.com/dotnet/spark/blob/master/examples/Microsoft.Spark.CSharp.Examples/Sql/Batch/Basic.cs
//
// Note: If you change the class definition above, and execute the cell,
// you should re-execute this cell (i.e., the cell that defines the UDF)
var udf = Udf<string, string>(str => $"{A.s} - {str} - is {B.GetMood()}!");

In [None]:
// Let's use the UDF on our Spark DataFrame
display(
    df
    .Select(
        udf((Microsoft.Spark.Sql.Column)df["name"])));

In [None]:
// Tables are not that interesting, right? :) Let's do some visualizations now.
// Let us start with something simple to illustrate the idea. We highly encourage
// you to look at https://fslab.org/XPlot/ to understand how you can use XPlot's
// full capabilities. While the examples are in F#, it is fairly straightforward
// to rewrite in C#.

using XPlot.Plotly;

var lineChart = Chart.Line(new List<int> { 1, 2, 3, 4, 5, 6, 10, 44 });
lineChart.WithTitle("My awesome chart");
lineChart.WithXTitle("X axis");
lineChart.WithYTitle("Y axis");
lineChart

In [None]:
// Good! Now let us try to visualize the Spark DataFrame we have.
// Now is a good time to refresh your concept of a Spark DataFrame
// https://spark.apache.org/docs/latest/sql-programming-guide.html
// Remember that a Spark DataFrame is a distributed representation 
// of your dataset (yes, even if your data is a few KB). Since we
// are using a visualization library, we need to first 'collect'
// (notice how we are using df.Collect().ToArray() below)
// all the data that is distributed on your cluster, and shape it
// appropriately for XPlot.
//
// Note: Visualizations are good for smaller datasets (typically, 
// a few 10s of thousands of data points coming to KBs), so if you are
// trying to visualize GBs of data, it is usually a good idea to
// summarize your data appropriately using Spark.NET's APIs. For
// a list of summarization APIs, see here:
// https://docs.microsoft.com/en-us/dotnet/api/microsoft.spark.sql.functions?view=spark-dotnet

var names = new List<string>();
var ids = new List<int>();

foreach (Row row in df.Collect().ToArray())
{
 names.Add(row.GetAs<string>("name"));
 int? id = row.GetAs<int?>("id");
 ids.Add( id ?? 0);
}
var bar = new Graph.Bar
{
 name = "bar chart",
 x = names,
 y = ids
};

var chart = Chart.Plot(new[] {bar});
display(chart);

In [None]:
// As a final step, let us now plot a histogram of a random dataset

using XPlot.Plotly;

var schema = new StructType(new List<StructField>()
    {
        new StructField("number", new DoubleType())
    });

Random random = new Random();

var data = new List<GenericRow>();
for(int i = 0; i <=100; i++) {
    data.Add(new GenericRow(new object[] { random.NextDouble() }));
}

var histogramDf = spark.CreateDataFrame(data, schema);
histogramDf.Show()

In [None]:
// Time to use LINQ (or Language Integrated Query) :)
// For those that are not familiar with LINQ, you can read more about it
// here: https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/

using System.Linq;

// Let us take the histogramDf we loaded through Spark and sample some data points
// for the histogram. We will then use LINQ to shape the data for our next 
// steps (visualization!)
var sample1 = 
        histogramDf.Sample(0.5, true).Collect().ToArray() // <---- Spark APIs
        .Select(x => x.GetAs<double>("number")); // <---- LINQ APIs
        
// Let us create two more sample sets we can use for plotting
var sample2 = histogramDf.Sample(0.3, false).Collect().ToArray().Select(x => x.GetAs<double>("number"));
var sample3 = histogramDf.Sample(0.6, true).Collect().ToArray().Select(x => x.GetAs<double>("number"));

In [None]:
// Let us plot the histograms now!
var hist1 = new Graph.Histogram{x = sample1, opacity = 0.75};
var hist2 = new Graph.Histogram{x = sample2, opacity = 0.75};
var hist3 = new Graph.Histogram{x = sample3, opacity = 0.75};

In [None]:
Chart.Plot(new[] {hist1})

In [None]:
Chart.Plot(new[] {hist2})

In [None]:
Chart.Plot(new[] {hist3})

In [None]:
// but wait, that's three different graphs and it's impossible to read them
// altogether! Let's try an overlay histogram, shall we?
var layout = new XPlot.Plotly.Layout.Layout{barmode="overlay", title="Overlaid Histogram"};
var histogram = Chart.Plot(new[] {hist1, hist2, hist3});
histogram.WithLayout(layout);
histogram

In [None]:
// And for the final touches
using static XPlot.Plotly.Graph;

layout.title = "Overlaid Histogram with cool colors!";
hist1.marker = new Marker {color = "#D65108)"};
hist2.marker = new Marker {color = "#ffff00"}; 
hist3.marker = new Marker {color = "#462255"};

histogram

# VectorUdfs using Apache Arrow
Spark .NET supports constructing Arrow-backed VectorUdfs by directly using the [Apache Arrow](https://github.com/apache/arrow) library or by using the [Microsoft DataFrame](https://devblogs.microsoft.com/dotnet/an-introduction-to-dataframe/) library.

In [None]:
// Let's construct a VectorUdf by directly using Arrow.
using Apache.Arrow;
using static Microsoft.Spark.Sql.ArrowFunctions;
using Column = Microsoft.Spark.Sql.Column;

// Helper method to construct an ArrowArray from a string[].
public static IArrowArray ToStringArrowArray(string[] array)
{
    var valueOffsets = new ArrowBuffer.Builder<int>();
    var valueBuffer = new ArrowBuffer.Builder<byte>();
    int offset = 0;

    foreach (string str in array)
    {
        byte[] bytes = Encoding.UTF8.GetBytes(str);
        valueOffsets.Append(offset);
        valueBuffer.Append(bytes);
        offset += bytes.Length;
    }

    valueOffsets.Append(offset);
    return new StringArray(
        new ArrayData(
            Apache.Arrow.Types.StringType.Default,
            valueOffsets.Length - 1,
            0,
            0,
            new[] { ArrowBuffer.Empty, valueOffsets.Build(), valueBuffer.Build() }));
}

Func<Int32Array, StringArray, StringArray> arrowUdf =
    (ids, names) => (StringArray)ToStringArrowArray(
        Enumerable.Range(0, names.Length)
            .Select(i => $"id: {ids.GetValue(i)}, name: {names.GetString(i)}")
            .ToArray());

Func<Column, Column, Column> vectorUdf1 = VectorUdf(arrowUdf);

In [None]:
df.Select(vectorUdf1(df["id"], df["name"]))

In [None]:
// Now let's construct a VectorUdf by using Microsoft Dataframe
using Microsoft.Data.Analysis;
using static Microsoft.Spark.Sql.DataFrameFunctions;

Func<Int32DataFrameColumn, ArrowStringDataFrameColumn, ArrowStringDataFrameColumn> msftDfFunc =
    (ids, names) =>
    {
        long i = 0;
        return names.Apply(cur => $"id: {ids[i++]}, name: {cur}");
    };

Func<Column, Column, Column> vectorUdf2 = VectorUdf(msftDfFunc);

In [None]:
df.Select(vectorUdf2(df["id"], df["name"]))

# Running custom Nugets as UDFs inside Spark
In .NET for Spark, it is very easy to install a library from Nuget and use in UDFs in Spark.

In [None]:
// Use #r to install new packages into the current session

// Installs latest version
#r "nuget: MathNet.Numerics"

// Installs specified version
#r "nuget: NumSharp,0.20.5"

In [None]:
// Let's construct some Udfs that have a dependency on the installed packages.
using MathNet.Numerics.LinearAlgebra;
using MathNet.Numerics.LinearAlgebra.Double;
using NumSharp;

var mathNetUdf = Udf<string, string>(str => {
    Matrix<double> matrix = DenseMatrix.OfArray(new double[,] {
        {1,1,1,1},
        {1,2,3,4},
        {4,3,2,1}});

    return $"{matrix[0, 0]} - {str} - {matrix[1, 1]}!";
});

var numSharpUdf = Udf<string, string>(str => {
    var nd = np.arange(12);

    return $"{nd[1].ToString()} - {str} - {nd[5].ToString()}!";
});

In [None]:
// UDFs are run on the Microsoft.Spark.Worker process. The package assemblies
// defined as a Udf depedency are shipped to the Worker so they are available
// at the time of execution.
df.Select(mathNetUdf(df["name"])).Show();

df.Select(numSharpUdf(df["name"])).Show();

// We can also chain udfs.
df.Select(mathNetUdf(numSharpUdf(df["name"])))

# Synapse Spark Utility Methods
[Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Notebook.MSSparkUtils](dev.azure.com/dnceng/internal/_git/dotnet-spark-extensions?path=%2Fsrc%2FMicrosoft.Spark.Extensions.Azure.Synapse.Analytics%2FNotebook%2FMSSparkUtils)

In [None]:
// Utility for obtaining credentials (tokens and keys) for Synapse resources.
// Credentials methods https://dev.azure.com/dnceng/internal/_git/dotnet-spark-extensions?path=%2Fsrc%2FMicrosoft.Spark.Extensions.Azure.Synapse.Analytics%2FNotebook%2FMSSparkUtils%2FCredentials.cs
using Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Notebook.MSSparkUtils;

// Note that the help message is the help message returned by the Scala implementation. This needs to be updated and addressed in a future version.
Console.WriteLine($"Help:\n{Credentials.Help()}");

In [None]:
// Utility for obtaining environment metadata for Synapse.
// Env methods https://dev.azure.com/dnceng/internal/_git/dotnet-spark-extensions?path=%2Fsrc%2FMicrosoft.Spark.Extensions.Azure.Synapse.Analytics%2FNotebook%2FMSSparkUtils%2FEnv.cs
Console.WriteLine($"UserName: {Env.GetUserName()}");
Console.WriteLine($"UserId: {Env.GetUserId()}");
Console.WriteLine($"WorkspaceName: {Env.GetWorkspaceName()}");
Console.WriteLine($"PoolName: {Env.GetPoolName()}");
Console.WriteLine($"ClusterId: {Env.GetClusterId()}");

// Note that the help message is the help message returned by the Scala implementation. This needs to be updated and addressed in a future version.
Console.WriteLine($"Help:\n{Env.Help()}");

In [None]:
// Utility for filesystem operations in Synapse notebook
// FS methods https://dev.azure.com/dnceng/internal/_git/dotnet-spark-extensions?path=%2Fsrc%2FMicrosoft.Spark.Extensions.Azure.Synapse.Analytics%2FNotebook%2FMSSparkUtils%2FFS.cs
// FileInfo methods https://dev.azure.com/dnceng/internal/_git/dotnet-spark-extensions?path=%2Fsrc%2FMicrosoft.Spark.Extensions.Azure.Synapse.Analytics%2FNotebook%2FMSSparkUtils%2FFileInfo.cs

// Note that the help message is the help message returned by the Scala implementation. This needs to be updated and addressed in a future version.
FS.Help("");

In [None]:
// Utility for notebook operations (e.g, chaining Synapse notebooks together)
// Notebook methods https://dev.azure.com/dnceng/internal/_git/dotnet-spark-extensions?path=%2Fsrc%2FMicrosoft.Spark.Extensions.Azure.Synapse.Analytics%2FNotebook%2FMSSparkUtils%2FNotebook.cs

// Note that the help message is the help message returned by the Scala implementation. This needs to be updated and addressed in a future version.
Notebook.Help("");

# [Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Notebook.Visualization](https://dev.azure.com/dnceng/internal/_git/dotnet-spark-extensions?path=%2Fsrc%2FMicrosoft.Spark.Extensions.Azure.Synapse.Analytics%2FNotebook%2FVisualization%2FFunctions.cs)

In [None]:
using Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Notebook.Visualization;
// Construct an specific html fragment to synapse notebook front-end for rendering
// based on user-input html content.
DisplayHTML("<h1>Hello World</h1>");

# [TokenLibrary](https://dev.azure.com/dnceng/internal/_git/dotnet-spark-extensions?path=%2Fsrc%2FMicrosoft.Spark.Extensions.Azure.Synapse.Analytics%2FUtils%2FTokenLibrary.cs)

[Synapse Analytics TokenLibrary Official Docs](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-secure-credentials-with-tokenlibrary)

In [None]:
using Microsoft.Spark.Extensions.Azure.Synapse.Analytics.Utils;

// Note that the help message is the help message returned by the Scala implementation. This needs to be updated and addressed in a future version.
// TODO: Methodname needs to be uppercase.
Console.WriteLine($"Help:\n{TokenLibrary.help()}");

# Miscellaneous Helpers
Learn about some internal functions offered by using .NET for Spark.

In [None]:
// Curious about the version of Spark .NET currently installed?
// Let's use the following method to find out!
using Microsoft.Spark.Experimental.Sql;
spark.GetAssemblyInfo()

In [None]:
// Current version of the dotnet-interactive REPL.
#!about

In [None]:
// We can even run powershell core commands
#!pwsh
cat /etc/hosts

In [None]:
// We can also run F# code
#!fsharp
open System
printfn "Hello World from F#!"

In [None]:
// Whatever code is deemed invalid by the C# Compiler, is invalid here too 
var z = 12345

In [None]:
// You could write code that throws exceptions and they bubble up to the notebook
throw new Exception("watzzz");