# Experimenting with .NET for Apache Spark Using the CreateDataFrame API

A Spark DataFrame is a distributed collection of data organized into named columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. 

DataFrames can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.
Another way of creating Spark Dataframes is by using the `CreateDataFrame` API that takes in data in the form of List of Row objects along with the schema and returns a `DataFrame` object. Let's look at a simple example below:


In [5]:
using Microsoft.Spark.Sql.Types;

// List of GenericRow objects that contain the data for each row of the DataFrame
var data = new List<GenericRow>();
data.Add(new GenericRow(new object[] { "Alice", 20 }));
data.Add(new GenericRow(new object[] { "Bob", 30}));

// Schema of the DataFrame
var schema = new StructType(new List<StructField>()
{
    new StructField("Name", new StringType()),
    new StructField("Age", new IntegerType())
});

// Calling CreateDataFrame with the data and schema
DataFrame df = spark.CreateDataFrame(data, schema);

// Displaying the returned dataframe
df.Show();

+-----+---+
| Name|Age|
+-----+---+
|Alice| 20|
|  Bob| 30|
+-----+---+

## A more real-life example

Now let's take a look at a more complex example closer to a real-life use case.


In [3]:
using Microsoft.Spark.Sql.Types;

// Data as list of Rows
var searchlogData = new List<GenericRow>();
searchlogData.Add(new GenericRow(new object[] {399266 , "2019-10-15T11:53:04Z" , "en-us" , "how to make nachos" , 73 , "www.nachos.com;www.wikipedia.com" , "NULL"})); 
searchlogData.Add(new GenericRow(new object[] {382045 , "2019-10-15T11:53:25Z" , "en-gb" , "best ski resorts" , 614 , "skiresorts.com;ski-europe.com;www.travelersdigest.com/ski_resorts.htm" , "ski-europe.com;www.travelersdigest.com/ski_resorts.htm"})); 
searchlogData.Add(new GenericRow(new object[] {382045 , "2019-10-16T11:53:42Z" , "en-gb" , "broken leg" , 74 , "mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture" , "mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture"}));
searchlogData.Add(new GenericRow(new object[] {106479 , "2019-10-16T11:53:10Z" , "en-ca" , "south park episodes" , 24 , "southparkstudios.com;wikipedia.org/wiki/Sout_Park;imdb.com/title/tt0121955;simon.com/mall" , "southparkstudios.com"}));
searchlogData.Add(new GenericRow(new object[] {906441 , "2019-10-16T11:54:18Z" , "en-us" , "cosmos" , 1213 , "cosmos.com;wikipedia.org/wiki/Cosmos:_A_Personal_Voyage;hulu.com/cosmos" , "NULL"}));
searchlogData.Add(new GenericRow(new object[] {351530 , "2019-10-16T11:54:29Z" , "en-fr" , "microsoft" , 241 , "microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com" , "NULL"}));
searchlogData.Add(new GenericRow(new object[] {640806 , "2019-10-16T11:54:32Z" , "en-us" , "wireless headphones" , 502 , "www.amazon.com;reviews.cnet.com/wireless-headphones;store.apple.com" , "www.amazon.com;store.apple.com"}));
searchlogData.Add(new GenericRow(new object[] {304305 , "2019-10-16T11:54:45Z" , "en-us" , "dominos pizza" , 60 , "dominos.com;wikipedia.org/wiki/Domino's_Pizza;facebook.com/dominos" , "dominos.com"})); 
searchlogData.Add(new GenericRow(new object[] {460748 , "2019-10-16T11:54:58Z" , "en-us" , "yelp" , 1270 , "yelp.com;apple.com/us/app/yelp;wikipedia.org/wiki/Yelp_Inc.;facebook.com/yelp" , "yelp.com"}));
searchlogData.Add(new GenericRow(new object[] {354841 , "2019-10-16T11:59:00Z" , "en-us" , "how to run" , 610 , "running.about.com;ehow.com;go.com" , "running.about.com;ehow.com"}));
searchlogData.Add(new GenericRow(new object[] {354068 , "2019-10-16T12:00:07Z" , "en-mx" , "what is sql" , 422 , "wikipedia.org/wiki/SQL;sqlcourse.com/intro.html;wikipedia.org/wiki/Microsoft_SQL" , "wikipedia.org/wiki/SQL"}));
searchlogData.Add(new GenericRow(new object[] {674364 , "2019-10-16T12:00:21Z" , "en-us" , "mexican food redmond" , 283 , "eltoreador.com;yelp.com/c/redmond-wa/mexican;agaverest.com" , "NULL"}));
searchlogData.Add(new GenericRow(new object[] {347413 , "2019-10-16T12:11:34Z" , "en-gr" , "microsoft" , 305 , "microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com" , "NULL"}));
searchlogData.Add(new GenericRow(new object[] {848434 , "2019-10-16T12:12:14Z" , "en-ch" , "facebook" , 10 , "facebook.com;facebook.com/login;wikipedia.org/wiki/Facebook" , "facebook.com"}));
searchlogData.Add(new GenericRow(new object[] {604846 , "2019-10-16T12:13:18Z" , "en-us" , "wikipedia" , 612 , "wikipedia.org;en.wikipedia.org;en.wikipedia.org/wiki/Wikipedia" , "wikipedia.org"}));
searchlogData.Add(new GenericRow(new object[] {840614 , "2019-10-16T12:13:41Z" , "en-us" , "xbox" , 1220 , "xbox.com;en.wikipedia.org/wiki/Xbox;xbox.com/xbox360" , "xbox.com/xbox360"}));
searchlogData.Add(new GenericRow(new object[] {656666 , "2019-10-16T12:15:19Z" , "en-us" , "hotmail" , 691 , "hotmail.com;login.live.com;msn.com;en.wikipedia.org/wiki/Hotmail" , "NULL"}));
searchlogData.Add(new GenericRow(new object[] {951513 , "2019-10-16T12:17:37Z" , "en-us" , "pokemon" , 63 , "pokemon.com;pokemon.com/us;serebii.net" , "pokemon.com"}));
searchlogData.Add(new GenericRow(new object[] {350350 , "2019-10-16T12:18:17Z" , "en-us" , "wolfram" , 30 , "wolframalpha.com;wolfram.com;mathworld.wolfram.com;en.wikipedia.org/wiki/Stephen_Wolfram" , "NULL"}));
searchlogData.Add(new GenericRow(new object[] {641615 , "2019-10-16T12:19:21Z" , "en-us" , "kahn" , 119 , "khanacademy.org;en.wikipedia.org/wiki/Khan_(title);answers.com/topic/genghis-khan;en.wikipedia.org/wiki/Khan_(name)" , "khanacademy.org"}));
searchlogData.Add(new GenericRow(new object[] {321065 , "2019-10-16T12:20:19Z" , "en-us" , "clothes" , 732 , "gap.com;overstock.com;forever21.com;footballfanatics.com/college_washington_state_cougars" , "footballfanatics.com/college_washington_state_cougars"}));
searchlogData.Add(new GenericRow(new object[] {651777 , "2019-10-16T12:20:49Z" , "en-us" , "food recipes" , 183 , "allrecipes.com;foodnetwork.com;simplyrecipes.com" , "foodnetwork.com"}));
searchlogData.Add(new GenericRow(new object[] {666352 , "2019-10-16T12:21:16Z" , "en-us" , "weight loss" , 630 , "en.wikipedia.org/wiki/Weight_loss;webmd.com/diet;exercise.about.com" , "webmd.com/diet"}));

// Schema for the above data
// For a full list of types you can use, please see the following link:
// https://docs.microsoft.com/en-us/dotnet/api/microsoft.spark.sql.types?view=spark-dotnet
var searchlogSchema = new StructType(new List<StructField>()
            { 
                new StructField("Id", new IntegerType()),
                new StructField("Time", new StringType()),
                new StructField("Market", new StringType()),
                new StructField("Searchtext", new StringType()),
                new StructField("Latency", new IntegerType()),
                new StructField("Links", new StringType()),
                new StructField("Clickedlinks", new StringType())
            });
 
 // Creating a DataFrame using the above data and schema as input to the CreateDataFrame API
DataFrame dfSearchlog = spark.CreateDataFrame(searchlogData, searchlogSchema);

// Displaying the created DataFrame
dfSearchlog.Show()

+------+--------------------+------+--------------------+-------+--------------------+--------------------+
|    Id|                Time|Market|          Searchtext|Latency|               Links|        Clickedlinks|
+------+--------------------+------+--------------------+-------+--------------------+--------------------+
|399266|2019-10-15T11:53:04Z| en-us|  how to make nachos|     73|www.nachos.com;ww...|                NULL|
|382045|2019-10-15T11:53:25Z| en-gb|    best ski resorts|    614|skiresorts.com;sk...|ski-europe.com;ww...|
|382045|2019-10-16T11:53:42Z| en-gb|          broken leg|     74|mayoclinic.com/he...|mayoclinic.com/he...|
|106479|2019-10-16T11:53:10Z| en-ca| south park episodes|     24|southparkstudios....|southparkstudios.com|
|906441|2019-10-16T11:54:18Z| en-us|              cosmos|   1213|cosmos.com;wikipe...|                NULL|
|351530|2019-10-16T11:54:29Z| en-fr|           microsoft|    241|microsoft.com;wik...|                NULL|
|640806|2019-10-16T11:54:32Z

## Casting String type to Timestamp type

We will now convert the Column `Time` which is currently of `StringType()` to `TimeStamp()` type using the `Column.Cast()` method.


In [20]:
// Function to convert the Time column from StringType to TimestampType
public DataFrame CastColumn(DataFrame df_, string colName, string t)
{
    df_ = df_.WithColumn("NewCol__", df_[colName].Cast(t));
    df_ = df_.Drop(colName);
    df_ = df_.WithColumnRenamed("NewCol__", colName);
    return df_;
}

// Calling castColumn function to return the new DataFrame
DataFrame dfTimestampCast = CastColumn(dfSearchlog, "Time", "timestamp");

// Display the new DataFrame
dfTimestampCast.Show();

+------+------+--------------------+-------+--------------------+--------------------+-------------------+
|    Id|Market|          Searchtext|Latency|               Links|        Clickedlinks|               Time|
+------+------+--------------------+-------+--------------------+--------------------+-------------------+
|399266| en-us|  how to make nachos|     73|www.nachos.com;ww...|                NULL|2019-10-15 11:53:04|
|382045| en-gb|    best ski resorts|    614|skiresorts.com;sk...|ski-europe.com;ww...|2019-10-15 11:53:25|
|382045| en-gb|          broken leg|     74|mayoclinic.com/he...|mayoclinic.com/he...|2019-10-16 11:53:42|
|106479| en-ca| south park episodes|     24|southparkstudios....|southparkstudios.com|2019-10-16 11:53:10|
|906441| en-us|              cosmos|   1213|cosmos.com;wikipe...|                NULL|2019-10-16 11:54:18|
|351530| en-fr|           microsoft|    241|microsoft.com;wik...|                NULL|2019-10-16 11:54:29|
|640806| en-us| wireless headphones| 