# TweetAnalysis - Data Prep
This Spark .NET notebook prepares the tweet analysis tables Mentions and Topics in the TweetAnalysis database.

It shows the use of Spark.NET with focus on:
- How to use a .NET function (cell 5) in a notebook as a Spark UDF
- How to use some of the Spark functions from .NET (including different ways to reference columns)
- call into SparkSQL
- Create the Parquet backed Spark tables to be used with other Spark applications and notebooks and even SQL engines

**Please replace the `inputfile` path with the location where you placed the Tweet files.** 

If you would like to use your own tweet data, you can use https://tweetdownload.net and save the result as CSV.

## This notebook gives us interactive C#!


In [16]:
bool display = true; 

"a b c".Split(' ')

index,value
0,a
1,b
2,c


## Define a C# function `extract_items`

`extract_items` returns a list of words from a tweet that were prefixed with the provided `prefix` parameter. 

This means that you get mentioned tweet handles if you use `prefix:"@"` and tweet topics if you use `prefix:"#"`.

Note that this function can be used now in the notebook.

In [6]:
    static IEnumerable<string> extract_items(string tweet, string prefix)
    {
            return tweet.Split(new char[] { ' ', ',', '.', ':', '!', ';', '"', '“', ')', '?', '\'' })
                        .Where(x => x.StartsWith(prefix) && x != prefix)
                        .Select(x => x.Substring(1));
    }

## Creating the 'tweets' dataframe

This creates the `tweets` dataframe by reading all the tweet files from the specified path (note the file wildcard in the path).

Will remove duplicate tweets.

**Please replace the `inputfile` path with the location where you placed the Tweet files.** 

If you would like to use your own tweet data, you can use https://tweetdownload.net and save the result as CSV.

In [7]:
var inputfile = @"abfss://<container>@<ADLSGen2Acct>.dfs.core.windows.net/<path>/Tweets/*.csv";
long before_count = 0;

var tweets = spark.Read().Schema("date STRING, time STRING, author STRING, tweet STRING").Format("csv").Load(inputfile);
if (display) {before_count= tweets.Count();}
tweets = tweets.Distinct();
if (display) {var after_count = tweets.Count(); Console.WriteLine("Number of distinct tweets: "+after_count.ToString()+" - Removed number of duplicates: "+(before_count-after_count).ToString());}

Number of distinct tweets: 17570 - Removed number of duplicates: 150

## Registering the C# function as Spark UDF
The following registers the `extract_items` function as a Spark UDF. 

Since the `prefix` parameter is a string value that is not taken from a column and registered Spark UDFs need to reference columns, we provide two explicitly named UDFs `extract_mentions` and `extract_topics` that hard-code the `prefix` for the given usage. 


In [8]:
Func<Column, Column> extract_mentions = 
        Udf<string, IEnumerable<string>>((tweet) => extract_items(tweet, "@"));
        
Func<Column, Column> extract_topics = 
        Udf<string, IEnumerable<string>>((tweet) => extract_items(tweet, "#"));

## Extracting Mentions and Topics
Now we can use the UDFs to extract the mention and topics. Both are represented as string arrays in the form `IEnumerable<string>`.

Note that we can refer to the colums either via the generic `Col(colname)` function or by referring to the explicitly named dataframe column accessor `dataframe[colname]`.


In [9]:
var mentionsandtopics = tweets.Select(Col("date"), Col("time"), Col("author")
                                    , extract_mentions(tweets["tweet"]).As("mentions")
                                    , extract_topics(tweets["tweet"]).As("topics")
                );
           
if (display) {mentionsandtopics.Show();}

+----------+-----+---------+--------------------+--------------------------------+
|      date| time|   author|            mentions|                          topics|
+----------+-----+---------+--------------------+--------------------------------+
|18/09/2015|00:01|       iC|         [dahowlett]|                              []|
|01/09/2015|16:56|       iC|[couchbase, museu...|                              []|
|16/08/2015|14:12|  etsurow|                  []|[JqkjcfD�����]|
|26/04/2015|00:04|       iC|              [lfnw]|                              []|
|14/02/2015|22:20|       iC|          [Stanford]|                 [ValentinesDay]|
|11/11/2014|02:48|       iC|[benkepes, jobswo...|                              []|
|30/07/2014|21:12|       iC|             [wsdot]|                              []|
|12/07/2014|02:03|       iC|    [xmlgrrl, UMAWG]|              [pbsnewshour, pii]|
|11/07/2014|04:50|       iC|[Paul_Hofmann, dw...|                              []|
|16/06/2014|19:04|   

## Create mentions and topics dataframes

We want to pivot the arrays into one row per array item. To avoid the cartesian product between mentions and topics, we create one dataframe each.


In [10]:
var mentions = mentionsandtopics.Select(Col("date"), Col("time"), Col("author"), Col("mentions").As("mention"))
                                .WithColumn("mention", Explode(Col("mention")));
if (display) {mentions.Show();}

+----------+-----+-------+--------------+
|      date| time| author|       mention|
+----------+-----+-------+--------------+
|18/09/2015|00:01|     iC|     dahowlett|
|01/09/2015|16:56|     iC|     couchbase|
|01/09/2015|16:56|     iC|museumofflight|
|01/09/2015|16:56|     iC|      AWScloud|
|01/09/2015|16:56|     iC|   googlecloud|
|26/04/2015|00:04|     iC|          lfnw|
|14/02/2015|22:20|     iC|      Stanford|
|11/11/2014|02:48|     iC|      benkepes|
|11/11/2014|02:48|     iC|     jobsworth|
|11/11/2014|02:48|     iC|    salesforce|
|11/11/2014|02:48|     iC|  parkerharris|
|11/11/2014|02:48|     iC|       Benioff|
|11/11/2014|02:48|     iC|        fscavo|
|30/07/2014|21:12|     iC|         wsdot|
|12/07/2014|02:03|     iC|       xmlgrrl|
|12/07/2014|02:03|     iC|         UMAWG|
|11/07/2014|04:50|     iC|  Paul_Hofmann|
|11/07/2014|04:50|     iC|      dwavesys|
|28/04/2015|10:34|PulsWeb|   sebastianbk|
|23/02/2015|09:07|PulsWeb|   GUSS_FRANCE|
+----------+-----+-------+--------

In [11]:
var topics = mentionsandtopics.Select(Col("date"), Col("time"), Col("author"), Col("topics").As("topic"))
                              .WithColumn("topic", Explode(Col("topic")));
if (display) {topics.Show();}

+----------+-----+---------------+------------+
|      date| time|         author|       topic|
+----------+-----+---------------+------------+
|09/03/2015|20:38|       SQLCindy|   HDInsight|
|09/03/2015|20:38|       SQLCindy|       Azure|
|09/03/2015|20:38|       SQLCindy|        MSBI|
|09/03/2015|20:38|       SQLCindy|      Hadoop|
|09/03/2015|20:38|       SQLCindy|    SQLAzure|
|25/06/2013|02:23|       SQLCindy| BigDataCamp|
|25/06/2013|02:23|       SQLCindy|   HDInsight|
|25/06/2013|02:23|       SQLCindy|        MSBI|
|16/04/2013|14:52|        sqlrnnr|      Hadoop|
|16/04/2013|14:52|        sqlrnnr|      SSSOLV|
|18/03/2013|15:53|       SQLCindy|  BigData100|
|19/02/2013|17:04|       SQLBalls|       MVP13|
|19/02/2013|17:04|       SQLBalls|         WIT|
|19/02/2013|17:04|       SQLBalls|     SQLPASS|
|18/02/2013|20:01|       SQLCindy|       mvp13|
|18/02/2013|20:01|       SQLCindy|     mvpbuzz|
|18/02/2013|20:01|       SQLCindy|   sqlserver|
|11/08/2012|00:12|    sqlagentman|     B

## Cleaning up the existing Database

First we need to drop the tables since Spark requires that a database is empty before we can drop the Database.

Then we recreate the database and set the default database context to it.


In [12]:
spark.Sql("DROP TABLE IF EXISTS BuildTweetAnalysis.Mentions"); spark.Sql("DROP TABLE IF EXISTS BuildTweetAnalysis.Topics");

In [13]:
spark.Sql("DROP DATABASE IF EXISTS BuildTweetAnalysis"); spark.Sql("CREATE DATABASE BuildTweetAnalysis"); 
spark.Sql("USE BuildTweetAnalysis");

## Creating the new tables
We create a mentions table from the mentions dataframe and a topics table from the topics dataframe.

Since we only have a few hundred KB of data, we want to avoid overpartitioning of the tables into too many files. Thus we repartition the data into a single partition.

Note that this simple way of creation will not provide for specifying cluster keys (which you want to specify for large production tables).

In [14]:
mentions.Repartition(1).Write().SaveAsTable("Mentions");
if (display) {spark.Sql("SELECT COUNT(*) FROM Mentions").Show();}

+--------+
|count(1)|
+--------+
|   25246|
+--------+

In [15]:
topics.Repartition(1).Write().SaveAsTable("BuildTweetAnalysis.Topics");
if (display) {spark.Sql("SELECT COUNT(*) FROM BuildTweetAnalysis.Topics").Show();}

+--------+
|count(1)|
+--------+
|   14587|
+--------+