# TweetAnalysis - Data Analysis

This Spark .NET notebook shows how you can use .NET for Spark and the .NET notebook experience to analyze data.

The data to be analyzed has been previously prepared into the Spark tables `Topics` and `Mentions` in the `TweetAnalysis` database.


## This notebook gives us interactive C#!

In [4]:
"a b c".Split(' ')

index,value
0,a
1,b
2,c


## Show daily count of distinct authors and overall daily count of how often a topic has been tagged
Using .NET for Spark's C# methods.

In [3]:
var topics = spark.Table("buildtweetanalysis.Topics");

topics.GroupBy(Lower(Col("topic")).As("topic"), Col("date"))
      .Agg(Count("author").As("count"), CountDistinct("author").As("dist_author_count"))
      .OrderBy(Desc("count")).Show();

+------------+----------+-----+-----------------+
|       topic|      date|count|dist_author_count|
+------------+----------+-----+-----------------+
| sqlsatparis|05/09/2015|  128|               10|
|   sqlsat109|03/03/2012|  100|                2|
|     jss2014|02/12/2014|   85|               18|
|     jss2014|01/12/2014|   48|               13|
|   hackfort2|26/03/2015|   47|                3|
|hadoopsummit|26/06/2013|   46|                1|
|  strataconf|29/02/2012|   45|                2|
|     sqlpass|13/10/2011|   38|                2|
|  mstechdays| 11/2/2015|   36|               12|
|     bigdata|29/02/2012|   36|                1|
|    sqlrally|10/05/2012|   34|                1|
|     sqlpass|08/11/2012|   33|                5|
|   build2015|29/04/2015|   32|               27|
|     jss2013|03/12/2013|   31|               10|
|  mstechdays| 12/2/2014|   28|               13|
|     sql2012|07/03/2012|   27|                2|
| sqlsatparis|13/09/2014|   26|               10|


## .NET based notebooks can do cool visualizations too!


### Show Michael's top 5 quarterly tweet topics in bar chart format
First show the data in raw form and with the custom DataFrame formatter

In [11]:
var michaels_top_5_quarterly_topics = spark.Sql(@"
    WITH michaels_topics_raw AS (SELECT TO_DATE(date,'dd/M/yyyy') AS date, topic FROM BuildTweetAnalysis.Topics WHERE author = 'MikeDoesBigData')
    ,    michaels_topics AS (SELECT YEAR(date) AS year, QUARTER(date) AS quarter, topic FROM michaels_topics_raw )
    ,    michaels_quarterly_topic_count AS (SELECT year, quarter, LOWER(topic) AS topic, COUNT(topic) AS topic_count FROM michaels_topics GROUP BY year, quarter, LOWER(topic))
    ,    all_michaels_topics_ranked_by_quarter AS (SELECT year, quarter, topic, topic_count, ROW_NUMBER () OVER (PARTITION BY year, quarter ORDER BY topic_count DESC) AS quarterly_rank FROM michaels_quarterly_topic_count)
    SELECT CONCAT(year, ' - ', quarter) AS year_quarter, topic, topic_count FROM all_michaels_topics_ranked_by_quarter WHERE quarterly_rank < 6
    ")
    .OrderBy(Col("year_quarter"));
    
michaels_top_5_quarterly_topics.Show();


+------------+----------------+-----------+
|year_quarter|           topic|topic_count|
+------------+----------------+-----------+
|    2010 - 4|         spatial|          2|
|    2010 - 4|          denali|          2|
|    2010 - 4|       sqlserver|          3|
|    2010 - 4|         sqlpass|         13|
|    2010 - 4|       filetable|          3|
|    2011 - 1|           azure|          1|
|    2011 - 1|             w2e|          4|
|    2011 - 1|       webmatrix|          1|
|    2011 - 1|        bizspark|          1|
|    2011 - 2|        sqlazure|          2|
|    2011 - 2|           nosql|          2|
|    2011 - 2|beyondrelational|          5|
|    2011 - 2|       sqlserver|          5|
|    2011 - 2|        msteched|          7|
|    2011 - 3|        sqlazure|          3|
|    2011 - 3|        nosqlnow|         24|
|    2011 - 3|           nosql|          4|
|    2011 - 3|       sqlserver|          4|
|    2011 - 3|          hadoop|          3|
|    2011 - 4|       sqlserver| 

Now let's use the Plotly library to generate an interesting bar chart visualization of the data 

In [12]:
using XPlot.Plotly;

// Prepare the Bar Charts
var barlist = new List<Graph.Bar>();

//Create the full quarter range to provide a complete and ordered X-axis
var fullrange = Enumerable.Range(2010,11).SelectMany(y=>Enumerable.Range(1,4),  (y,q) => $"{y} - {q}");

// Transform the Spark DataFrame into a .NET IEnumerable to avoid hitting the Spark cluster more than once
var topics_table = michaels_top_5_quarterly_topics.Collect().ToArray()
                   .Select(x=>new {topic = x.GetAs<string>("topic"), yq = x.GetAs<string>("year_quarter"), cnt = x.GetAs<int>("topic_count")});

// Add one Bar Chart per topic
foreach (var topic in topics_table.Select(x=>x.topic).Distinct()) 
{
   // Get the data for the specific topic. Join the full range to fill in the missing quarters
   var partialdata = topics_table.Where(x=>x.topic==topic).Select(x => new {x.yq, x.cnt});
   var data = from q in fullrange
          join entry in partialdata on q equals entry.yq into gj
          from subq in gj.DefaultIfEmpty()
          select new { q, cnt = subq?.cnt };

   // The following assumes that LINQ will preserve the order and thus correlation of the values. Otherwise additional code to guarantee the correlation is needed.
   var arr_year_quarter = data.Select(x=>x.q); 
   var arr_topic_count = data.Select(x=>x.cnt); 
  
   var bar = new Graph.Bar
   {
     name = topic,
     x = arr_year_quarter,
     y = arr_topic_count
   };
   barlist.Add(bar);
}

var layout = new XPlot.Plotly.Layout.Layout{barmode="stack", title="Michael's Quarterly Top Tweet Topics"};
var chart = Chart.Plot(barlist);
chart.WithLayout(layout);
chart.WithHeight(1000);
chart.WithWidth(1400);
chart.WithXTitle("Yearly Quarters");
chart.WithYTitle("Number of Topic Mentions");
chart

In [8]:
// Run this if you need to refresh the Spark Table cache when the tables got recreated
//spark.Sql("REFRESH TABLE BuildTweetAnalysis.Topics");
//spark.Sql("REFRESH TABLE BuildTweetAnalysis.Mentions");