# GitHub Project Analysis
In this notebook, we'll see how to use .NET for Apache Spark to analyze a file containing info about a set of GitHub projects.

## Let's kick things off by starting a Spark Session
All we need is the `spark` keyword to get our .NET for Spark app started!

In [3]:
spark

SparkContext
{ SparkContext: DefaultParallelism: 32 }


## Now that we have a Spark Session, we can start writing .NET for Spark code
We can read in our input file into a DataFrame and set its schema. Then we'll print out our DataFrame.

Update the code to include the path in Azure storage to your input projects data.

In [4]:
DataFrame projectsDf = spark
    .Read()
    .Schema("id INT, url STRING, owner_id INT, " +
    "name STRING, descriptor STRING, language STRING, " +
    "created_at STRING, forked_from INT, deleted STRING, " +
    "updated_at STRING")
    .Csv("<path_to_projects.csv>");

projectsDf.Show();

+----+--------------------+--------+--------------------+--------------------+--------+---------------+-----------+-------+-------------------+
|  id|                 url|owner_id|                name|          descriptor|language|     created_at|forked_from|deleted|         updated_at|
+----+--------------------+--------+--------------------+--------------------+--------+---------------+-----------+-------+-------------------+
|   1|https://api.githu...|       1|           ruote-kit|RESTful wrapper f...|    Ruby|12/8/2009 10:17|          2|      0|     11/5/2015 1:15|
|null|                null|    null|                null|                null|    null|           null|       null|   null|               null|
|null|                null|    null|                null|                null|    null|           null|       null|   null|               null|
|   4|https://api.githu...|      24|             basemap|                null|     C++|6/14/2012 14:14|          3|      1|0000-00-00 00

## Clean up our data
Our data is all there, but it's looking a bit crowded. Let's do some **data prep** to clean up our data.

### We can drop any rows with null entries.

In [5]:
DataFrameNaFunctions dropEmptyProjects = projectsDf.Na();
DataFrame cleanedProjects = dropEmptyProjects.Drop("any");

### We can also drop columns we won't need later.


In [6]:
cleanedProjects = cleanedProjects.Drop("id", "url", "owner_id", "descriptor");
cleanedProjects.Show();

+--------------------+-----------+----------------+-----------+-------+-------------------+
|                name|   language|      created_at|forked_from|deleted|         updated_at|
+--------------------+-----------+----------------+-----------+-------+-------------------+
|           ruote-kit|       Ruby| 12/8/2009 10:17|          2|      0|     11/5/2015 1:15|
|           cocos2d-x|        C++| 3/12/2012 16:48|          6|      0|   10/22/2015 17:36|
|           cocos2d-x|          C| 4/23/2012 10:20|          6|      0|    11/1/2015 17:32|
|       rake-compiler|       Ruby|  8/1/2012 18:33|   14556189|      0|    11/3/2015 19:30|
|    cobertura-plugin|       Java| 7/26/2012 18:46|     193522|      0|    11/1/2015 19:55|
|     scala-vs-erlang|     Erlang|12/25/2011 13:51|    1262879|      0|    10/22/2015 4:50|
|              opencv|        C++|  8/2/2012 12:50|         29|      0|    10/26/2015 6:44|
| redmine_git_hosting|       Ruby| 7/30/2012 12:53|         42|      0|   10/28/

## Now our data's looking better! Let's analyze this prepped data.
We can group our projects by language, and then find the average number of times each project has been forked.


In [7]:
DataFrame groupedDF = cleanedProjects
    .GroupBy("language")
    .Agg(Avg(cleanedProjects["forked_from"]));

### Let's order our data to have the top-forked projects first.


In [8]:
groupedDF.OrderBy(Desc("avg(forked_from)")).Show();

+------------------+------------------+
|          language|  avg(forked_from)|
+------------------+------------------+
|               GAP|         5497992.4|
|              PAWN|         4883154.0|
|               ooc|         4763738.5|
|            Racket| 3868442.445652174|
|              Haxe|3217677.3333333335|
|               CSS|   2784979.3203125|
|            Nimrod|       2662935.625|
|                eC|         2304409.5|
|              XSLT|      1638888.1875|
|               Bro|1607539.3333333333|
|              HTML|1554805.9310344828|
|               TeX|1534278.6363636365|
|             Logos|         1496027.5|
|              VHDL|        1471550.25|
|Ragel in Ruby Host|         1307989.5|
|     SuperCollider|1301081.8666666667|
|                Io|       1301006.875|
|            Kotlin|         1233764.4|
|           Ruby"""|         1151385.0|
|               Ada|1149265.3333333333|
+------------------+------------------+
only showing top 20 rows

## We can use Spark SQL with user-defined functions (UDFs) and SQL calls in our notebooks, too!
Let's use a UDF that will see if a given date comes after October 20, 2015.


### First, we define our UDF, including the type of its input, output, and the functionality it performs.


In [9]:
spark.Udf().Register<string, bool>(
    "BeforeOct",
    (date) => DateTime.TryParse(date, out DateTime convertedDate) &&
        (convertedDate > (new DateTime(2015, 10, 20))));

### Now that we have our UDF `BeforeOct` defined, we can call it on our prepped data.
We can use **Spark SQL** to write a SQL call. It will call `BeforeOct` on each row of our DataFrame.


In [10]:
cleanedProjects.CreateOrReplaceTempView("dateView");

DataFrame dateDf = spark.Sql(
    "SELECT *, BeforeOct(dateView.updated_at) AS datebefore FROM dateView");

dateDf.Show();

DataFrame recentDf = dateDf.Filter("datebefore");

recentDf.Show();

+--------------------+-----------+----------------+-----------+-------+-------------------+----------+
|                name|   language|      created_at|forked_from|deleted|         updated_at|datebefore|
+--------------------+-----------+----------------+-----------+-------+-------------------+----------+
|           ruote-kit|       Ruby| 12/8/2009 10:17|          2|      0|     11/5/2015 1:15|      true|
|           cocos2d-x|        C++| 3/12/2012 16:48|          6|      0|   10/22/2015 17:36|      true|
|           cocos2d-x|          C| 4/23/2012 10:20|          6|      0|    11/1/2015 17:32|      true|
|       rake-compiler|       Ruby|  8/1/2012 18:33|   14556189|      0|    11/3/2015 19:30|      true|
|    cobertura-plugin|       Java| 7/26/2012 18:46|     193522|      0|    11/1/2015 19:55|      true|
|     scala-vs-erlang|     Erlang|12/25/2011 13:51|    1262879|      0|    10/22/2015 4:50|      true|
|              opencv|        C++|  8/2/2012 12:50|         29|      0|  