# DataFrame Lab

In this lab, we will be working with data from [Libraries.io](http://Libraries.io), a package manager aggregator. Our data consist of two files, one detailing packages and package managers, the other detailing the code repositories the packages are developed in. 

In [1]:
data = spark.read.csv("hdfs:///data/projects-1.0.0-2017-06-15.csv",header=True, inferSchema=True, mode="DROPMALFORMED")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
15,application_1566055793802_0013,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Verify the data was read in correctly using `show`

In [2]:
data.show(vertical=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

-RECORD 0------------------------------------------------
 ID                               | 1                    
 Platform                         | Alcatraz             
 Name                             | 21st digital Temp... 
 Created Timestamp                | 2015-01-11 23:56:... 
 Updated Timestamp                | 2017-06-14 01:03:... 
 Description                      | A starting point ... 
 Keywords                         | null                 
 Homepage URL                     | https://github.co... 
 Licenses                         | null                 
 Repository URL                   | https://github.co... 
 Versions Count                   | 0                    
 SourceRank                       | 2                    
 Latest Release Publish Timestamp | 2017-06-14 01:03:... 
 Latest Release Number            | null                 
 Package Manager ID               | null                 
 Dependent Projects Count         | 0                    
 Language     

The rest of the lab consists of answering questions about the data. 

### How many packages are accounted for in this dataset?

Hint: Use `count`

In [3]:
data.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2211516

### What package managers are included in the data?

The package manager is in the column named 'platform'.

Hint: Extract the package manager names using `select` and then use `distinct`

In [4]:
managers = data.select('platform').distinct()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

View the results using `collect`

In [5]:
managers.collect()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(platform='NuGet'), Row(platform='Emacs'), Row(platform='Sublime'), Row(platform='Meteor'), Row(platform='CRAN'), Row(platform='Hex'), Row(platform='Pub'), Row(platform='Clojars'), Row(platform='Maven'), Row(platform='CocoaPods'), Row(platform='NPM'), Row(platform='Bower'), Row(platform='Packagist'), Row(platform='Homebrew'), Row(platform='Atom'), Row(platform='Elm'), Row(platform='Wordpress'), Row(platform='Julia'), Row(platform='SwiftPM'), Row(platform='Jam'), Row(platform='Pypi'), Row(platform='Inqlude'), Row(platform='Haxelib'), Row(platform='CPAN'), Row(platform='Nimble'), Row(platform='Shards'), Row(platform='PlatformIO'), Row(platform='Go'), Row(platform='Alcatraz'), Row(platform='Rubygems'), Row(platform='Dub'), Row(platform='Hackage'), Row(platform='Carthage'), Row(platform='Cargo')]

### What package has the higest SourceRank?

SourceRank is Libraries.io measure that combines popularity as well as how well maintained the packages is, along with a few other factors.

First, use the `withColumn` method to convert SourceRank to integers.

Hint: `.cast("double")` can be called on a column

In [6]:
data = data.withColumn('SourceRank', data.SourceRank.cast("double"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Use the `agg` method to find the highest SourceRank. Call show after to see the results.

In [7]:
data.agg({'SourceRank':'max'}).collect()[0]['max(SourceRank)']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

31.0

Use `filter` to select the row of the DataFrame that has the highest SourceRank.

In [8]:
maxSR = data.filter(data.SourceRank == data.agg({'SourceRank':'max'}).collect()[0]['max(SourceRank)'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Use `show` to display the results.

In [9]:
maxSR.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+--------------------+--------------+----------+--------------------------------+---------------------+------------------+------------------------+----------+------+---------------------+----------------------------+-------------+
|     ID|Platform|               Name|   Created Timestamp|   Updated Timestamp|         Description|            Keywords|        Homepage URL|Licenses|      Repository URL|Versions Count|SourceRank|Latest Release Publish Timestamp|Latest Release Number|Package Manager ID|Dependent Projects Count|  Language|Status|Last synced Timestamp|Dependent Repositories Count|Repository ID|
+-------+--------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+--------------------+--------------+----------+--------------------------------+----

### What is the most frequent dependency per package manager?

To answer this question, let's break it down into smaller parts. 

First, we need to cast the "Dependent Repositories Count" column to "double".

In [10]:
data = data.withColumn('Dependent Repositories Count', data["Dependent Repositories Count"].cast("double"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Next, use `groupBy` and `max` to find the highest number of "Dependent Repositories Count" per package manager.

In [11]:
max_deps = data.groupBy('platform').max('Dependent Repositories Count')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Use `show` to look at this data

In [12]:
max_deps.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+---------------------------------+
| platform|max(Dependent Repositories Count)|
+---------+---------------------------------+
|    NuGet|                          63559.0|
|    Emacs|                              0.0|
|  Sublime|                              0.0|
|   Meteor|                             72.0|
|     CRAN|                           6037.0|
|      Hex|                           5950.0|
|      Pub|                           2972.0|
|  Clojars|                          26779.0|
|    Maven|                          43397.0|
|CocoaPods|                          15781.0|
|      NPM|                         380978.0|
|    Bower|                         106457.0|
|Packagist|                         122137.0|
| Homebrew|                              0.0|
|     Atom|                              0.0|
|      Elm|                           4655.0|
|Wordpress|                              0.0|
|    Julia|                            540.0|
|  SwiftPM|                       

Notice how the column name includes the name of the aggregate function, in this case `max`. We need to remove this, so run the code below to rename the column.

In [13]:
max_deps = max_deps.withColumnRenamed("max(Dependent Repositories Count)", "Dependent Repositories Count")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Use `filter` to remove any package managers from `max_deps` that have a max of 0.

In [14]:
max_deps = max_deps.filter(max_deps['Dependent Repositories Count'] != 0) 

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Next, use `join` to select the rows of `data` that match `max_deps`. Dont forget to specify the `on` keyword.

In [15]:
max_deps_info = data.join(max_deps, on=['Dependent Repositories Count','platform'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Use `select` to only retain the package manager, the package name, and the "Dependent Repositories Count". Call `show` after to see the results. Pass a number to `show` so that all the results are shown.

In [16]:
max_deps_info.select('platform','name','Dependent Repositories Count').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+--------------------+----------------------------+
| platform|                name|Dependent Repositories Count|
+---------+--------------------+----------------------------+
| Rubygems|                rake|                    481616.0|
|    NuGet|     Newtonsoft.Json|                     63559.0|
|       Go|golang.org/x/net/...|                      5379.0|
|      Dub|              vibe-d|                       203.0|
|      Elm|       elm-lang/core|                      4655.0|
|      Hex|              poison|                      5950.0|
|   Shards|               radix|                        66.0|
|CocoaPods|        AFNetworking|                     15781.0|
|    Bower|              jQuery|                    106457.0|
|Packagist|     phpunit/phpunit|                    122137.0|
|    Julia|              Compat|                       540.0|
|    Cargo|                libc|                      6922.0|
|     Pypi|            requests|                     71110.0|
| Cartha

### Who is the most proflific owner of packages per package manager?

For this next question, we need to consult the second file, which is detailed information about where and who develops each package. Reading in the data will be very similar to as was done above.

In [17]:
repos_data = spark.read.csv("hdfs:///data/repositories-1.0.0-2017-06-15.csv",header=True, inferSchema=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

To extract the owner into its own column, we are going to use the function `regexp_extract`.

In [18]:
from pyspark.sql.functions import regexp_extract

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [19]:
repos_data = repos_data.withColumn('owner', regexp_extract(repos_data["Name With Owner"],"(.*)/",0))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Next, use `join` to create a joined DataFrame. The columns to join on are "ID" from `repos_data`, and "Repository ID" from `data`.

In [20]:
joined = repos_data.join(data, on =  data['Repository ID'] == repos_data['ID'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Now that we have the package owner information joined with the package, use `groupby` and `count` to see how many packages each owner has per package manager.

The relevant columns are "Platform" and "owner".

Hint: `groupBy` can take more than one column to group on

In [21]:
counts = joined.groupby('Platform','owner').count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Next we need to deterimine the max count per platform. Do this using `groupby` and `max`.

In [22]:
max_count = counts.groupby('Platform').max('count')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Like before, we want to remove the name of the function from the column name, so run the code below to rename it.

In [23]:
max_count = max_count.withColumnRenamed("max(count)","count")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Now, the owner with the most packages per package manager can be found by using `join` on `counts` and `max_count`.

Call `show` after joining to see the results. 

In [24]:
max_count.join(counts,on=['Platform','count']).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+-----+-----------------+
| Platform|count|            owner|
+---------+-----+-----------------+
|    Cargo|  428|        retep998/|
|   Meteor|  110|          rzymek/|
|      Hex|   35|  nerves-project/|
|    NuGet|  360|             aws/|
| Homebrew|   19|          google/|
|    Julia|   28|      JuliaStats/|
|      Dub|   25|     DerelictOrg/|
|     CRAN|  117|        ropensci/|
|    Emacs|   45|          syohex/|
|   Shards|    7|      ysbaddaden/|
|      Elm|   51|   elm-community/|
|CocoaPods|   50|    nicklockwood/|
|CocoaPods|   50|       hyperoslo/|
|  Haxelib|   35|      haxe-react/|
|    Maven|  979|        kiegroup/|
|      Pub|  245|       dart-lang/|
|     Pypi|  357|      collective/|
| Rubygems|  301|      jrobertson/|
|Packagist|  212|thecodingmachine/|
|      Jam|   56|        aureooms/|
+---------+-----+-----------------+
only showing top 20 rows

### What is the correlation between number of github stars and number of times a package is listed as a dependency?

Once again we will be working with the joined DataFrame we made in the previous section. 

First convert the "Stars Count" column to a double using the `withColumn` and `cast` functions.

In [25]:
joined = joined.withColumn('Stars Count', joined['Stars Count'].cast("double"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Now call `corr` on the DataFrame, passing it the correct column names.

In [26]:
joined.corr('Stars Count', 'Dependent Repositories Count')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

0.0662784659838636

### Which package names are found in both npm and pypi
For the final question, we are going to use set operations.

First we need to find the packages in pypi and the packages in NPM.

Use `filter` to find all the elements of the DataFrame whose "Platform" is equal to "Pypi"

In [27]:
pypi = data.filter(data['Platform'] == "Pypi")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Use `filter` to find all the elements of the DataFrame whose "Platform" is equal to "NPM".

In [28]:
npm = data.filter(data['Platform'] == "NPM")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Now use `select` to get only the names of the packages in PyPi.

In [29]:
pypi_names = pypi.select("name")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Now use `select` to get only the names of the packages in NPM.

In [30]:
npm_names = npm.select("name")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Use `intersect` to get the names that appear in both.

In [31]:
intersection = npm_names.intersect(pypi_names)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

View the names that appear in both by calling `show`.

In [32]:
intersection.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+
|                name|
+--------------------+
|               anime|
|           arguments|
|             bitlist|
|           bookshelf|
|                 bsw|
|           carbonate|
|                clog|
|cloudfront-log-pa...|
|            collectd|
|               crest|
|                 dbt|
|                earl|
|                elsa|
|                foxy|
|                 gir|
|                guts|
|            habanero|
|              honcho|
|                hope|
|                 hud|
+--------------------+
only showing top 20 rows