# DataFrame Lab

In this lab, we will be working with data from [Libraries.io](http://Libraries.io), a package manager aggregator. Our data consist of two files, one detailing packages and package managers, the other detailing the code repositories the packages are developed in. 

In [None]:
data = spark.read.csv("hdfs:///data/projects-1.0.0-2017-06-15.csv",header=True, inferSchema=True, mode="DROPMALFORMED")

Verify the data was read in correctly using `show`

The rest of the lab consists of answering questions about the data. 

### How many packages are accounted for in this dataset?

Hint: Use `count`

### What package managers are included in the data?

The package manager is in the column named 'platform'.

Hint: Extract the package manager names using `select` and then use `distinct`

In [None]:
managers = 

View the results using `collect`

### What package has the higest SourceRank?

SourceRank is Libraries.io measure that combines popularity as well as how well maintained the packages is, along with a few other factors.

First, use the `withColumn` method to convert SourceRank to integers.

Hint: `.cast("double")` can be called on a column

In [None]:
data = 

Use the `agg` method to find the highest SourceRank. Call show after to see the results.

Use `filter` to select the row of the DataFrame that has the highest SourceRank.

In [None]:
maxSR = 

Use `show` to display the results.

### What is the most frequent dependency per package manager?

To answer this question, let's break it down into smaller parts. 

First, we need to cast the "Dependent Repositories Count" column to "double".

In [None]:
data = 

Next, use `groupBy` and `max` to find the highest number of "Dependent Repositories Count" per package manager.

In [None]:
max_deps = 

Use `show` to look at this data

Notice how the column name includes the name of the aggregate function, in this case `max`. We need to remove this, so run the code below to rename the column.

In [None]:
max_deps = 

Use `filter` to remove any package managers from `max_deps` that have a max of 0.

In [None]:
max_deps =  

Next, use `join` to select the rows of `data` that match `max_deps`. Dont forget to specify the `on` keyword.

In [None]:
max_deps_info = 

Use `select` to only retain the package manager, the package name, and the "Dependent Repositories Count". Call `show` after to see the results. Pass a number to `show` so that all the results are shown.

### Who is the most proflific owner of packages per package manager?

For this next question, we need to consult the second file, which is detailed information about where and who develops each package. Reading in the data will be very similar to as was done above.

In [None]:
repos_data = spark.read.csv("hdfs:///data/repositories-1.0.0-2017-06-15.csv",header=True, inferSchema=True)

To extract the owner into its own column, we are going to use the function `regexp_extract`.

In [None]:
from pyspark.sql.functions import regexp_extract

In [None]:
repos_data = repos_data.withColumn('owner', regexp_extract(repos_data["Name With Owner"],"(.*)/",0))

Next, use `join` to create a joined DataFrame. The columns to join on are "ID" from `repos_data`, and "Repository ID" from `data`.

In [None]:
joined =

Now that we have the package owner information joined with the package, use `groupby` and `count` to see how many packages each owner has per package manager.

The relevant columns are "Platform" and "owner".

Hint: `groupBy` can take more than one column to group on

In [None]:
counts = 

Next we need to deterimine the max count per platform. Do this using `groupby` and `max`.

In [None]:
max_count = 

Like before, we want to remove the name of the function from the column name, so run the code below to rename it.

In [None]:
max_count = 

Now, the owner with the most packages per package manager can be found by using `join` on `counts` and `max_count`.

Call `show` after joining to see the results. 

### What is the correlation between number of github stars and number of times a package is listed as a dependency?

Once again we will be working with the joined DataFrame we made in the previous section. 

First convert the "Stars Count" column to a double using the `withColumn` and `cast` functions.

In [None]:
joined = 

Now call `corr` on the DataFrame, passing it the correct column names.

### Which package names are found in both npm and pypi
For the final question, we are going to use set operations.

First we need to find the packages in pypi and the packages in NPM.

Use `filter` to find all the elements of the DataFrame whose "Platform" is equal to "Pypi"

In [None]:
pypi = 

Use `filter` to find all the elements of the DataFrame whose "Platform" is equal to "NPM".

In [None]:
npm = 

Now use `select` to get only the names of the packages in PyPi.

In [None]:
pypi_names =

Now use `select` to get only the names of the packages in NPM.

In [None]:
npm_names = 

Use `intersect` to get the names that appear in both.

In [None]:
intersection = 

View the names that appear in both by calling `show`.