# RDD Lab

In this lab, we will be working with data from [Libraries.io](http://Libraries.io), a package manager aggregator. Our data consist of two files, one detailing packages and package managers, the other detailing the code repositories the packages are developed in. 

The data is stored as csv files, so to get started import the needed Python packages and read in the data

In [None]:
import csv
from io import StringIO

In [None]:
text = spark.sparkContext.textFile("hdfs:///data/projects-1.0.0-2017-06-15.csv")

Verify the data was read in correctly using take.

You may have noticed that what we read in is a list of strings. Further more, the first string appears to a column headers. 

In the next cell we are going to split each string into a tuple. We have done this step for you, but make sure you can understand what the code below is doing.

In [None]:
data = text.map(lambda x: tuple(next(csv.reader(StringIO(x)))))

Now that we have split each line into tuples, we need to remove the first row, which isn't actually part of the data. 

To do this, use the `filter` function and write a lambda that checks if the first element of the tuple is **not** "ID"

In [None]:
data =

Once again, use take to look at the format the data is now in

The rest of the lab consists of answering questions about the data. 

### How many packages are accounted for in this dataset?

Hint: Use `count`

### What package managers are included in the data?

The package manager is the second element in each tuple in the RDD, and can be accseed using `tuple[1]`

Hint: Extract the package manager names using `map` and then use `distinct`

In [None]:
managers = 

View the results using `collect`

### What package has the higest SourceRank?

SourceRank is Libraries.io measure that combines popularity as well as how well maintained the packages is, along with a few other factors

We've done this one for you to show you a unique way to use Python's built in `max` function.

By supplying a key to the `max` function we can compare two tuples using a sepcific element in that tuple. In this case we are comparing source rank, which we can access at the 11th position.

In [None]:
data.reduce(lambda x,y: max(x,y,key=lambda tup: int(tup[11])))

### What is the most frequent dependency per package manager?

To answer this question, let's break it down into smaller parts. 

First it is a good idea to change the RDD into a Pair RDD, using the package manager as the key. We have done this step for you.

In [None]:
package_manager_keys = data.map(lambda x: (x[1],x))

Next, use `reduceByKey` to find the most frequent dependency in each package manager.

Use the built in `max` function from Python, similiarly to as was done above. The element of the tuple you should be comparing on this time is `19`

In [None]:
popular_deps = 

Finally, use `map` again to select only the package manager, the package name, and the number of times it is listed as a dependency.

Remember, the RDD elements now have a form of

`(packageMananger, (packageMananger, packageName, .... )`

The packageName can be accessed using `tupleVar[1][2]` and the number of times it is a dependency can be acessed using `tupleVar[1][19]`.


In [None]:
reformatted = 

Now print the most popular dependencies using `collect`

### Who is the most proflific owner of packages per package manager?

For this next question, we need to consult the second file, which is detailed information about where and who develops each package. Reading in the data will be very similar to as was done above, so we have taken all the steps to split each line and filter the data for you

In [None]:
repos = spark.sparkContext.textFile("hdfs:///data/repositories-1.0.0-2017-06-15.csv")

In [None]:
repos_data = repos.map(lambda x: tuple(next(csv.reader(StringIO(x),'unix')))).filter(lambda x: x[0] != "ID")

To answer this question we are going to join our two RDDs together. In order to do that, first we need to once again convert them into a Pair RDD, this time with the primary key from the repository dataset

Use `map` to make a (key,value) pair for each element in the `repos_data` RDD. The tuple index for the key is `0`

In [None]:
repos_to_join = 

Use `map` to make a (key, value) pair for each element in the `data` RDD. The tuple index for the key is `-1`

In [None]:
data_to_join = 

Use `join` to join the two RDDs together into a single RDD

In [None]:
joined = 

Now that we have a single RDD, we can prepare to count the number of packages each owner has in each package manager. We have done this step for you. 

What the code below is doing is creating a tuple of the form

`((packageManager, repositoryOwner), 1)`

In [None]:
package_owners = joined.map(lambda x: ((x[1][0][1],x[1][1][2].split('/')[0]),1))

Use `reduceByKey` to add up the total number for reach `packageManager,reposititoryOwner` pair

In [None]:
counts = 

Use `map` to reformat the data so just the package manager is the key now.

The data looked like 

`((packageManager, repositoryOwner), count)` before,

but now it should looke like

`(packageManager, (repositoryOwner, count))`

In [None]:
counts_reformated =

Now that the data is in the right format, use `reduceByKey` and `max` to find the most prominent owner per package manager.

Hint: remember that the `max` function in python can take a `key` value

In [None]:
maxes =

Use `map` to reformat the data one more time, into a 3-tuple per element.

When you are done, your data should look like

`(packageMananger, repositoryOwner, count)`

In [None]:
maxes_reformatted = 

Finally, use `collect` to display the results

### What is the correlation between number of github stars and number of times a package is listed as a dependency?

Once again, we will be working with the joined RDD.

First we need to retrieve the two pieces of information we need from each element in the RDD.

Because some of the elements of the RDD don't have this information, we have written the function below to assign 0 to missing values

In [None]:
def turn_to_int(string):
    try:
        return int(string)
    except:
        return 0

Run the code below to extract the two numbers we need

In [None]:
stars_and_deps = joined.map(lambda x: (turn_to_int(x[1][0][19]),turn_to_int(x[1][1][10])))

Now we will use reduce to add up the number of stars and dependencies. We have done this step for you because calling reduce on tuples can be tricky.

In [None]:
mean_stars = stars_and_deps.reduce(lambda x,y: (x[0] + y[0],))[0]/stars_and_deps.count()

In [None]:
mean_deps = stars_and_deps.reduce(lambda x,y: (1,x[1] + y[1]))[1]/stars_and_deps.count()

Use `map` to calculate the errors for each element in the RDD, by subtracting either `mean_stars`, or `mean_deps` from the appropriate value in the tuple

In [None]:
errors = \

Use `map` to square each value of the tuple

In [None]:
sq_errors =

Use `reduce` to calculate the sums of the squared errors

In [None]:
sums_of_squares = 

Get the denominator by taking the square root of each sum and multiplying them together

In [None]:
import math

In [None]:
denominator =

Next use `map` to multiply together the error for the stars and the error for the dependencies for each element in the RDD

In [None]:
products = 

Get the numerator of the equation by using `reduce` to sum all the products from the previous cell

In [None]:
numerator = 

Finally, get the correlation by dividing the numerator by the denominator

### Which package names are found in both npm and pypi
For the final question, we are going to use set operations.

First we need to find the packages in pypi and the packages in NPM

Use `filter` to find all the elements of the RDD whose first value is equal to "Pypi"

In [None]:
pypi = 

Use `filter` to find all the elements of the RDD whose first value is equal to "NPM"

In [None]:
npm = 

Use `map` to get an RDD of only the names of the pypi packages. The names are the 2nd value of the tuple

In [None]:
pypi_names =

Use `map` to get an RDD of only the names of the npm packages. The names are the 2nd value of the tuple

In [None]:
npm_names =

Use `intersection` to get the names that appear in both

In [None]:
intersection = 

View the names that appear in both by calling `collect`