# SparkSQL Lab

In this lab, we will be working with data from [Libraries.io](http://Libraries.io), a package manager aggregator. Our data consist of two files, one detailing packages and package managers, the other detailing the code repositories the packages are developed in. 

In [None]:
data = spark.read.csv("hdfs:///user/bryan/data/projects-1.0.0-2017-06-15.csv",header=True, inferSchema=True, mode="DROPMALFORMED")

In [None]:
data = data.withColumn('SourceRank', data.SourceRank.cast("double"))
data = data.withColumn('Dependent Repositories Count', data["Dependent Repositories Count"].cast("double"))

Make the data available to SQL using `createTempView`

The rest of the lab consists of answering questions about the data. 

### How many packages are accounted for in this dataset?

Hint: Use `count`

In [None]:
spark.sql()

### What package managers are included in the data?

The package manager is in the column named 'platform'

Hint: `DISTINCT` can be used right in a `SELECT` statement

In [None]:
managers = 

View the results using `collect`

In [None]:
managers.collect()

### What package has the higest SourceRank?

SourceRank is Libraries.io measure that combines popularity as well as how well maintained the packages is, along with a few other factors

Use `max` to find the largest SourceRank

Use a `WHERE` clause to select the row of the DataFrame that has the highest SourceRank

In [None]:
maxSR = 

Use `show` to display the results 

In [None]:
maxSR.show()

### What is the most frequent dependency per package manager?

To answer this question, let's break it down into smaller parts. 

Next use `groupBy` and `max` to find the highest number of "Dependent Repositories Count" per package manager

In [None]:
max_deps = 

Use `show` to look at this data

In [None]:
max_deps.show()

Notice how the column name includes the name of the aggregate function, in this case `max`. Re-write the query above using `AS` so this doesn't happen

In [None]:
max_deps = 

Use the above query as a subquery, and filter out any rows that have a "Dependent Repositories Count" of 0

In [None]:
max_deps = 

Check your work using `show`

In [None]:
max_deps.show()

Next, use `join` to join data with the above query (we've done this for you)

In [None]:
max_deps_info = max_deps = spark.sql("""
    SELECT data.platform, name, data.`Dependent Repositories Count` FROM data 
    JOIN (SELECT platform,`Dependent Repositories Count` FROM 
            (SELECT platform, max(`Dependent Repositories Count`) AS `Dependent Repositories Count` 
                    FROM data GROUP BY platform) WHERE `Dependent Repositories Count` != 0) AS X 
    ON X.`Dependent Repositories Count` == data.`Dependent Repositories Count` AND X.platform == data.platform
""") 

Call `show` after to see the results. Pass a number to `show` so that all the results are shown.

### Who is the most proflific owner of packages per package manager?

For this next question, we need to consult the second file, which is detailed information about where and who develops each package. Reading in the data will be very similar to as was done above.

In [None]:
repos_data = spark.read.csv("hdfs:///user/bryan/data/repositories-1.0.0-2017-06-15.csv",header=True, inferSchema=True)

In [None]:
repos_data.createTempView("repos_data")

To extract the owner into its own column, we are going to use the function `regexp_extract`

In [None]:
from pyspark.sql.functions import regexp_extract

In [None]:
repos_data = spark.sql("""SELECT *,regexp_extract(`Name With Owner`,'(.*)/',0) AS owner
                          FROM repos_data """)

Next, use `join` to create a joined DataFrame. The columns to join on are "ID" from the subquery above, and "Repository ID" from `data`. It will be helpful to use a subquery and the keyword `AS`

In [None]:
joined = 

Put this table back into the SQL catalog by using `createTempView`

Now that we have the package owner information joined with the package, use `groupby` and `count` to see how many packages each owner has per package manager.

The relevant columns are "Platform" and "owner"

Hint: `groupBy` can take more than one column to group on. Use `AS` to give the aggregate function a good name

In [None]:
counts = 

Next we need to deterimine the max count per platform. Do this using `groupby` and `max`, using the previous query as a subquery

In [None]:
max_count =

Like before, we want to remove the name of the function from the column name, so re-write the query above to use `AS`

In [None]:
max_count = 

Now the owner with the most packages per package manager can be found by using `join` on `counts` and `max_count`

Call `show` after joining to see the results. 

In [None]:
spark.sql("""SELECT A.platform, A.owner, A.count FROM
                (SELECT platform, owner, count(*) as count from joined GROUP BY platform, owner) AS A
                    JOIN
                       ( SELECT platform, max(count) AS count FROM
                      (SELECT platform, owner, count(*) AS count from joined GROUP BY platform, owner)
                      GROUP BY platform) AS B
                      on A.platform = B.platform AND A.count = B.count
                      """).show()

### What is the correlation between number of github stars and number of times a package is listed as a dependency?

Call `corr` on the table, passing it the correct column names

### Which package names are found in both npm and pypi
For the final question, we are going to use set operations.

First we need to find the packages in pypi and the packages in NPM

Use `WHERE` to find all the elements of the DataFrame whose "Platform" is equal to "Pypi", and `SELECT` to only return the names

Use `WHERE` to find all the elements of the DataFrame whose "Platform" is equal to "NPM", and `SELECT` to only return the names

Use `INTERSECT` to get the names that appear in both

In [None]:
intersection = 

View the names that appear in both by calling `show`

In [None]:
intersection.show()