# Analysing Audioscrobbler Data

In this notebook we want to create two very simple statistics on artists from data provided by Audioscrobbler. We are working with three related data sets:

1. A list of users containing their number of plays per artist
2. A list which maps a generic artist-id to its real (band) name
3. A list which maps bad artist ids to good ones (fixing typing errors)

Then we will ask two simple questions:

1. Which artists have the most listeners (in terms of number of unique users)
2. Which artists are played most often (in terms of total play counts)

# 1 Load Data

First of all we have to load the data from S3.

## 1.1 Read User-Artist Data

First we read in the most important data set, containing the information how often a user played songs of a specific artist. This information is stored in the file at `s3://dimajix-training/data/audioscrobbler/user_artist_data/`. The file has the following characteristics:

* Format: CSV (kind of)
* Separator: Space (” “)
* Header: No
* Fields: user_id (INT), artist_id(INT), play_count(INT)

So we need to read in this file and store it in a local variable user_artist_data. Since we do not have any header contained in the file itself, we need to specify the schema explicitly.

In [None]:
from pyspark.sql.types import *


schema = StructType([
        StructField("user_id", IntegerType()),
        StructField("artist_id", IntegerType()),
        StructField("play_count", IntegerType())
    ])
    
user_artist_data = spark.read \
    .option("sep", " ") \
    .option("header", False) \
    .schema(schema) \
    .csv("s3://dimajix-training/data/audioscrobbler/user_artist_data/")

### Peek inside
Let us have a look at the first 5 records

In [None]:
user_artist_data.limit(5).toPandas()

## 1.2 Read in Artist aliases

Now we read in a file containing mapping of bad artist IDs to good IDs. This fixes typos in the artists names and thereby enables us to merge information with different artist IDs belonging to the same band. This information is stored in the file at `s3://dimajix-training/data/audioscrobbler/artist_alias/`. The file has the following characteristics:

* Format: CSV (kind of)
* Separator: Tab (“\t”)
* Header: No
* Fields: bad_id (INT), good_id(INT)

So we need to read in this file and store it in a local variable `artist_alias`. Since we do not have any header contained in the file itself, we need to specify the schema explicitly.

In [None]:
schema = # YOUR CODR HERE
    
artist_alias = # YOUR CODR HERE

### Peek inside

In [None]:
# YOUR CODR HERE

## 1.3 Read in Artist names

The third file contains the artists name for his/her id. We also use this file in order to be able to display results with the artists names instead of their IDs. This information is stored in the file at `s3://dimajix-training/data/audioscrobbler/artist_data/`. The file has the following characteristics:

* Format: CSV (kind of)
* Separator: Tab (“\t”)
* Header: No
* Fields: artist_id (INT), artist_name(STRING)

So we need to read in this file and store it in a local variable `artist_data`. Since we do not have any header contained in the file itself, we need to specify the schema explicitly.

In [None]:
schema = # YOUR CODR HERE

artist_data = # YOUR CODR HERE

### Peek inside

In [None]:
# YOUR CODR HERE

# 2 Clean Data

Before continuing with the analysis, we first create a cleaned version of the `user_artist_data` DataFrame with the `artist_alias` mapping applied. This means that we need to lookup each artist-id in the original data set in the alias data set and see if we find have a matching `bad_id` entry with a replacement `good_id`. The result should be stored in a variable `cleaned_user_artist_data`. This DataFrame should contain the columns Finally select only the columns `user_id`, `artist_id` (the corrected one) and `play_count`.

Hints:

1. Join the user artist data DataFrame with the artist alias DataFrame containing fixes for some artists. Which join type is appropriate?    
2. Replace the artist-id in the user artists data with the `good_id` from the artist alias DataFrame - if a match is found on `bad_id`    
3. Finally select only the columns `user_id`, `artist_id` (the corrected one) and `play_count`

In [None]:
from pyspark.sql.functions import *


cleaned_user_artist_data = # YOUR CODR HERE
    
cleaned_user_artist_data.limit(5).toPandas()    

# 3 Question 1: Artist with most unique listeners

Who are the artist with the most unique listeners? For this question, it is irrelevant how often an individul user has played songs of a specific artist. It is only important how many different users have played each artists work. Of course we do not want to see the artist-id, but their real names as provided in the DataFrame `artist_data`!

Hints:
1. Group cleaned data by `artist_id`
2. Perform aggregation by counting unique user ids
3. Join `artist_data`
4. Lookup the artists name
5. Sort result by descending user counts

In [None]:
result = # YOUR CODR HERE

result.limit(10).toPandas()

# 4 Question 2: Artist with most Plays

Now we also take into account how often each user played the work of individual artists. That is, we also include the `play_coun` column into our calculations. So which artists have the most plays in total? Of course we do not want to see the artist-id, but their real names as provided in the DataFrame `artist_data`!

Hints:
1. Group cleaned data by `artist_id`
2. Perform aggregation by summing up play count
3. Join `artist_data`
4. Lookup the artists name
5. Sort result by descending play counts


In [None]:
result = # YOUR CODR HERE

result.limit(5).toPandas()