# CMU Machine Learning with Large Datasets
## Homework 4 - Machine Learning at Scale: Part A

Before starting with this notebook, make sure you have already completed the data conversion step on AWS.

Note that we will not be autograding this notebook because of the open-ended nature of it (although you will have to submit this notebook). To make grading easier and to learn about your thought process, throughout the notebook, we include questions you have to anwswer in your writeup. We have indicated locations in the notebook corresponding to these questions with a ✰ symbol.

### 0. Start a Spark Session and Install Libraries

As a first step, you should 

- start a Spark session on your cluster, and 

- check how many executor instances you have and whether that matches your configuration

In [None]:
# YOUR CODE HERE

# YOUR CODE HERE

Throughout this assignment, you will be generating plots. `Matplotlib` and other useful Python libraries do not come pre-installed on the cluster. Therefore, you will have to ssh into your master node (think about why it should be the master) using your keypair created earlier and install `matplotlib`. You might have to do this later again for other libraries you use.

Run the cell below to ensure you installation was successful. If an error occurs, you might want to double check your installation.

In [3]:
import matplotlib.pyplot as plt

### 1. Data Loading and Preparation

Earlier, we have extracted relevant features from and converted format of the full raw Million Song Dataset. We now want to load our converted dataset from the S3 Storage.

Use something like this: 

```
df = spark.read.format("csv")
        .option("header", "false")
        .option("inferSchema", "true")
        .load("s3://<bucket_name>/<path>/<file_name>.csv")
```

Note that although you can load all chunks of the dataset using `*`, we recommend you only load in a subset while developing so that processing takes shorter time when you are just verifying your ideas.

In [None]:
# YOUR CODE HERE

# YOUR CODE HERE

Now if we inspect the `df` we just created by running the below cell:

In [5]:
df.printSchema()

We see a few problems:

- Because we did not include headers in the CSV files, Spark does not know the name of the features, and hence the "_c0", "_c1", ... that we see
- Although we set `inferSchema=True` when loading data, all array types were still interpreted as plain strings.

Let's first recover all the names of the features. You could reuse the feature name array you used in your `million_song_reader.py` from the conversion step.

In [6]:
# YOUR CODE HERE

# YOUR CODE HERE

Now if we run the below cell again, we should see proper feature names being attached to the columns.

In [7]:
df.printSchema()

Note that there are still a few features, e.g. `artist_latitude`, not being converted to the correct type. Let's do this manually and convert numeric features to `pyspark.sql.types.DoubleType` (Hint: there should be 19 of them). ✰ List the 19 numeric features in your writeup.

Don't worry about array features for now.

In [8]:
from pyspark.sql.types import DoubleType

# YOUR CODE HERE

# YOUR CODE HERE

We are all set for now. Let's run the following cell to inspect everything except the arrays looks ok.

In [9]:
df.printSchema()
df.head()

For us to grade your checkpoint, run the following cell and ✰ include the output in your writeup.

Some sanity checks based on our reference solution:
- There should be 19 numeric features
- There should be around 580k data records
- `song_hotttnesss` should be a floating point number between 0 and 1, with mean around 0.36
- `artist_name` and `title` should be human-readable text, rather than undecoded bytes
- `artist_terms` should be a string literal of an array containing human-readable tags, rather than undecoded bytes
- The max of `year` should be 2011 (because MSD was published in 2011)

We will have some wiggle rooms in grading because everyone might have processed the data slightly differently.

In [11]:
double_cols = [t for t in df.dtypes if t[1]=='double']
str_cols = [t for t in df.dtypes if t[1] == 'string']
print('total feature {}, numeric feature {}, string feature {}'.format(len(df.dtypes),len(double_cols),len(str_cols)))
print('total {} records'.format(df.count()))
print('\nsample data record:')
head = df.head()
features = ['song_hotttnesss', 'artist_hotttnesss', 'artist_id', 'artist_latitude', 'artist_name',
           'title', 'danceability', 'duration', 'loudness', 'year', 'artist_terms', 'artist_terms_freq']
for f in features:
    print(f'  {f}: {head[f]}')
print()
df.select('song_hotttnesss', 'artist_hotttnesss', 'year').summary().show()

## End of Part A