# Data Modeling With Apache Cassandra
***A Project in Course of Udacity's Nano-Degree for Data Engineering With AWS***

## The Task
***As provided by Udacity***

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.  

They'd like a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions, and wish to bring you on the project. Your role is to create a database for this analysis. You'll be able to test your database by running queries given to you by the analytics team from Sparkify to create the results.  

Specifically, they'd like you to create tables in Apache Cassandra to run the following queries:
1. Give me the artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4
2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182
3. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'  

Udacity splits the project in two parts:
Part 1: ETL Pipeline for Pre-Processing the Files
Part II. Complete the Apache Cassandra coding portion of your project.

For the sake of the reviewer, I stick to this division.

In [3]:
# Packages used
# Standard packages
from pathlib import Path

# Third party packages
import cassandra

# Local packages
from src.preprocessing import preprocess_data

# Auto-reload local packages
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Part I: ETL Pipeline for Pre-Processing the Files

We need to join the data from the csv-files into a single file, which we can use for our database.  

The target structure is given by Udacity through this image:  

<img src="../images/image_event_datafile_new.jpg">

The folder `event_data` holds the data in csv-files from `2018-11-01-events.csv` to `2018-11-30-events.csv` containing the following data (highlighted with `*` the ones we need):
- artist `*`
- auth
- firstName `*`
- gender `*`
- itemInSession `*`
- lastName `*`
- length `*`
- level `*`
- location `*`
- method
- page
- registration
- sessionId `*`
- song `*`
- status
- ts
- userId `*`

Furthermore, `artist` should not be an empty string.

In [4]:
# Global variables
RAW_DATA_PATH = Path("../data/event_data")
PREPROCESS_DATA_PATH = Path("../data/preprocessed_data")
COLUMNS_TO_USE = ["artist", "firstName", "gender", "itemInSession", "lastName", "length", "level", "location", "sessionId", "song", "userId"]

In [15]:
# Get relevant columns into one dataframe
df = preprocess_data(RAW_DATA_PATH)
selected_df = df[COLUMNS_TO_USE].dropna(subset="artist")
selected_df

Unnamed: 0,artist,firstName,gender,itemInSession,lastName,length,level,location,sessionId,song,userId
0,Harmonia,Ryan,M,0,Smith,655.77751,free,"San Jose-Sunnyvale-Santa Clara, CA",583,Sehr kosmisch,26.0
1,The Prodigy,Ryan,M,1,Smith,260.07465,free,"San Jose-Sunnyvale-Santa Clara, CA",583,The Big Gundown,26.0
2,Train,Ryan,M,2,Smith,205.45261,free,"San Jose-Sunnyvale-Santa Clara, CA",583,Marry Me,26.0
5,Sony Wonder,Samuel,M,0,Gonzalez,218.06975,free,"Houston-The Woodlands-Sugar Land, TX",597,Blackbird,61.0
9,Van Halen,Tegan,F,2,Levine,289.38404,paid,"Portland-South Portland, ME",602,Best Of Both Worlds (Remastered Album Version),80.0
...,...,...,...,...,...,...,...,...,...,...,...
382,Foo Fighters,Rylan,M,57,George,271.38567,paid,"Birmingham-Hoover, AL",1076,The Pretender,16.0
383,Timbiriche,Rylan,M,58,George,202.60526,paid,"Birmingham-Hoover, AL",1076,Besos De Ceniza,16.0
384,A Perfect Circle,Rylan,M,59,George,206.05342,paid,"Birmingham-Hoover, AL",1076,Rose,16.0
385,Anberlin,Rylan,M,60,George,348.68200,paid,"Birmingham-Hoover, AL",1076,The Haunting,16.0


In [14]:
# Save preprocessed data
selected_df.to_csv(PREPROCESS_DATA_PATH / "event_datafile_new.csv", index=False)

## Part II: Complete the Apache Cassandra Coding Portion of Your Project

Cassandra is a No-SQL database or, more specifically, a key-value store.
In order to setup the database, it makes sense to think of the use cases in terms of queries first. This helps to design the tables required to later perform the queries.

Use cases:
1. Give me the artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4
2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182
3. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'  

This translates to the following queries:
1. `SELECT artist, song, length FROM music_app_history WHERE sessionId = 338 AND itemInSession = 4`
2. `SELECT artist, song, firstName, lastName FROM music_app_history WHERE userId = 10 AND sessionId = 182`
3. `SELECT firstName, lastName FROM music_app_history WHERE song = 'All Hands Against His Own'`

Furthermore, we need to think about the ordering by `itemInSession` as requested by use case 2.

#### Creating a Cluster

In [6]:
# This should make a connection to a Cassandra instance your local machine 
# (127.0.0.1)

from cassandra.cluster import Cluster
cluster = Cluster(["127.0.0.1"])

# To establish connection and begin executing queries, need a session
session = cluster.connect()

#### Create Keyspace

In [None]:
# TO-DO: Create a Keyspace 

#### Set Keyspace

In [None]:
# TO-DO: Set KEYSPACE to the keyspace specified above


### Now we need to create tables to run the following queries. Remember, with Apache Cassandra you model the database tables on the queries you want to run.

## Create queries to ask the following three questions of the data

### 1. Give me the artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4


### 2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182
    

### 3. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'




In [1]:
## TO-DO: Query 1:  Give me the artist, song title and song's length in the music app history that was heard during \
## sessionId = 338, and itemInSession = 4


                    

In [None]:
# We have provided part of the code to set up the CSV file. Please complete the Apache Cassandra code below#
file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    next(csvreader) # skip header
    for line in csvreader:
## TO-DO: Assign the INSERT statements into the `query` variable
        query = "<ENTER INSERT STATEMENT HERE>"
        query = query + "<ASSIGN VALUES HERE>"
        ## TO-DO: Assign which column element should be assigned for each column in the INSERT statement.
        ## For e.g., to INSERT artist_name and user first_name, you would change the code below to `line[0], line[1]`
        session.execute(query, (line[#], line[#]))

#### Do a SELECT to verify that the data have been inserted into each table

In [None]:
## TO-DO: Add in the SELECT statement to verify the data was entered into the table

### COPY AND REPEAT THE ABOVE THREE CELLS FOR EACH OF THE THREE QUESTIONS

In [None]:
## TO-DO: Query 2: Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name)\
## for userid = 10, sessionid = 182


                    

In [None]:
## TO-DO: Query 3: Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'


                    

### Drop the tables before closing out the sessions

In [4]:
## TO-DO: Drop the table before closing out the sessions

### Close the session and cluster connection¶

In [None]:
session.shutdown()
cluster.shutdown()