## Exploration notebook
This notebook is used in order to explore and get a a deeper understanding of the columns in the file `event_datafile_new.csv`.
Having a fine comprehension of each feature will help to select the columns that should be:
1. The partition key;
2. The clustering columns (if needed);
3. The data columns that will be also added to the partition.

In [1]:
import pandas as pd

In [2]:
# read the datafile 
df = pd.read_csv("event_datafile_new.csv")

In [3]:
df.head()

Unnamed: 0,artist,firstName,gender,itemInSession,lastName,length,level,location,sessionId,song,userId
0,Barry Tuckwell/Academy of St Martin-in-the-Fie...,Mohammad,M,0,Rodriguez,277.15873,paid,"Sacramento--Roseville--Arden-Arcade, CA",961,Horn Concerto No. 4 in E flat K495: II. Romanc...,88
1,Jimi Hendrix,Mohammad,M,1,Rodriguez,239.82975,paid,"Sacramento--Roseville--Arden-Arcade, CA",961,Woodstock Inprovisation,88
2,Building 429,Mohammad,M,2,Rodriguez,300.61669,paid,"Sacramento--Roseville--Arden-Arcade, CA",961,Majesty (LP Version),88
3,The B-52's,Gianna,F,0,Jones,321.54077,free,"New York-Newark-Jersey City, NY-NJ-PA",107,Love Shack,38
4,Die Mooskirchner,Gianna,F,1,Jones,169.29914,free,"New York-Newark-Jersey City, NY-NJ-PA",107,Frisch und g'sund,38


In [4]:
df.shape

(6820, 11)

### Exploring the dataset to create a table able to perform queries to answer the _first question._
"1. Give me the artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession = 4"

In [5]:
# Unique values in sessionId column
df.sessionId.nunique()

776

In [6]:
# Unique values in itemInSession column
df.itemInSession.nunique()

123

In [7]:
type(df.sessionId.values[0])

numpy.int64

In [8]:
# creating a new column to check the nunique combinations generated by the combination of the two columns
df["sessionId_&_itemInSession"] = df.sessionId.astype(str) + df.itemInSession.astype(str)

In [9]:
df["sessionId_&_itemInSession"].nunique()

6806

Answer 1: using a combination of 'sessionId and itemInSession' it is produced a combination of 6806 unique combinations out of a total of 6820 rows. These two columns should be enough in order to perform the query for the first question properly.  

Furthermore, it also must be added the `artist name, song name and song length` columns as it is what was required in the question.

### Exploring the dataset to create a table able to perform queries to answer the _second question._
"2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182"

In [11]:
df.userId.nunique()

96

In [12]:
df.sessionId.nunique()

776

**Answer 2**: in order to adress this question in a form of a table we must guarantee a unique primary key and it is done using a combination of `userId and sessionId` columns at it was requested that they must be used for filtering.  

Another great idea would be to use these both columns as a composite partition key, as it would improve overall performance because the userId data will be spread among more than just one node and it'll be much faster to look for a specific sessionId!

To address the question entirely the columns `artist name, song name and user's first and last name` should be added as data columns and this will produce a great query to answer the question.

### Exploring the dataset to create a table able to perform queries to answer the _second question._
"Question 3: Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'."

In [13]:
df.song.nunique()

5190

In [15]:
df.firstName.nunique()

84

In [16]:
df.lastName.nunique()

86

In [17]:
# creating a new column to check the nunique combinations generated by the combination of the two columns
df["song_&_lastName"] = df.song + df.lastName
df["song_&_lastName"].nunique()

6615

**Answer 3**: as shown in Exploration notebook, there are 5190 unique `song`s and and 96 unique `userId`. These numbers show that just these two columns are enough to create a unique primary key and to allow the using of WHERE clause to get the desired results for the selected song.  
For data columns, as asked, the features `firstName and lastName` must be included.