## Welcome to the Cassandra project

The goal of the project is to transform a number of .csv files to 3 optimized Cassandra tables, ready to execute 3 specific queries. 
To reach this goal we need to do the following:

1. transform the .csv files to a Pandas DataFrame
2. inspect the distribution of the potential partition keys
3. check for uniqueness of the primary keys
4. create the database
5. transform the Pandas DataFrame to lists of tuples, ready for insertion into the database
6. insert the data
7. inspect and validate the queries
8. close the session and cluster connection 

**NOTE**: run all queries, check all queries, validate the idea behind them (order by) and the table names. If satisfied:
- clean this notebook and transfer to PyCharm / git
- write a decent but short readme -> most information comes from the notebook -> copy / paste from Postgres when possible
- check the graph functions in the project
- Check the Project_1B notebook once more
- Submit!

In [None]:
import os
os.chdir('..')

In [None]:
! pip install -e .

### 1. transform the .csv files to a Pandas DataFrame

In [1]:
from pathlib import Path
from src.create_event_data import CreateEventData
from src.data_utils import create_csv_path_list

In [2]:
event_data_path = Path('..') / 'event_data'
csv_list = sorted(create_csv_path_list(event_data_path))

../event_data contains 31 .csv files.


In [3]:
del csv_list[0]

In [4]:
create_event_instance = CreateEventData(csv_path_list=csv_list)
event_df = create_event_instance.data_pipeline()

INFO 2021-01-04 12:57:15,524 [create_event_data.py:data_pipeline:37] Data pipeline started...
DEBUG 2021-01-04 12:57:15,554 [create_event_data.py:data_pipeline:46] Processed 1 / 30 .csv files
DEBUG 2021-01-04 12:57:15,569 [create_event_data.py:data_pipeline:46] Processed 2 / 30 .csv files
DEBUG 2021-01-04 12:57:15,584 [create_event_data.py:data_pipeline:46] Processed 3 / 30 .csv files
DEBUG 2021-01-04 12:57:15,599 [create_event_data.py:data_pipeline:46] Processed 4 / 30 .csv files
DEBUG 2021-01-04 12:57:15,616 [create_event_data.py:data_pipeline:46] Processed 5 / 30 .csv files
DEBUG 2021-01-04 12:57:15,631 [create_event_data.py:data_pipeline:46] Processed 6 / 30 .csv files
DEBUG 2021-01-04 12:57:15,647 [create_event_data.py:data_pipeline:46] Processed 7 / 30 .csv files
DEBUG 2021-01-04 12:57:15,664 [create_event_data.py:data_pipeline:46] Processed 8 / 30 .csv files
DEBUG 2021-01-04 12:57:15,684 [create_event_data.py:data_pipeline:46] Processed 9 / 30 .csv files
DEBUG 2021-01-04 12:57:1

In [5]:
# compare to the .jpg in /images
event_df.head(n=3)

Unnamed: 0,artist,firstName,gender,itemInSession,lastName,length,level,location,sessionId,song,userId
0,Des'ree,Kaylee,F,1,Summers,246.30812,free,"Phoenix-Mesa-Scottsdale, AZ",139,You Gotta Be,8
1,Mr Oizo,Kaylee,F,3,Summers,144.03873,free,"Phoenix-Mesa-Scottsdale, AZ",139,Flat 55,8
2,Tamba Trio,Kaylee,F,4,Summers,177.18812,free,"Phoenix-Mesa-Scottsdale, AZ",139,Quem Quiser Encontrar O Amor,8


In [6]:
# check correct datatypes
event_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6820 entries, 0 to 6819
Data columns (total 11 columns):
artist           6820 non-null object
firstName        6820 non-null object
gender           6820 non-null object
itemInSession    6820 non-null int64
lastName         6820 non-null object
length           6820 non-null float64
level            6820 non-null object
location         6820 non-null object
sessionId        6820 non-null int64
song             6820 non-null object
userId           6820 non-null int64
dtypes: float64(1), int64(3), object(7)
memory usage: 586.2+ KB


### 2. inspect the distribution of the potential partition keys

The potential partition keys are dependent on the queries we are interested in, these keys are marked in **bold**. These partition keys will be part of the WHERE statements in the SQL queries. To fully benefit from the speed of a Cassandra database, it is important that the partition key is evenly distributed, so that each node has a comparable number of values to store. If only one node contains most of the values, this may slow down the queries.
To prevent this, we visually inspect each potential partition key with the help of matplotlib and seaborn (the code to reproduce these graphs can be found in src/partition_key_graphs.py).

1. Give me the artist, song title and song's length in the music app history that was heard during **sessionId** = 338, and **itemInSession**  = 4
2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for **userid** = 10, **sessionid** = 182
3. Give me every user name (first and last) in my music app history who listened to the **song** 'All Hands Against His Own'

#### Query 1

<img src="https://user-images.githubusercontent.com/49920622/103486947-d7f4a380-4e01-11eb-8b81-6c7494649b28.jpg" width=800>

#### Query 2

<img src="https://user-images.githubusercontent.com/49920622/103487010-581b0900-4e02-11eb-821b-32951b566c4b.jpg" width=800>

#### Query 3

<img src="https://user-images.githubusercontent.com/49920622/103487072-d7104180-4e02-11eb-9689-529824ba78d2.jpg" width=800>

### 3. check for uniqueness of the primary keys

Based on the graphs we have found our partition key, but our primary key consists of clustering keys as well. The combination of the partition and clustering key makes up for the primary index, and this combination *must* be unique. Before we create the tables, we need to verify this constraint.

In [7]:
from src.data_utils import unique_key_check

In [8]:
unique_key_check(event_df, ['sessionId', 'itemInSession'])

True

In [9]:
unique_key_check(event_df, ['userId', 'sessionId', 'itemInSession'])

True

In [10]:
unique_key_check(event_df, ['song'])

False

#### Note

Queries 1 and 2 are unique based on the already identified keys, however, for query 3 the variable song is not unique by itself. Since people love to hear certain songs more than once - even in the same session -, we add sessionId *and* itemInSession as clustering keys.

In [11]:
unique_key_check(event_df, ['song', 'sessionId', 'itemInSession'])

True

### 4. create the database

In [12]:
from src.create_database import CreateDatabase
from src.sql_queries import drop_list, create_list

In [13]:
create_db_instance = CreateDatabase(create_queries=create_list, drop_queries=drop_list)

In [14]:
cluster, session = create_db_instance.create_database_pipeline()

INFO 2021-01-04 12:57:29,648 [create_database.py:create_database_pipeline:20] Create database pipeline started...
INFO 2021-01-04 12:57:29,650 [create_database.py:init_cluster_and_session:30] Initializing the local Cassandra cluster and session
INFO 2021-01-04 12:57:29,769 [create_database.py:set_udacity_keyspace:43] Setting up the udacity keyspace
INFO 2021-01-04 12:57:29,777 [create_database.py:drop_tables:59] Dropping tables if exists
INFO 2021-01-04 12:57:30,524 [create_database.py:create_tables:68] Creating tables


### 5. transform the Pandas DataFrame to lists of tuples, ready for insertion into the database

In [15]:
# query 1 -> session_info table 
columns_session_info = ['artist', 'song', 'length', 'sessionId', 'itemInSession']
data_session_info = [tuple(row) for row in event_df[columns_session_info].itertuples(index=False)]

In [16]:
# query 2 -> user_info table 
columns_user_info = ['artist', 'song', 'firstName', 'lastName', 'sessionId', 'userId', 'itemInSession']
data_user_info = [tuple(row) for row in event_df[columns_user_info].itertuples(index=False)]

In [17]:
# query 3 -> user_info_per_song table 
columns_user_info_per_song = ['firstName', 'lastName', 'song', 'sessionId', 'itemInSession']
data_user_info_per_song = [tuple(row) for row in event_df[columns_user_info_per_song].itertuples(index=False)]

### 6. insert the data

In [18]:
from src.sql_queries import insert_list

In [23]:
data_list = [data_session_info, data_user_info, data_user_info_per_song]

for idx, (data, query) in enumerate(zip(data_list, insert_list), start=1):
    print(f"inserting file {idx}/{len(data_list)}")
    for row in data:
        try:
            session.execute(query, row)
        except Exception as e:
            print(e)

inserting file at index 0/3
inserting file at index 1/3
inserting file at index 2/3


### 7. inspect and validate the queries

#### Create the Cassandra tables

##### Give the artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4

In [None]:
create_query_1 = """ 
CREATE TABLE IF NOT EXISTS item_in_session (
artist text,
song text,
song_length float,
session_id int,
item_in_session int,
PRIMARY KEY (session_id, item_in_session)
)
"""

In [None]:
try:
    session.execute(create_query_1)
except Exception as e:
    print(e)

In [None]:
insert_query_1 = """
INSERT INTO song_length (artist, song, song_length, session_id, item_in_session)
VALUES (%s, %s, %s, %s, %s)
"""

In [None]:
data_query_1 = [tuple(row) for row in return_df[['artist', 'song', 'length', 'sessionId', 'itemInSession']].itertuples(index=False)]

In [None]:
data_query_1[0]

In [None]:
for row in data_query_1:
    try:
        session.execute(insert_query_1, row)
    except Exception as e:
        print(e)

In [None]:
try:
    rows = session.execute("SELECT * FROM song_length")
except Exception as e:
    print(e)
    
for row in rows:
    print(row)

### NOTES
- rewrite preprocessing, difference create / insert / and DATA -> easier to create the list of tuples per query ... so only return RETURN DF
- from there create functions to create the list of tuples per query

### Close the session and cluster connection -> refer to the instance variables of create_db_instance

In [None]:
session.shutdown()
cluster.shutdown()