# Some data checks and analysis
First import and install necessary modules

In [1]:
import configparser
from datetime import datetime
import os
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, StringType, TimestampType
import boto3
import pandas as pd
import gc
!pip install s3fs



In [2]:
config = configparser.ConfigParser()
config.read('dl.cfg')

os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['AWS_SECRET_ACCESS_KEY']

Create Spark session and increase broadcast timeout. The last step depends on the size of the cluster / machine, which is used.

In [3]:
def create_spark_session():
    spark = SparkSession \
        .builder \
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
        .getOrCreate()
    return spark

In [4]:
spark = create_spark_session()
spark.conf.set("spark.sql.broadcastTimeout",  900)

## Analyse artist table
Let's have a look at the first rows and count the records. Apparently there are 967 records on artists availabke.

In [5]:
artistdf = spark.read.parquet("s3a://christophndde4/artist_table/")

In [6]:
gc.collect()

206

In [7]:
artistdf.limit(2).show()

+------------------+---------------+--------------+--------+---------+
|         artist_id|           name|      location|latitude|longitude|
+------------------+---------------+--------------+--------+---------+
|ARQ6F0E1187FB45427|     Eisbrecher|              |    null|     null|
|ARQ6JKR1187B99CC19|Grzegorz Turnau|Crakow, Poland|50.06007| 19.93259|
+------------------+---------------+--------------+--------+---------+



In [8]:
artistdf.count()

10025

## Analyse user table
Apparently there are only 104 records, please also have at look at the readme.md concerning some restrictions of the user table.

In [9]:
userdf = spark.read.parquet("s3a://christophndde4/user_table/")

In [10]:
userdf.limit(5).show()

+-------+----------+---------+------+-----+
|user_id|first_name|last_name|gender|level|
+-------+----------+---------+------+-----+
|     88|  Mohammad|Rodriguez|     M| free|
|     88|  Mohammad|Rodriguez|     M| paid|
|     29|Jacqueline|    Lynch|     F| free|
|     29|Jacqueline|    Lynch|     F| paid|
|     36|   Matthew|    Jones|     M| free|
+-------+----------+---------+------+-----+



In [11]:
userdf.count()

104

## Analyse time table
Apparently there ate 6820 unique timestamps in this dataframe, which were between Nov. 1st and Nov. 30th of 2018.

In [12]:
timedf = spark.read.parquet("s3a://christophndde4/time_table/")

In [13]:
timedf.limit(5).show()

+--------------------+----+---+----+-----+----+-------+
|          start_time|hour|day|week|month|year|weekday|
+--------------------+----+---+----+-----+----+-------+
|2018-11-05 18:36:...|  18|  5|  45|   11|2018|      2|
|2018-11-05 18:37:...|  18|  5|  45|   11|2018|      2|
|2018-11-05 18:41:...|  18|  5|  45|   11|2018|      2|
|2018-11-05 18:41:...|  18|  5|  45|   11|2018|      2|
|2018-11-05 18:44:...|  18|  5|  45|   11|2018|      2|
+--------------------+----+---+----+-----+----+-------+



In [14]:
timedf.count()

6820

In [15]:
timedf.agg(F.min(F.col("start_time")), F.max(F.col("start_time"))).show()

+--------------------+--------------------+
|     min(start_time)|     max(start_time)|
+--------------------+--------------------+
|2018-11-01 21:01:...|2018-11-30 19:54:...|
+--------------------+--------------------+



## Analyse song table
Apparently there are records in this table

In [16]:
songdf = spark.read.parquet("s3a://christophndde4/song_table/")

In [17]:
songdf.orderBy("song_id").limit(5).show()

+------------------+--------------------+------------------+----+---------+
|           song_id|               title|         artist_id|year| duration|
+------------------+--------------------+------------------+----+---------+
|SOAAAQN12AB01856D3|Campeones De La Vida|ARAMIDF1187FB3D8D4|   0|153.36444|
|SOAACFC12A8C140567| Supernatural Pt. II|ARNHTE41187B99289A|   0|343.09179|
|SOAACTC12AB0186A20|Christmas Is Comi...|ARXWFZ21187FB43A0B|2008|180.76689|
|SOAADAD12A8C13D5B0|One Shot (Album V...|ARQTC851187B9B03AF|2005|263.99302|
|SOAADJH12AB018BD30|Black Light (Albu...|AR3FKJ61187B990357|1975|385.90649|
+------------------+--------------------+------------------+----+---------+



In [18]:
songdf.count()

14896

## Analyse songplay (fact) table
There are records in this table. This is due to the fact that records without a song_id or an artist_is were not included here, since tbhis would not make much sense here.

In [19]:
songplaydf = spark.read.parquet("s3a://christophndde4/songplay_table/")

In [20]:
songplaydf.orderBy("song_id").limit(5).show()

+-----------+--------------------+-------+-----+------------------+------------------+----------+--------------------+--------------------+
|songplay_id|          start_time|user_id|level|           song_id|         artist_id|session_id|            location|          user_agent|
+-----------+--------------------+-------+-----+------------------+------------------+----------+--------------------+--------------------+
|       3903|2018-11-29 20:21:...|     49| paid|SOABIXP12A8C135F75|AR15DJQ1187FB5910C|      1041|         Seattle, WA|Mozilla/5.0 (Wind...|
|        992|2018-11-14 06:19:...|     80| paid|SOACRBY12AB017C757|ARVGCRM11F50C496F4|       548|                    |"Mozilla/5.0 (Mac...|
|       3369|2018-11-24 04:31:...|     29| paid|SOAECHX12A6D4FC3D9|ARX2DLI1187FB4DD03|       709|   Sydney, Australia|"Mozilla/5.0 (Mac...|
| 8589935354|2018-11-04 09:41:...|     44| paid|SOAFQGA12A8C1367FA|AR0IVTL1187B9AD520|       196|     Los Angeles, CA|Mozilla/5.0 (Maci...|
|       1582|2018-11

In [21]:
songplaydf.count()

333

Please see the last sectiob in the following notebook for explanations, why the songplay table is that small.
https://github.com/ChristophGmeiner/NDDE3_DataWarehouse_AWS/blob/master/DataChecks.ipynbdde4