# BITCOIN AND ETHEREUM DATA CONSOLIDATION

### Data Engineering Capstone Project

#### Project Summary

This project defines the pipeline to load historical data of Bitcoin and Ethereum blockchains and create a Data Lake. The process includes data formatting, cleaning, and transformation. The data lake can be used to analyze ETH and BTC price changes and correlation over time. The main use case for the data can e run prediction and regression models to identify trends and future prices and also can provide advanced visualizations and technical analysis for users using the data. One of the questions can be There is some correlation between BTC and ETH price variation anytime?


The project follows the follow steps:

* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Import required libraries
import pandas as pd
import re
import boto3
import zipfile
from pyspark.sql import SparkSession
import os
import glob
import configparser
from datetime import datetime, timedelta, date
from pyspark.sql import types as t
from pyspark.sql.functions import udf, col, monotonically_increasing_id, to_date, to_timestamp, isnan, when, count
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, minute

import matplotlib.pyplot as plt
import seaborn as sns

### Configure goblal variables

In [2]:
# Configure java an hadoop global variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = "/opt/conda/bin:/opt/spark-2.4.3-bin-hadoop2.7/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/jvm/java-8-openjdk-amd64/bin"
os.environ["SPARK_HOME"] = "/opt/spark-2.4.3-bin-hadoop2.7"
os.environ["HADOOP_HOME"] = "/opt/spark-2.4.3-bin-hadoop2.7"

In [3]:
# Read configiguration file
config = configparser.ConfigParser()
config.read_file(open('dl.cfg'))

os.environ["AWS_ACCESS_KEY_ID"]= config['AWS']['AWS_ACCESS_KEY_ID']
os.environ["AWS_SECRET_ACCESS_KEY"]= config['AWS']['AWS_SECRET_ACCESS_KEY']

# NOTE: Use these if using AWS S3 as a storage
INPUT_DATA_AWS                = config['AWS']['INPUT_DATA_AWS']
OUTPUT_DATA_AWS               = config['AWS']['OUTPUT_DATA_AWS']

# NOTE: Use these if using local storage
INPUT_DATA_LOCAL              = config['LOCAL']['INPUT_DATA_LOCAL']
OUTPUT_DATA_LOCAL             = config['LOCAL']['OUTPUT_DATA_LOCAL']

# Common configuration parameters
DATA_LOCATION                 = config['COMMON']['DATA_LOCATION']
DATA_STORAGE                  = config['COMMON']['DATA_STORAGE']
INPUT_DATA_BTC_DIRECTORY      = config['COMMON']['INPUT_DATA_BTC_DIRECTORY']
INPUT_DATA_BTC_ZIP_FILENAME   = config['COMMON']['INPUT_DATA_BTC_ZIP_FILENAME']
INPUT_DATA_BTC_FILENAME       = config['COMMON']['INPUT_DATA_BTC_FILENAME']
INPUT_DATA_ETH_DIRECTORY      = config['COMMON']['INPUT_DATA_ETH_DIRECTORY']
INPUT_DATA_ETH_ZIP_FILENAME   = config['COMMON']['INPUT_DATA_ETH_ZIP_FILENAME']
INPUT_DATA_ETH_FILENAME       = config['COMMON']['INPUT_DATA_ETH_FILENAME']
OUTPUT_DATA_BTC_FILENAME      = config['COMMON']['OUTPUT_DATA_BTC_FILENAME']
OUTPUT_DATA_ETH_FILENAME      = config['COMMON']['OUTPUT_DATA_ETH_FILENAME']
OUTPUT_BTC_TABLE_FILENAME     = config['COMMON']['OUTPUT_BTC_TABLE_FILENAME']
OUTPUT_ETH_TABLE_FILENAME     = config['COMMON']['OUTPUT_ETH_TABLE_FILENAME']
OUTPUT_CRYPTO_TABLE_FILENAME  = config['COMMON']['OUTPUT_CRYPTO_TABLE_FILENAME']

In [4]:
# Set global configuration variables
if DATA_LOCATION == "local":
    input_data          = INPUT_DATA_LOCAL
    output_data         = OUTPUT_DATA_LOCAL

elif DATA_LOCATION == "aws":
    input_data          = INPUT_DATA_AWS
    output_data         = OUTPUT_DATA_AWS
    
elif DATA_STORAGE == "parquet":
    data_storage        = DATA_STORAGE
    
# load variables for BTC data
btc_data_directory      = INPUT_DATA_BTC_DIRECTORY
btc_zip_filename        = INPUT_DATA_BTC_ZIP_FILENAME    
btc_filename            = INPUT_DATA_BTC_FILENAME
btc_table_filename      = OUTPUT_BTC_TABLE_FILENAME

# load variables for ETH data
eth_data_directory      = INPUT_DATA_ETH_DIRECTORY
eth_zip_filename        = INPUT_DATA_ETH_ZIP_FILENAME    
eth_filename            = INPUT_DATA_ETH_FILENAME
eth_table_filename      = OUTPUT_ETH_TABLE_FILENAME

# general variables
crypto_timeseries_table = OUTPUT_CRYPTO_TABLE_FILENAME

#### Unzip data. RUN ONLY if files are compressed. ONLY WORKS FOR LOCAL

In [5]:
# Create btc_data directory
#!mkdir btc_data_directory

In [6]:
# Create btc_data directory
#!unzip btc_zip_filename -d btc_data_directory

In [7]:
# Create eth_data directory
#!mkdir eth_data_directory

In [8]:
# Create btc_data directory
#!unzip eth_data_directory -d eth_data_directory

## Step 1: Scope the Project and Gather Data

#### Project Scope 

Create a Data Pipeline to process Bitcoin and Ethereum daily prices from CSV files and add to the Data Warehouse, and then can be used to run price prediction models using Machine Learning Time series analysis. The pipeline includes data loading, cleaning, transformation, and aggregation to make the data available to train the ML models. In this project, the main goal is to build the data pipeline to load data to the Data Warehouse and make it available to run ML Models for price prediction.

To build the ETL Pipeline Apache Spark On AWS Services is used, and pandas and matplotlib is used to execute the EDA. Pyspark API is used to interact with Spark.

#### Describe and Gather Data 

Datasets used is obtained from Kaggle's datasets, from these repositories:

**Bitcoin Historical Data**

    * Source: https://www.kaggle.com/mczielinski/bitcoin-historical-data
    * Description: Bitcoin data at 1-min intervals from select exchanges, Jan 2012 to March 2021
    * Format: Unique CSV file
    * Fields: - Timestamp
              - Open
              - High
              - Low
              - Close
              - Volume_(BTC)
              - Volume_(Currency)
              - Weighted_Price
    * Time period: 2012-01-01 to 2021-3-31

**Ethereum (ETH/USDT) 1m Dataset**

    * Source: https://www.kaggle.com/priteshkeleven/ethereum-ethusdt-1m-dataset
    * Description: Ethereum dataset with 1 minute interval from 17-8-2017 to 03-2-2021
    * Format: CSV for each month
    * Fields: - timestamp
              - open
              - high
              - low
              - close
              - volume
              - close_time
              - quote_av
              - trades
              - tb_base_av
              - tb_quote_av
              - ignore

    * Time period: 17-8-2017 to 03-2-2021

## Step 2: Explore and Assess the Data

### 2.1 Read Bitcoin data

**NOTE**: Original content was compressed zip file. The files were unziped and copu to a local directory to execute the EDA

In [9]:
# Define a function to read the BTC data from files and consolidate a unique datafram
path = os.path.join(input_data,'btc_data' ,'*.csv')
print(path)
files = glob.glob(path)
l_data = []

for filename in files:
    e_data = pd.read_csv(filename, index_col=None, header=0)
    l_data.append(e_data)

BTC_data = pd.concat(l_data, axis=0, ignore_index=True)    

data/btc_data/*.csv


In [10]:
BTC_data.head(5)

Unnamed: 0,Timestamp,Open,High,Low,Close,Volume_(BTC),Volume_(Currency),Weighted_Price
0,1325317920,4.39,4.39,4.39,4.39,0.455581,2.0,4.39
1,1325317980,,,,,,,
2,1325318040,,,,,,,
3,1325318100,,,,,,,
4,1325318160,,,,,,,


In [11]:
# Show data columns
BTC_data.columns

Index(['Timestamp', 'Open', 'High', 'Low', 'Close', 'Volume_(BTC)',
       'Volume_(Currency)', 'Weighted_Price'],
      dtype='object')

In [12]:
# count registers
BTC_data.shape

(4857377, 8)

In [13]:
# Count the null values
BTC_data.isnull().sum(axis = 0)

Timestamp                  0
Open                 1243608
High                 1243608
Low                  1243608
Close                1243608
Volume_(BTC)         1243608
Volume_(Currency)    1243608
Weighted_Price       1243608
dtype: int64

### 2.2 Read ETH data

In [14]:
# Define a function to read the ETH data from files and consolidate a unique datafram
path = os.path.join(eth_data_directory, '*.csv')
files = glob.glob(path)
l_data = []

for filename in files:
    e_data = pd.read_csv(filename, index_col=None, header=0)
    l_data.append(e_data)

ETH_data = pd.concat(l_data, axis=0, ignore_index=True)         

In [15]:
ETH_data.head(5)

Unnamed: 0,timestamp,open,high,low,close,volume,close_time,quote_av,trades,tb_base_av,tb_quote_av,ignore
0,2020-08-19 00:00:00,421.92,423.56,421.51,423.55,1632.38697,1597795259999,689035.13322,511,998.48569,421432.15264,0.0
1,2020-08-19 00:01:00,423.56,424.27,423.56,423.98,909.17074,1597795319999,385519.232088,382,352.17966,149320.871212,0.0
2,2020-08-19 00:02:00,424.0,424.0,422.96,423.01,712.07169,1597795379999,301541.453675,305,174.70524,74014.364972,0.0
3,2020-08-19 00:03:00,423.0,423.02,422.56,422.68,680.10097,1597795439999,287561.429434,264,228.55088,96630.8249,0.0
4,2020-08-19 00:04:00,422.67,423.07,422.42,422.54,414.13931,1597795499999,175058.889421,315,153.82495,65014.349621,0.0


In [16]:
# Show data columns
ETH_data.columns

Index(['timestamp', 'open', 'high', 'low', 'close', 'volume', 'close_time',
       'quote_av', 'trades', 'tb_base_av', 'tb_quote_av', 'ignore'],
      dtype='object')

In [17]:
# count registers
ETH_data.shape

(1817149, 12)

In [18]:
# Count the null values
ETH_data.isnull().sum(axis = 0)

timestamp      0
open           0
high           0
low            0
close          0
volume         0
close_time     0
quote_av       0
trades         0
tb_base_av     0
tb_quote_av    0
ignore         0
dtype: int64

### 2.3 Data Quality Analysis

#### After verify the BTC and ETH data, these quality issues were identified:

**istorical BTC data:**

  - From 4.857.377 registers in the file, 1.243.608 have null values
  - The time format is timestamp and needs transformation to make possible join data to ETH dataset


**Historical ETH data:**

   - From 1.817.149 registers in the files there are not null values
   - The time format is timestamp and needs transformation to make possible join data to ETH dataset

- Number of columns between datasets arr diferent, need to map atribbuites.
- Registers are not ordered, for the combined table need sorting.


**Mapping data fields**

    BTC fields                         ETF fields
    - Timestamp                        - timestamp 
    - Open                             - open 
    - High                             - high 
    - Low                              - low 
    - Close                            - close 
    - Volume_(BTC)                     - close_time 
    - Volume_(Currency)                - quote_av 
    - Weighted_Price                   - trades 
                                       - tb_base_av 
                                       - tb_quote_av 
                                       - ignore   

### 2.4 Data cleaning required

Input data requires execute this process:

    - Drop null values from staging tables for BTC data (Can't replace with cero values )
    - Only use fields that appear in both datasets
    - Timestamp need to be splitted into year, month, day, hour 
    - Data need to be ordered vy date keys

## 3. Define the Data Model

### 3.1 Conceptual Data Model


![conceptual mode](./conceptual_model-spark.png)


The basic model is consolidate table that works as the source for ML Models training.




    *Table: staging_btc

    *Columns:
        - Timestamp
        - Open
        - High
        - Low
        - Close
        - Volume_btc
        - Volume_currency
        - Weighted_price


    *Table: staging_eth

    *Columns:
        - timestamp
        - open
        - high
        - close
        - volume
        - close_time
        - quote_av
        - trades
        - tb_base_av
        - tb_quote_Av
        - ignore


    *Table: btc_timeseries

    *Columns:
        - timestamp
        - year
        - month
        - day
        - hour
        - btc_open
        - btc_high
        - btc_low
        - btc_close
        - btc_volume


    *Table: eth_timeseries

    *Columns:
        - timestamp
        - year (Partition Key)
        - month
        - day
        - hour
        - eth_open
        - eth_high
        - eth_low
        - eth_close
        - eth_volume


    *Table: crypto_timeseries

    *Columns:
        - year (Partition Key)
        - month
        - day
        - hour
        - btc_open
        - btc_high
        - btc_low
        - btc_close
        - btc_volume
        - eth_open
        - eth_high
        - eth_low
        - eth_close
        - eth_volume

### 3.2 Mapping Out Data Pipelines

  - Define the global variables in the configuration file (dl.cfg)
  - Read data from CSV files from INPUT_FILE Directory into Spark dataframe
  - Save Spark dataframes to staging parquet file 
  - Read BTC parket file and drop null values.
  - Transform BTC timestamp in to (year, month, day, hour)
  - Transform ETH timestamp in to (year, month, day, hour)
  - Join BTC and ETH spark dataframes
  - Save data to table crypto_timeseries
  - Run the Quality control check

## Step 4: Run Pipelines to Model the Data 

#### 4.1 Create the data model
Build the data pipelines to create the data model.

#### 4.1.1 Create Spark session

In [19]:
# Create the spark session
spark = SparkSession.builder.getOrCreate()

#### 4.1.2 Read data from CSV files and write Spark DataFrames to parquet files

##### Create the BTC Data Staging file

In [20]:
# read BTC data to spark
btc_data_staging = spark.read.options(header='True', inferSchema='True').csv(INPUT_DATA_BTC_DIRECTORY)
btc_data_staging.printSchema()
btc_data_staging.show(5, truncate=False)

root
 |-- Timestamp: integer (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume_(BTC): double (nullable = true)
 |-- Volume_(Currency): double (nullable = true)
 |-- Weighted_Price: double (nullable = true)

+----------+----+----+----+-----+------------+-----------------+--------------+
|Timestamp |Open|High|Low |Close|Volume_(BTC)|Volume_(Currency)|Weighted_Price|
+----------+----+----+----+-----+------------+-----------------+--------------+
|1325317920|4.39|4.39|4.39|4.39 |0.45558087  |2.0000000193     |4.39          |
|1325317980|NaN |NaN |NaN |NaN  |NaN         |NaN              |NaN           |
|1325318040|NaN |NaN |NaN |NaN  |NaN         |NaN              |NaN           |
|1325318100|NaN |NaN |NaN |NaN  |NaN         |NaN              |NaN           |
|1325318160|NaN |NaN |NaN |NaN  |NaN         |NaN              |NaN           |
+----------+----+----+----+--

In [21]:
# BTC Rows Count 
btc_data_staging.count()

4857377

In [22]:
# Rename columns with not allowed symbols and write the parket file
btc_data_staging_temp = btc_data_staging.withColumnRenamed("Volume_(BTC)", "Volume_BTC") \
                                        .withColumnRenamed("Volume_(Currency)", "Volume_Currency")

btc_data_staging_temp.printSchema()
btc_data_staging_temp.write.mode("overwrite").parquet(output_data+"btcstaging.parquet")

root
 |-- Timestamp: integer (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume_BTC: double (nullable = true)
 |-- Volume_Currency: double (nullable = true)
 |-- Weighted_Price: double (nullable = true)



##### Create the ETH Data Staging file

In [23]:
# read ETH data to spark
eth_data_staging = spark.read.options(header='True', inferSchema='True').csv('data/eth_data/eth_data/')
eth_data_staging.printSchema()
eth_data_staging.show(5, truncate=False)

root
 |-- timestamp: timestamp (nullable = true)
 |-- open: double (nullable = true)
 |-- high: double (nullable = true)
 |-- low: double (nullable = true)
 |-- close: double (nullable = true)
 |-- volume: double (nullable = true)
 |-- close_time: long (nullable = true)
 |-- quote_av: double (nullable = true)
 |-- trades: integer (nullable = true)
 |-- tb_base_av: double (nullable = true)
 |-- tb_quote_av: double (nullable = true)
 |-- ignore: double (nullable = true)

+----------------------+------+------+------+------+--------+-------------+-------------+------+----------+-------------+---------------+
|timestamp             |open  |high  |low   |close |volume  |close_time   |quote_av     |trades|tb_base_av|tb_quote_av  |ignore         |
+----------------------+------+------+------+------+--------+-------------+-------------+------+----------+-------------+---------------+
|2017-12-13 00:00:20.81|619.4 |621.99|615.11|615.12|34.17637|1513123280809|21085.5846554|42    |22.16912  |13678

In [24]:
# ETH Rows Count 
eth_data_staging.count()

1817149

In [25]:
# Write the parket file
eth_data_staging.write.mode("overwrite").parquet(output_data+"ethstaging.parquet")

#### 4.1.3 Clean null values from staging BTC

In [26]:
# Read parquet file for BTC data and drop null values
btc_data_staging = spark.read.parquet(output_data+"btcstaging.parquet")
btc_data_staging_temp = btc_data_staging.na.drop()
btc_data_staging_temp.show(5, truncate=False)
btc_data_staging_temp.count()

+----------+------+------+------+------+----------+---------------+--------------+
|Timestamp |Open  |High  |Low   |Close |Volume_BTC|Volume_Currency|Weighted_Price|
+----------+------+------+------+------+----------+---------------+--------------+
|1471105200|583.73|583.73|583.72|583.72|3.0       |1751.1603323   |583.72011078  |
|1471105320|583.81|583.81|583.8 |583.8 |0.2265    |132.231065     |583.80161148  |
|1471105680|584.12|584.17|584.12|584.17|0.4511    |263.51742527   |584.16631628  |
|1471105860|584.16|584.16|583.82|583.82|2.04297896|1193.3083659   |584.10213186  |
|1471105920|584.08|584.08|584.08|584.08|0.0421    |24.589768      |584.08        |
+----------+------+------+------+------+----------+---------------+--------------+
only showing top 5 rows



3613769

#### 4.1.4 Format BTC date , create btc_timeseries table and save parquet file

In [27]:
btc_data_staging_temp = btc_data_staging_temp.withColumn('year',year(to_timestamp('Timestamp')))
btc_data_staging_temp = btc_data_staging_temp.withColumn('month',month(to_timestamp('Timestamp')))
btc_data_staging_temp = btc_data_staging_temp.withColumn('day',dayofmonth(to_timestamp('Timestamp')))
btc_data_staging_temp = btc_data_staging_temp.withColumn('hour',hour(to_timestamp('Timestamp')))
btc_data_staging_temp = btc_data_staging_temp.withColumn('minute',minute(to_timestamp('Timestamp')))

In [28]:
btc_data_staging_temp.show(5, truncate=False)

+----------+------+------+------+------+----------+---------------+--------------+----+-----+---+----+------+
|Timestamp |Open  |High  |Low   |Close |Volume_BTC|Volume_Currency|Weighted_Price|year|month|day|hour|minute|
+----------+------+------+------+------+----------+---------------+--------------+----+-----+---+----+------+
|1471105200|583.73|583.73|583.72|583.72|3.0       |1751.1603323   |583.72011078  |2016|8    |13 |16  |20    |
|1471105320|583.81|583.81|583.8 |583.8 |0.2265    |132.231065     |583.80161148  |2016|8    |13 |16  |22    |
|1471105680|584.12|584.17|584.12|584.17|0.4511    |263.51742527   |584.16631628  |2016|8    |13 |16  |28    |
|1471105860|584.16|584.16|583.82|583.82|2.04297896|1193.3083659   |584.10213186  |2016|8    |13 |16  |31    |
|1471105920|584.08|584.08|584.08|584.08|0.0421    |24.589768      |584.08        |2016|8    |13 |16  |32    |
+----------+------+------+------+------+----------+---------------+--------------+----+-----+---+----+------+
only showi

In [29]:
# btc_timeseries table creation

btc_data_staging_temp.createOrReplaceTempView("btc_timeseries")
btc_timeseries_table = spark.sql("""
    SELECT  DISTINCT Timestamp    AS timestamp,
                     year         AS year, 
                     month        AS month, 
                     day          AS day, 
                     hour         AS hour, 
                     minute       AS minute,
                     Open         AS btc_open, 
                     High         AS btc_high, 
                     Low          AS btc_low,
                     Volume_BTC   AS btc_volume                     
    FROM btc_timeseries
    ORDER BY year, month, day, hour, minute
""")
btc_timeseries_table.printSchema()

root
 |-- timestamp: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minute: integer (nullable = true)
 |-- btc_open: double (nullable = true)
 |-- btc_high: double (nullable = true)
 |-- btc_low: double (nullable = true)
 |-- btc_volume: double (nullable = true)



In [30]:
# Write btc_timeseries_table to parquet file:
btc_timeseries_table.write.mode("overwrite").parquet(output_data+btc_table_filename)

#### 4.1.5 Format ETH date , create eth_timeseries table and save parquet file

In [31]:
# Read parquet file for ETH DATA
eth_data_staging_temp = spark.read.parquet(output_data+"ethstaging.parquet")
eth_data_staging_temp.show(5, truncate=False)
eth_data_staging_temp.count()

+----------------------+------+------+------+------+--------+-------------+-------------+------+----------+-------------+---------------+
|timestamp             |open  |high  |low   |close |volume  |close_time   |quote_av     |trades|tb_base_av|tb_quote_av  |ignore         |
+----------------------+------+------+------+------+--------+-------------+-------------+------+----------+-------------+---------------+
|2017-12-13 00:00:20.81|619.4 |621.99|615.11|615.12|34.17637|1513123280809|21085.5846554|42    |22.16912  |13678.7625208|396902.76727375|
|2017-12-13 00:01:20.81|615.12|617.86|615.11|616.99|26.96652|1513123340809|16610.1922788|26    |18.68354  |11509.242068 |396673.21816384|
|2017-12-13 00:02:20.81|617.0 |617.0 |613.01|613.01|29.32637|1513123400809|18043.0303292|52    |18.03304  |11102.5676661|396552.02389462|
|2017-12-13 00:03:20.81|613.01|615.11|610.0 |611.76|53.86512|1513123460809|32970.7225753|60    |23.70823  |14534.62275  |396556.05743226|
|2017-12-13 00:04:20.81|610.8 |612

1817149

In [32]:
eth_data_staging_temp = eth_data_staging_temp.withColumn('year',year(to_timestamp('Timestamp')))
eth_data_staging_temp = eth_data_staging_temp.withColumn('month',month(to_timestamp('Timestamp')))
eth_data_staging_temp = eth_data_staging_temp.withColumn('day',dayofmonth(to_timestamp('Timestamp')))
eth_data_staging_temp = eth_data_staging_temp.withColumn('hour',hour(to_timestamp('Timestamp')))
eth_data_staging_temp = eth_data_staging_temp.withColumn('minute',minute(to_timestamp('Timestamp')))

In [33]:
eth_data_staging_temp.show(5, truncate=False)

+----------------------+------+------+------+------+--------+-------------+-------------+------+----------+-------------+---------------+----+-----+---+----+------+
|timestamp             |open  |high  |low   |close |volume  |close_time   |quote_av     |trades|tb_base_av|tb_quote_av  |ignore         |year|month|day|hour|minute|
+----------------------+------+------+------+------+--------+-------------+-------------+------+----------+-------------+---------------+----+-----+---+----+------+
|2017-12-13 00:00:20.81|619.4 |621.99|615.11|615.12|34.17637|1513123280809|21085.5846554|42    |22.16912  |13678.7625208|396902.76727375|2017|12   |13 |0   |0     |
|2017-12-13 00:01:20.81|615.12|617.86|615.11|616.99|26.96652|1513123340809|16610.1922788|26    |18.68354  |11509.242068 |396673.21816384|2017|12   |13 |0   |1     |
|2017-12-13 00:02:20.81|617.0 |617.0 |613.01|613.01|29.32637|1513123400809|18043.0303292|52    |18.03304  |11102.5676661|396552.02389462|2017|12   |13 |0   |2     |
|2017-12-1

In [34]:
# eth_timeseries table creation

eth_data_staging_temp.createOrReplaceTempView("eth_timeseries")
eth_timeseries_table = spark.sql("""
    SELECT  DISTINCT timestamp    AS timestamp,
                     year         AS year, 
                     month        AS month, 
                     day          AS day, 
                     hour         AS hour, 
                     minute       AS minute,
                     open         AS eth_open, 
                     high         AS eth_high, 
                     low          AS eth_low,
                     volume       AS eth_volume                     
    FROM eth_timeseries
    ORDER BY year, month, day, hour, minute
""")
eth_timeseries_table.printSchema()

root
 |-- timestamp: timestamp (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- minute: integer (nullable = true)
 |-- eth_open: double (nullable = true)
 |-- eth_high: double (nullable = true)
 |-- eth_low: double (nullable = true)
 |-- eth_volume: double (nullable = true)



In [35]:
# Write btc_timeseries_table to parquet file:
eth_timeseries_table.write.mode("overwrite").parquet(output_data+eth_table_filename)

### 4.2 Join BTC and ETH data and create crypto_timeseries table

In [36]:
# Read BTC timeseries table data form parquet file
btc_timserie_table = spark.read.parquet(output_data+btc_table_filename)
btc_timserie_table.show(5, truncate=False)

+----------+----+-----+---+----+------+--------+--------+-------+-----------+
|timestamp |year|month|day|hour|minute|btc_open|btc_high|btc_low|btc_volume |
+----------+----+-----+---+----+------+--------+--------+-------+-----------+
|1528880580|2018|6    |13 |9   |3     |6542.0  |6542.0  |6534.5 |10.04486266|
|1528880640|2018|6    |13 |9   |4     |6542.0  |6542.0  |6535.07|1.28037011 |
|1528880700|2018|6    |13 |9   |5     |6534.57 |6541.99 |6527.33|30.8100248 |
|1528880760|2018|6    |13 |9   |6     |6530.36 |6541.9  |6522.74|5.7461329  |
|1528880820|2018|6    |13 |9   |7     |6541.4  |6541.9  |6532.48|0.29524416 |
+----------+----+-----+---+----+------+--------+--------+-------+-----------+
only showing top 5 rows



In [37]:
# Read ETH timeseries table data form parquet file
eth_timserie_table = spark.read.parquet(output_data+eth_table_filename)
eth_timserie_table.show(5, truncate=False)

+-------------------+----+-----+---+----+------+--------+--------+-------+----------+
|timestamp          |year|month|day|hour|minute|eth_open|eth_high|eth_low|eth_volume|
+-------------------+----+-----+---+----+------+--------+--------+-------+----------+
|2021-01-09 02:39:00|2021|1    |9  |2   |39    |1191.2  |1192.31 |1189.73|629.16301 |
|2021-01-09 02:40:00|2021|1    |9  |2   |40    |1191.72 |1194.84 |1190.67|617.54553 |
|2021-01-09 02:41:00|2021|1    |9  |2   |41    |1194.47 |1194.78 |1189.0 |838.08516 |
|2021-01-09 02:42:00|2021|1    |9  |2   |42    |1189.71 |1190.03 |1183.91|1688.38612|
|2021-01-09 02:43:00|2021|1    |9  |2   |43    |1184.1  |1186.3  |1183.0 |1020.06746|
+-------------------+----+-----+---+----+------+--------+--------+-------+----------+
only showing top 5 rows



In [71]:
# Join BTC and ETH tables by year, month, day, hour adn minute keys
crypto_timeseries_spark = btc_timserie_table.alias('b').join(eth_timserie_table.alias('e'), on=[
                                            btc_timserie_table.year ==  eth_timserie_table.year,
                                            btc_timserie_table.month ==  eth_timserie_table.month,
                                            btc_timserie_table.day ==  eth_timserie_table.day,
                                            btc_timserie_table.hour ==  eth_timserie_table.hour,
                                            btc_timserie_table.minute ==  eth_timserie_table.minute
                                            ]).select("b.year", 
                                                      "b.month",
                                                      "b.day",
                                                      "b.hour",
                                                      "b.minute",
                                                      "b.btc_open",
                                                      "b.btc_high",
                                                      "b.btc_low",
                                                      "b.btc_volume",                                
                                                      "e.eth_open",
                                                      "e.eth_high",
                                                      "e.eth_low",
                                                      "e.eth_volume").sort("year","month","day","hour","minute")

In [72]:
crypto_timeseries_spark.show(5, truncate=False)

+----+-----+---+----+------+--------+--------+-------+----------+--------+--------+-------+----------+
|year|month|day|hour|minute|btc_open|btc_high|btc_low|btc_volume|eth_open|eth_high|eth_low|eth_volume|
+----+-----+---+----+------+--------+--------+-------+----------+--------+--------+-------+----------+
|2017|8    |17 |4   |0     |4279.35 |4279.35 |4274.0 |1.1698521 |301.13  |301.13  |301.13 |0.42643   |
|2017|8    |17 |4   |1     |4270.71 |4270.71 |4270.71|0.09      |301.13  |301.13  |301.13 |2.75787   |
|2017|8    |17 |4   |2     |4271.0  |4275.45 |4271.0 |0.58030593|300.0   |300.0   |300.0  |0.0993    |
|2017|8    |17 |4   |3     |4274.0  |4274.0  |4271.0 |0.12603432|300.0   |300.0   |300.0  |0.31389   |
|2017|8    |17 |4   |4     |4279.35 |4279.35 |4271.0 |0.91548205|301.13  |301.13  |301.13 |0.23202   |
+----+-----+---+----+------+--------+--------+-------+----------+--------+--------+-------+----------+
only showing top 5 rows



In [73]:
# Write crypto_timeseries_table to parquet file:
crypto_timeseries_spark.write.mode("overwrite").parquet(output_data+crypto_timeseries_table)

#### 4.3 Data Quality Checks

**Data quality checks:**
 * Check that all primary and secondary keys in star schema dimension and fact tables have values.
 * Check that all tables have more than 0 rows.

##### 4.3.1 Quality checks for staging_btc table

In [43]:
# verify table is not empty and key don' have null values
btc_data_staging_check = spark.read.parquet(output_data+"btcstaging.parquet")
btc_data_staging_check.select(count(col('timestamp'))).show()
btc_data_staging_check.select(count(when(isnan('timestamp') | col('timestamp').isNull(), 'timestamp'))).show()

+----------------+
|count(timestamp)|
+----------------+
|         4857377|
+----------------+

+-----------------------------------------------------------------------------+
|count(CASE WHEN (isnan(timestamp) OR (timestamp IS NULL)) THEN timestamp END)|
+-----------------------------------------------------------------------------+
|                                                                            0|
+-----------------------------------------------------------------------------+



##### 4.3.2 Quality checks for staging_eth table

In [44]:
# verify table is not empty and key don' have null values
eth_data_staging_check = spark.read.parquet(output_data+"ethstaging.parquet")
eth_data_staging_check.select(count(col('timestamp'))).show()
eth_data_staging_check.select(count(when(col('timestamp').isNull(), 'timestamp'))).show()

+----------------+
|count(timestamp)|
+----------------+
|         1817149|
+----------------+

+-------------------------------------------------------+
|count(CASE WHEN (timestamp IS NULL) THEN timestamp END)|
+-------------------------------------------------------+
|                                                      0|
+-------------------------------------------------------+



##### 4.3.3 Quality checks for staging_btc table

In [45]:
# verify table is not empty and key don' have null values
btc_data_series_check = spark.read.parquet(output_data+btc_table_filename)
btc_data_series_check.select(count(col('timestamp'))).show()
btc_data_series_check.select(count(when(col('timestamp').isNull(), 'timestamp'))).show()

+----------------+
|count(timestamp)|
+----------------+
|         3613769|
+----------------+

+-------------------------------------------------------+
|count(CASE WHEN (timestamp IS NULL) THEN timestamp END)|
+-------------------------------------------------------+
|                                                      0|
+-------------------------------------------------------+



##### 4.3.4 Quality checks for staging_eth table

In [46]:
# verify table is not empty and key don' have null values
eth_data_series_check = spark.read.parquet(output_data+eth_table_filename)
eth_data_series_check.select(count(col('timestamp'))).show()
eth_data_series_check.select(count(when(col('timestamp').isNull(), 'timestamp'))).show()

+----------------+
|count(timestamp)|
+----------------+
|         1815899|
+----------------+

+-------------------------------------------------------+
|count(CASE WHEN (timestamp IS NULL) THEN timestamp END)|
+-------------------------------------------------------+
|                                                      0|
+-------------------------------------------------------+



##### 4.3.5 Quality checks for crypto_timeserie table

In [74]:
# verify table is not empty and key don' have null values
crypto_timeserie_check = spark.read.parquet(output_data+crypto_timeseries_table)
crypto_timeserie_check.select(count(col('year'))).show()
crypto_timeserie_check.select(count(when(col('year').isNull(), 'year'))).show()

+-----------+
|count(year)|
+-----------+
|    1767716|
+-----------+

+---------------------------------------------+
|count(CASE WHEN (year IS NULL) THEN year END)|
+---------------------------------------------+
|                                            0|
+---------------------------------------------+



#### 4.3 Data dictionary 

**Table btc_timeseries**

|    Field   | Data Type | NULL |                          Description                         |    Source   |
|:----------:|:---------:|:----:|:------------------------------------------------------------:|:-----------:|
| timestamp  | Integer   | NO   | Timestamp from   original source                             | staging_btc |
| year       | Integer   | NO   | Integer   representing the year obtained from the timestamp  | staging_btc |
| month      | Integer   | NO   | Integer   representing the month obtained from the timestamp | staging_btc |
| day        | Integer   | NO   | Integer   representing the day obtained from the timestamp   | staging_btc |
| hour       | Integer   | NO   | Integer   representing the hour obtained from the timestamp  | staging_btc |
| btc_open   | double    | NO   | BTC price at   opening                                       | staging_btc |
| btc_high   | double    | NO   | Highest BTC price   registered for the period                | staging_btc |
| btc_low    | double    | NO   | Lowest BTC price   registered for the period                 | staging_btc |
| btc_close  | double    | NO   | BTC Price at time   zone close                               | staging_btc |
| btc_volume | double    | NO   | Amount of BTC that   was interchanged                        | staging_btc |


**Table eth_timeseries**

|    Field   | Data Type | NULL |                          Description                         |    Source   |
|:----------:|:---------:|:----:|:------------------------------------------------------------:|:-----------:|
| timestamp  | Integer   | NO   | Timestamp from   original source                             | staging_eth |
| year       | Integer   | NO   | Integer   representing the year obtained from the timestamp  | staging_eth |
| month      | Integer   | NO   | Integer   representing the month obtained from the timestamp | staging_eth |
| day        | Integer   | NO   | Integer   representing the day obtained from the timestamp   | staging_eth |
| hour       | Integer   | NO   | Integer   representing the hour obtained from the timestamp  | staging_eth |
| eth_open   | double    | NO   | ETH price at   opening                                       | staging_eth |
| eth_high   | double    | NO   | Highest ETH price   registered for the period                | staging_eth |
| eth_low    | double    | NO   | Lowest ETH price   registered for the period                 | staging_eth |
| eth_close  | double    | NO   | ETH Price at time   zone close                               | staging_eth |
| eth_volume | double    | NO   | Amount of ETH that   was interchanged                        | staging_eth |


**Table crypto_timeseries**

|    Field   | Data Type | NULL |                          Description                         |     Source     |
|:----------:|:---------:|:----:|:------------------------------------------------------------:|:--------------:|
| timestamp  | Integer   | NO   | Timestamp from   original source                             | btc_timeseries |
| year       | Integer   | NO   | Integer   representing the year obtained from the timestamp  | btc_timeseries |
| month      | Integer   | NO   | Integer   representing the month obtained from the timestamp | btc_timeseries |
| day        | Integer   | NO   | Integer   representing the day obtained from the timestamp   | btc_timeseries |
| hour       | Integer   | NO   | Integer   representing the hour obtained from the timestamp  | btc_timeseries |
| btc_open   | double    | NO   | BTC price at   opening                                       | btc_timeseries |
| btc_high   | double    | NO   | Highest BTC price   registered for the period                | btc_timeseries |
| btc_low    | double    | NO   | Lowest BTC price   registered for the period                 | btc_timeseries |
| btc_close  | double    | NO   | BTC Price at time   zone close                               | btc_timeseries |
| btc_volume | double    | NO   | Amount of BTC that   was interchanged                        | btc_timeseries |
| eth_open   | double    | NO   | ETH price at   opening                                       | eth_timeseries |
| eth_high   | double    | NO   | Highest ETH price   registered for the period                | eth_timeseries |
| eth_low    | double    | NO   | Lowest ETH price   registered for the period                 | eth_timeseries |
| eth_close  | double    | NO   | ETH Price at time   zone close                               | eth_timeseries |
| eth_volume | double    | NO   | Amount of ETH that   was interchanged                        | eth_timeseries |

## Step 5: Complete Project Write Up

### 5.1 Rationale for the tools selection:

   - Better knowledge developing on Python and Pyspark
   - Easy access to run local environment for Pyspark and cloud using AWS Spark clusters.
   - PySpark shares some similarities with Python pandas and then is easy to develop with the framework.
   - Easy use of the dataframes style data access and SQL 
   

### 5.2 How often ETL script should be run:

   - Due to the high dynamic in the cryptocurrency market new info is added each minute, for that reason could be necessary to execute the pipeline at least one time daily. But is not necessary to update all the info only the new one, then the pipeline can be modified to append information, or maybe use a data streaming framework like Kafka can be a better option
   

### 5.3 Other scenarions (what to consider in them):

   - Huge increase in data (Data 100x):  This case can happen if more cryptos are included, which means that each crypto requires more space. In this case, we can use a bigger spark cluster on AWS or another cloud service. Other options include use another database for staging and simplify some columns in the datasets. In that case, we can set up a parallel process on spark to improve the performance
   
   - The pipelines would be run on a daily basis by 7 am every day: In this case, could be helpful to use a task manager as ML Flow to integrate and schedule the job execution. To update the data every morning is required to integrate the data with providers.

   - The database needed to be accessed by 100+ people: For this case, many users can access the data to create specific queries, views, or run their ML models. in that case, a very flexible database is required using the cloud services to provide the data with low latency times.
   
   
### 5.4 Potential further work:    

   - Integrate with other data sources as exchanges to integrate more data and more up-to-date price information.
   - This approach can require a data streaming service that appends new data
   - Integrate with AUTO ML tools or ML flows tools to automatize the prediction models after data is updated.
   - Improve the process using MLflow or other data pipelines flow manager.
   - Improve the pipeline design and implementation