# Bitcoin Price Analysis

Here, my objective is to analyse the bitcoin daily price database obtained from: https://www.investing.com/crypto/bitcoin/historical-data

Initially, the daily timeframe will be the only one taken into account. In the future, I intend to analyse other timeframes like weekly or monthly data.

### 1) Pre-processing with PySpark

In [1]:
# Importing pyspark and creating a new session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("btc") \
    .getOrCreate()
    

24/05/21 22:12:17 WARN Utils: Your hostname, Valhalla-Zorin resolves to a loopback address: 127.0.1.1; using 192.168.100.36 instead (on interface enp4s0)
24/05/21 22:12:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/21 22:12:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
csv_file_path = "Databases/BTC/btc_daily.csv"
data = spark.read.csv(csv_file_path, header=True)
data

DataFrame[Date: string, Price: string, Open: string, High: string, Low: string, Vol.: string, Change %: string]

In [5]:
data.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Vol.: string (nullable = true)
 |-- Change %: string (nullable = true)



In [6]:
data.show()

+----------+--------+--------+--------+--------+-------+--------+
|      Date|   Price|    Open|    High|     Low|   Vol.|Change %|
+----------+--------+--------+--------+--------+-------+--------+
|05/21/2024|70,139.9|71,430.5|71,872.0|69,181.7|108.56K|  -1.80%|
|05/20/2024|71,422.7|66,278.3|71,482.8|66,076.5|112.66K|   7.76%|
|05/19/2024|66,279.1|66,919.0|67,662.5|65,937.3| 36.19K|  -0.95%|
|05/18/2024|66,917.5|67,036.6|67,361.4|66,636.1| 29.68K|  -0.18%|
|05/17/2024|67,036.8|65,231.1|67,420.7|65,121.7| 63.09K|   2.77%|
|05/16/2024|65,231.0|66,219.6|66,643.9|64,623.3| 72.55K|  -1.50%|
|05/15/2024|66,225.1|61,569.4|66,417.1|61,357.5|106.05K|   7.56%|
|05/14/2024|61,569.4|62,936.8|63,102.6|61,156.9| 68.84K|  -2.17%|
|05/13/2024|62,937.2|61,480.5|63,443.2|60,779.0| 70.55K|   2.37%|
|05/12/2024|61,480.0|60,826.6|61,847.7|60,647.1| 27.40K|   1.07%|
|05/11/2024|60,826.6|60,796.8|61,487.5|60,499.3| 27.50K|   0.05%|
|05/10/2024|60,796.9|63,074.3|63,454.3|60,251.8| 79.33K|  -3.61%|
|05/09/202

In [8]:
# importing datatypes that will be used with this database
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col
from pyspark.sql.functions import regexp_replace
from pyspark.sql.functions import to_date

In [9]:
# Convert "Date" column to DateType
data = data.withColumn("Date", to_date(data["Date"], "MM/dd/yyyy"))

# Convert other columns to DoubleType
columns_to_convert = ['Price', 'Open', 'High', 'Low']
for col_name in columns_to_convert:
    data = data.withColumn(col_name, regexp_replace(col(col_name), ",", ""))
    data = data.withColumn(col_name, col(col_name).cast(DoubleType()))

In [167]:
data.show()

+----------+-------+-------+-------+-------+-------+--------+
|      Date|  Price|   Open|   High|    Low|   Vol.|Change %|
+----------+-------+-------+-------+-------+-------+--------+
|2024-03-24|67211.9|64036.5|67587.8|63812.9| 65.59K|   4.96%|
|2024-03-23|64037.8|63785.6|65972.4|63074.9| 35.11K|   0.40%|
|2024-03-22|63785.5|65501.5|66633.3|62328.3| 72.43K|  -2.62%|
|2024-03-21|65503.8|67860.0|68161.7|64616.1| 75.26K|  -3.46%|
|2024-03-20|67854.0|62046.8|68029.5|60850.9|133.53K|   9.35%|
|2024-03-19|62050.0|67594.1|68099.6|61560.6|148.08K|  -8.20%|
|2024-03-18|67594.1|68389.7|68920.1|66601.4| 78.07K|  -1.17%|
|2024-03-17|68391.2|65314.2|68857.7|64605.5| 66.07K|   4.71%|
|2024-03-16|65314.2|69456.5|70037.0|64971.0| 75.82K|  -5.97%|
|2024-03-15|69463.7|71387.1|72398.1|65765.6|148.59K|  -2.69%|
|2024-03-14|71387.5|73066.7|73740.9|68717.2|109.43K|  -2.30%|
|2024-03-13|73066.3|71461.9|73623.5|71338.4| 77.18K|   2.23%|
|2024-03-12|71470.2|72099.1|72916.7|68845.6|105.09K|  -0.87%|
|2024-03

In [168]:
data = data.drop("Vol.", "Change %")
data.printSchema()

root
 |-- Date: date (nullable = true)
 |-- Price: double (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)

