# **Price Study Notebook**

## Objectives

* Answer business requirement 1:
- The client is interested in identifying key variables that correlate with significant Bitcoin price changes.

## Inputs

* outputs/datasets/collection/BTCDaily.csv

## Outputs

* generate code that answers business requirement 1 and can be used to build the Streamlit App  

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/bitcoin-forecast/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/bitcoin-forecast'

# Load Data

In [4]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/BTCDaily.csv"))
df.head()

Unnamed: 0,date,open,high,low,close,Volume BTC,Volume USD
0,2014-11-28,363.59,381.34,360.57,376.28,3220878.18,8617.15
1,2014-11-29,376.42,386.6,372.25,376.72,2746157.05,7245.19
2,2014-11-30,376.57,381.99,373.32,373.34,1145566.61,3046.33
3,2014-12-01,376.4,382.31,373.03,378.39,2520662.37,6660.56
4,2014-12-02,378.39,382.86,375.23,379.25,2593576.46,6832.53


---

# Data Exploration

We want to become more familiar with the dataset by examining variable types and distributions, identifying missing values, and understanding the business context of these variables.

In [5]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

***Observations and Conclusions:***
The analysis of the dataset revealed that there are no missing values, indicating a complete record of daily prices. However, the date variable is currently stored in text format, which is not ideal for time series analysis.

- Next Steps:
1. Convert the 'date' variable to datetime format.

In [6]:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2651 entries, 0 to 2650
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        2651 non-null   datetime64[ns]
 1   open        2651 non-null   float64       
 2   high        2651 non-null   float64       
 3   low         2651 non-null   float64       
 4   close       2651 non-null   float64       
 5   Volume BTC  2651 non-null   float64       
 6   Volume USD  2651 non-null   float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 145.1 KB


2. Set the 'date' as the index of the DataFrame.

In [7]:
df.set_index('date', inplace=True)
df.head()

Unnamed: 0_level_0,open,high,low,close,Volume BTC,Volume USD
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-11-28,363.59,381.34,360.57,376.28,3220878.18,8617.15
2014-11-29,376.42,386.6,372.25,376.72,2746157.05,7245.19
2014-11-30,376.57,381.99,373.32,373.34,1145566.61,3046.33
2014-12-01,376.4,382.31,373.03,378.39,2520662.37,6660.56
2014-12-02,378.39,382.86,375.23,379.25,2593576.46,6832.53


3. Verify the integrity of the index and explore for any potential missing days in the dataset.
It is noted that there are no missing dates.

In [8]:
#Source: https://pandas.pydata.org/docs/reference/api/pandas.date_range.html
full_date_range = pd.date_range(start=df.index.min(), end=df.index.max(), freq='D')
missing_days = full_date_range[~full_date_range.isin(df.index)]  # Identifies dates in `full_date_range` that are not in the DataFrame index, indicating missing days.
print(missing_days)

DatetimeIndex([], dtype='datetime64[ns]', freq='D')


# Correlation Study

Here the correlation between all quantities is visualized.

In [9]:
df.corr()

Unnamed: 0,open,high,low,close,Volume BTC,Volume USD
open,1.0,0.999485,0.998983,0.998798,-0.048598,0.609992
high,0.999485,1.0,0.998901,0.999474,-0.045371,0.617662
low,0.998983,0.998901,1.0,0.999319,-0.05604,0.593531
close,0.998798,0.999474,0.999319,1.0,-0.049489,0.608673
Volume BTC,-0.048598,-0.045371,-0.05604,-0.049489,1.0,-0.160242
Volume USD,0.609992,0.617662,0.593531,0.608673,-0.160242,1.0


- There is a very strong positive correlation among the daily price variables, supporting your hypothesis that daily closing prices are influenced by daily high and low prices.
- The volume in USD has a moderate positive correlation with price metrics, while volume in BTC has a weak correlation

We use .corr() for spearman and pearson methods, and investigate the top correlations

In [11]:
corr_spearman = df.corr(method='spearman')['close']
print("\nSpearman Correlation with Closing Price:")
print(corr_spearman)


Spearman Correlation with Closing Price:
open          0.999043
high          0.999490
low           0.999513
close         1.000000
Volume BTC   -0.611318
Volume USD    0.806241
Name: close, dtype: float64


In [None]:
corr_pearson = df.corr(method='pearson')['close']
print("Pearson Correlation with Closing Price:")
print(corr_pearson)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
