# Exploratory Data Analysis
- Check the properties of the dataset, that will be a base for both establishing the trading policy for a bot and a training dataset for a return forecasting model

### 1. Import libraries and download dataset

In [9]:
# Import neccessary libraries
import pandas as pd

In [11]:
# Read parquet df
raw_data = pd.read_parquet('raw_data.parquet', engine='pyarrow')

### 2. Shape of a dataset

In [27]:
# Check the size of the dataset
print(f'Number of rows: {raw_data.shape[0]}')
print(f'Number of columns: {raw_data.shape[1]}')

# Check starting and ending time lag of the dataset
print(f'First time lag: {raw_data.time.iloc[0]}')
print(f'Last time lag: {raw_data.time.iloc[-1]}')

Number of rows: 1172528
Number of columns: 11
First time lag: 2021-01-01 00:00:00
Last time lag: 2023-03-27 00:00:00


#### Takeaway:
- Dataset consists of 1.172.528 rows, each of them representing 1minute candle of trading data for a BTC/BUSD pair, and 11 columns which meaning will be elaborated further
---
- Dataset is a timeseries that spans from the 1st of January 2021 to 27th of March 2023 --> 2 and 1/4 years of data

### 3. Column names

- TIME: open time in unix time format (first second of a minute)
---
- OPEN: opening price in a given time lag
---
- HIGH: highest price in a given time lag
---
- LOW: lowest price in a given time lag
---
- CLOSE: closing price in a given time lag
---
- VOLUME: traded volume in a given time lag
---
- CLOSE_TIME: close time in unix time format (last second of a minute)
---
- QUOTE_ASSET_VOLUME: traded volume in a trading asset (here BUSD)
---
- NUMBER_OF_TRADES: trades occured in a given time lag
---
- TAKER_BUY_BASE_ASSET_VOLUME: volume of base asset that traders want to buy (here BTC), if over 1 - probably bullish sentiment (https://dataguide.cryptoquant.com/market-data-indicators/taker-buy-sell-volume-ratio)
---
- TAKER_BUY_QUOTE_ASSET_VOLUME: volume of trading asset that traders want to buy (here BUSD)

### 4. Null values, duplicates, statistical properties

In [28]:
# Check for null values and number of rows per column
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1172528 entries, 0 to 1172527
Data columns (total 11 columns):
 #   Column                        Non-Null Count    Dtype         
---  ------                        --------------    -----         
 0   time                          1172528 non-null  datetime64[ns]
 1   open                          1172528 non-null  float64       
 2   high                          1172528 non-null  float64       
 3   low                           1172528 non-null  float64       
 4   close                         1172528 non-null  float64       
 5   volume                        1172528 non-null  float64       
 6   close_time                    1172528 non-null  datetime64[ns]
 7   quote_asset_volume            1172528 non-null  float64       
 8   number_of_trades              1172528 non-null  int64         
 9   taker_buy_base_asset_volume   1172528 non-null  float64       
 10  taker_buy_quote_asset_volume  1172528 non-null  float64       
dty

#### Takeaway:
- No null values
---
- All of the rows consist of the same number of rows, same as the previously checked total number of rows for the whole dataset

In [35]:
# Check for duplicate rows
duplicate_rows = raw_data[raw_data.duplicated()]

# Count duplicate rows if exist
print(f'Duplicate rows: {len(duplicate_rows)}')
# Display duplicate rows if exists
duplicate_rows.head()

Duplicate rows: 0


Unnamed: 0,time,open,high,low,close,volume,close_time,quote_asset_volume,number_of_trades,taker_buy_base_asset_volume,taker_buy_quote_asset_volume


#### Takeaway:
- No duplicate rows

In [13]:
# Statistical properties of datasets variables
raw_data.describe()

Unnamed: 0,open,high,low,close,volume,quote_asset_volume,number_of_trades,taker_buy_base_asset_volume,taker_buy_quote_asset_volume
count,1172528.0,1172528.0,1172528.0,1172528.0,1172528.0,1172528.0,1172528.0,1172528.0,1172528.0
mean,36197.11,36219.27,36175.11,36197.09,33.75019,857396.3,763.8513,16.78371,425651.9
std,13952.99,13961.31,13944.5,13952.98,60.97816,1289898.0,1000.379,30.84987,659196.4
min,15494.55,15528.14,15461.92,15498.45,0.0,0.0,0.0,0.0,0.0
25%,22313.99,22325.0,22300.28,22313.94,5.007305,212159.4,220.0,2.250323,95269.78
50%,36712.43,36747.79,36678.79,36712.53,12.40295,454923.4,413.0,5.998395,218309.0
75%,47138.97,47164.12,47112.93,47138.97,38.99865,1009983.0,943.0,19.27613,502505.7
max,68999.99,69020.0,68782.87,68999.99,2536.149,54768290.0,51377.0,1430.346,31207540.0


#### Takeaway:
- Nothing too interesting can be observed here
---
- According to the 'mean' row, mean values for each of OHLC prices are very simmilar (absolute differences are very small), therefore daily absolute volatility is not high
---
- Same is confirmed with the value of standard deviation
---
- Volume however, differs much more than the OHLC prices, 'std' is almost twice larger than the mean value
---
- TAKER_BUY_BASE_ASSET_VOLUME average value is above one (nearly 2), therefore the trend thoughout the given period is rather 'bullish' 