# How to preprocess your files

## Setup

Best practice is to create a `virtual environment` for a new project
I'm using `pyenv` for this

In [1]:
# uncomment to install

# create a virtual environment
#!pyenv virtualenv 3.12.7 ctax

# activate a local virtual environment for this project
# make sure to be in the project root directory
#!pyenv local ctax

Next we install the dependencies

In [None]:
# uncomment to install

#!pip install pandas yfinance pyarrow

# for jupyter notebook support
#!pip install jupyter

We can also use the `requirements.txt`

In [3]:
# uncomment to install

#!pip install -e .
#or
#!pip install -r requirements.txt

---

## Preprocess
The main function to do all at once is to use `preprocess_histories()`

In [4]:
# this will guarantee that changes in your code will be automatically reloaded
%load_ext autoreload
%autoreload 2

In [5]:
# imports
from ctax.preprocess.preprocessing import preprocess_histories

We need to pass:
- `csv_file`: This can either a string or a list of strings with filenames
- `cex`: For every file we need the Exchange it was exported from


> Directory Paths can be changed in `paths.py`

In [None]:
# using one file only
csv_file = 'bitpanda_all.csv' #change to your filename
cex = "bitpanda"

# preprocess the data
preprocess_histories(csv_file, cex)



Opening /home/knolli/code/Knolli14/pandas/data/csv/bitpanda_all.csv
Start Preprocessing bitpanda data...
...saved to /home/knolli/code/Knolli14/pandas/data/parquet/bitpanda_all.parquet
-> Finished Preprocess of bitpanda_all.csv
--------------------------------------------------
-> Done converting raw csv files to parquet!
--------------------------------------------------


In [13]:
# using multiple files

csv_files = ['bitpanda_all.csv', "kucoin_2021.csv"]
cexs = ["bitpanda", "kucoin"]

preprocess_histories(csv_files, cexs)


Opening /home/knolli/code/Knolli14/pandas/data/csv/bitpanda_all.csv
Start Preprocessing bitpanda data...
...saved to /home/knolli/code/Knolli14/pandas/data/parquet/bitpanda_all.parquet
-> Finished Preprocess of bitpanda_all.csv

Opening /home/knolli/code/Knolli14/pandas/data/csv/kucoin_2021.csv
Start Preprocessing kucoin data...
...saved to /home/knolli/code/Knolli14/pandas/data/parquet/kucoin_2021.parquet
-> Finished Preprocess of kucoin_2021.csv
--------------------------------------------------
-> Done converting raw csv files to parquet!
--------------------------------------------------

Merging all histories into one DataFrame...
...saved to /home/knolli/code/Knolli14/pandas/data/parquet/full_history.parquet


> if inserted more than one file, it will automatically be merged and saved as `full_history.parquet`

---

### Open history

To open this file use the `open_parquet()` method

In [15]:
from ctax.data import open_parquet

history = open_parquet("full_history.parquet")
history.head()

Unnamed: 0,timestamp,tx_type,amount_fiat,fiat,amount_asset,asset
0,2021-03-11 15:14:44,deposit,50.0,EUR,0.0,EUR
1,2021-03-11 15:34:16,buy,16.95,EUR,19.957612,USDT
2,2021-03-11 15:43:13,buy,23.66,EUR,30.0,BEST
3,2021-03-11 18:43:08,buy,6.02,EUR,5000.0,BTT
4,2021-03-11 18:50:33,sell,5.02,EUR,6.464104,BEST


#### to be continued...
