## Data Cleaning

The .csv file was obtained from [EthereumETL](https://github.com/blockchain-etl/ethereum-etl), but it has a lot of columns we are not interested in. Then, we are going to remove them from our final .csv files in order to reduce memory and storage usage.

The columns kept are the following:

| **Column**      | **Description**                                               |
| --------------- | ------------------------------------------------------------- |
| hash            | Identifier hash of the transaction                            |
| block_number    | Number of the block that contains that transaction in decimal |
| block_timestamp | UNIX Timestamp in UTC that shows when the block was mined     |
| from_address    | Identifier hash of the account that sent the transaction      |
| to_address      | Identifier hash of the account that received the transaction  |
| value           | Amount of ETH sent in the transaction                         |

These columns are sufficient to check things like:

* Transaction volume in function of time;
* Graph metrics, building a graph of accounts as nodes and transactions as edges;
* Possibility to weigh the edges of the graph using the _value_ column.

---

### Filenames

The raw files were named as "MMyy_id.csv", where MM is the month name in lowercase, yy is the 2 last digits of the year and id is a unique identifier for the file. The lightweight files were named with the same format, but with "light-" at the start.

### Imports

In [None]:
import dask.dataframe as dd

### Code

In [None]:
def create_light_df(month_name, year_name, dataset_id, relative_path):
    DATASET_NAME = month_name+year_name+'_'+dataset_id+'.csv'
    df = dd.read_csv(DATASET_NAME)
    df = df[['hash', 'block_number','block_timestamp','from_address','to_address','value']]
    df.to_csv('light-'+DATASET_NAME, single_file = True)

In [None]:
create_light_df('july', '21', '1', './')