## BER energy data

Data taken from the public BER dataset within this notebook. 
Aim is to understand build a structure that can be used to perform analysis on future datasets that contain a smaller number of the features. Creating a robust target variable ("BER") using the fewest number of available features is the long term goal. To align this strategy with the data that a company will have available will be the challenge of understanding how the features are collected. Will features within this dataset perform better if they are before or after the BER assessment to provide detailed information on what drives the BER.

Data Source: public search data [website](https://ndber.seai.ie/BERResearchTool/ber/search.aspx). 

- Build a baseline ML classification model e.g., boosting tree, help understand important features
- Perform unsupervised learning to review clusters of variables that have similar characteristics
- Review which variables could be transformed and/or combined to benefit model accuracy
- Are there any features that could be collected in external datasets that are similar to data shown

In [None]:
# Training examples using Jupyter Notebook
# Aim is to understand example code that can be moved to GitHub for future use

# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import sys
import installPack # took the original code that was being used to install a new package and wrapped it in a py script
import polars as pl
import plotly.express as px
import os

In [None]:
os.getcwd()

In [None]:
# List of libraries to import
requirements = ["pyarrow"]
for requirement in requirements:
    installPack.installPackage(requirement)

In [None]:
import pyarrow as pa
import tidypolars as tp
from tidypolars import col, desc

In [None]:
# text file to scan
txt_file = "BERPublicsearch.txt"

pl_lazy = pl.scan_csv(txt_file, separator="\t").fetch(100)

In [None]:
pl_lazy

In [None]:
# Import the entire text file
pl_df = pl.read_csv(txt_file, separator="\t", ignore_errors=True)

In [None]:
# Shape of the file
pl_df.shape

In [None]:
import polars.selectors as cs

# pl_df.select(cs.numeric()).head()
pl_df.select(cs.all()).head()

In [None]:
pl_df.columns

In [None]:
from collections import Counter
dtypes = pl_df.dtypes
Counter(dtypes).keys() # equals to list(set(words))

In [None]:
Counter(dtypes).values() # counts the elements' frequency

In [None]:
pl_df.select(cs.all()).head()

In [None]:
print(pl_df.select(cs.float()).estimated_size("mb"))

In [None]:
try:
    out = pl_df.select(cs.float().cast(pl.Float32))
    print(out)
    print(out.select(cs.float()).estimated_size("mb"))
except Exception as e:
    print(e)

In [None]:
print(pl_df.select(cs.integer()).estimated_size("mb"))

In [None]:
try:
    out = pl_df.select(cs.integer().cast(pl.Int32))
    print(out)
    print(out.select(cs.integer()).estimated_size("mb"))
except Exception as e:
    print(e)

In [None]:
# pl_df.estimated_size("gb")
pl_df.estimated_size("mb")

In [None]:
# Downcast the floats and integers
pl_df_fl = (
    pl_df
    .select(cs.float().cast(pl.Float32))
)

In [None]:
pl_df_int = (
    pl_df
    .select(cs.integer().cast(pl.Int32))
)

In [None]:
pl_df_fl.estimated_size("mb")

In [None]:
pl_df_int.estimated_size("mb")

In [None]:
pl_df_out = pl_df.select(cs.all() - cs.float() - cs.integer())

In [None]:
pl_df_out.estimated_size("mb")

In [None]:
pl_df_out.head()

In [None]:
pl_df_final = pl.concat([pl_df_out, pl_df_fl, pl_df_int], how="horizontal")
pl_df_final.estimated_size("mb")

In [None]:
pl_df_final.head()

In [None]:
pl_df_final1 = pl_df_final.select(pl_df.columns)
pl_df_final1.head()

Polars import shows that there was 63,093 extra rows when importing the text file compared to conversion of file to csv format and then importing. Extra data within excel could not be processed correctly.

In [None]:
type(pl_df_final1)

In [None]:
pl_df_final1.describe()

In [None]:
pl_df_final1.write_parquet("ber_publicsearch.parquet", use_pyarrow = True)

### Dask - review of the data

In [None]:
import dask.dataframe as dd

In [None]:
df = dd.read_parquet("ber_publicsearch.parquet")

In [None]:
df = df.repartition(partition_size="500MB")

In [None]:
df.npartitions

In [None]:
df.head()

In [None]:
df.memory_usage(deep=True)

In [None]:
def with_snappy(n):
    return f"part-{n}.snappy.parquet"

df.to_parquet(
    "data/",
    engine="pyarrow",
    write_metadata_file=False,
    compression="snappy",
    name_function=with_snappy,
)