# Importing the Dataset  for the Report Generation Agent

This notebook implements the **data import** for the **Report Generation Agent** for single-table relational
data source.

The data source implemented here is an [SQLite](https://sqlite.org/) database which is supported
natively by Python and saves the data in disk.
[SQLAlchemy](https://www.sqlalchemy.org/) is used as a SQL connection tool so this
SQL connection can be easily swapped for other databases.

The SQL Alchemy tool is set up to allow **read-only queries**, so there is **no risk** the agent runs queries that can modify the DB data.

## Setting up

The code below sets the notebook default folder, sets the default constants and checks the presence of the environment variables.

The environment variables can be set in the `.env` file in the root folder of the project.

In [None]:
import os
import ssl
import urllib.request
import zipfile
from pathlib import Path

import certifi
import pandas as pd
from aieng.agent_evals.async_client_manager import AsyncClientManager


# Setting the notebook directory to the project's root folder
if Path("").absolute().name == "eval-agents":
    print(f"Notebook path is already the root path: {Path('').absolute()}")
else:
    os.chdir(Path("").absolute().parent.parent)
    print(f"The notebook path has been set to: {Path('').absolute()}")

client_manager = AsyncClientManager.get_instance()
assert client_manager.configs.report_generation_db.database, (
    "[ERROR] The database path is not set! Please configure the REPORT_GENERATION_DB__DATABASE environment variable."
)

print("All environment variables have been set.")

DATA_FOLDER = Path("implementations/report_generation/data")
DATASET_PATH = DATA_FOLDER / "OnlineRetail.csv"

from implementations.report_generation.data.import_online_retail_data import import_online_retail_data  # noqa: E402

## Dataset

The dataset used in this example is the
**[Online Retail](https://archive.ics.uci.edu/dataset/352/online+retail) dataset**. It contains
information about **invoices** for products that were purchased by customers, which also includes
product quantity, the invoice date and the country that the customer resides in. For a more
detailed data structure, please check the [OnlineRetail.ddl](http://localhost:8888/lab/tree/implementations/report_generation/data/OnlineRetail.ddl) file.

## Downloading the Dataset

The code below will **download and unzip** the dataset to the `implementations/report_generation/data/` folder.

In [None]:
url = "https://archive.ics.uci.edu/static/public/352/online+retail.zip"
zip_file_path = DATA_FOLDER / "online_retail.zip"
xlsx_file_path = DATA_FOLDER / "Online Retail.xlsx"

print("Downloading the dataset...")
ctx = ssl.create_default_context(cafile=certifi.where())
req = urllib.request.Request(url)
with urllib.request.urlopen(req, context=ctx) as resp, open(zip_file_path, "wb") as f:
    f.write(resp.read())

print("Extracting the dataset file...")
with zipfile.ZipFile(zip_file_path, "r") as zf:
    zf.extractall(DATA_FOLDER)

print("Converting the dataset file from .xls to .csv...")
df = pd.read_excel(xlsx_file_path)
df.to_csv(DATASET_PATH, index=False)

print("Done!")

## Visualizing the data

In [None]:
df = pd.read_csv(DATASET_PATH)
df  # noqa: B018

## Importing the Data

The code below will import the `.csv` dataset to the database at the path set by the `REPORT_GENERATION_DB__DATABASE` environment variable.

In [None]:
import_online_retail_data(DATASET_PATH)
print("Done!")

## Conclusion

Now the data should be ready to be consumed by the agent on the **next notebook**.