In [21]:
import pandas as pd

In [22]:
pd.read_csv('toy_dataset_1_eng_twitter.csv')

Unnamed: 0,Text,Label1,Label2
0,"I can't stand those people, they don't belong ...",yes,yes
1,We should all work together to build a better ...,no,no
2,"Go back to where you came from, you're not wan...",yes,yes
3,"Everyone deserves equal rights and respect, no...",no,no
4,They're all lazy and don't contribute anything...,yes,yes
5,"I love how diverse our community is, it's so e...",no,no
6,People like them are ruining this country.,yes,yes
7,We need to stop spreading hate and focus on ki...,no,no
8,Why do they always cause problems wherever the...,yes,yes
9,Helping others is the best way to improve the ...,no,no


## Install the Package

```bash
pip install xxx
```

## Generate Config File

Configuration File for the database can be either supplied or created with our `Config` tool. However, it is crucial for the later stages that supplied config file conform to the following manuals:

Manual for the Config fileï¼š
- `dataset_file_name`: Please include the **full** file name including the datatype, e.g. data.csv, data.tsv. If the datasets are splitted into different sets, seperate the names with a comma(`,`).
- `dataset_name`: The short name for the dataset.
- `label_name_definition`: Please write the label name and corresponding definition in a `JSON` format.
- `source`: For data with a single source, please state the source name, e.g. Twitter, Facebook, etc. Add an `@` symbol in-front, e.g. *@Twitter*. If of multi-source, please provide a column name.
- `language`: For single language, language code as stipulated in ISO 639-2 are recognized, e.g. eng, spa, chi, ger, fre, ita, etc (). Add an `@` symbol in-front, e.g. *@eng*. For multi-language contents, please provide a column name describing this property, e.g. languages.
- `text`: The column name of text.

**Note**:
1. Only datasets that are in CSV(comma-seperated) or TSV formats are supported.

In [23]:
from tool.config.generator import Config

To avoid problems in later stages, we advise to create the config file using our tool `Config`, which is modularly structured, easy to use and reliable for the following stages. 

Config file generator (Config) takes two parameters during initialization: `name` of the config file, and `mode` of either `create` or `append`, for the creation and append of dataset item. A mode switch function needs to be called explicitly when switching modes. In the example, two toy datasets will be added into config with the tools:

In [24]:
config = Config("toy_config", mode = 'create')

In [25]:
config.add_entry(
    dataset_file_name='toy_dataset_1_eng_twitter.csv',
    dataset_name= 'toy_data_1',
    label_name_definition={'Label1':'Definition Label 1',
                           'Label2': 'Definition Label 2'}, # two label columns
    text='Text',
    source='@Twitter',
    language='@eng'
)

In [26]:
config.switch_mode('append')

In [27]:
config.add_entry(    
    dataset_file_name='toy_dataset_2_ger_reddit.csv',
    dataset_name= 'toy_data_2',
    label_name_definition={'Label':'Definition Label'},
    text='Text',
    source='@Reddit',
    language='@ger')

## Validate the Datasets
`ConfigValidator` compares the `config` file with the provided data folder path, to see if everything listed on `config` matches the dataset provided. If some check fails, detailed error message will also be helpful to locate and correct the problem.

In [28]:
from tool.loader.validator import ConfigValidator

In [29]:
config_validator = ConfigValidator('toy_config','../toy_data')
config_validator.final_config()

Filename integrity check complete!


Unnamed: 0,dataset_file_name,dataset_name,label_name_definition,source,language,text
0,toy_dataset_1_eng_twitter.csv,toy_data_1,"{'Label1': 'Definition Label 1', 'Label2': 'De...",@Twitter,@eng,Text
1,toy_dataset_2_ger_reddit.csv,toy_data_2,{'Label': 'Definition Label'},@Reddit,@ger,Text


## Load Data into Database
`DataLoader` take two inputs, a `conn` from the established empty database and an instance of `validator`. Its member function of `storage_datasets()` will process and commit the data change to the database.

In [46]:
import sqlite3
from tool.database.setup import setupSchema
from tool.loader.loader import DataLoader

path = 'toy_database.db'
conn = sqlite3.connect(path)
setupSchema(conn)

In [47]:
loader = DataLoader(conn=conn, validator=config_validator)

In [48]:
loader.storage_datasets()

Filename integrity check complete!
Row 1: Error occurred: UNIQUE constraint failed: dataset.dataset_name. Transaction rolled back.
Row 2: Error occurred: UNIQUE constraint failed: dataset.dataset_name. Transaction rolled back.
Data insertion process complete.


## Check the data
`QueryInterface` provides many options to look into the dataset, `get_dataset_text_labels()` function extracts information from the text label pairs. More flexible way is also provided in the repository.

In [65]:
from tool.database.queryInterface import QueryInterface

In [66]:
queryer = QueryInterface(conn)

In [68]:
queryer.get_dataset_text_labels(10, file_path='result.csv')

(['dataset_name', 'text', 'label_name', 'label_value'],
 [('toy_data_1',
   "I can't stand those people, they don't belong here!",
   'Label1',
   'yes'),
  ('toy_data_1',
   'We should all work together to build a better future.',
   'Label1',
   'no'),
  ('toy_data_1',
   "Go back to where you came from, you're not wanted here!",
   'Label1',
   'yes'),
  ('toy_data_1',
   'Everyone deserves equal rights and respect, no matter what.',
   'Label1',
   'no'),
  ('toy_data_1',
   "They're all lazy and don't contribute anything to society.",
   'Label1',
   'yes'),
  ('toy_data_1',
   "I love how diverse our community is, it's so enriching.",
   'Label1',
   'no'),
  ('toy_data_1',
   'People like them are ruining this country.',
   'Label1',
   'yes'),
  ('toy_data_1',
   'We need to stop spreading hate and focus on kindness instead.',
   'Label1',
   'no'),
  ('toy_data_1',
   'Why do they always cause problems wherever they go?',
   'Label1',
   'yes'),
  ('toy_data_1',
   'Helping ot

## Extras

The `utils` module includes a set of helpful tools for data analysis and selection during the dataset preparation phase:

- **Distribute Tool**: Analyzes the distribution of one column relative to another, helping users identify balanced or imbalanced data points, useful for dataset selection.
- **Fuzzysearch Tool**: Allows approximate matching within the dataset, helping locate relevant data, such as label definitions or metadata, without requiring exact queries.
- **Sampling Tool**: Provides three pre-configured sampling strategies to ensure balanced and representative data subsets for experimental setups.


SyntaxError: invalid syntax (2864803535.py, line 1)