## Experiments Reproduction Step-by-Step Setup Guide

This notebook provides a comprehensive step-by-step guide to set up and reproduce our experiments. By following these instructions, you will be able to download the necessary models, convert their weights, and prepare the data required for training and testing.

### Environment

To set-up your environment, first run the following commands:
```bash
python -m venv venv
Set-ExecutionPolicy Unrestricted -Scope Process
venv\Scripts\Activate
```

Next, navigate to `ebnerd-benchmark` and run:
```bash
pip install .
```

After that, navigate back to the root directory and first upgrade torch to a cuda version, following the instructions [here](https://pytorch.org/get-started/locally/).

Lastly, to add the rest of the required packages with the correct versions, run:
```bash
pip install -r requirements.txt
```

### Llama model

#### 1. Download the Llama-2 model

First, download Llama-2 model from the Meta website: https://llama.meta.com/llama-downloads/. Filling in the form here will give you a link you can use to run this script https://github.com/meta-llama/llama/blob/main/download.sh (make sure you install wget - https://gnuwin32.sourceforge.net/packages/wget.htm - if you don't have it already before running this script).


After downloading, make sure the model is saved in a folder with the following structure:

llama
- **7B**
  - `checklist.chk`
  - `config.json`
  - `consolidated.00.pth`
  - `params.json`
- `LICENSE`
- `tokenizer_checklist.chk`
- `tokenizer.model`
- `USE_POLICY.md`

#### 2. Convert Llama weights to HuggingFace interface

Next, convert the Llama weights to the HuggingFace interface using the `convert_llama_weights_to_hf` script with the following arguments:
```bash
python convert_llama_weights_to_hf.py --input_dir llama --model_size 7B --output_dir llama_converted --llama_version 2
```


#### 3. Create the weights embeddings

Create the weights embeddings using the following script:

```bash
python convert_llama_converted_to_token_npy.py
```

This will create a llama-token.npy file.

### Tokenize data

Depending on the embedding layer implementation you want to reproduce, you can choose one of the following methods to tokenize your data. Each method provides a unique way of generating embeddings for your data.
<!-- 
1. Tokenize with BERT
2. Tokenize with Llama
3. Tokenize both with BERT and Llama and combine the embeddings for the news -->

#### 1. Method 1: Tokenize with BERT

To leverage BERT as the LLM implementation of the embedding layer, you can use `process\eb-nerd\processor.py`. Before running this script, make sure to change these paths in the main function:
```python
processor = Processor(
    data_dir="ebnerd-benchmark/data/ebnerd_small", # PATH to your data
    store_dir="ebnerd-benchmark/data/tokenized_bert" # PATH to the directory where you want to save the tokenized data
)
```

#### 2. Method 2: Tokenize with Llama

To leverage Llama as the LLM implementation of the embedding layer, you can use `process\eb-nerd\processor_llama.py`. Before running this script, make sure to change these paths in the main function:
```python
processor = Processor(
    data_dir="ebnerd-benchmark/data/ebnerd_small", # PATH to your data
    store_dir="ebnerd-benchmark/data/tokenized_llama" # PATH to the directory where you want to save the tokenized data
)
```

#### 3. Method 3: Tokenize with both BERT and Llama

In order to leverage both LLMs for the (news) items representations, you can use `process\eb-nerd\fusion.py`. Before running this script, make sure to change these paths:
```python
news = UniDep(os.path.join("ebnerd-benchmark/data/tokenized_bert", 'news')) # PATH to the directory where data tokenized with BERT is saved
news_llama = UniDep('ebnerd-benchmark/data/tokenized_llama/news-llama') # PATH to the directory where data tokenized with Llama is saved
...
news.export('ebnerd-benchmark/data/news-fusion') # PATH to the directory where you want to save the combined representation of the (news) items
```

Note: To integrate GENRE-generated data, similar operations should be conducted.