<a href="https://colab.research.google.com/github/DanialPahlavan/RAG/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# intro
Basic RAG from article Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

https://arxiv.org/abs/2005.11401

## KoRC from THU-KEG

https://github.com/THU-KEG/KoRC/tree/main

In [1]:
!git clone https://github.com/THU-KEG/KoRC.git

Cloning into 'KoRC'...
remote: Enumerating objects: 132, done.[K
remote: Counting objects: 100% (132/132), done.[K
remote: Compressing objects: 100% (85/85), done.[K
remote: Total 132 (delta 49), reused 119 (delta 41), pack-reused 0 (from 0)[K
Receiving objects: 100% (132/132), 85.80 KiB | 1.13 MiB/s, done.
Resolving deltas: 100% (49/49), done.


In [2]:
!wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz

--2024-08-14 12:50:05--  https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.35.7.128, 13.35.7.38, 13.35.7.82, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.35.7.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4694541059 (4.4G) [application/gzip]
Saving to: ‘psgs_w100.tsv.gz’


2024-08-14 12:50:46 (109 MB/s) - ‘psgs_w100.tsv.gz’ saved [4694541059/4694541059]



In [3]:
!gzip -d psgs_w100.tsv.gz

In [4]:
!pwd

/content


### preprocess

#### small DB

In [5]:
from tqdm import tqdm

# Open the file with a context manager to ensure it's closed properly
with open('psgs_w100.tsv', 'r') as file:
    new_lines = []
    # Use enumerate for counting lines and tqdm for progress bar
    for i, line in enumerate(tqdm(file), start=1):
        # Skip the first line (header)
        if i == 1:
            continue
        # Break the loop if 500 lines have been processed
        if i > 500:
            break
        id, text, title = line.strip().split("\t")
        new_lines.append(title + '\t' + text + '\n')

# Specify the full path to the file
with open('/content/KoRC/RAG/preprocess/data.csv', 'w') as f:
    f.writelines(new_lines)


500it [00:00, 29964.02it/s]


### full DB
just use small or full not both

In [None]:
from tqdm import tqdm

new_lines = []
for line in tqdm(open('psgs_w100.tsv').readlines()[1:]):
    id, text, title = line.strip().split("\t")
    new_lines.append(title+'\t'+text+'\n')

# Specify the full path to the file
with open('/content/KoRC/RAG/preprocess/data.csv', 'w') as f:
    f.writelines(new_lines)


In [None]:
!python /content/KoRC/RAG/preprocess/convert.py

^C


### build_faiss

In [6]:
try:
    !pip install datasets
    !apt-get install libomp-dev -y
    !pip install --upgrade faiss-gpu faiss-cpu
except Exception as e:
    print(f"Error installing dependencies: {e}")

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[

In [7]:
!apt install libomp-dev
!python -m pip install --upgrade faiss-gpu
!python -m pip install --upgrade faiss-cpu

#gpu
#!pip install faiss-gpu

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libomp-dev is already the newest version (1:14.0-55~exp2).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [8]:
!python /content/KoRC/RAG/preprocess/build_faiss.py --csv_path /content/KoRC/RAG/preprocess/data.csv

INFO:__main__:Step 1 - Create the dataset
Generating train split: 499 examples [00:00, 6543.11 examples/s]
Map: 100% 499/499 [00:00<00:00, 32833.80 examples/s]
config.json: 100% 492/492 [00:00<00:00, 2.15MB/s]
pytorch_model.bin: 100% 438M/438M [00:20<00:00, 21.3MB/s]
Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
tokenizer_config.json: 100% 28.0/28.0 [00:

### prepare_open_dataset.py

In [9]:
!python /content/KoRC/RAG/prepare_openqa_dataset.py

Traceback (most recent call last):
  File "/content/KoRC/RAG/prepare_openqa_dataset.py", line 35, in <module>
    mrc_train = json.load(open(os.path.join(input_path,'train.json')))
  File "/usr/lib/python3.10/posixpath.py", line 76, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType


In [10]:
import os

try:
    os.chdir('/content/KoRC/RAG/')
    print("Changed to /content/KoRC/RAG/ directory.")
except Exception as e:
    print(f"Error changing directory: {e}")


Changed to /content/KoRC/RAG/ directory.


In [11]:
!python prepare_feature.py -o output_directory --max_tgt_len 128 --max_source_length 256 --max_context_length 512 --question_with_ctx

add special tokens ['<ans>']
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the cla

In [14]:
cd script

/content/KoRC/RAG/script


In [21]:
import os

try:
    os.chdir('/content/KoRC/RAG/')
    print("Changed to /content/KoRC/RAG/ directory.")
except Exception as e:
    print(f"Error changing directory: {e}")

try:
    # Upgrade pytorch_lightning to ensure all required modules are available
    os.system('pip install --upgrade pytorch-lightning')

    # Ensure necessary directories exist
    os.system('mkdir -p ../dataset/rag/human/seq1e-5/small_iid_result')
    os.system('mkdir -p ../dataset/rag/human/seq1e-5/small_ood_result')

    # Run the prepare_feature.py script with the required arguments
    os.system('python prepare_feature.py -o output_directory --max_tgt_len 128 --max_source_length 256 --max_context_length 512 --question_with_ctx')

    # Run the Bash script
    os.system('bash seq.human.sh')
except Exception as e:
    print(f"Error running final scripts: {e}")


Changed to /content/KoRC/RAG/ directory.
Error running command: Command 'python prepare_feature.py -o output_directory --max_tgt_len 128 --max_source_length 256 --max_context_length 512 --question_with_ctx' returned non-zero exit status 1.
Output: None


In [20]:
!bash seq.human.sh

bash: seq.human.sh: No such file or directory
