<a href="https://colab.research.google.com/github/ChaitaliV/generative-explanation/blob/main/datacollection/unsupervised_data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Create Unsupervised Dataset for Generative Transformer
* read pdfs and save the data in .txt file.
* manually add some more data to it, from blogs and other online resources which are not downloadable.
* manually read the data once, and remove text which is not relevant
* clean the text
* create sentences
* create unsupervised learning dataset for generative transformer model by selectively masking words in all sentences

In [1]:
!pip install pyPDF2 transformers sentencepiece
!git clone https://github.com/ChaitaliV/generative-explanation

Collecting pyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m56.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m56.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.1-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.1/311.1 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-m

### Create Raw Dataset
* read text from the pdf which are textbooks, manuals and other web resources on depression and it's dignosis. this will be our raw data
* Go through the raw data manually, and add % to seperate topics, this should work better than seperating the text by sentences, as whole topics as input will retain long term dependencies.
* Also remove garbage text and topics that are not relevant.

In [2]:
from PyPDF2 import PdfReader
import os

In [3]:
folder_path = 'generative-explanation/datasets/unsupervised dataset'

In [4]:
def read_pdf(file_path):
    """fn to read pdf files"""
    with open(file_path, 'rb') as file:
        reader = PdfReader(file)
        text = ''
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()
    return text

# get the list of all pdf from the folder
file_list = [file for file in os.listdir(folder_path) if file.endswith('.pdf')]
corpus = ''

# Iterate through PDF files and read text
for file_name in file_list:
    file_path = os.path.join(folder_path, file_name)
    text = read_pdf(file_path)
    corpus += text

#save the corpus as .txt file for future use
def save_to_txt(text, file_path):
    with open(file_path, 'w') as file:
        file.write(text)

save_to_txt(corpus, 'generative-explanation/raw_unsupervised_text.txt')

### Process the raw text to create Dataset with masked tokens
* load the raw data, remove numbers and special characters. and seperate the data from % token.
* mask words and create unsupervised dataset to predict the masked words.
* for masking process, mask individual masks

In [56]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import random
import numpy as np
import re
import pandas as pd

In [1]:
raw_data_path = 'generative-explanation/datasets/unsupervised dataset/raw_unsupervised_text.txt'

In [35]:
corpus = ''
with open(raw_data_path, 'r') as file:
  corpus = file.read()

In [64]:
 tokenizer = T5Tokenizer.from_pretrained("t5-base")

def mask_tokens(sentence, mask_percentage=0.4):
    """"this function will take the sentence, generate tokens,
    randomly mask 40% tokens for encoder. unmasked 60% tokens are masked in decoder.
    """
    # Tokenize the sentence
    tokenized_sentence = tokenizer(sentence, return_tensors="pt").input_ids[0]
    label = tokenized_sentence.clone()

    # Calculate the number of tokens to mask
    num_tokens_to_mask = int(mask_percentage * len(tokenized_sentence))

    # Randomly choose indices to mask
    encoder_masked_indices = random.sample(range(1, len(tokenized_sentence) - 1), num_tokens_to_mask)
    decoder_masked_indices = list(set(np.arange(1, len(tokenized_sentence)-1)) - set(encoder_masked_indices))

    # Mask the chosen tokens in encoder
    for index in encoder_masked_indices:
        tokenized_sentence[index] = tokenizer.convert_tokens_to_ids(f"<masked_token>")


    #mask rest of the tokens in decoder
    for index in decoder_masked_indices:
        label[index] = tokenizer.convert_tokens_to_ids(f"<masked_token>")

    return tokenized_sentence, label

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [61]:
def generate_data(text):

  #remove new line characters
  text = text.replace('\n',' ')

  #remove special characters
  pattern = re.compile('[^a-zA-Z;?.,\']')
  clean_text = pattern.sub(' ', text)
  clean_text = clean_text.replace('  ',' ')

  #create topic sequences
  text_strings = clean_text.split('%')

  #create masks for each topic strings
  encoder_data = []
  decoder_data = []
  for string in text_strings:
    encoder, decoder = mask_tokens(string)
    encoder_data.append(encoder)
    decoder_data.append(decoder)

  #create final dataframe
  df = pd.DataFrame({'Encoder': encoder_data, 'Decoder': decoder_data})
  return df

In [62]:
unsupervised_dataset = generate_data(corpus)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Token indices sequence length is longer than the specified maximum sequence length for this model (36020 > 512). Running this sequence through the model will result in indexing errors


In [63]:
unsupervised_dataset

Unnamed: 0,Encoder,Decoder
0,"[tensor(23138), tensor(11), tensor(6261), tens...","[tensor(23138), tensor(2), tensor(2), tensor(2..."


In [65]:
text_strings = corpus.split('%')

#create masks for each topic strings
encoder_data = []
decoder_data = []
for string in text_strings:
  encoder, decoder = mask_tokens(string)
  encoder_data.append(encoder)
  decoder_data.append(decoder)

Token indices sequence length is longer than the specified maximum sequence length for this model (701 > 512). Running this sequence through the model will result in indexing errors
