# KAN-42 Textual Data Processing
by Miguel

**Step 1: Install Required Libraries**


First, ensure you have the necessary libraries installed. You'll need the datasets library for loading the dataset, pandas for data manipulation, and nltk for text processing. If you don't have these installed, you can install them using pip:

In [3]:
!pip install datasets pandas nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.9.11-cp312-cp312-win_amd64.whl.metadata (41 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   --------------------------- ------------ 1.0/1.5 MB 5.6 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 5.7 MB/s eta 0:00:00
Downloading regex-2024.9.11-cp312-cp312-win_amd64.whl (273 kB)
Using cached click-8.1.7-py3-none-any.whl (97 kB)
Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
Installing collected packages: regex, joblib, click, nltk
Successfully installed click-8.1.7 joblib-1.4.2 nltk-3.9.1 regex-2024.9.11


**Step 2: Load the SLAKE Dataset**

Load the SLAKE dataset using the datasets library.

In [4]:
from datasets import load_dataset  
  
# Load the SLAKE dataset  
ds = load_dataset("BoKelvin/SLAKE")  


  from .autonotebook import tqdm as notebook_tqdm


**Step 3: Extract Question-Answer Pairs**

Inspect the dataset to identify the structure and extract the question-answer pairs. Assuming the dataset contains columns for questions and answers.

In [6]:
# Display dataset structure  
print(ds)  
  
# Display the first example to understand the structure  
print(ds['train'][0])  
  
# Convert to Pandas DataFrame for easier manipulation  
import pandas as pd  
  
# Convert the train split to a DataFrame  
df_train = pd.DataFrame(ds['train'])  
  
# Extract question-answer pairs  
qa_pairs = df_train[['question', 'answer','img_name']]  
print(qa_pairs.head())  


DatasetDict({
    train: Dataset({
        features: ['img_name', 'location', 'answer', 'modality', 'base_type', 'answer_type', 'question', 'qid', 'content_type', 'triple', 'img_id', 'q_lang'],
        num_rows: 9835
    })
    validation: Dataset({
        features: ['img_name', 'location', 'answer', 'modality', 'base_type', 'answer_type', 'question', 'qid', 'content_type', 'triple', 'img_id', 'q_lang'],
        num_rows: 2099
    })
    test: Dataset({
        features: ['img_name', 'location', 'answer', 'modality', 'base_type', 'answer_type', 'question', 'qid', 'content_type', 'triple', 'img_id', 'q_lang'],
        num_rows: 2094
    })
})
{'img_name': 'xmlab1/source.jpg', 'location': 'Abdomen', 'answer': 'MRI', 'modality': 'MRI', 'base_type': 'vqa', 'answer_type': 'OPEN', 'question': 'What modality is used to take this image?', 'qid': 0, 'content_type': 'Modality', 'triple': ['vhead', '_', '_'], 'img_id': 1, 'q_lang': 'en'}
                                            question   ans

**Step 4: Clean and Preprocess Text Data**

We'll perform the following preprocessing steps:
* Lowercase the text
* Remove punctiation
* Tokenize the text
* Remove stop words

We'll use the re library for regular expressions and nltk for stop words and tokenization

In [8]:
import re  
import nltk  
from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize  
  
# Download NLTK data (you can skip this if already downloaded)  
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
  
# Function to preprocess text  
def preprocess_text(text):  
    # Lowercase the text  
    text = text.lower()  
      
    # Remove punctuation  
    text = re.sub(r'[^\w\s]', '', text)  
      
    # Tokenize the text  
    words = word_tokenize(text)  
      
    # Remove stop words  -- commented out as answers may be yes/no or loose context if stopwrods are removed
    #stop_words = set(stopwords.words('english'))  
    #words = [word for word in words if word not in stop_words]  
      
    return ' '.join(words)  
  
# Apply preprocessing to the question and answer columns  
qa_pairs['processed_question'] = qa_pairs['question'].apply(preprocess_text)  
qa_pairs['processed_answer'] = qa_pairs['answer'].apply(preprocess_text)  
  
# Display the first few rows of the processed text  
print(qa_pairs[['question', 'processed_question', 'answer', 'processed_answer','img_name']].head())  


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\021348\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\021348\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\021348\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


                                            question  \
0          What modality is used to take this image?   
1  Which part of the body does this image belong to?   
2            What is the mr weighting in this image?   
3                    Does the picture contain liver?   
4                   Does the picture contain kidney?   

                                 processed_question   answer processed_answer  \
0          what modality is used to take this image      MRI              mri   
1  which part of the body does this image belong to  Abdomen          abdomen   
2            what is the mr weighting in this image       T2               t2   
3                    does the picture contain liver      Yes              yes   
4                   does the picture contain kidney       No               no   

            img_name  
0  xmlab1/source.jpg  
1  xmlab1/source.jpg  
2  xmlab1/source.jpg  
3  xmlab1/source.jpg  
4  xmlab1/source.jpg  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qa_pairs['processed_question'] = qa_pairs['question'].apply(preprocess_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qa_pairs['processed_answer'] = qa_pairs['answer'].apply(preprocess_text)


**Step 5: Tokenize Text for Input into the PubMedCLIP Model**

To tokenize text for input into the PubMedCLIP model, you would typically use a tokenizer that is compatible with the model.
 However, since PubMedCLIP is a specialized model, make sure to use the appropriate tokenizer.

 Below is an example of how you might do it with a generic tokenizer.

In [None]:
from transformers import AutoTokenizer  
  
# Load the PubMedCLIP tokenizer (replace 'pubmedclip-tokenizer' with the actual tokenizer name)  
tokenizer = AutoTokenizer.from_pretrained('pubmedclip-tokenizer')  
  
# Tokenize the processed questions and answers  
qa_pairs['tokenized_question'] = qa_pairs['processed_question'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))  
qa_pairs['tokenized_answer'] = qa_pairs['processed_answer'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))  
  
# Display the first few rows of the token
