# RAG System for Survey Variable Resolution

This Jupyter Notebook is designed to implement a RAG (Retrieval-Augmented Generation) system. The primary purpose of this system is to resolve user-submitted survey variable data and find other variables that are likely to match.

## Overview

The RAG system leverages the power of retrieval-based and generative models for machine learning. It uses a two-step process:

1. **Retrieval**: The system retrieves relevant documents (in this case, survey variables) from a Pinecone DB knowledge source based on the user-submitted data.

2. **Generation**: The system then uses the retrieved documents to generate a response using a Cohere Reranking LLM and OpenAI's ChatOpenAI GPT 3.5 model. This response includes other variables that have a high likelihood of matching the user-submitted data.

## Usage

To use this notebook, input your survey variable data using the `input_data` directory. The RAG system will process this data, retrieve relevant variables from the knowledge source, and generate a list of variables that are likely to match your input.

## Benefits

The RAG system provides a powerful tool for survey data analysis. It can help identify patterns and correlations in the data, which can be invaluable for research and decision-making.

Please note that the accuracy of the system's output depends on the quality and comprehensiveness of the knowledge source. Therefore, it's crucial to continually update and expand the knowledge source to improve the system's performance.

## Setup

In [87]:
import os
import json
import pandas as pd
from langchain.docstore.document import Document
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.retrievers.multi_query import MultiQueryRetriever

import cohere
from pinecone import Pinecone
from pinecone import ServerlessSpec



from getpass import getpass
# from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings


In [88]:
import os
from getpass import getpass
# from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings

#### API KEYS

In [89]:
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or getpass("OpenAI API Key: ")
COHERE_API_KEY = os.getenv('COHERE_API_KEY') or getpass("COHERE API Key: ")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY") or getpass("Enter your Pinecone API key: ")



#### Configuration Values

In [91]:

""""
--- MODEL CONFIG ---
"""

PINECONE_INDEX = "biolincc-labels-001" # Name of the Pinecone index
TEMPERATURE = 0
VECTOR_TEXT_FIELD = "text"
EMBEDDING_MODEL = "text-embedding-ada-002"

""""
--- USER SUBMITTED VARIABLE CONFIG ---
"""
USER_SUBMITTED_VARIABLE_COL = "Variable"
USER_SUBMITTED_LABEL_COL = "Label"
# USER_SUBMITTED_VARIABLE_COL = "name"
# USER_SUBMITTED_LABEL_COL = "description"

#### Logging

In [92]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

### Embedding and Vector DB Setup

In [93]:
pc = Pinecone(api_key=PINECONE_API_KEY)
spec = ServerlessSpec(
    cloud="aws", region="us-west-2"
)

In [94]:
import time

index_name = PINECONE_INDEX
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
  raise Exception("Pinecone index does not exist")

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 259880}},
 'total_vector_count': 259880}

In [95]:
embed = OpenAIEmbeddings(
    model=EMBEDDING_MODEL, openai_api_key=OPENAI_API_KEY, disallowed_special=()
)

In [96]:
from langchain.vectorstores import Pinecone

vectorstore = Pinecone(index, embed.embed_query, VECTOR_TEXT_FIELD)



## Parse User Input

In [97]:
# Specify the path to your CSV file
csv_file_path = 'input_data/ozzys_gen3_test.tsv'

# Read the CSV file and convert it into a DataFrame
df = pd.read_csv(csv_file_path, sep='\t')

In [98]:
df.describe()

Unnamed: 0,Variable,Label
count,104,104
unique,104,102
top,submitter_id,Measurement of the mass concentration (mcnc) o...
freq,1,2


In [12]:
df = df.head(5)
# The column we need to input into the RAG is 'description'
# While in test mode this is all we want to do

In [13]:
user_variables = df['description']

## Setup Retriever


**Generation Prompt**
```
    You are a data curator whos role is to harmonize biological variables in the NHLBI (National Heart Lung and Blood) data repository. 
    You are tasked with evaluating input variables from data dictionary that describes new data that will be added to the existing pool of variables in the repository.
    For each new variable, a vector search engine has returned the three nearest existing variables found in the data repository. 
    Your job is determine which of these existing variables, if any, is the best fit for harmonizing the new variable to the existing variable given your understanding of the underlying scientific principles.
    You must rationalize why the selected existing variable was chosen over the others, provide context to the relevancy of each existing variable to the new variable.
    Then you are to provide the user with as much information as we can on how they can align their new variable with the selected existing variable.
    
    When there is no obvious match provide additional context for why you can't make a determination.
    
    Contexts:
    {contexts}

    Original question: {query}
```

In [99]:
from typing import List
from langchain.chains import LLMChain
from pydantic import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser

In [100]:
# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
    # "lines" is the key (attribute name) of the parsed output
    lines: List[str] = Field(description="Lines of text")


class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.strip().split("\n")
        return LineList(lines=lines)

output_parser = LineListOutputParser()


### Prompt Templates

#### Output Generation Prompt

In [101]:
# Define the prompt template
template = """
    You are a data curator whos role is to harmonize biological variables in the NHLBI (National Heart Lung and Blood) data repository. 
    You are tasked with evaluating input variables from data dictionary that describes new data that will be added to the existing pool of variables in the repository.
    For each new variable, a vector search engine has returned the three nearest existing variables found in the data repository. 
    Your job is determine which of these existing variables, if any, is the best fit for harmonizing the new variable to the existing variable given your understanding of the underlying scientific principles.
    You must rationalize why the selected existing variable was chosen over the others, provide context to the relevancy of each existing variable to the new variable.
    Then you are to provide the user with as much information as we can on how they can align their new variable with the selected existing variable.
    
    When there is no obvious match provide additional context for why you can't make a determination.
    
    Contexts:
    {contexts}

    Original question: {query}
"""

GEN_PROMPT = PromptTemplate(
    input_variables=["query", "contexts"],
    template=template,
)
llm = ChatOpenAI(temperature=TEMPERATURE, openai_api_key=OPENAI_API_KEY)

# Chain
gen_chain = LLMChain(llm=llm, prompt=GEN_PROMPT, output_parser=output_parser)

#### Query Prompt Placeholder

In [17]:
### IS THIS NEEDED? TBD
# query_template = """
#     You are a data curator whos role is to harmonize biological variables in the NHLBI (National Heart Lung and Blood) data repository. 
#     You are tasked with evaluating input variables from data dictionary that describes new data that will be added to the existing pool of variables in the repository.
    
#     Original query: {query}
# """

# QUERY_PROMPT = PromptTemplate(
#     # input_variables=["question"],
#     input_variables=["query"],
#     template=query_template,
# )
# llm = ChatOpenAI(temperature=TEMPERATURE, openai_api_key=OPENAI_API_KEY)

# # Chain
# llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)

#### Cohere Rerank Retriever

In [102]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain_community.llms import Cohere

In [104]:
cohere_rerank = CohereRerank(cohere_api_key=COHERE_API_KEY)
retriever=ContextualCompressionRetriever(
    base_compressor=cohere_rerank, 
    base_retriever=vectorstore.as_retriever() 
)


## Connecting RAG elements

In [105]:
from langchain.chains import TransformChain
from langchain.chains import SequentialChain

In [106]:
def retrieval_transform(inputs: dict) -> dict:
    docs = retriever.get_relevant_documents(query=inputs["question"])
    metadata = pd.DataFrame([d.metadata for d in docs], columns=['uid', 'relevance_score'])
    docs = [d.page_content for d in docs]
    print(docs)
    docs_dict = {
        "query": inputs["question"],
        "contexts": "\n---\n".join(docs),
        "metadata": metadata
    }
    return docs_dict

retrieval_chain = TransformChain(
    input_variables=["question"],
    output_variables=["query", "contexts", "metadata"],
    transform=retrieval_transform
)

In [107]:

rag_chain = SequentialChain(
    chains=[retrieval_chain, gen_chain],
    input_variables=["question"],  # we need to name differently to output "query"
    output_variables=["query", "contexts", "text", "metadata"],
)

### Execute the Model

In [108]:
def format_mapped_vars(out, user_var):

    return pd.DataFrame(
        {
            "user_variable": user_var,
            "user_description": out["query"],
            "uid": out["metadata"]["uid"],
            "relevance_score": out["metadata"]["relevance_score"],
        }
    )

In [109]:
def format_llm_res_df(out, user_var):

    return pd.DataFrame(
        {
            "user_variable": user_var,
            "user_description": out["query"],
            "llm_response": out["text"]
        }
    )

In [110]:
def execute_rag_chain(df, mapped_vars_df, llm_res_df):
    total_length = len(df)
    print("Starting execution of RAG chain.")
    print(f"Total length of DataFrame: {total_length}")

    try:
        for index, row in df.iterrows():
            question = row[USER_SUBMITTED_LABEL_COL]
            print(question)
            out = rag_chain({"question": question})
            mapped_vars_df = pd.concat([mapped_vars_df, format_mapped_vars(out, row[USER_SUBMITTED_VARIABLE_COL])])
            llm_res_df = pd.concat([llm_res_df, format_llm_res_df(out, row[USER_SUBMITTED_VARIABLE_COL])])

            # Log progress at every 25% interval
            if index % (total_length // 4) == 0 and index != 0:
                print(f"Processed {index} rows, which is {(index/total_length)*100}% of the total.")

        print("Processing completed.")
        return mapped_vars_df, llm_res_df
    except Exception as e:
        print(f"An error occurred: {e}")
        return mapped_vars_df, llm_res_df

## Map to orginal data

In [111]:
mapped_vars_df = pd.DataFrame(columns=[
    "user_variable", # lowest
    "user_description", # out[query] i.e. middle
    "uid", # highest out[metadata][uid]
    "relevance_score",  # highest out[metadata][uid]
])


In [112]:
llm_res_df = pd.DataFrame(columns=[
    "user_variable", # lowest
    "user_description", # out[query] i.e. middle
    "llm_response", # out[text] i.e. middle
])


In [113]:
mapped_vars_df, llm_res_df = execute_rag_chain(df, mapped_vars_df, llm_res_df)

A project-specific identifier for a node. This property is the calling card/nickname/alias for a unit of submission. It can be used in place of the UUID for identifying or recalling a node.
['UNIQUE COMMUNITY IDENTIFIER', 'UNIQUE SUBJECT ID', 'UNIQUE SUBJECT IDENTIFIER']


  mapped_vars_df = pd.concat([mapped_vars_df, format_mapped_vars(out, row[USER_SUBMITTED_VARIABLE_COL])])


Unique identifier that can be used to retrieve more information for a participant.
['PARTICIPANT IDENTIFICATION', 'PARTICIPANT ID', 'PARTICIPANT ID NUMBER']
Data Use Restrictions that are used to indicate permissions/restrictions for datasets and/or materials, and relates to the purposes for which datasets and/or material might be removed, stored or used.Based on the Data Use Ontology : see http://www.obofoundry.org/ontology/duo.htmlGRU - For general research use for any research purpose. Example of usage: This includes but is not limited to: health/medical/biomedical purposes, fundamental biology research, the study of population origins or ancestry, statistical methods and algorithms development, and social-sciences research.HMB - Use of the data is limited to health/medical/biomedical purposes; does not include the study of population origins or ancestry.DS-X - Use of the data must be related to [disease].NPU - Use of the data is limited to not-for-profit organizations and not-for-p

Query has been truncated from the right to 256 tokens from 1164 tokens.


['IS PATIENT ADDRESS IN ARIC COMMUNITY SURVEILLANCE CATCHMENT AREA? Q5B', 'ARIC SURVEILLANCE EVENT ID (CIR)', 'FRAMINGHAM HEART STUDY PARTICIPANT COHORT IDENTIFIER']
"Indicates whether the time point reference begins at the donor's actual death, presumed death, or when the cross clamp was applied in the case of surgical donors."
['TIME OF DEATH', 'DNR TIME (RELATIVE TO EDARTM)', 'TIME OF CLINICAL DEATH']
If the Donor was an Organ Donor, which Organs were donated.
['BONE MARROW DONOR', 'WHAT ORGAN WAS TRANSPLANTED?', 'BLOOD TYPE OF DONOR']
harmonization unit for site (A "harmonization unit" is a defined group of subjects whose phenotypes can be similarly processed.) (HARMONIZED)
["UNIT HAS A MATCH [SITE'S OPINION]", 'UNIT', 'OBSERVATION UNIT']
RV Cardiac output (l/min) (HARMONIZED)
['RV CARDIAC OUTPUT (ML/MIN)', 'CARDIAC OUTPUT (L/MIN)', 'MEASURED CARDIAC OUTPUT (L/MIN)']
age at measurement of bmi (years) (HARMONIZED)
['AGE AT MEASUREMENT', 'ADULT 1 BMI AT YEAR 1', 'ADULT 4 BMI AT YEAR 

In [114]:
mapped_vars_df['uid'] = mapped_vars_df['uid'].astype(int)
mapped_vars_df.to_csv('llm_output/mapped_vars_df.csv', index=False)

mapped_vars_df.head()


Unnamed: 0,user_variable,user_description,uid,relevance_score
0,submitter_id,A project-specific identifier for a node. This...,242980.0,0.338842
1,submitter_id,A project-specific identifier for a node. This...,242989.0,0.266835
2,submitter_id,A project-specific identifier for a node. This...,242990.0,0.264358
0,participant_id,Unique identifier that can be used to retrieve...,177036.0,0.944074
1,participant_id,Unique identifier that can be used to retrieve...,177033.0,0.941331


In [115]:
llm_res_df.to_csv('llm_output/llm_res_df.csv', index=False)


llm_res_df.head()

Unnamed: 0,user_variable,user_description,llm_response
0,submitter_id,A project-specific identifier for a node. This...,"(lines, [Based on the description provided for..."
0,participant_id,Unique identifier that can be used to retrieve...,"(lines, [Based on the provided context, the be..."
0,consent_codes,Data Use Restrictions that are used to indicat...,"(lines, [Based on the provided context, the ne..."
0,amputation_type,"If amputated, the amputation type for leg, abo...","(lines, [Based on the provided context, the ne..."
0,cohort_id,The study subgroup that the participant belong...,"(lines, [Based on the context provided, the ne..."


In [42]:
llm_res_df


Unnamed: 0,user_variable,user_description,llm_response
0,CONSENT,Consent group as determined by DAC,"(lines, [Based on the provided existing variab..."
0,SUBJECT_SOURCE,Source repository where subjects originate,"(lines, [Based on the context provided, the ne..."
0,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,"(lines, [Based on the provided contexts, the b..."
0,AFFECTION_STATUS,Affection status,"(lines, [Based on the provided existing variab..."
0,SAMPLE_SOURCE,Source repository where samples originate,"(lines, [Based on the information provided, th..."


In [50]:
mapped_vars_df


Unnamed: 0,user_variable,user_description,uid,relevance_score
0,CONSENT,Consent group as determined by DAC,52783,0.454563
1,CONSENT,Consent group as determined by DAC,52786,0.124745
2,CONSENT,Consent group as determined by DAC,52785,0.066448
0,SUBJECT_SOURCE,Source repository where subjects originate,216548,0.084342
1,SUBJECT_SOURCE,Source repository where subjects originate,216503,0.0771
2,SUBJECT_SOURCE,Source repository where subjects originate,226461,0.061876
0,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,226592,0.549127
1,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,226431,0.45051
2,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,242989,0.208018
0,AFFECTION_STATUS,Affection status,16208,0.858244


### Map to original variables

In [51]:
import pandas as pd

tsv_file_path = './input_data/biolincc_dedupLUT.tsv'
df = pd.read_csv(tsv_file_path, delimiter='\t')

df.head()


Unnamed: 0,Study,Variable,Label,UID
0,AADR,AUGDATE_FZD,"AUGMENTATION DATE, FZD",25106
1,AADR,CENTER,CENTER WHERE TESTING DONE,44791
2,AADR,CLINIC,CLINIC,48834
3,AADR,CLINIC,CLINICAL CENTER,48894
4,AADR,CLINIC,CLINICAL CENTER CODE,48895


In [119]:
242989
df[df['UID'] == 52785]

mapped_vars_df

Unnamed: 0,user_variable,user_description,uid,relevance_score,related_vars
0,CONSENT,Consent group as determined by DAC,52783,0.454563,"[{'Study': 'NANBTAH', 'Variable': 'HAVEORIG', ..."
1,CONSENT,Consent group as determined by DAC,52786,0.124745,"[{'Study': 'TAAG', 'Variable': 'DSCB4', 'Label..."
2,CONSENT,Consent group as determined by DAC,52785,0.066448,"[{'Study': 'CSSCD', 'Variable': 'SOURCE', 'Lab..."
3,SUBJECT_SOURCE,Source repository where subjects originate,216548,0.084342,"[{'Study': 'NANBTAH', 'Variable': 'HAVEORIG', ..."
4,SUBJECT_SOURCE,Source repository where subjects originate,216503,0.0771,"[{'Study': 'BABYHUG', 'Variable': 'SOURCE', 'L..."
5,SUBJECT_SOURCE,Source repository where subjects originate,226461,0.061876,"[{'Study': 'CSSCD', 'Variable': 'SOURCE', 'Lab..."
6,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,226592,0.549127,"[{'Study': 'NANBTAH', 'Variable': 'HAVEORIG', ..."
7,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,226431,0.45051,"[{'Study': 'ALVEOLI', 'Variable': 'PTID', 'Lab..."
8,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,242989,0.208018,"[{'Study': 'CSSCD', 'Variable': 'SOURCE', 'Lab..."
9,AFFECTION_STATUS,Affection status,16208,0.858244,"[{'Study': 'NANBTAH', 'Variable': 'HAVEORIG', ..."


In [128]:
for index, row in mapped_vars_df.iterrows():
    uid = row['uid']
    print(f'--- UID: {uid} --  INDEX: {index} --------------------------------------------------------------')
    matching_rows = df[df['UID'] == uid]

    related_vars = []
    for _, matching_row in matching_rows.iterrows():

        related_vars.append({
            'Study': matching_row['Study'].strip(),
            'Variable': matching_row['Variable'].strip(),
            'Label': matching_row['Label'].strip(),
            'UID': matching_row['UID']
        })
    
    print(related_vars)
    mapped_vars_df.at[index, 'related_vars'] = related_vars
    # try:
    #     mapped_vars_df.at[index, 'related_vars'] = str(related_vars)
    # except:
    #     mapped_vars_df.at[index, 'related_vars'] = []


mapped_vars_df

--- UID: 52783 --  INDEX: 0 --------------------------------------------------------------
[{'Study': 'BMT0701', 'Variable': 'CNSTRS', 'Label': 'CONSENT RESEARCH SAMPLE', 'UID': 52783}]
--- UID: 52786 --  INDEX: 1 --------------------------------------------------------------
[{'Study': 'TAAG', 'Variable': 'DSCB4', 'Label': 'CONSENT STATUS', 'UID': 52786}, {'Study': 'TAAG', 'Variable': 'DSCC4', 'Label': 'CONSENT STATUS', 'UID': 52786}]
--- UID: 52785 --  INDEX: 2 --------------------------------------------------------------
[]
--- UID: 216548 --  INDEX: 3 --------------------------------------------------------------
[{'Study': 'CSSCD', 'Variable': 'SOURCE', 'Label': 'SOURCE FILE', 'UID': 216548}]
--- UID: 216503 --  INDEX: 4 --------------------------------------------------------------
[{'Study': 'BABYHUG', 'Variable': 'SOURCE', 'Label': 'SOURCE', 'UID': 216503}, {'Study': 'IPPB', 'Variable': 'SOURCE', 'Label': 'SOURCE', 'UID': 216503}, {'Study': 'WLKPHST', 'Variable': 'SOURCE', 'La

Unnamed: 0,user_variable,user_description,uid,relevance_score,related_vars
0,CONSENT,Consent group as determined by DAC,52783,0.454563,"[{'Study': 'BMT0701', 'Variable': 'CNSTRS', 'L..."
1,CONSENT,Consent group as determined by DAC,52786,0.124745,"[{'Study': 'TAAG', 'Variable': 'DSCB4', 'Label..."
2,CONSENT,Consent group as determined by DAC,52785,0.066448,[]
3,SUBJECT_SOURCE,Source repository where subjects originate,216548,0.084342,"[{'Study': 'CSSCD', 'Variable': 'SOURCE', 'Lab..."
4,SUBJECT_SOURCE,Source repository where subjects originate,216503,0.0771,"[{'Study': 'BABYHUG', 'Variable': 'SOURCE', 'L..."
5,SUBJECT_SOURCE,Source repository where subjects originate,226461,0.061876,"[{'Study': 'LUNGHIV', 'Variable': 'LOCATION', ..."
6,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,226592,0.549127,"[{'Study': 'REDS3ZIKA', 'Variable': 'SUBJECT_I..."
7,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,226431,0.45051,"[{'Study': 'ALVEOLI', 'Variable': 'PTID', 'Lab..."
8,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,242989,0.208018,"[{'Study': 'LUNGHIV', 'Variable': 'USUBJID', '..."
9,AFFECTION_STATUS,Affection status,16208,0.858244,"[{'Study': 'ACCESS', 'Variable': 'E132', 'Labe..."


In [127]:
mapped_vars_df

Unnamed: 0,user_variable,user_description,uid,relevance_score,related_vars
0,CONSENT,Consent group as determined by DAC,52783,0.454563,"[{'Study': 'BMT0701', 'Variable': 'CNSTRS', 'L..."
1,CONSENT,Consent group as determined by DAC,52786,0.124745,"[{'Study': 'TAAG', 'Variable': 'DSCB4', 'Label..."
2,CONSENT,Consent group as determined by DAC,52785,0.066448,[]
3,SUBJECT_SOURCE,Source repository where subjects originate,216548,0.084342,"[{'Study': 'CSSCD', 'Variable': 'SOURCE', 'Lab..."
4,SUBJECT_SOURCE,Source repository where subjects originate,216503,0.0771,"[{'Study': 'BABYHUG', 'Variable': 'SOURCE', 'L..."
5,SUBJECT_SOURCE,Source repository where subjects originate,226461,0.061876,"[{'Study': 'LUNGHIV', 'Variable': 'LOCATION', ..."
6,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,226592,0.549127,"[{'Study': 'REDS3ZIKA', 'Variable': 'SUBJECT_I..."
7,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,226431,0.45051,"[{'Study': 'ALVEOLI', 'Variable': 'PTID', 'Lab..."
8,SOURCE_SUBJECT_ID,Subject ID used in the Source Repository,242989,0.208018,"[{'Study': 'LUNGHIV', 'Variable': 'USUBJID', '..."
9,AFFECTION_STATUS,Affection status,16208,0.858244,"[{'Study': 'ACCESS', 'Variable': 'E132', 'Labe..."
