# Jupyter notebook sample

In [1]:
# imports

import os
from tqdm import tqdm
from dotenv import load_dotenv
from huggingface_hub import login
import numpy as np
import pickle
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import chromadb
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from openai import OpenAI
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

In [69]:
%matplotlib inline

In [2]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

In [76]:
import src.profile_scraper as profile_scraper
faculty = profile_scraper.FacultyProfileScraper("https://sph.uth.edu/faculty/?fac=O5VHhdEYnwuzkAQcMguuOA==")
faculty.text

'<< Back\nYunxin Fu\nProfessor\nBiostatistics\n713/500-9813\nReuel Stallones Building\n1200 Pressler Street, Houston, TX 77030\nView CV\nAbout\nI was trained as a biostatistician specialized in computational biology, and have spent much of my career in developing population genetics theory and statistical methods for analyzing population samples of DNA sequences, including algorithms for simulating samples for the analysis of large scale data. Recently I have been involved in the analysis of polymorphism data from the 1000 Genomes Project and related large data sets. In addition to my continuous work on population genetics theory and evolution,  my recent interest include within–individual polymorphism generated by both classical experiments and next generation sequencing for the purpose of understanding the mutational process during individual development, and  population genetics of T. thermorphila.\nCenter Affiliation\nHuman Genetics Center\nResearch Interests\nBig Data/Data Science

In [72]:
from src.summarize import summarize
summarize("https://sph.uth.edu/faculty/?fac=O5VHhdEYnwuzkAQcMguuOA==")

API key found and looks good so far!


'# Summary of Yunxin Fu - Faculty - UTHealth Houston School of Public Health\n\nYunxin Fu is a professor of Biostatistics at UTHealth Houston School of Public Health. He has expertise in computational biology and has contributed significantly to population genetics, developing statistical methods to analyze large-scale DNA sequence data. His recent research focuses on the analysis of polymorphism data from the 1000 Genomes Project and includes studying within-individual polymorphism and the genetics of *T. thermophila*.\n\n## Research Interests\n- Big Data/Data Science\n- Biostatistics\n- Genetics and Omics\n- Infectious Disease\n\n**Contact Information**  \n- Phone: 713/500-9813  \n- Location: Reuel Stallones Building, 1200 Pressler Street, Houston, TX 77030  \n\nThere are no news or announcements mentioned on this website. \n\nThe website does not indicate whether any device has to be bought in the US, as it primarily focuses on academic and research information.'

In [3]:
# load json into python dictionary
import json 

with open('faculty_data.json', 'r', encoding='utf-8') as f:
    faculty_data = json.load(f)

faculty_data[0]

{'id': 'RxApW409EWwpnikiXFRn8g==',
 'first_name': 'Robert',
 'last_name': 'Addy',
 'campus': 'Houston',
 'department': 'Health Promotion & Behavioral Sciences',
 'center': 'Center for Health Promotion and Prevention Research',
 'research_interests': ['Adolescent Health',
  'Biostatistics',
  'Health Education/Behavioral Sciences',
  'Program Evaluation',
  'Sexual Health'],
 'image_url': 'https://web.sph.uth.edu/thumbs/addy.jpg',
 'full_name': 'Robert Addy',
 'profile_url': 'https://sph.uth.edu/faculty/?fac=RxApW409EWwpnikiXFRn8g==',
 'title': 'Faculty Associate',
 'department_url': 'https://sph.uth.edu/dept/hpbs/',
 'cv_url': 'https://web.sph.uth.edu/cv/raddy.pdf',
 'about': 'Dr. Addy has over 25 years of research experience, including collecting, managing, and analyzing data in multiple contexts.  He has served as data manager for multiple projects involving single and multiple sites, as well as cross-sectional and longitudinal community-based studies in diverse settings (e.g. childr

In [4]:
# get unique departments
departments = list(set([faculty['department'] for faculty in faculty_data]))
departments



['Health Promotion & Behavioral Sciences',
 'Environmental & Occupational Health Sciences',
 'Biostatistics',
 'Epidemiology',
 'Management, Policy & Community Health']

In [5]:
# join the faculty data into a string
faculty_documents = [
    f'''name: {faculty.get('full_name', 'no information')}
    title: {faculty.get('title', 'no information')}
    about: {faculty.get('about', 'no information')}
    research interests: {faculty.get('research_interests', 'no information')}
    affiliation: {faculty.get('affiliation', 'no information')}'''
    for faculty in faculty_data
]
faculty_documents[0]

"name: Robert Addy\n    title: Faculty Associate\n    about: Dr. Addy has over 25 years of research experience, including collecting, managing, and analyzing data in multiple contexts.  He has served as data manager for multiple projects involving single and multiple sites, as well as cross-sectional and longitudinal community-based studies in diverse settings (e.g. children, adolescents, adults, schools, clinics, and communities) and with diverse populations, including Hispanic, African-American, American Indian and Alaska Native populations.  He has extensive experience in compiling, cleaning, and processing data in preparation for analysis and distribution.  He is part of a teaching team that provides graduate level classes in program evaluation and data management at UTSPH.\n    research interests: ['Adolescent Health', 'Biostatistics', 'Health Education/Behavioral Sciences', 'Program Evaluation', 'Sexual Health']\n    affiliation: no information"

In [6]:
# vectorize the faculty documents
from sentence_transformers import SentenceTransformer
vectorizer = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
vectorizer = OpenAIEmbeddings(model="text-embedding-ada-002",
    openai_api_key=os.getenv('OPENAI_API_KEY'))

In [7]:
DB = "faculties_vectorstore"

In [8]:
client = chromadb.PersistentClient(path=DB)

In [9]:
# Check if the collection exists and delete it if it does
collection_name = "faculties"

# For old versions of Chroma, use this line instead of the subsequent one
# existing_collection_names = [collection.name for collection in client.list_collections()]
existing_collection_names = client.list_collections()

if collection_name in existing_collection_names:
    client.delete_collection(collection_name)
    print(f"Deleted existing collection: {collection_name}")

collection = client.create_collection(collection_name)

Deleted existing collection: faculties


In [10]:

# Uncomment if you'd rather not wait for the full 400,000
# NUMBER_OF_DOCUMENTS = 20000


# vectors = vectorizer.encode(faculty_documents).astype(float).tolist()
vectors = vectorizer.embed_documents(faculty_documents)
# metadata is everything except the about field
metadatas = [{"name": faculty["first_name"] + " " + faculty["last_name"],
    "campus": faculty["campus"],
    "department": faculty['department']} for faculty in faculty_data]
ids = [f"doc_{j}" for j in range(len(faculty_data))]
collection.add(
    ids=ids,
    documents=faculty_documents,
    embeddings=vectors,
    metadatas=metadatas
)

In [11]:
collection = client.get_or_create_collection('faculties')

In [12]:
result = collection.get(include=['embeddings', 'documents', 'metadatas'])

In [13]:
CATEGORIES = ['Epidemiology',
 'Management, Policy & Community Health',
 'Biostatistics',
 'Environmental & Occupational Health Sciences',
 'Health Promotion & Behavioral Sciences']
COLORS = ['red', 'blue', 'brown', 'orange', 'yellow']

In [14]:
# Prework
result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
categories = [metadata['department'] for metadata in result['metadatas']]
colors = [COLORS[CATEGORIES.index(c)] for c in categories]

In [64]:
# Let's try a 2D chart

tsne = TSNE(n_components=2, random_state=42, n_jobs=-1)
reduced_vectors = tsne.fit_transform(vectors)

In [65]:
# Create the 2D scatter plot with hover information
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=3, color=colors, opacity=0.7),
    # Add hover information
    text=[f"Name: {metadata['name']}<br>Department: {metadata['department']}" 
          for metadata in result['metadatas']],
    hoverinfo='text',
    hovertemplate="%{text}<extra></extra>"  # This removes the x,y coordinates from hover
)])

fig.update_layout(
    title='2D Chroma Vectorstore Visualization',
    xaxis_title='x',
    yaxis_title='y',
    width=600,
    height=400,
    margin=dict(r=20, b=10, l=10, t=40),
    # Add hover mode
    hovermode='closest'
)

fig.show()

In [17]:
def find_similars(description):
    results = collection.query(query_embeddings=vectorizer.embed_query(description), n_results=5)
    documents = results['documents'][0][:]
    name = [m['name'] for m in results['metadatas'][0][:]]
    return documents, name

In [18]:
find_similars("I am looking for a faculty member who is an expert in epidemiology.")

(["name: Elena Feofanova\n    title: Assistant Professor Non-Tenure Instruction\n    about: no information\n    research interests: ['Epidemiology']\n    affiliation: no information",
  "name: Susan Fisher-Hoch\n    title: Professor Non-Tenure Research\n    about: I am trained as a physician with a doctoral degree in epidemiology from London University with Membership of the Royal College of Pathology in virology.  After a long career at the Centers for Disease Control and Prevention working with viral hemorrhagic fevers, in poor communities in developing countries. I designed and directed the French Biosafety 4 Level principally for studies of Ebola and Lassa fever for which I received the Légion d’Honneur.  I joined the Brownsville campus of the School of Public Health to work in health disparity population health in 2001.  In 2004 I helped found the Cameron County Hispanic Cohort (n=5000) which I direct.  This is a randomly selected community based cohort of health disparity Mexican

In [19]:
def make_context(similars):
    message = "To provide some context, here are some faculty members that might be relevant to your description.\n\n"
    documents, names = similars
    for similar, name in zip(documents, names):
        message += f"Potentially related faculty:\n{name}\n{similar}\n\n"
    return message
make_context(find_similars("I am looking for a faculty member who is an expert in epidemiology."))

"To provide some context, here are some faculty members that might be relevant to your description.\n\nPotentially related faculty:\nElena Feofanova\nname: Elena Feofanova\n    title: Assistant Professor Non-Tenure Instruction\n    about: no information\n    research interests: ['Epidemiology']\n    affiliation: no information\n\nPotentially related faculty:\nSusan Fisher-Hoch\nname: Susan Fisher-Hoch\n    title: Professor Non-Tenure Research\n    about: I am trained as a physician with a doctoral degree in epidemiology from London University with Membership of the Royal College of Pathology in virology.  After a long career at the Centers for Disease Control and Prevention working with viral hemorrhagic fevers, in poor communities in developing countries. I designed and directed the French Biosafety 4 Level principally for studies of Ebola and Lassa fever for which I received the Légion d’Honneur.  I joined the Brownsville campus of the School of Public Health to work in health dispar

In [20]:
def messages_for(description, similars):
    system_message = "You are a academic advisor. You estimate the relevance of faculty members to a given description. Suggest relevant faculty members. You should give explanation for your choice in markdown format."
    user_prompt = f"Here is my description: {description}\n\n"
    user_prompt += make_context(similars)
    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt}
    ]
messages_for("I am looking for a faculty member who is an expert in epidemiology.", find_similars("I am looking for a faculty member who is an expert in epidemiology."))

[{'role': 'system',
  'content': 'You are a academic advisor. You estimate the relevance of faculty members to a given description. Suggest relevant faculty members. You should give explanation for your choice in markdown format.'},
 {'role': 'user',
  'content': "Here is my description: I am looking for a faculty member who is an expert in epidemiology.\n\nTo provide some context, here are some faculty members that might be relevant to your description.\n\nPotentially related faculty:\nElena Feofanova\nname: Elena Feofanova\n    title: Assistant Professor Non-Tenure Instruction\n    about: no information\n    research interests: ['Epidemiology']\n    affiliation: no information\n\nPotentially related faculty:\nSusan Fisher-Hoch\nname: Susan Fisher-Hoch\n    title: Professor Non-Tenure Research\n    about: I am trained as a physician with a doctoral degree in epidemiology from London University with Membership of the Royal College of Pathology in virology.  After a long career at the C

In [21]:
# environment

load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')

openai = OpenAI()

In [22]:


def gpt_4o_mini_rag(description):
    similars = find_similars(description=description)
    response = openai.chat.completions.create(
        model="gpt-4o-mini", 
        messages=messages_for(description, similars),
        seed=42,
        stream=True
    )
    full_response = ""
    for chunk in response:
        if chunk.choices[0].delta.content is not None:
            full_response += chunk.choices[0].delta.content
    return full_response

# print in markdown format
from IPython.display import Markdown, display

def display_markdown_response(description):
    response = gpt_4o_mini_rag(description)
    display(Markdown(response))

display_markdown_response("I am keen on clinical trial. Can you give me faculty members who are experts in develop statistical methods for clinical trial?")

Based on your keen interest in clinical trials and your emphasis on the development of statistical methods for clinical trials, I recommend the following faculty members:

### 1. **Samiran Ghosh**
- **Title:** Professor and Vice Chair, Department of Biostatistics and Data Science
- **Expertise:**
  - Focuses on developing statistical methods targeted at precision medicine through clinical trials.
  - Research areas include various methodologies such as Non-inferiority and Bayesian Adaptive trials.
- **Why Relevant:** His extensive experience in methodological research directly aligns with your interest in statistical methods for clinical trials, making him an excellent resource for guidance in this area.

### 2. **Jose-Miguel Yamal**
- **Title:** Professor and Director, Coordinating Center for Clinical Trials
- **Expertise:**
  - Specialized in the design and analysis of clinical and diagnostic trials.
  - Involved in significant NIH-sponsored research and comparative effectiveness trials.
- **Why Relevant:** As a clinical trialist, his comprehensive background in both the design and statistical analysis of trials makes him an ideal faculty member for someone interested in clinical trials.

### 3. **Folefac Atem**
- **Title:** Associate Professor
- **Expertise:**
  - Develops statistical methodologies for both observational and experimental studies, focusing on issues including censored data and causal inference.
- **Why Relevant:** His work on developing statistical methodologies tailored for clinical and translational research directly ties to your interest in clinical trials, especially in handling complexities within trial data.

### 4. **Luis Leon Novelo**
- **Title:** Associate Professor
- **Expertise:**
  - Works on developing statistical methods for analyzing various types of biomedical data, including clinical trials.
- **Why Relevant:** His involvement with clinical trial analysis and his methodological focus on Bayesian non-parametric statistics align closely with your interest in statistical methods for clinical trials, providing a solid resource for your ambitions in this field.

### 5. **Baojiang Chen**
- **Title:** Professor
- **Expertise:**
  - Focuses on methodological research, particularly with missing data and longitudinal data, in the context of clinical applications.
- **Why Relevant:** His expertise in handling complex data structures and applications in clinical trials complements your interest in the statistical aspects of clinical trial design and analysis.

### Summary
Each of these faculty members has demonstrated expertise in clinical trial methodology and statistical methods, making them valuable contacts for your academic and research endeavors. You would benefit from engaging with them to deepen your understanding of statistical methods in clinical trials.

# Langchain implementation

In [23]:
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_core.callbacks import StdOutCallbackHandler

In [59]:
if os.path.exists("faculties_vectorstore"):
    Chroma(persist_directory="faculties_vectorstore",
            embedding_function=OpenAIEmbeddings()).delete_collection()
# Create vectorstore


The class `Chroma` was deprecated in LangChain 0.2.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-chroma package and should be used instead. To use it run `pip install -U :class:`~langchain-chroma` and import as `from :class:`~langchain_chroma import Chroma``.



In [60]:
from langchain.schema import Document
# Create Chroma vector store using LangChain
# Create documents in the correct format for LangChain
documents = []
for doc, faculty in zip(faculty_documents, faculty_data):
    # Convert research interests list to a comma-separated string
    research_interests = ', '.join(faculty.get('research_interests', []))
    
    # Create a Document object with the faculty information
    document = Document(
        page_content=doc,  # The main content of the document
        metadata={
            "name": f"{faculty['first_name']} {faculty['last_name']}",
            "department": faculty['department'],
            "title": faculty.get('title', 'No title'),
            "research_interests": research_interests,  # Now a string instead of a list
            "about": faculty.get('about', 'No information')
        }
    )
    documents.append(document)

In [61]:

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=OpenAIEmbeddings(),
    persist_directory="faculties_vectorstore"
)

# Create retriever
retriever = vectorstore.as_retriever(
       search_type="mmr",  # Use MMR search type
       search_kwargs={
           "k": 10,  # Fetch more documents initially
           "fetch_k": 20,  # Fetch even more for MMR to choose from
           "lambda_mult": 0.7  # Balance between relevance and diversity
       }
   )

# Initialize LLM
llm = ChatOpenAI(temperature=0.7, model_name="gpt-4")

# Set up memory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# Create conversation chain
conversation_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    callbacks=[StdOutCallbackHandler()]
)

def chat(query, history):
    """
    Chat with the faculty database using RAG
    """
    result = conversation_chain.invoke({"question": query})
    return result["answer"]

In [62]:
# Example usage
query = "I am looking for a faculty member who is an expert in epidemiology."
response = chat(query, [])
print("\nResponse:", response)



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
name: Elena Feofanova
    title: Assistant Professor Non-Tenure Instruction
    about: no information
    research interests: ['Epidemiology']
    affiliation: no information

name: Susan Fisher-Hoch
    title: Professor Non-Tenure Research
    about: I am trained as a physician with a doctoral degree in epidemiology from London University with Membership of the Royal College of Pathology in virology.  After a long career at the Centers for Disease Control and Prevention working with viral hemorrhagic fevers, in poor communities in developing countries. I designed and directed the French Biosafety 4 Level prin

In [63]:
import gradio as gr
view = gr.ChatInterface(chat, type="messages").launch(inbrowser=True)

* Running on local URL:  http://127.0.0.1:7877
* To create a public link, set `share=True` in `launch()`.




[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: I am looking for a faculty member who is an expert in epidemiology.
Assistant: There are several faculty members who are experts in epidemiology. Here are a few:

1. Elena Feofanova is an Assistant Professor Non-Tenure Instruction. Her research interest includes Epidemiology.

2. Susan Fisher-Hoch is a Professor Non-Tenure Research. She is trained as a physician with a doctoral degree in epidemiology from London University. Her research interests include Cancer, Cardiovascular and Chronic Diseases, Clinical Trials, Epidemiology, and Infectious Disease.

3. Janelle Rios, a Faculty Associate, has primary interests in epidemiology, biosafety, hazardous materials management, and dis