In [1]:
import ollama
import faiss
import numpy as np
import os
from pypdf import PdfReader

In [9]:
# Function to extract text from a PDF file
def extract_text_from_pdf(file_path):
    text = ""
    with open(file_path, "rb") as file:
        reader = PdfReader(file)
        # Iterate through all pages and extract text
        for page in reader.pages:
            text += page.extract_text() + " "
    return text.strip()

# Directory containing the PDF files
file_path = "C:/Users/kjoshi/Downloads/gemma/pdfs"

# List all PDF files in the directory
pdf_files = [f for f in os.listdir(file_path) if f.lower().endswith('.pdf')]

# Construct full file paths
full_pdf_paths = [os.path.join(file_path, pdf) for pdf in pdf_files]

# Extract text from each PDF
documents = [extract_text_from_pdf(pdf) for pdf in full_pdf_paths]

# Optionally, print the documents to check
for doc in documents:
    print(doc[:500])  # Print the first 500 characters of each document

1 Developmental regulators drive DUX4 expression in facioscapulohumeral muscular dystrophy   1 
 2 
Authors  3 
Amelia Fox1, Jonathan Oliva1, Rajanikanth Vangipurapu1, and Francis M. Sverdrup1* 4 
 5 
Affiliations   6 
1Department of Biochemistry and Molecular Biology, Saint Louis University School of Medicine, Saint 7 
Louis, Missouri.  8 
 9 
*Corresponding Author: Francis M. Sverdrup; email: fran.sverdrup@health.slu.edu. 10 
 11 
Abstract  12 
 13 
Facioscapulohumeral muscular dystrophy (FSHD
CURRENTOPINION The FSHD jigsaw: are we placing the tiles in the
right position?
Valentina Salsia, Gaetano Nicola Alfio Vattemib
and Rossella Ginevra Tuplera,c,d
Purpose of review
Facioscapulohumeral muscular dystrophy (FSHD) is one of the most common myopathies, involving over870,000 people worldwide and over 20 FSHD national registries. Our purpose was to summarize themain objectives of the scientific community on this topic and the moving trajectories of research from the
past to the present.

In [3]:
embeddings = []
for doc in documents:
    response = ollama.embeddings(model="nomic-embed-text", prompt=doc)
    embeddings.append(response["embedding"])

# Convert to NumPy array for FAISS
embeddings_np = np.array(embeddings).astype('float32')

# Create a FAISS index and add embeddings
index = faiss.IndexFlatL2(embeddings_np.shape[1])  # L2 distance for similarity
index.add(embeddings_np)

In [None]:
query = "What myogenic enhancers are drivers of DUX4 activation?"

In [11]:
query_embedding = ollama.embeddings(model="nomic-embed-text", prompt=query)["embedding"]
query_embedding_np = np.array(query_embedding).reshape(1, -1).astype('float32')

In [12]:
# Search the FAISS index for nearest neighbors
k = 1  # Number of results to retrieve
distances, indices = index.search(query_embedding_np, k)

# Retrieve the most relevant document
retrieved_doc = documents[indices[0][0]]
print(f"Retrieved document: {retrieved_doc}")

response = ollama.generate(
    model="mistral",
    prompt=f"Using this data: {retrieved_doc}. Respond to this prompt: {query}"
)

# Print the generated response
print(response['response'])

Retrieved document: 1 Developmental regulators drive DUX4 expression in facioscapulohumeral muscular dystrophy   1 
 2 
Authors  3 
Amelia Fox1, Jonathan Oliva1, Rajanikanth Vangipurapu1, and Francis M. Sverdrup1* 4 
 5 
Affiliations   6 
1Department of Biochemistry and Molecular Biology, Saint Louis University School of Medicine, Saint 7 
Louis, Missouri.  8 
 9 
*Corresponding Author: Francis M. Sverdrup; email: fran.sverdrup@health.slu.edu. 10 
 11 
Abstract  12 
 13 
Facioscapulohumeral muscular dystrophy (FSHD) is a progressive muscle wasting disease caused by 14 
misexpression of the Double Homeobox 4 (DUX4) transcription factor in skeletal muscle.  While 15 
epigenetic derepression of D4Z4 macrosatellite repeats  is recognized to cause DUX4 misexpression in  16 
FSHD, the factors  promoting DUX4 transcription  are unknown. Here, we show that  SIX ( sine oculis ) 17 
transcription factors, critical  during embryonic development , muscle differentiation,  regeneration  and 18 
hom