<a href="https://colab.research.google.com/github/IngeniiImperator/Multiversum/blob/main/Advanced_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install wikipedia-api sentence-transformers faiss-cpu openai



In [None]:
import wikipediaapi

import os
import random



# Set up Wikipedia API with a custom User-Agent
wiki_wiki = wikipediaapi.Wikipedia(user_agent='MyResearchBot/1.0 (https://themultiverse.school/; liz@themultiverse.school)', language='en')

# Define categories and sample articles from different domains
categories = {
  "Computers": {
      "Computer science": ["Abstract Data Structure", "Algorithm", "Object Oriented Programming", "Scripting", "Python (programming language)"],
      "CyberSecurity": ["TCP IP", "Internet Protocol", "Computer Network", "TCP Packet", "LoRA", "OSI Model", "Layer 1", "Layer 2", "Layer 3", "Layer 4", "Layer 5", "Layer 6", "Layer 7", "Layer 8", "International Organization for Standardization", "OSHA"],
      "Machine Learning": ["Artificial intelligence", "Machine learning", "Deep learning", "Computer vision", "Natural language processing", "FAISS", "Embedding"]
  },
  "Biology": {
    "Cell biology": ["Cytoskeleton", "Cell membrane", "Endoplasmic reticulum", "Golgi apparatus", "Apoptosis"],
    "Genetics": ["Epigenetics", "Gene expression", "CRISPR", "DNA replication", "Genetic drift"],
    "Food Web": ["Trophic cascade", "Keystone species", "Ecological pyramid", "Energy flow (ecology)", "Biogeochemical cycle"],
    "Microbiology": ["Bacteriophage", "Gram-positive bacteria", "Archaea", "Biofilm", "Extremophiles"],
    "Human anatomy": ["Circulatory system", "Endocrine system", "Nervous system", "Musculoskeletal system", "Digestive system"],
    "Mitochondria": ["ATP synthase", "Mitochondrial DNA", "Oxidative phosphorylation", "Mitochondrial diseases", "Endosymbiotic theory"],
    "Phylogenetics": ["Cladistics", "Evolutionary tree", "Molecular phylogenetics", "Common descent", "Homology (biology)"]
  },
  "Chemistry": {
    "Organic chemistry": ["Functional groups", "Alkene", "Aromaticity", "Polymerization", "Carbohydrates"],
    "Inorganic chemistry": ["Coordination complex", "Transition metal", "Crystallography", "Lanthanides", "Actinides"],
    "Analytical chemistry": ["Chromatography", "Spectroscopy", "Mass spectrometry", "Electrochemical analysis", "Nuclear magnetic resonance"],
    "Physical chemistry": ["Quantum chemistry", "Thermodynamics", "Statistical mechanics", "Chemical kinetics", "Molecular dynamics"],
    "Biochemistry": ["Enzyme kinetics", "Protein folding", "Lipid metabolism", "Glycolysis", "Signal transduction"]
  },
  "Geology": {
    "Plate tectonics": ["Subduction zone", "Mid-ocean ridge", "Continental drift", "Transform fault", "Rift valley"],
    "Mineralogy": ["Silicate minerals", "Feldspar", "Quartz", "Mohs scale of mineral hardness", "Crystal habit"],
    "Volcano": ["Stratovolcano", "Shield volcano", "Pyroclastic flow", "Volcanic explosivity index", "Supervolcano"],
    "Earthquake": ["Seismic wave", "Richter scale", "Fault mechanics", "Liquefaction", "Tsunami"],
    "Geological history of Earth": ["Hadean eon", "Cambrian explosion", "Snowball Earth", "K-Pg extinction event", "Great Oxygenation Event"],
    "Igneous Rock": ["Basalt", "Granite", "Magma differentiation", "Intrusive rock", "Plutonic rock"]
  },
  "History": {
    "World War II": ["Battle of Stalingrad", "Manhattan Project", "D-Day", "Holocaust", "Blitzkrieg"],
    "Ancient Egypt": ["Pharaoh", "Hieroglyphics", "Valley of the Kings", "Mummification", "Great Pyramid of Giza"],
    "Renaissance": ["Humanism (Renaissance)", "Leonardo da Vinci", "Medici family", "Florence during the Renaissance", "Printing press"],
    "Industrial Revolution": ["Steam engine", "Factory system", "Textile industry", "Urbanization", "Luddites"],
    "Cold War": ["Cuban Missile Crisis", "Space Race", "Berlin Wall", "McCarthyism", "NATO"]
  },
  "Art": {
    "Impressionism": ["Claude Monet", "Edgar Degas", "Pierre-Auguste Renoir", "Plein air painting", "Color theory"],
    "Cubism": ["Pablo Picasso", "Georges Braque", "Analytic Cubism", "Synthetic Cubism", "Still Life with Chair Caning"],
    "Renaissance art": ["Michelangelo", "Sistine Chapel ceiling", "Raphael", "Leonardo da Vinci’s notebooks", "Linear perspective"],
    "Sculpture": ["Rodin", "Bronze casting", "Marble sculpture", "Gothic sculpture", "Greek classical sculpture"],
    "Abstract art": ["Wassily Kandinsky", "Color field painting", "Abstract expressionism", "Suprematism", "De Stijl"],
    "Dadaism": ["Marcel Duchamp", "Readymades", "Cabaret Voltaire", "Tristan Tzara", "Anti-art movement"],
    "Absurdism": ["Albert Camus", "The Myth of Sisyphus", "Theatre of the Absurd", "Samuel Beckett", "Waiting for Godot"]
  }
}

# Create the cache directory
cache_dir = "data/wikipedia_cache"
os.makedirs(cache_dir, exist_ok=True)

# Function to save Wikipedia text to cache
def save_to_cache(title, text):
    filename = os.path.join(cache_dir, f"{title.replace(' ', '_')}.txt")
    with open(filename, "w", encoding="utf-8") as file:
        file.write(text)

# Function to load Wikipedia text from cache
def load_from_cache(title):
    filename = os.path.join(cache_dir, f"{title.replace(' ', '_')}.txt")
    if os.path.exists(filename):
        with open(filename, "r", encoding="utf-8") as file:
            return file.read()
    return None

# Function to fetch Wikipedia page content with caching
def get_wikipedia_text(title):
    # Check if the page is already cached
    cached_text = load_from_cache(title)
    if cached_text:
        return cached_text

    page = wiki_wiki.page(title)

    # If it's a disambiguation page, follow the first few linked pages
    if page.exists():
        if "may refer to:" in page.text[:200]:  # Check if it's a disambiguation page
            linked_pages = list(page.links.keys())[:5]  # Grab first few related links
            for linked_title in linked_pages:
                sub_page = wiki_wiki.page(linked_title)
                if sub_page.exists() and len(sub_page.text) > 500:  # Ensure meaningful content
                    save_to_cache(linked_title, sub_page.text[:2000])  # Save to cache
                    return sub_page.text[:2000]

        # Save the fetched page to cache and return it
        save_to_cache(title, page.text[:2000])
        return page.text[:2000]

    return None

# Function to get additional Wikipedia pages from related categories
def get_category_pages(category_name, max_pages=5):
    category_page = wiki_wiki.page(f"Category:{category_name}")
    pages = []

    if category_page.exists():
        for title, page in category_page.categorymembers.items():
            if page.ns == 0:  # Only fetch articles (not subcategories)
                text = get_wikipedia_text(title)
                if text:
                    pages.append(text)
                if len(pages) >= max_pages:
                    break
    return pages

# Fetch and store documents
documents = []

for category, topics in categories.items():
    for topic in topics:
        print("Getting ", topic)
        for page in topics[topic]:
            text = get_wikipedia_text(page)
            if text:
                documents.append(text)

    # Also pull a few pages from the Wikipedia category
    documents.extend(get_category_pages(category, max_pages=5))


print(f"Collected {len(documents)} Wikipedia documents. Cached files saved in: {cache_dir}")


Getting  Computer science
Getting  CyberSecurity
Getting  Machine Learning
Getting  Cell biology
Getting  Genetics
Getting  Food Web
Getting  Microbiology
Getting  Human anatomy
Getting  Mitochondria
Getting  Phylogenetics
Getting  Organic chemistry
Getting  Inorganic chemistry
Getting  Analytical chemistry
Getting  Physical chemistry
Getting  Biochemistry
Getting  Plate tectonics
Getting  Mineralogy
Getting  Volcano
Getting  Earthquake
Getting  Geological history of Earth
Getting  Igneous Rock
Getting  World War II
Getting  Ancient Egypt
Getting  Renaissance
Getting  Industrial Revolution
Getting  Cold War
Getting  Impressionism
Getting  Cubism
Getting  Renaissance art
Getting  Sculpture
Getting  Abstract art
Getting  Dadaism
Getting  Absurdism
Collected 200 Wikipedia documents. Cached files saved in: data/wikipedia_cache


## load the embedding model

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Generate embeddings
document_embeddings = model.encode(documents, convert_to_numpy=True)

# Print embedding shape
print("Embedding Shape:", document_embeddings.shape)

Embedding Shape: (200, 768)


In [None]:
import faiss

embedding_size = document_embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_size)
index.add(document_embeddings)

faiss.write_index(index, "faiss_index.bin")


In [None]:
question = "What artists specifically were the most impacted by world war 2?"

In [None]:
import openai
from google.colab import userdata

client = openai.OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
def prompt(q):
  results = client.chat.completions.create(
      messages=[
          {
             "role": "system",
             "content": "Your role is to improve the query the user has asked, so that we can search for better, more effective results. Please expand the query, make sure the question contains enough detail to find the document the user is requesting."
          },
      {
          "role": "user",
          "content": q
      }
      ],
      model="gpt-4o"
  )
  return results.choices[0].message.content

In [None]:
expanded_question = prompt(question)
print(expanded_question)

Which artists from various artistic disciplines, such as painting, sculpture, music, and literature, were most significantly impacted by World War II in terms of their work, career, or personal lives? Additionally, I'd like information on how the war influenced their artistic themes, styles, and the overall trajectory of their careers during and after the conflict.


In [None]:
question_embedding = model.encode([question], convert_to_numpy=True)
expanded_question_embedding = model.encode([expanded_question], convert_to_numpy=True)

In [None]:
_, indices = index.search(question_embedding, 5)
docs = [documents[i] for i in indices[0]]

In [None]:
_, indices = index.search(expanded_question_embedding, 5)
expanded_docs = [documents[i] for i in indices[0]]

In [None]:
for doc in docs:
  print("New Doc:")
  print(doc)

New Doc:
Abstract expressionism in the United States emerged as a distinct art movement in the aftermath of World War II and gained mainstream acceptance in the 1950s, a shift from the American social realism of the 1930s influenced by the Great Depression and Mexican muralists. The term was first applied to American art in 1946 by the art critic Robert Coates. Key figures in the New York School, which was the center of this movement, included such artists as Arshile Gorky, Jackson Pollock, Franz Kline, Mark Rothko, Norman Lewis, Willem de Kooning, Adolph Gottlieb, Clyfford Still, Robert Motherwell, Theodoros Stamos and Lee Krasner among others.
The movement was not limited to painting but included influential collagists and sculptors, such as David Smith, Louise Nevelson, and others. Abstract expressionism was notably influenced by the spontaneous and subconscious creation methods of Surrealist artists like André Masson and Max Ernst. Artists associated with the movement combined the 

In [None]:
for doc in expanded_docs:
  print("New Doc:")
  print(doc)

New Doc:
Abstract expressionism in the United States emerged as a distinct art movement in the aftermath of World War II and gained mainstream acceptance in the 1950s, a shift from the American social realism of the 1930s influenced by the Great Depression and Mexican muralists. The term was first applied to American art in 1946 by the art critic Robert Coates. Key figures in the New York School, which was the center of this movement, included such artists as Arshile Gorky, Jackson Pollock, Franz Kline, Mark Rothko, Norman Lewis, Willem de Kooning, Adolph Gottlieb, Clyfford Still, Robert Motherwell, Theodoros Stamos and Lee Krasner among others.
The movement was not limited to painting but included influential collagists and sculptors, such as David Smith, Louise Nevelson, and others. Abstract expressionism was notably influenced by the spontaneous and subconscious creation methods of Surrealist artists like André Masson and Max Ernst. Artists associated with the movement combined the 

In [None]:
from sentence_transformers import CrossEncoder

# Load a neural re-ranking model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

# Function to re-rank top-k documents
def rerank_documents(query, retrieved_docs):
    # Create query-document pairs for scoring
    pairs = [[query, doc] for doc in retrieved_docs]

    # Get relevance scores
    scores = reranker.predict(pairs)

    # Sort documents by score
    ranked_docs = [doc for _, doc in sorted(zip(scores, retrieved_docs), reverse=True)]

    return ranked_docs

reranked_docs = rerank_documents(question, expanded_docs)

# Print the reranked documents
for doc in reranked_docs:
    print("New Doc:")
    print(doc)

New Doc:
Abstract expressionism in the United States emerged as a distinct art movement in the aftermath of World War II and gained mainstream acceptance in the 1950s, a shift from the American social realism of the 1930s influenced by the Great Depression and Mexican muralists. The term was first applied to American art in 1946 by the art critic Robert Coates. Key figures in the New York School, which was the center of this movement, included such artists as Arshile Gorky, Jackson Pollock, Franz Kline, Mark Rothko, Norman Lewis, Willem de Kooning, Adolph Gottlieb, Clyfford Still, Robert Motherwell, Theodoros Stamos and Lee Krasner among others.
The movement was not limited to painting but included influential collagists and sculptors, such as David Smith, Louise Nevelson, and others. Abstract expressionism was notably influenced by the spontaneous and subconscious creation methods of Surrealist artists like André Masson and Max Ernst. Artists associated with the movement combined the 

In [None]:
import openai
from google.colab import userdata

client = openai.OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
def evaluate(question, results):
  output = client.chat.completions.create(
      messages=[
          {
             "role": "system",
             "content": "Your role is to interpret the search results and answer the user's original question."
          },
      {
          "role": "user",
          "content": f"Question: {question}\nResults: {results}\nAnswer:"
      }
      ],
      model="gpt-4o"
  )
  return output.choices[0].message.content

In [None]:
print(evaluate(question, reranked_docs))

World War II had a significant impact on artists, particularly those associated with the Abstract Expressionism movement, which emerged in the United States in its aftermath. This movement became prominent in the 1950s, marking a shift from earlier styles like American social realism. It included key figures such as Jackson Pollock, Mark Rothko, Willem de Kooning, and Franz Kline. These artists, associated with the New York School, were central to the development of Abstract Expressionism. The cultural and artistic shift also saw New York City rise as a new epicenter of the art world, previously held by Paris before the war. 

Other artists, like Marcel Duchamp, who were already influential before World War II, continued to impact art movements during the war and beyond, but were not specifically noted as being impacted by the war itself in the provided search results. They, however, influenced broad avant-garde movements that indirectly informed post-war styles.

Overall, artists clos