# Minecraft Wiki Loader Notebook
This notebook demonstrates how to fetch and process data from the French Minecraft Wiki using Python. It contains various functions and a custom loader class for extracting and processing wiki data.

## Importing Necessary Libraries
The libraries used in this notebook include:
- `requests` for making HTTP requests
- `BeautifulSoup` for parsing HTML content
- Classes from `langchain_core` for handling documents and loaders

In [1]:
from typing import AsyncIterator, Iterator
import requests
from bs4 import BeautifulSoup
from rich.progress import Progress
import json

import re
import unicodedata

import ollama

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document

from langchain.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings

## Fetching Revision IDs
The `fetch_rvid` function retrieves the revision ID of a wiki page as of a specified date. This is essential for accessing historical content.

In [2]:
headers = {"User-Agent": "MyScript/1.0 (myemail@example.com)"}

def fetch_rvid(title, date="2021-01-01T00:00:00.000Z", api = "https://fr.minecraft.wiki/api.php"):
    # API request parameters
    params = {
        "action": "query",
        "format": "json",
        "prop": "revisions",
        "titles": title,
        "formatversion": "2",
        "rvprop": "ids",
        "rvlimit": "1",
        "rvstart": date,
        "rvdir": "older"
    }

    # Make the API request
    response = requests.get(api, params=params, headers=headers)

    # Check if the response was successful
    if response.status_code == 200:
        data = response.json()
        
        # Extract the revisions data safely
        pages = data.get("query", {}).get("pages", [])
        if pages:
            page_info = pages[0]
            revisions = page_info.get("revisions", [])
            
            if revisions:
                # Return the revision ID if found
                return revisions[0].get("revid", None)
    
    # Return None if no revisions are found or if there's an issue
    return None

## Fetching Category Members
The `fetch_category_members` function retrieves the titles of all pages within a specified category. It handles pagination if the number of pages exceeds the API's limit.

In [3]:
def fetch_category_members(category, limit=500 , api = "https://fr.minecraft.wiki/api.php"):
    members = []
    cmcontinue = None

    while True:
        # API request parameters
        params = {
            "action": "query",
            "list": "categorymembers",
            "cmtitle": category,
            "cmlimit": limit,
            "format": "json",
            "cmtype": "page",
        }
        if cmcontinue:
            params["cmcontinue"] = cmcontinue

        # Make the API request
        response = requests.get(api, params=params, headers=headers)
        data = response.json()

        # Collect members
        members.extend([page["title"] for page in data.get("query", {}).get("categorymembers", [])])
        
        # Check if more pages are available
        cmcontinue = data.get("continue", {}).get("cmcontinue")
        if not cmcontinue:
            break
    return members

## Fetching Page Content
The `fetch_page_content` function fetches the HTML content of a wiki page at a specific revision. It extracts and prettifies the content using `BeautifulSoup`.

In [4]:
def fetch_page_content(title : str,rvid = "", vrb = True, baseURL = "https://fr.minecraft.wiki"):
    page_url = f"{baseURL}/w/{title.replace(' ', '_')}?oldid={rvid}"
    if vrb:
        print(f"collecting data from : {page_url}")
    
    response = requests.get(page_url, headers=headers)
    
    if response.status_code != 200:
        print(f"Error: Unable to fetch the page. Status code {response.status_code}")
        return page_url, None
    
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    soup.prettify(formatter="html5")
    
    return  page_url, soup.select_one('#mw-content-text') 


## Text Processing Functions
These utility functions determine whether a paragraph is worth extracting (`is_textable`) and clean up the text (`textify`).

In [7]:
def is_textable(p : BeautifulSoup):
    text = p.get_text()
    if(not text):
       return False
    length = len( text.strip())
    return  ((length > 30) and # skip things like "Alambic/BS" witch is useless
            (not text.strip().endswith(":")) and # skip <p> follawed by array or image
            (not text.strip().startswith("Erreur"))  # skip error message <p>
            (not "bug tracker" in text)
         )

def to_ascii(text):
    return unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode()

def textify(p : BeautifulSoup):
   input_text = p.get_text()
   input_text = re.sub(r'\s+', ' ', input_text)
   input_text = re.sub(r'\[([a-zA-Z0-9])\]', '', input_text)
   return process(to_ascii(input_text))


def process(input_text: str):
    prompt = f"""
Rephrase the following text into clear, simple, independent sentences.
- Each sentence must contain only one subject.
- Replace all pronouns like "they", "this", "other", "the" or "it" with explicit nouns.
- If a sentence contains multiple actions, split it into separate sentences, one action per sentence.
- Do not add or remove any information.
- Sentences should be fully independent; none should require context from another.
- Ignore line breaks in the input; split sentences based on content and meaning, not line endings.
- Write each sentence on its own line without punctuation at the end.

Example:
Input text:
Certain non-hostile mobs such as chickens skeleton horses and spiders can be mounted by hostile mobs on rare occasions and become hostile

Output:
Chickens are non-hostile mobs
Skeleton horses are non-hostile mobs
Spiders are non-hostile mobs
Hostile mobs can mount Chickens on rare occasions
Hostile mobs can mount Skeleton horses on rare occasions
Hostile mobs can mount Spiders on rare occasions
Mounted mobs become hostile

Now rephrase this text:
{input_text}"""
    text = ollama.generate(model='qwen2.5', prompt=prompt).response.split("</think>", 1)
    return [t.strip() for t in text[-1].split('\n') if t.strip() != '']


In [8]:
process("Mobs spawn in various ways. Most mobs spawn naturally, depending on the light level, biome, and their surroundings. For example, most animals are found in bright areas on the surface, while hostile monsters are commonly found in the dark (whether it's a cave, monster room, mansion, or at night). Animals usually spawn upon chunk generation, while hostile monsters spawn and despawn in a certain radius around the player.")

['Mobs spawn in various ways',
 'Most mobs spawn naturally',
 'Light level affects mob spawning',
 'Biome influences mob presence',
 'Surroundings impact mob appearance',
 'Animals are typically found in bright areas on the surface',
 'Hostile monsters commonly appear in dark places',
 'Dark places include caves, monster rooms, mansions, or at night',
 'Animals usually spawn upon chunk generation',
 'Hostile monsters spawn and despawn within a certain radius around the player']

## Loader Class
The `MinecraftWikiLoader` class processes wiki pages within specified categories. It lazily loads paragraphs as `Document` objects for further use.

In [12]:
class MinecraftWikiLoader(BaseLoader):
    def __init__(
        self,
        categorys=["Category:Blocks", "Category:Environment", "Category:Gameplay", "Category:Redstone", "Category:Entities"],
        date="2021-01-01T00:00:00.000Z",
        lang="en"
    ) -> None:
        """Initialize the loader by fetching and processing content from the Minecraft wiki."""
        self.documents = []  # Will store all Document objects
        self.baseURL = f"https://{lang}.minecraft.wiki"
        self.api = f"{self.baseURL}/api.php"

        # --- Fetch pages from categories ---
        print("Getting all pages...")
        pages = []
        for cat in categorys:
            for page in fetch_category_members(cat, api=self.api):
                pages.append(page)
        pages = list(set(pages))  # remove duplicates

        self.categorys_page = []
        with Progress() as progress:
            task = progress.add_task("[cyan]Sorting Pages...", total=len(pages))
            for page in pages:
                rvid = fetch_rvid(page, date=date, api=self.api)
                if rvid:
                    self.categorys_page.append({"title": page, "id": rvid})
                progress.update(task, advance=1)
            progress.stop_task(task)

        # --- Fetch content for each page and store as documents ---
        with Progress() as progress:
            task = progress.add_task("[cyan]Getting Pages...", total=len(self.categorys_page))
            for page in self.categorys_page:
                page_url, page_content = fetch_page_content(
                    page["title"], rvid=page["id"], vrb=False, baseURL=self.baseURL
                )
                if page_content:
                    # Select paragraphs from the page content
                    paragraphs = page_content.select('div.mw-parser-output > p')
                    for p in paragraphs:
                        if is_textable(p):
                            self.documents.append(
                                {
                                    "page_content": re.sub(r'\s+', ' ', p.get_text().strip()),  # Clean up text efficiently
                                    "metadata": {"title": page["title"], "source": page_url}
                                }
                            )
                    
                progress.update(task, advance=1)
            progress.stop_task(task)
        with open('data.json', 'w') as f:
            json.dump(self.documents, f, indent=4)
        self.page_count = len(self.documents)

    @classmethod
    def from_file(cls, file_path="data.json") -> "MinecraftWikiLoader":
        instance = cls.__new__(cls)
        with open(file_path, 'r') as f:
            instance.documents = json.load(f)
        instance.page_count = len(instance.documents)
        return instance


    def lazy_load(self) -> Iterator[Document]:
        """Yields the pre-fetched Document objects one by one."""
        with Progress() as progress:
            task = progress.add_task("[cyan]Processing data...", total=len(self.documents))
            for doc in self.documents:
                for s in process(to_ascii(doc["page_content"])):#process(doc["page_content"]):  # Access content correctly
                    yield Document(
                        page_content=s,
                        metadata=doc["metadata"]  # Access metadata correctly
                    )
                progress.update(task, advance=1)
            progress.stop_task(task)


## Loader Initialization and Testing
Here, the `MinecraftWikiLoader` is initialized and tested by collecting the first 20 paragraphs from the specified categories.

In [None]:
model_name = "mxbai-embed-large"
db_data = Chroma(persist_directory= "./wiki_db",embedding_function= OllamaEmbeddings(model=model_name),collection_name="data")

  db_data = Chroma(persist_directory= "./wiki_db",embedding_function= OllamaEmbeddings(model=model_name),collection_name="data")
  db_data = Chroma(persist_directory= "./wiki_db",embedding_function= OllamaEmbeddings(model=model_name),collection_name="data")


In [None]:
loader =  MinecraftWikiLoader.from_file() # peut prendre un moment le temp de trié les page.... 3 min en général
loader.page_count

6720

# Vectorisation process

In [None]:
for document in loader.lazy_load():
    try:
        db_data.add_documents([document])
    except: 
        print("Error on : ", document)
        
db_data.persist()

Output()

In [None]:
db_data._collection.count()

In [None]:
retriever = db_data.as_retriever(search_kwargs={"k": 100})

In [None]:
retriever.invoke("they")

In [None]:
from langchain.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
import ollama

def schearch(query):
    data = retriever.invoke(query)
    data_result = ''.join([i.page_content+'\n' for i in data])
    print(data_result)
    return ollama.generate(model = 'qwen2.5', prompt=f"You are an API that summarizes data relevant to answering the request. If there is no useful data, respond with 'No data found.' \n Request: {query} \n Data: {data_result} \n Provide a concise and relevant summary based on the available data.").response