# Minecraft Wiki Loader Notebook
This notebook demonstrates how to fetch and process data from the French Minecraft Wiki using Python. It contains various functions and a custom loader class for extracting and processing wiki data.

## Importing Necessary Libraries
The libraries used in this notebook include:
- `requests` for making HTTP requests
- `BeautifulSoup` for parsing HTML content
- Classes from `langchain_core` for handling documents and loaders

In [1]:
from typing import AsyncIterator, Iterator
import requests
from bs4 import BeautifulSoup
from rich.progress import Progress

import ollama

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document

from langchain.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings

## Fetching Revision IDs
The `fetch_rvid` function retrieves the revision ID of a wiki page as of a specified date. This is essential for accessing historical content.

In [2]:
headers = {"User-Agent": "MyScript/1.0 (myemail@example.com)"}

def fetch_rvid(title, date="2021-01-01T00:00:00.000Z", api = "https://fr.minecraft.wiki/api.php"):
    # API request parameters
    params = {
        "action": "query",
        "format": "json",
        "prop": "revisions",
        "titles": title,
        "formatversion": "2",
        "rvprop": "ids",
        "rvlimit": "1",
        "rvstart": date,
        "rvdir": "older"
    }

    # Make the API request
    response = requests.get(api, params=params, headers=headers)

    # Check if the response was successful
    if response.status_code == 200:
        data = response.json()
        
        # Extract the revisions data safely
        pages = data.get("query", {}).get("pages", [])
        if pages:
            page_info = pages[0]
            revisions = page_info.get("revisions", [])
            
            if revisions:
                # Return the revision ID if found
                return revisions[0].get("revid", None)
    
    # Return None if no revisions are found or if there's an issue
    return None

## Fetching Category Members
The `fetch_category_members` function retrieves the titles of all pages within a specified category. It handles pagination if the number of pages exceeds the API's limit.

In [3]:


def fetch_category_members(category, limit=500 , api = "https://fr.minecraft.wiki/api.php"):
    members = []
    cmcontinue = None

    while True:
        # API request parameters
        params = {
            "action": "query",
            "list": "categorymembers",
            "cmtitle": category,
            "cmlimit": limit,
            "format": "json",
            "cmtype": "page",
        }
        if cmcontinue:
            params["cmcontinue"] = cmcontinue

        # Make the API request
        response = requests.get(api, params=params, headers=headers)
        data = response.json()

        # Collect members
        members.extend([page["title"] for page in data.get("query", {}).get("categorymembers", [])])
        
        # Check if more pages are available
        cmcontinue = data.get("continue", {}).get("cmcontinue")
        if not cmcontinue:
            break
    return members

## Fetching Page Content
The `fetch_page_content` function fetches the HTML content of a wiki page at a specific revision. It extracts and prettifies the content using `BeautifulSoup`.

In [4]:
def fetch_page_content(title : str,rvid = "", vrb = True, baseURL = "https://fr.minecraft.wiki"):
    page_url = f"{baseURL}/w/{title.replace(' ', '_')}?oldid={rvid}"
    if vrb:
        print(f"collecting data from : {page_url}")
    
    response = requests.get(page_url, headers=headers)
    
    if response.status_code != 200:
        print(f"Error: Unable to fetch the page. Status code {response.status_code}")
        return page_url, None
    
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    soup.prettify(formatter="html5")
    
    return  page_url, soup.select_one('#mw-content-text') 


## Text Processing Functions
These utility functions determine whether a paragraph is worth extracting (`is_textable`) and clean up the text (`textify`).

In [5]:
def is_textable(p : BeautifulSoup):
    text = p.get_text()
    if(not text):
       return False
    length = len( text.strip())
    return  ((length > 30) and # skip things like "Alambic/BS" witch is useless
            (not text.strip().endswith(":")) and # skip <p> follawed by array or image
            (not text.strip().startswith("Erreur")) and # skip error message <p>
            (length > 100 or "true" == ollama.generate(model = 'qwen2.5:3b', prompt=f"You are a JSON API, responding with either true or false only to the question: \"Does this contain useful information?\" Input : {input}").response)
        )

def textify(p : BeautifulSoup):
    text = p.get_text().replace('\n', ' ').replace('  ', ' ')
    return text.lower()

## Loader Class
The `MinecraftWikiLoader` class processes wiki pages within specified categories. It lazily loads paragraphs as `Document` objects for further use.

In [6]:
class MinecraftWikiLoader(BaseLoader):
    #["Catégorie:Bloc","Catégorie:Environnement","Catégorie:Gameplay","Catégorie:Objets","Catégorie:Redstone","Catégorie:Entitée"]
    def __init__(self, categorys = ["Category:Blocks", "Category:Environment", "Category:Gameplay", "Category:Redstone", "Category:Entities"],date="2021-01-01T00:00:00.000Z", lang = "en") -> None:
        """Initialize the loader
        """
        self.categorys_page = []
        pages = []
        self.baseURL = f"https://{lang}.minecraft.wiki"
        self.api = f"{self.baseURL}/api.php"

        print("Getting all pages...")
        for str in categorys:
            for page in fetch_category_members(str,api = self.api):
                pages.append(page)
        pages = list(set(pages)) # supprime les doublons
        with Progress() as progress:
            task = progress.add_task("[cyan]Sorting Pages...", total=len(pages))
            for page in pages:
                rvid = fetch_rvid(page,date=date, api= self.api)
                if rvid:
                    self.categorys_page.append({"title":page,"id":rvid}) 
                progress.update(task, advance=1)
            progress.stop_task(task)
        self.page_count = len(self.categorys_page)

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a page <p> by <p>
        """
        with Progress() as progress:
            task = progress.add_task("[cyan]Processing Pages...", total=self.page_count)
            for page in self.categorys_page:
                page_url, page_content = fetch_page_content(page["title"],rvid=page["id"],vrb=False, baseURL= self.baseURL)
                paragraphs = []
                if page_content:
                    paragraphs = page_content.select('div.mw-parser-output > p')
                paragraph_number = 0
                for p in paragraphs:
                    if is_textable(p):
                        yield Document(
                            page_content=textify(p),
                            metadata={"title": page["title"], "source": page_url},
                        )
                        paragraph_number += 1
                progress.update(task, advance=1)
            progress.stop_task(task)
                

## Loader Initialization and Testing
Here, the `MinecraftWikiLoader` is initialized and tested by collecting the first 20 paragraphs from the specified categories.

In [7]:
model_name = "mxbai-embed-large"
db_data = Chroma(persist_directory= "./chroma_db_en_V2",embedding_function= OllamaEmbeddings(model=model_name,show_progress=True),collection_name="data")


  db_data = Chroma(persist_directory= "./chroma_db_en_V2",embedding_function= OllamaEmbeddings(model=model_name,show_progress=True),collection_name="data")
  db_data = Chroma(persist_directory= "./chroma_db_en_V2",embedding_function= OllamaEmbeddings(model=model_name,show_progress=True),collection_name="data")


In [None]:
loader =  MinecraftWikiLoader() # peut prendre un moment le temp de trié les page.... 3 min en général
loader.page_count
db_data.add_documents(loader.lazy_load())

Getting all pages...


Output()

Output()

In [None]:
for document in loader.lazy_load():
    try:
        db_data.add_documents([document])
    except: 
        print("Error on : ", document)
        
db_data.persist()


In [8]:
retriever = db_data.as_retriever(search_kwargs={"k": 10})

In [None]:
retriever.invoke("")