# Acquire Wikipidia Pages

## Overview 

Acquire the wikipedia pages information and save it locally, then chunks them and loads on the Qdrant DB. 

All the wikipiedia information are acquired using the `wikipedia_urls.txt` file, which conatis the list of Wikipedia links from which the information will be taken from. 

# Prerequisistes

A conda environment is needed. 

For example: 
```
cd path/to/conda/dir
conda env create -f wiki_rag_notebooks.yaml
conda activate wiki_rag_notebooks
python -m ipykernel install --user --name wiki_rag_notebooks --display-name "wiki_rag_notebooks"
```

In [8]:
# Import useful libraries
import requests
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import json
import os

In [9]:
import wikipediaapi
from urllib.parse import unquote, urlparse

### Test wikipediaapi library

Repo [here](https://github.com/martin-majlis/Wikipedia-API)

In [10]:
# Initialize the Wikipedia API
wiki_wiki = wikipediaapi.Wikipedia(
    'WikiRag (mauo.andretta222@gmail.com)', 
    'it',
    extract_format=wikipediaapi.ExtractFormat.WIKI)

In [11]:
# Function to read Wikipedia page titles from an external file
def load_wikipedia_urls(file_path):
    with open(file_path, 'r') as file:
        urls = [line.strip() for line in file.readlines()]
    return urls

urls = load_wikipedia_urls('../wikipedia_urls.txt')
urls

['https://it.wikipedia.org/wiki/Giochi_olimpici',
 'https://it.wikipedia.org/wiki/Giochi_olimpici_estivi']

In [12]:
# Function to extract the title from a Wikipedia URL
def get_title_from_url(url):
    path = urlparse(url).path
    title = path.split('/')[-1]
    return unquote(title)

# List with all the Wikipedia page titles
titles = [get_title_from_url(url) for url in urls]
titles

['Giochi_olimpici', 'Giochi_olimpici_estivi']

In [13]:
# Function to get the Wikipedia page content
p_wiki = wiki_wiki.page(titles[0])
# Print the content of the Wikipedia page
print(f"Wikipedia content: {p_wiki.text[:100]}")
# Print the Wikipedia page title
print(f"Wikipedia title: {p_wiki.title}")
# Print the Wikipedia page URL
print(f"Wikipedia URL: {p_wiki.fullurl}")
# Print the Wikipedia language
print(f"Wikipedia language: {p_wiki.language}")

Wikipedia content: I Giochi olimpici dell'era moderna sono un evento sportivo quadriennale che prevede la competizione 
Wikipedia title: Giochi olimpici
Wikipedia URL: https://it.wikipedia.org/wiki/Giochi_olimpici
Wikipedia language: it


In [14]:
def print_sections(sections, level=0):
    for s in sections:
            print("%s: %s - %s" % ("*" * (level + 1), s.title, s.text[0:40]))
            print_sections(s.sections, level + 1)

print_sections(p_wiki.sections)

*: Storia - 
**: Antichità - I primi giochi olimpici si svolsero nel 
**: La rinascita dei Giochi olimpici - La memoria degli antichi Giochi olimpici
*: Interferenze con le Olimpiadi - 
**: Guerra - Contrariamente alle speranze del barone 
**: Politica - La politica interferì sullo svolgimento 
**: Pandemia di COVID-19 - Il 24 marzo 2020 è stato annunciato il r
*: Il Comitato olimpico internazionale - Il Movimento Olimpico racchiude tutte qu
**: Contestazioni al CIO - Il CIO è stato più volte oggetto di cont
*: Simboli olimpici - Il movimento olimpico utilizza diversi s
*: Cerimonie - 
**: Cerimonia di apertura - La cerimonia di apertura di un'Olimpiade
**: Cerimonia di chiusura - La cerimonia di chiusura è più semplice 
**: Consegna delle medaglie - Al termine di ogni evento olimpico si ti
*: Sport olimpici - Ai Giochi di Sydney 2000 erano presenti 
**: Competizioni artistiche - L'inserimento delle competizioni d'arte 
*: Doping - Già dagli inizi del XX secolo si iniziar
*: Atleti oli

### Step 1: Acquiring the Wikipedia Pages

In [15]:
# Function to read Wikipedia page titles from an external file
def load_wikipedia_urls(file_path):
    with open(file_path, 'r') as file:
        urls = [line.strip() for line in file.readlines()]
    return urls

urls = load_wikipedia_urls('../wikipedia_urls.txt')
urls

['https://it.wikipedia.org/wiki/Giochi_olimpici',
 'https://it.wikipedia.org/wiki/Giochi_olimpici_estivi']

In [16]:
# Function to extract the title from a Wikipedia URL
def get_title_from_url(url):
    path = urlparse(url).path
    title = path.split('/')[-1]
    return unquote(title)

# List with all the Wikipedia page titles
titles = [get_title_from_url(url) for url in urls]
titles

['Giochi_olimpici', 'Giochi_olimpici_estivi']

In [17]:
# Cell code used to clean the text

# Function to clean the text
# usefult url to text processing: https://stackoverflow.com/questions/72214118/preprocessing-data-to-remove-italian-stopwords-for-text-analysis
def clean_text(text):
    # Remove references (e.g., [1], [2])
    text = re.sub(r'\[\d+\]', '', text)
    # Remove hyperlinks
    text = re.sub(r'https?:\/\/.*\/\w*', '', text)
    # Remove words with less than 2 characters
    text = re.sub(r'\b\w{1,2}\b', '', text)
    # Remove punctuation
    text = re.sub(r'[^\w\s]', ' ', text)
    # Remove whitespace (including new line characters)
    text = re.sub(r'\s\s+', ' ', text).strip()
    # To lowercase
    text = text.lower()
    
    return text

Example of usage of the function clean_text()

In [18]:
test_text = """
This is a sample text with various elements that need to be cleaned.

Here is a reference [1] that should be removed.

We should also remove this hyperlink: https://example.com/page and any short words like "a", "is", "to".

Furthermore, punctuation, such as commas, periods, and exclamation marks, should be removed!

This is the "See also" section that should be excluded.

References:
[2] Another reference to remove.
External links:
[3] And another.
"""


In [19]:
# Test the clean_text function
cleaned_text = clean_text(test_text)
cleaned_text

'this sample text with various elements that need cleaned here reference that should removed should also remove this hyperlink and any short words like furthermore punctuation such commas periods and exclamation marks should removed this the see also section that should excluded references another reference remove external links and another'

In [20]:
# Function to extract the wikipediaapi object from a Wikipedia Page Title
def scrape_wikipedia(title):

    p_wiki = wiki_wiki.page(title)
           
    return p_wiki

In [21]:
# Function used to get a dictionary with the sections of a Wikipedia page
def get_sections_dict(sections):
    section_dict = {}
    for section in sections:
        section_dict[section.title] = clean_text(section.text)
        # Recursively add subsections
        if section.sections:
            section_dict.update(get_sections_dict(section.sections))
    return section_dict

In [22]:
# Scrape the content from each URL
documents = {}
for title in titles:
    p_wiki = scrape_wikipedia(title)
    documents[title] = {
        'title': p_wiki.title,
        'url': p_wiki.fullurl,
        'language': p_wiki.language,
        'content': clean_text(p_wiki.text),
        #'sections': get_sections_dict(p_wiki.sections),
    }

# Check the content length ang language for all the documents
for title, doc in documents.items():
    print(f'{title} ({len(doc["content"])} characters, language: {doc["language"]})')

Giochi_olimpici (36334 characters, language: it)
Giochi_olimpici_estivi (14208 characters, language: it)


In [23]:
# Print a snippet of the cleaned text
for title, document in documents.items():
    print(f"Snippet from {title}:\n{document['content'][:500]}\n")

Snippet from Giochi_olimpici:
giochi olimpici dell era moderna sono evento sportivo quadriennale che prevede competizione tra migliori atleti del mondo quasi tutte discipline sportive praticate nei cinque continenti abitati essi pur essendo comunemente chiamati anche olimpiadi non sono confondere con olimpiade quest ultima indica intervallo tempo quattro anni che intercorre tra edizione dei giochi olimpici estivi successiva per questo anche giochi del 1916 1940 1944 non sono stati disputati continuato conteggiare olimpiadi c

Snippet from Giochi_olimpici_estivi:
giochi olimpici estivi sono una manifestazione sportiva multidisciplinare internazionale prevista negli anni multipli organizzata dal comitato olimpico internazionale olimpiadi sono più prestigioso mondo tra gli eventi questo tipo presentano una varietà sport superiore quella altre manifestazioni simili tutti gli sport vittoria olimpica viene generalmente considerata come risultato più prestigioso conseguibile qualsiasi sport e

In [None]:
# print the sections of the first document
for section, text in documents[titles[0]]['sections'].items():
    print(f"Section: {section}\n{text[:500]}\n")

In [25]:
# check if a string representing an external link exist in the content
def has_external_urls(content):
    return 'https' in content

# Check if the content has an image
for title, document in documents.items():
    print(f"{title}: {has_external_urls(document['content'])}")

Giochi_olimpici: False
Giochi_olimpici_estivi: False


### Step 3: Tokenization

Tokenize the cleaned text.

1. Stop words removal: Stop words are simply those words that are extremely common in all sorts of texts and probably bear no (or only a little) useful information 

2. Tokenization: 

In [26]:
# Dowlnoad the stopwords 
nltk.download('punkt', force=True)
nltk.download('punkt_tab', force=True)
nltk.download('stopwords', force=True)

[nltk_data] Downloading package punkt to C:\Users\Mauro
[nltk_data]     Andretta\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to C:\Users\Mauro
[nltk_data]     Andretta\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package stopwords to C:\Users\Mauro
[nltk_data]     Andretta\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [27]:
# Function used to tokenize and remove stopwords from a text
def remove_stopwords(content, language):
    # Get the stopwords for the specified language
    if language == 'en':
        stop_words = set(stopwords.words('english'))
    elif language == 'it':
        stop_words = set(stopwords.words('italian'))
    else:
        raise ValueError(f"Unsupported language: {language}")
        
    # Tokenize the content
    tokens = word_tokenize(content)
    
    # Remove the stopwords
    filtered_content = [token for token in tokens if token.lower() not in stop_words]
    
    return ' '.join(filtered_content)

In [28]:
# remove the stopwords from the content and tokenize it
for title, document in documents.items():
    documents[title]['content'] = remove_stopwords(document['content'], document['language'])

In [29]:
# Print a snippet of the cleaned text
for title, document in documents.items():
    print(f"Snippet from {title}:\n{document['content'][:500]}\n")

Snippet from Giochi_olimpici:
giochi olimpici moderna evento sportivo quadriennale prevede competizione migliori atleti mondo quasi tutte discipline sportive praticate cinque continenti abitati essi pur comunemente chiamati olimpiadi confondere olimpiade quest ultima indica intervallo tempo quattro anni intercorre edizione giochi olimpici estivi successiva giochi 1916 1940 1944 stati disputati continuato conteggiare olimpiadi cosicché giochi parigi 2024 stati trentatreesima edizione nome giochi olimpici stato scelto ricordar

Snippet from Giochi_olimpici_estivi:
giochi olimpici estivi manifestazione sportiva multidisciplinare internazionale prevista anni multipli organizzata comitato olimpico internazionale olimpiadi prestigioso mondo eventi tipo presentano varietà sport superiore altre manifestazioni simili sport vittoria olimpica viene generalmente considerata risultato prestigioso conseguibile qualsiasi sport eccezione calcio pochi altri sport prevalentemente squadra quali vengono d

### Step 4: Storage

After the tokenization is it possible to store the data

In [30]:
# Directory where the JSON files will be stored
output_dir = 'wikipedia_documents'
os.makedirs(output_dir, exist_ok=True)

# Save each document in an individual JSON file
for title, doc in documents.items():
    # Sanitize title to create a valid filename
    filename = f"{title.replace(' ', '_').replace('/', '_')}.json"
    filepath = os.path.join(output_dir, filename)
    
    # Write the document to a JSON file
    with open(filepath, 'w', encoding='utf-8') as json_file:
        json.dump(doc, json_file, ensure_ascii=False, indent=4)

    print(f"Document for '{title}' saved as '{filepath}'")

Document for 'Giochi_olimpici' saved as 'wikipedia_documents\Giochi_olimpici.json'
Document for 'Giochi_olimpici_estivi' saved as 'wikipedia_documents\Giochi_olimpici_estivi.json'


## Create chunks

Once the .json document are created is it possible to start chunking the individual text. 

The expected chunk schema should be: 

{

    "id": uuid.uuid() #vector in vector db
    "vector": List[float] # The embedding of the chunk
    "payload": {

        "content": str # the textual content of the chunk,
        "language": str # the language of the text,
        "title": str # the title of the wikipedia page,
        "url": str # the url link used to get the wikipedia informations
    }
}

Sentence transformer installation problem: [solution](https://stackoverflow.com/questions/78808745/the-python-module-sentence-transformers-is-not-found-even-though-the-package-is)

In [31]:
import uuid
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from typing import List

In [44]:
def process_documents(input_dir: str):
    # Initialize the text splitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
    
    # Initialize the embedding model
    embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
    
    # List to hold all the chunks
    all_chunks = []
    
    # Iterate over all JSON files in the directory
    for filename in os.listdir(input_dir):
        if filename.endswith('.json'):
            filepath = os.path.join(input_dir, filename)
            
            with open(filepath, 'r', encoding='utf-8') as json_file:
                doc = json.load(json_file)
                
                # Extract the necessary information
                title = doc['title']
                content = doc['content']
                language = doc['language']
                url = doc['url']
                
                # Split the content into chunks
                texts = text_splitter.split_text(content)
                
                # Create embeddings for each chunk using LangChain
                embeddings = embedding_model.encode(texts)
                
                # Create a chunk for each piece of text
                for i, (text, vector) in enumerate(zip(texts, embeddings)):
                    chunk = {
                        "id": str(uuid.uuid4()),  # Unique identifier
                        "vector": vector.tolist(),  # Convert NumPy array to list
                        "payload": {
                            "content": text,
                            "language": language,
                            "title": title,
                            "url": url,
                        }
                    }
                    all_chunks.append(chunk)
    
    return all_chunks


In [45]:
import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

In [46]:
# Example usage
input_dir = 'wikipedia_documents'
os.makedirs(input_dir, exist_ok=True)
chunks = process_documents(input_dir)



In [47]:
chunks[0]

{'id': '9bcca9b1-76dd-4597-9ecd-7aae8e62d085',
 'vector': [-0.0050218962132930756,
  0.03419070690870285,
  -0.045247167348861694,
  -0.0019858398009091616,
  -0.11844921112060547,
  0.03562400862574577,
  0.03216155990958214,
  0.067304328083992,
  -0.06848309189081192,
  0.06326217204332352,
  0.06979662925004959,
  -0.013670475222170353,
  0.020547524094581604,
  0.02239847369492054,
  -0.00953915249556303,
  -0.03514229878783226,
  0.010729718953371048,
  0.009665064513683319,
  -0.09331295639276505,
  -0.05155130475759506,
  0.0034140448551625013,
  -0.029114410281181335,
  0.01524844579398632,
  0.037869472056627274,
  -0.13054262101650238,
  0.0058988528326153755,
  -0.06963079422712326,
  0.004497836343944073,
  -0.018714364618062973,
  0.007497597485780716,
  0.013199086301028728,
  0.08543937653303146,
  0.05496099591255188,
  0.02430768869817257,
  0.02171754650771618,
  0.007798492908477783,
  -0.04191609099507332,
  -0.053675610572099686,
  -0.032413508743047714,
  0.07667

In [48]:
def save_chunk_to_json(chunk, output_dir):
    # Create the output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Generate a filename using the chunk ID
    filename = f"{chunk['id']}.json"
    filepath = os.path.join(output_dir, filename)
    
    # Write the chunk to a JSON file
    with open(filepath, 'w', encoding='utf-8') as json_file:
        json.dump(chunk, json_file, ensure_ascii=False, indent=4)
    
    print(f"Chunk saved as {filepath}")

In [49]:
# Example usage
output_dir = 'chunked_documents'

for chunk in chunks:
    save_chunk_to_json(chunk, output_dir)

Chunk saved as chunked_documents\9bcca9b1-76dd-4597-9ecd-7aae8e62d085.json
Chunk saved as chunked_documents\0164dfd0-beae-46bf-9da1-6e4e3022ad8c.json
Chunk saved as chunked_documents\c185fd48-53ea-4285-a51f-3765a990f0e1.json
Chunk saved as chunked_documents\a9793504-9151-4458-86ac-0df6fb0edf7f.json
Chunk saved as chunked_documents\497af23b-fb58-476e-b443-06d3343e89f5.json
Chunk saved as chunked_documents\44d2854c-7e9f-4075-a337-7cdfc9238d10.json
Chunk saved as chunked_documents\2bee4bdc-33a5-466f-a4dc-e6c2d3fea385.json
Chunk saved as chunked_documents\070c7a2e-abb4-4c5a-b32e-d5de455eab38.json
Chunk saved as chunked_documents\73ac319c-96b9-42b7-81d5-d9fff8913c04.json
Chunk saved as chunked_documents\b94b4d5d-db88-47be-b795-416d2d107bcd.json
Chunk saved as chunked_documents\cd451ddd-5a19-4ed7-9905-71b3d3b6107d.json
Chunk saved as chunked_documents\083c776b-f0e2-4464-88ce-6f19461e1eaf.json
Chunk saved as chunked_documents\465f9c67-df80-435b-8df3-9e63c7183d9b.json
Chunk saved as chunked_do

## Load the chunks as Qrant points

In [52]:
# Import the official libraries from Qdrant APi
from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams,
    Distance,
    Filter,
    FieldCondition,
    MatchValue,
    PointStruct,
    Distance,
)

In [61]:
def load_chunks_to_qdrant(chunks_dir: str, collection_name: str):
    # Connect to Qdrant instance
    qdrant_client = QdrantClient(host="localhost", port=6333)  # Adjust as necessary

    processed_chunks = []
    unprocessed_chunks = []
    
    # Create the collection if it doesn't exist
    if not qdrant_client.collection_exists(collection_name):
        qdrant_client.create_collection(
            collection_name=collection_name,
            vectors_config=VectorParams(size=384, distance=Distance.COSINE)
        )
        print(f"Collection '{collection_name}' created in Qdrant.")
    else:
        print(f"Collection '{collection_name}' already exists in Qdrant.")
    
    # Iterate over all chunk files in the directory
    for filename in os.listdir(chunks_dir):
        if filename.endswith('.json'):
            filepath = os.path.join(chunks_dir, filename)

            print(f"Processing file '{filename}'...")

            try:
                with open(filepath, 'r', encoding='utf-8') as json_file:
                    chunk = json.load(json_file)
                    point_id = chunk['id']
                    

                    # Insert the point into Qdrant
                    qdrant_client.upsert(
                        collection_name=collection_name,
                        points=[
                            PointStruct(
                                id=point_id,
                                vector=chunk['vector'],
                                payload=chunk['payload']
                            )
                        ]
                    )
                    print(f"Point {point_id} inserted into Qdrant.")
                    processed_chunks.append(filepath)
            except Exception as e:
                print(f"Error processing file '{filename}': {e}")
                unprocessed_chunks.append(filepath)
    
    print(f"Processed {len(processed_chunks)} chunks.")
    print(f"Failed to process {len(unprocessed_chunks)} chunks.")
    print(f"Failed chunks: {unprocessed_chunks}")


In [62]:
# Example usage
chunks_dir = 'chunked_documents'
collection_name = 'test_collection'

# Load chunks into Qdrant
load_chunks_to_qdrant(chunks_dir, collection_name)

Collection 'test_collection' already exists in Qdrant.
Processing file '0164dfd0-beae-46bf-9da1-6e4e3022ad8c.json'...
Point 0164dfd0-beae-46bf-9da1-6e4e3022ad8c inserted into Qdrant.
Processing file '01a7234f-3198-42a6-a101-e3bd90dd56e8.json'...
Point 01a7234f-3198-42a6-a101-e3bd90dd56e8 inserted into Qdrant.
Processing file '03325c45-e6aa-485e-abea-6f446da7dce3.json'...
Point 03325c45-e6aa-485e-abea-6f446da7dce3 inserted into Qdrant.
Processing file '070c7a2e-abb4-4c5a-b32e-d5de455eab38.json'...
Point 070c7a2e-abb4-4c5a-b32e-d5de455eab38 inserted into Qdrant.
Processing file '0778d780-b2e4-40e9-827d-41afd8087ad0.json'...
Point 0778d780-b2e4-40e9-827d-41afd8087ad0 inserted into Qdrant.
Processing file '083c776b-f0e2-4464-88ce-6f19461e1eaf.json'...
Point 083c776b-f0e2-4464-88ce-6f19461e1eaf inserted into Qdrant.
Processing file '10344e99-32d1-4242-8eed-80aeaa2fb67f.json'...
Point 10344e99-32d1-4242-8eed-80aeaa2fb67f inserted into Qdrant.
Processing file '2bee4bdc-33a5-466f-a4dc-e6c2d3fe