In [1]:
import wikipediaapi

In [4]:
import nltk
from nltk.tokenize import word_tokenize


nltk.download('punkt') # 문장과 단어를 분리하기 위한 토크나이저 데이터
nltk.download('punkt_tab')

def nb_tokens(text):
    tokens = word_tokenize(text)
    return len(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [5]:
word_tokenize("I am happy!")

['I', 'am', 'happy', '!']

In [6]:
wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='Knowledge/1.0 (example@google.com)'
)

In [9]:
topic = "Retrieval-Augmented Generation"
filename = "RAG"
maxl = 100

In [10]:
import textwrap

page = wiki.page(topic)
if page.exists() == True:
    print("Page - Exists: %s" % page.exists())

    summary = page.summary
    nbt = nb_tokens(summary)
    print(f"Number of tokens: {nbt}")

    wrapped_text = textwrap.fill(summary, width=100)
    print(wrapped_text)
else:
    print("Page does not exist")

Page - Exists: True
Number of tokens: 372
Retrieval-augmented generation (RAG) is a technique that enables generative artificial intelligence
(Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large
language model (LLM) so that the model responds to user queries with reference to a specified set of
documents, using this information to supplement information from its pre-existing training data.
This allows LLMs to use domain-specific and/or updated information. Use cases include providing
chatbot access to internal company data or generating responses based on authoritative sources. RAG
improves large language models (LLMs) by incorporating information retrieval before generating
responses. Unlike traditional LLMs that rely on static training data, RAG pulls relevant text from
databases, uploaded documents, or web sources. According to Ars Technica, "RAG is a way of improving
LLM performance, in essence by blending the LLM process with a web sea

In [11]:
print(page.fullurl)

https://en.wikipedia.org/wiki/Retrieval-augmented_generation


In [21]:
links = page.links

urls = []
counter = 0
for link in links:
    try:
        counter += 1
        print(f"Link {counter}: {link}")
        summary = wiki.page(link).summary
        print(f"Link: {link}")
        print(wiki.page(link).fullurl)
        urls.append(wiki.page(link).fullurl)
        print(f"Summary: {summary}")
        if counter >= maxl:
            break
    except page.exists() == False:
        pass

Link 1: 01.AI
Link: 01.AI
https://en.wikipedia.org/wiki/01.AI
Summary: 01.AI (Chinese: 零一万物; pinyin: Língyī Wànwù) is an artificial intelligence (AI) company based in Beijing, China. It focuses on developing open source products.
Link 2: 15.ai
Link: 15.ai
https://en.wikipedia.org/wiki/15.ai
Summary: 15.ai was a free non-commercial web application that used artificial intelligence to generate text-to-speech voices of fictional characters from popular media. Created by an anonymous artificial intelligence researcher known as 15, who began developing the technology as a freshman during their undergraduate research at the Massachusetts Institute of Technology, the application allowed users to make characters from video games, television shows, and movies speak custom text with emotional inflections faster than real-time. The platform was notable for its ability to generate convincing voice output using minimal training data—the name "15.ai" referenced the creator's claim that a voice could

In [22]:
print(counter)
print(urls)

100
['https://en.wikipedia.org/wiki/01.AI', 'https://en.wikipedia.org/wiki/15.ai', 'https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)', 'https://en.wikipedia.org/wiki/Amazon_Web_Services', 'https://en.wikipedia.org/wiki/Action_selection', 'https://en.wikipedia.org/wiki/Activation_function', 'https://en.wikipedia.org/wiki/Adobe_Firefly', 'https://en.wikipedia.org/wiki/Adversarial_machine_learning', 'https://en.wikipedia.org/wiki/Alan_Turing', 'https://en.wikipedia.org/wiki/AlexNet', 'https://en.wikipedia.org/wiki/Alex_Graves_(computer_scientist)', 'https://en.wikipedia.org/wiki/Alex_Krizhevsky', 'https://en.wikipedia.org/wiki/Alibaba_Group', 'https://en.wikipedia.org/wiki/Allen_Newell', 'https://en.wikipedia.org/wiki/AlphaFold', 'https://en.wikipedia.org/wiki/AlphaGo', 'https://en.wikipedia.org/wiki/AlphaZero', 'https://en.wikipedia.org/wiki/Alphabet_Inc.', 'https://en.wikipedia.org/wiki/Andrej_Karpathy', 'https://en.wikipedia.org/wiki/Andrew_Ng', 'https://en.wikiped

In [25]:
from datetime import datetime

links = page.links

fname = filename + "_citations.txt"
with open(fname, "w") as file:
    file.write(f"Citation. In Wikipedia, The Free Encyclopedia. Pages retrieved from the following Wikipedia contributors on {datetime.now()}\n")
    file.write("Root page: " + page.fullurl + "\n")

    counter = 0
    urls = []
    urls.append(page.fullurl)

    for link in links:
        try:
            counter += 1
            page_detail = wiki.page(link)
            summary = page_detail.summary

            file.write(f"Link {counter}: {link}\n")
            file.write(f"Link: {link}\n")
            file.write(f"{page_detail.fullurl}\n")
            urls.append(page_detail.fullurl)
            file.write(f"Summary: {summary}\n")

            if counter >= maxl:
                break
        except wiki.exceptions.PageError:
            continue
    
    file.write(f"Total links processd: {counter}\n")
    file.write("URLs:\n")
    file.write("\n".join(urls))

In [26]:
urls

['https://en.wikipedia.org/wiki/Retrieval-augmented_generation',
 'https://en.wikipedia.org/wiki/01.AI',
 'https://en.wikipedia.org/wiki/15.ai',
 'https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)',
 'https://en.wikipedia.org/wiki/Amazon_Web_Services',
 'https://en.wikipedia.org/wiki/Action_selection',
 'https://en.wikipedia.org/wiki/Activation_function',
 'https://en.wikipedia.org/wiki/Adobe_Firefly',
 'https://en.wikipedia.org/wiki/Adversarial_machine_learning',
 'https://en.wikipedia.org/wiki/Alan_Turing',
 'https://en.wikipedia.org/wiki/AlexNet',
 'https://en.wikipedia.org/wiki/Alex_Graves_(computer_scientist)',
 'https://en.wikipedia.org/wiki/Alex_Krizhevsky',
 'https://en.wikipedia.org/wiki/Alibaba_Group',
 'https://en.wikipedia.org/wiki/Allen_Newell',
 'https://en.wikipedia.org/wiki/AlphaFold',
 'https://en.wikipedia.org/wiki/AlphaGo',
 'https://en.wikipedia.org/wiki/AlphaZero',
 'https://en.wikipedia.org/wiki/Alphabet_Inc.',
 'https://en.wikipedia.org/wiki/A

In [27]:
ufname = filename + "_urls.txt"
with open(ufname, 'w') as file:
    for url in urls:
        file.write(url + '\n')

print("URLs have been written to urls.txt")

URLs have been written to urls.txt


In [28]:
with open(ufname, 'r') as file:
    urls = [line.strip() for line in file]

print("Read URLs:")
for url in urls:
    print(url)

Read URLs:
https://en.wikipedia.org/wiki/Retrieval-augmented_generation
https://en.wikipedia.org/wiki/01.AI
https://en.wikipedia.org/wiki/15.ai
https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)
https://en.wikipedia.org/wiki/Amazon_Web_Services
https://en.wikipedia.org/wiki/Action_selection
https://en.wikipedia.org/wiki/Activation_function
https://en.wikipedia.org/wiki/Adobe_Firefly
https://en.wikipedia.org/wiki/Adversarial_machine_learning
https://en.wikipedia.org/wiki/Alan_Turing
https://en.wikipedia.org/wiki/AlexNet
https://en.wikipedia.org/wiki/Alex_Graves_(computer_scientist)
https://en.wikipedia.org/wiki/Alex_Krizhevsky
https://en.wikipedia.org/wiki/Alibaba_Group
https://en.wikipedia.org/wiki/Allen_Newell
https://en.wikipedia.org/wiki/AlphaFold
https://en.wikipedia.org/wiki/AlphaGo
https://en.wikipedia.org/wiki/AlphaZero
https://en.wikipedia.org/wiki/Alphabet_Inc.
https://en.wikipedia.org/wiki/Andrej_Karpathy
https://en.wikipedia.org/wiki/Andrew_Ng
https://en.w