# URL Vector Database Generator

## Install Requirements

In [None]:
!pip install openai
!pip install langchain
!pip install tiktoken
!pip install faiss-cpu
!pip install requests beautifulsoup4

## Specify main URL sources

For this specific project, our objective is to build a vector database by extracting data from www.telkom.co.id and all of its subdirectories. There are three primary parent URLs from which we will retrieve all of their respective subdirectories. These parent URLs are:

- https://www.telkom.co.id/sites/about-telkom/id_ID
- https://www.telkom.co.id/sites/enterprise/id_ID
- https://www.telkom.co.id/sites/wholesale/id_ID

To accomplish this task, we will utilize the `BeautifulSoup` library to extract all the links or href tags from the mentioned parent URLs. This process will compile a list of all subdirectory URLs

In [43]:
main_url = ["https://www.telkom.co.id/sites/about-telkom/id_ID", "https://www.telkom.co.id/sites/enterprise/id_ID", "https://www.telkom.co.id/sites/wholesale/id_ID"]

In [44]:
subdirectories = []

In [45]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

for url in (main_url):

  #Send an HTTP GET request to the main URL
  response = requests.get(url)

  # Check if the request was successful (status code 200)
  if response.status_code == 200:
      # Parse the HTML content of the page
      soup = BeautifulSoup(response.text, "html.parser")

      # Find all anchor (a) tags that contain href attributes
      links = soup.find_all("a", href=True)

      # Loop through the links and extract subdirectories
      for link in links:
          # Get the href attribute of the link
          href = link["href"]

          # Join the URL with the main domain to create an absolute URL
          absolute_url = urljoin(url, href)

          # Check if the URL is a subdirectory of the main domain
          if absolute_url.startswith(url) and absolute_url != url:
              subdirectories.append(absolute_url)
  else:
      print("Failed to retrieve the main URL.")

In [46]:
subdirectories = list(set(subdirectories))

In [47]:
print("First five url:")
for i in range (5):
  print(subdirectories[i])

print(f"\nTotal url: {len(subdirectories)}")

First five url:
https://www.telkom.co.id/sites/about-telkom/id_ID/news/batic-2023-transformasi-dan-inovasi-jadi-strategi-jitu-di-tengah-evolusi-teknologi-digital-yang-dinamis-2101
https://www.telkom.co.id/sites/wholesale/id_ID/page/homepage-network-connectivity-973
https://www.telkom.co.id/sites/enterprise/id_ID/page/digital-financial-banking-solution-809
https://www.telkom.co.id/sites/about-telkom/id_ID/page/ir-informasi-atau-fakta-material-lain-174
https://www.telkom.co.id/sites/enterprise/id_ID/news/partisipasi-telkomgroup-dukung-pendanaan-startup-nasional-melalui-peresmian-merah-putih-fund-2105

Total url: 141


## Extract Text from HTML Webpages and Turn it into document format

In the next phase of our project, we will be employing the `WebBaseLoader` tool provided by LangChain. This tool will facilitate the extraction of text content from HTML webpages and convert it into a document format suitable for downstream use. Then, we will segment the document into chunks of text to avoid token limit from OpenAI.

In [49]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(subdirectories)
data = loader.load()

In [50]:
import warnings
warnings.filterwarnings("ignore")

In [51]:
from langchain.text_splitter import TokenTextSplitter
from langchain.text_splitter import CharacterTextSplitter

chunk_size_limit = 850
max_chunk_overlap = 20

# text_splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=25)
text_splitter = CharacterTextSplitter(chunk_size=chunk_size_limit, chunk_overlap=max_chunk_overlap)
docs = text_splitter.split_documents(data)



In [52]:
print(f'First content of our data:\n {docs[0]}')

First content of our data:
 page_content='Telkom | BATIC 2023: Transformasi dan Inovasi Jadi Strategi Jitu di Tengah Evolusi Teknologi Digital yang Dinamis\n\n\nTentang TelkomEnterpriseWholesale\n\n\nProfil\n\n\nProfil dan Riwayat Singkat\nDewan Komisaris\nDireksi\nStruktur Group Perusahaan\nPenghargaan\nAnggaran Dasar\nLogo Telkom Indonesia\nAsean Summit 2023\nHut Telkom\n\n\nHubungan Investor\n\nLaporan-Laporan\n\n\nInformasi Saham dan Obligasi\n\n\nBerita dan Kegiatan\n\n\nInformasi Lainnya\n\n\nLaporan SEC\n\n\nLaporan Keuangan\n\n\nLaporan Tahunan\n\n\nInfo Memo\n\n\nLaporan Keberlanjutan\n\n\nIkhtisar Keuangan\n\n\nHarga dan Volume Saham\n\n\nKomposisi Pemegang Saham\n\n\nKebijakan Dividen\n\n\nKronologis Pencatatan Saham\n\n\nInformasi Obligasi\n\n\nProspektus Penawaran Umum\n\n\nRUPS\n\n\nKalender Investor\n\n\nEarnings Call\n\n\nInformasi Kepada Investor\n\n\nInformasi Aksi Korporasi\n\n\nInformasi atau Fakta Material Lain' metadata={'source': 'https://www.telkom.co.id/sites/a

In [53]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

total_word_count = sum(len(doc.page_content.split()) for doc in docs)
total_token_count = sum(len(enc.encode(doc.page_content)) for doc in docs)

print(f"\nTotal word count: {total_word_count}")
print(f"\nEstimated tokens: {total_token_count}")
print(f"\nEstimated cost of embedding: ${total_token_count * 0.0004 / 1000}")


Total word count: 12506

Estimated tokens: 25293

Estimated cost of embedding: $0.0101172


## Create LLM and Embedding Model

In [54]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI

embeddings_openai = OpenAIEmbeddings(openai_api_key = 'API_KEY_HERE')

llm_openai = ChatOpenAI(openai_api_key='API_KEY_HERE',
                        temperature=0
                        )

## Save data document as vector database

In [55]:
from langchain.vectorstores import FAISS

vector_store = FAISS.from_documents(docs, embeddings_openai)

In [56]:
#This step will result in the creation of a folder named "Telkom_URL_vectorstore," which will be our vector database.

vector_store.save_local('Telkom_URL_vectorstore')

## Save vector database to local

To streamline the process and prevent the need for repetitive vector database creation from the same data source, we will execute the following program to generate a zip file of our vector database. This zip file can then be downloaded locally for future use.

In [57]:
import shutil

folder_path = "/content/Telkom_URL_vectorstore"
output_path = "/content/Telkom_URL_vectorstore"

shutil.make_archive(output_path, 'zip', folder_path)


'/content/Telkom_URL_vectorstore.zip'