### WEB SCRAPPING USING BEAUTIFUL SOAP

In [2]:
import requests
from bs4 import BeautifulSoup

# Step 1: Scrape data from Wikipedia
url = "https://en.wikipedia.org/wiki/India"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract data - for example, the first 10 paragraphs
paragraphs = soup.find_all('p', limit=10)
documents = []

for paragraph in paragraphs:
    text = paragraph.text.strip()
    doc = text  # Assign the text to doc
    documents.append(doc)

# Print the extracted documents
for i, doc in enumerate(documents):
    print(doc)
    print("\n")




India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[21] is a country in South Asia.  It is the seventh-largest country by area; the most populous country as of June 2023;[22][23] and from the time of its independence in 1947, the world's most populous democracy.[24][25][26] It is physiographically bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, the Bay of Bengal on the southeast, and High-mountain Asia on the northeast. It shares land borders with Pakistan to the northwest;[j] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.


Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.[27][28][29]
Their long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diver

In [3]:
documents

['',
 "India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[21] is a country in South Asia.  It is the seventh-largest country by area; the most populous country as of June 2023;[22][23] and from the time of its independence in 1947, the world's most populous democracy.[24][25][26] It is physiographically bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, the Bay of Bengal on the southeast, and High-mountain Asia on the northeast. It shares land borders with Pakistan to the northwest;[j] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.",
 'Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.[27][28][29]\nTheir long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highl

### BUILDING OPEN SOURCE LLM WITH FINE TUNING & EMBEDDING USING HUGGNG FACE

In [4]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.prompts.prompts import SimpleInputPrompt
import torch


  from .autonotebook import tqdm as notebook_tqdm



In [5]:
system_prompt="""
You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""
## Default format supportable by LLama2
query_wrapper_prompt=SimpleInputPrompt("<|USER|>{query_str}<|ASSISTANT|>")

In [6]:
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    model_name="meta-llama/Llama-2-7b-chat-hf",
    device_map="auto",
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16 , "load_in_8bit":True}
)

Loading checkpoint shards: 100%|██████████| 2/2 [01:47<00:00, 53.75s/it] 


In [7]:

from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.core import ServiceContext
from llama_index.legacy.embeddings.langchain import LangchainEmbedding

embed_model=LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

  warn_deprecated(


In [8]:
service_context=ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    embed_model=embed_model
)
     

service_context

  service_context=ServiceContext.from_defaults(


ServiceContext(llm_predictor=LLMPredictor(system_prompt=None, query_wrapper_prompt=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>), prompt_helper=PromptHelper(context_window=4096, num_output=256, chunk_overlap_ratio=0.1, chunk_size_limit=None, separator=' '), embed_model=LangchainEmbedding(model_name='sentence-transformers/all-mpnet-base-v2', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001566188D710>), transformations=[SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001566188D710>, id_func=<function default_id_func at 0x000001565733D6C0>, chunk_size=1024, chunk_overlap=200, separator=' ', paragraph_separator='\n\n\n', secondary_chunking_regex='[^,.;。？！]+[,.;。？！]?')], llama_logger=<llama_index.core.service_context_elements.llama_logger.LlamaLogger object at 0x000001566A46F390>, callback_manager=<llama_index.c

In [9]:
from llama_index.core import Document
documents1 = [Document(text=doc) for doc in documents if doc.strip()]

index=VectorStoreIndex.from_documents(documents1,service_context=service_context)
     
     

### PROMPT ENGINE

In [10]:
query_engine=index.as_query_engine()
     

response=query_engine.query("about India in 1 sentence")



In [11]:
print(response)

India is a federal republic in South Asia, with a population of almost 1.4 billion people, a growing middle class, and a diverse culture influenced by its history, geography, and religion.
