# LlamaWebChat
#### Query you web page data using Llama index

install all the necessary Libraries

In [None]:
!pip -q install llama-index
!pip -q install unstructured
!pip install "transformers[torch]" "huggingface_hub[inference]"
!pip install transformers
!pip install beautifulsoup4

We are currently exploring the Gemini-Pro model from Google, the GPT-3.5 model from OpenAI, and open-source Hugging Face models in this notebook. You can choose to use only one model along with its corresponding API key for your specific use case. There's no requirement to use all API keys if it's not necessary for your experimentation.

Get the API keys Here:

[Google API Key](https://ai.google.dev/) ,
[Open AI API key](https://openai.com/),
[Hugging Face Token](https://huggingface.co/settings/tokens)

In [2]:
import os
os.environ["GOOGLE_API_KEY"]="Insert_your_Google_API_Key"
HF_TOKEN="Insert_your_HF_token"
os.environ["OPENAI_API_KEY"]="Insert_your_OpenAI_API_Key"

### Import all the neccessary Libraries

In [3]:
from llama_index.llms import HuggingFaceInferenceAPI, HuggingFaceLLM , Gemini , OpenAI
from llama_index import VectorStoreIndex, download_loader,ServiceContext
from llama_index.embeddings import HuggingFaceEmbedding
from IPython.display import Markdown, display
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext

### Data Loadining

There are Various Data Loader you can use to get webpage data, select the best one's for your use case , code to demonstrate every metod is provided.

#### SimpleWebPage Loader

In [17]:
# Uncomment this code if you want to use SimpleWebPage Reader
'''
SimpleWebPageReader = download_loader("SimpleWebPageReader")
SimpleWeb_loader = SimpleWebPageReader()
documents = SimpleWeb_loader.load_data(urls=['https://en.wikipedia.org/wiki/Taylor_Swift'])
'''

This is a Basic Way of Loading Data from Web , you can always go through and clean the data if data is not looking good or Not getting good Results , Some Data cleaning code will be provided in this notebook

#### BeautifulSoupWebReader

In [4]:
# We Can Also Use BeautifulSoupWebReader for loading content from webpages as Alternative for SimpleWebPageReader

BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")
BeautifulSoup_loader = BeautifulSoupWebReader()
documents = BeautifulSoup_loader.load_data(urls=['https://en.wikipedia.org/wiki/Taylor_Swift'])

### WikipediaReader

if you are Only interested in Wikipedia Pages , you can use Wikipedia reader. for this you can just provide page name no need for the link, works very well compared to other loader on Wikipedia Pages

In [None]:
# !pip install wikipedia

In [None]:
## uncomment this code to use wikipedia Reader
'''
from llama_index import download_loader

WikipediaReader = download_loader("WikipediaReader")

loader = WikipediaReader()
documents = loader.load_data(pages=['india'])

'''

You can just use following code from next cell using beautiful soup to extract data clean it properly and remove any unknown data , you can customise with html tags with regards to your webpage to extract better info for your purpose

In [None]:
# Uncomment if you need it
'''
import requests
from bs4 import BeautifulSoup
import re

def scrape_and_clean(url):
    response = requests.get(url)

    if response.status_code == 200:
        html_content = response.text
    else:
        print(f"Failed to fetch the webpage. Status code: {response.status_code}")
        return []

    # Parse HTML content with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    #Find and extract relevant text data
    paragraphs = soup.find_all('p')
    extracted_text = [paragraph.get_text() for paragraph in paragraphs]

    # Clean each paragraph
    cleaned_paragraphs = [clean_text(para) for para in extracted_text]

    return cleaned_paragraphs

def clean_text(text):
    cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    return cleaned_text

# Example usage:
url_to_scrape = 'https://en.wikipedia.org/wiki/Taylor_Swift'
cleaned_data = scrape_and_clean(url_to_scrape)

cleaned_data

'''

### Embedding Model

using Hugging face embedding here , to avoid any API limitations with OpenAI embedding and Google embeddings and hugging face works really well also sometimes faster than other two and various models to pick from with Hf embeddings

In [None]:
embed_model_uae = HuggingFaceEmbedding(model_name="WhereIsAI/UAE-Large-V1")

You can use other Embedding models as well , find the embedding models leaderboard here
https://huggingface.co/spaces/mteb/leaderboard

### LLM

using Llm to send the context we get from Index store and pass it through llm to get final output

Use the Preferrable llm you want to use , recommended to use Hf model , they are doing good compared to Gemini and OpenAI while Giving responses

In [6]:
# Gemini
from llama_index.llms import Gemini

Gemini_llm=Gemini()

In [7]:
# Hugging Face model

from llama_index.llms import HuggingFaceInferenceAPI
hf_llm = HuggingFaceInferenceAPI(
    model_name="HuggingFaceH4/zephyr-7b-alpha", token=HF_TOKEN
)

In [8]:
# Open AI Model

from llama_index.llms import OpenAI
openAI_llm = OpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=256)

In [9]:
# Set the service context
# using hf_llm in here for demonstration

service_context = ServiceContext.from_defaults(llm=hf_llm, chunk_size=800, chunk_overlap=20,embed_model=embed_model_uae)

In [10]:
# Indexing the data

index = VectorStoreIndex.from_documents(documents,service_context=service_context,show_progress=True)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/101 [00:00<?, ?it/s]

In [11]:
# save the storage context

index.storage_context.persist()

In [12]:
# query_engine for retrieval
query_engine = index.as_query_engine()

response = query_engine.query("who is Taylor swift")

display(Markdown(f"{response}"))



Taylor Swift is an American singer-songwriter, producer, director, businesswoman, and actress. She began professional songwriting at age 14 and signed with Big Machine Records in 2005 to become a country singer. She released six studio albums under the label, four of them to country radio, starting with Taylor Swift (2006). Her next, Fearless (2008), explored country pop, and its singles "Love Story" and "You Belong with Me" catapulted her to mainstream fame. Speak Now (2010) infused rock influences, while Red (2012) experimented with electronic elements and featured Swift's first Billboard Hot 100 number-one song, "We Are Never Ever Getting Back Together". She departed from her country image with 1989 (2014), a synth-pop album supported by the chart-topping songs "Shake It Off", "Blank Space", and "Bad Blood". Media scrutiny inspired the hip-hop-flavored Reputation (2017) and its number-one single "Look What You

### Using Chroma DB instead Vector Index

In [13]:
!pip -q install chromadb

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/508.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/508.6 kB[0m [31m944.7 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/508.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m508.6/508.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m67.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.3/60.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
[2

In [14]:
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext

In [15]:
import chromadb

In [16]:
# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")

In [18]:
# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=hf_llm, chunk_size=800, chunk_overlap=20,embed_model=embed_model_uae)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, service_context=service_context
)

In [19]:
# Query Data
query_engine = index.as_query_engine()

In [20]:
response = query_engine.query("what are taylor swift albums?")
display(Markdown(f"<b>{response}</b>"))

<b>

Taylor Swift is an American singer-songwriter who has released several albums throughout her career. Some of her notable albums include:

1. Taylor Swift (2006)
2. Fearless (2008)
3. Speak Now (2010)
4. Red (2012)
5. 1989 (2014)
6. Reputation (2017)
7. Lover (2019)
8. Folklore (2020)
9. Evermore (2020)

Each of these albums has its unique style and genre, ranging from country to pop, rock, and indie.</b>