# Personal Generative AI with LLM Learning

## 1 Dependencies

- langchain
- huggingface_hub
- sentence_transformers
- faiss-cpu
- unstructured
- youtube_transcript_api

## 2 Environment Setup

In [39]:
import os
import requests
from langchain.document_loaders import TextLoader, UnstructuredPDFLoader, ArxivLoader, SeleniumURLLoader, OnlinePDFLoader
from youtube_transcript_api import YouTubeTranscriptApi
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub

import textwrap


os.environ["TOKENIZERS_PARALLELISM"] = "true"

In [4]:
with open('hf_api.txt') as f:
    hf_key = f.readlines()
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = hf_key[0]

In [17]:

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

## 3 Loading of Different Types of Document

### 3.1 Loding Local Text File

In [6]:
def loadlocaltxtfile(txtfile):

    # Load the text document using TextLoader
    loader = TextLoader('./'+txtfile)
    loaded_docs = loader.load()
    return loaded_docs

In [None]:
"local_text_file.txt"

### 3.2 Loading Text from URL

In [7]:
def loadtxtfromURL(text_url, output_txt_name):
   # Fetching the text file
    
   response = requests.get(text_url)
   with open(output_txt_name, "w",  encoding='utf-8') as file:
      file.write(response.text)

   # Load the text document using TextLoader
   loader = TextLoader('./'+output_txt_name)
   loaded_docs = loader.load()
   return loaded_docs

In [None]:
https://raw.githubusercontent.com/vashAI/AnsweringQuestionsWithHuggingFaceAndLLM/main/url_text_file.txt"

### 3.3 Loading Local PDF

In [8]:
def loadlocalPDF(pdf_file):
    loader = UnstructuredPDFLoader(pdf_file)
    loaded_docs = loader.load()
    return loaded_docs

In [None]:
"Eurovision_Song_Contest_2023.pdf"

### 3.4 Loading Text from Website

In [26]:
def loadwebsitetext(url):    # loader = UnstructuredURLLoader(urls=[url])
    loader = SeleniumURLLoader(urls=[url])
    loaded_docs = loader.load()
    return loaded_docs

In [None]:
"https://saturncloud.io/blog/breaking-the-data-barrier-how-zero-shot-one-shot-and-few-shot-learning-are-transforming-machine-learning/"


### 3.5 Reading text from YouTude Video

In [10]:
def loadyoutubetext(youtube_video_id="eg9qDjws_bU"):
    transcript = YouTubeTranscriptApi.get_transcript(youtube_video_id)

    transcript_text = ""
    for entry in transcript:
        transcript_text += ' ' + entry['text']
    
    youtube_txt_file = "youtube_transcript.txt"
    with open('./'+youtube_txt_file, "w",  encoding='utf-8') as file:
      file.write(transcript_text)

    # Load the text document using TextLoader
    loader = TextLoader('./'+youtube_txt_file)
    loaded_docs = loader.load()
    return loaded_docs

### 3.6 Loading Arxiv Paper

In [36]:
def loadtextfromArxiv(query):
    loaded_docs = ArxivLoader(query=query, load_max_docs=5).load()
    return loaded_docs

### 3.7 Loading Online PDF

In [40]:
def loadonlinePDF(pdf_url):
    loaded_docs = OnlinePDFLoader(pdf_url).load()
    return loaded_docs

## 4 Preprocessing

### 4.1 Splitting documents in chunks

In [11]:
def documentsplitter(loaded_docs):
    # Splitting documents into chunks
    splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=10)
    chunked_docs = splitter.split_documents(loaded_docs)
    return chunked_docs

### 4.2 Convert Documents to Embeddings

In [12]:
def createEmbeddings(chunked_docs):
    # Create embeddings and store them in a FAISS vector store
    embedder = HuggingFaceEmbeddings()
    vector_store = FAISS.from_documents(chunked_docs, embedder)
    return vector_store

## 5 Building the Model
### 5.1 Use embeddings to feed the LLM model and Answer Questions

In [13]:
def loadLLMModel():
    llm=HuggingFaceHub(repo_id="declare-lab/flan-alpaca-large", model_kwargs={"temperature":0.0, "max_length":512})
    chain = load_qa_chain(llm, chain_type="stuff")
    return chain

def askQuestions(vector_store, chain, question):
    # Ask a question using the QA chain
    similar_docs = vector_store.similarity_search(question)
    response = chain.run(input_documents=similar_docs, question=question)
    return response

In [14]:
chain = loadLLMModel()

### 5.2 Test with Local file & Test with file from URL

In [15]:
local_loaded_docs = loadlocaltxtfile('local_text_file.txt')
local_chunked_docs = documentsplitter(local_loaded_docs)
local_vector_store = createEmbeddings(local_chunked_docs)

In [18]:
local_response = askQuestions(local_vector_store, chain, "Explain me how ChatGPT and Plugin are empowering Citizen Data Scientists?")
print(wrap_text_preserve_newlines(local_response))
# print(LOCAL_response)

ChatGPT and plugins are helping Citizen Data Scientists by providing them with the tools they need to analyze
and interpret data. By enabling them to use natural language, they are able to ask questions and get answers
in plain English, without knowing complex programming languages or statistical techniques. Additionally,
ChatGPT is a personal expert who is always available to help them turn their idea into reality.


### 5.3 Test file from URL

In [19]:
url_loaded_docs = loadtxtfromURL("https://raw.githubusercontent.com/vashAI/AnsweringQuestionsWithHuggingFaceAndLLM/main/url_text_file.txt",
                                  "urltext.txt")
url_chunked_docs = documentsplitter(url_loaded_docs)
url_vector_store = createEmbeddings(url_chunked_docs)

In [20]:
url_response = askQuestions(url_vector_store, chain, "What are 5 examples of chatgpt and plugin applications?")
print(wrap_text_preserve_newlines(url_response))

1. Data visualization and analysis using ChatGPT and Plugins 2. Content creation and summarization 3.
Personalized learning and skill development 4. Collaboration and knowledge sharing


### 5.4 Test with local PDF

In [21]:
pdf_loaded_docs = loadlocalPDF(pdf_file="Eurovision_Song_Contest_2023.pdf")
pdf_chunked_docs = documentsplitter(pdf_loaded_docs)
pdf_vector_store = createEmbeddings(pdf_chunked_docs)

Created a chunk of size 1019, which is longer than the specified 1000
Created a chunk of size 1316, which is longer than the specified 1000
Created a chunk of size 1425, which is longer than the specified 1000
Created a chunk of size 1352, which is longer than the specified 1000


In [22]:
pdf_response = askQuestions(pdf_vector_store, chain, "Why is it that the 2023 Eurovision Songcontest didn't hold in Ukraine?")
print(wrap_text_preserve_newlines(pdf_response))

The 2023 Eurovision Songcontest didn't hold in Ukraine due to security concerns caused by the Russian invasion
of Ukraine.


### 5.5 Test with Website

In [31]:
web_url = "https://deepchecks.com/glossary/zero-shot-learning/"
web_loaded_docs = loadwebsitetext(web_url)
web_chunked_docs = documentsplitter(web_loaded_docs)
web_vector_store = createEmbeddings(web_chunked_docs)

In [32]:
web_response = askQuestions(web_vector_store, chain, "What is Zero-shot learning?")
print(wrap_text_preserve_newlines(web_response))

Zero-shot learning is a machine learning approach that uses a labeled training set of seen classes and unseen
classes to build models for classes that have not yet been labeled for training. It transfers information from
source classes to labeled samples using class properties as a part of information.


### 5.6 Test with text from video

In [33]:
vid_loaded_docs = loadyoutubetext(youtube_video_id="DIU48QL5Cyk")
vid_chunked_docs = documentsplitter(vid_loaded_docs)
vid_vector_store = createEmbeddings(vid_chunked_docs)

In [34]:
vid_response = askQuestions(vid_vector_store, chain, "What are the trending LLMs?")
print(wrap_text_preserve_newlines(vid_response))

The trending LLMs are chat GPT, Google's Bard AI, and Adobe's AI art generator.


### 5.7 Test with Arxiv Paper

In [37]:
Arxiv_loaded_docs = loadtextfromArxiv(query="2104.12520")
Arxiv_chunked_docs = documentsplitter(Arxiv_loaded_docs)
Arxiv_vector_store = createEmbeddings(Arxiv_chunked_docs)

In [38]:
Arxiv_response = askQuestions(Arxiv_vector_store, chain, "What are the common radio sources?")
print(wrap_text_preserve_newlines(Arxiv_response))

The common radio sources are OB stars, Be stars, flares from M dwarfs, and Ultra Compact HII regions.


### 5.8 Test with Online PDF

In [41]:
opdf_loaded_docs = loadonlinePDF(pdf_url="https://arxiv.org/pdf/2104.12520.pdf")
opdf_chunked_docs = documentsplitter(opdf_loaded_docs)
opdf_vector_store = createEmbeddings(opdf_chunked_docs)

Created a chunk of size 1152, which is longer than the specified 1000
Created a chunk of size 1110, which is longer than the specified 1000
Created a chunk of size 2650, which is longer than the specified 1000
Created a chunk of size 1125, which is longer than the specified 1000
Created a chunk of size 1058, which is longer than the specified 1000
Created a chunk of size 1009, which is longer than the specified 1000
Created a chunk of size 1169, which is longer than the specified 1000


In [42]:
opdf_response = askQuestions(opdf_vector_store, chain, "Summarize the paper?")

print(wrap_text_preserve_newlines(str(opdf_response)))

This paper provides a comprehensive analysis of the evolution of stars in a galaxy. It uses the Besançon model
to predict the distribution of stars at each location in the galaxy, based on the thin disk, the thick disk,
the halo, and the bulge. The model is gridded to a distance step size specified by the user and produces a
table of stars with their parameter bases on the input selection provided by the user.
