<a href="https://colab.research.google.com/github/Khoubaib-Boughalmi/1337-library/blob/main/langchain_rag_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
%pip install --quiet --upgrade langchain-core langchain-text-splitters faiss-cpu langchain-community langgraph beautifulsoup4 langchain-google-vertexai google-cloud-aiplatform pypdf


In [30]:
from langchain_community.document_loaders import PyPDFLoader
import os

pdf_directory = "/content/drive/MyDrive/RCDH/"
pdf_docs = []

for filename in os.listdir(pdf_directory):
    if filename.endswith(".pdf"):
        loader = PyPDFLoader(os.path.join(pdf_directory, filename))
        pdf_docs.extend(loader.load())




In [31]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(id="bodyContent")
loader = WebBaseLoader(
    web_paths=("https://en.wikipedia.org/wiki/Saudi_Authority_for_Data_and_Artificial_Intelligence",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load() # 1 doc (1 page)

assert len(docs) == 1
print(f"Total characters: {len(docs[0].page_content)}")

Total characters: 2442


In [92]:
import requests
import pandas as pd
from langchain.schema import Document

url = "https://bcg.trial.opendatasoft.com/api/explore/v2.1/catalog/datasets/saudi-aircraft-traffic-passenger-and-cargo-by-domestic-airports/exports/csv?lang=en&timezone=Africa%2FLagos&use_labels=true&delimiter=%3B"

headers = {
    "X-CSRFToken": "ofJ7t5RQM9sRsave0osExvD3fifDG1qN",
    "User-Agent": "Mozilla/5.0"
}

cookies = {
    "sessionid": "sk4xbp4dx8svwtkiss13bywu1va8ymfw"
}

response = requests.get(url, headers=headers, cookies=cookies)
csv_path = "/content/drive/MyDrive/RCDH/data.csv"
if response.status_code == 200:
    with open(csv_path, "wb") as f:
        f.write(response.content)
else:
    print(f"Failed to fetch data: {response.status_code}")

df = pd.read_csv(csv_path, delimiter=";")
print("CSV loaded into DataFrame.")

# Step 3: Convert DataFrame rows into LangChain `Document` objects
csv_docs = []
for _, row in df.iterrows():
    content = "\n".join(f"{col}: {val}" for col, val in row.items())
    csv_docs.append(Document(page_content=content))


CSV loaded into DataFrame.


In [95]:
print(csv_docs[0].page_content[:1000])

Year: 2013
Arr/Dep: Arr. Cargo (kgs)
Domestic Airport: Riyadh
Aircraft Traffic : 117251302.0
Lat, Long: 24.9427121, 46.7123544


In [94]:
all_docs = docs + pdf_docs + csv_docs
print(f"Total documents: {len(all_docs)}")
print(len(docs))
print(len(pdf_docs))
print(len(csv_docs))


Total documents: 57
1
20
36


In [96]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separators=["\n\n", "\n", " ", ""],
    add_start_index=True,
)
all_splits = text_splitter.split_documents(all_docs)
print(f"Split into {len(all_splits)} chunks")

Split into 94 chunks


In [100]:
print(all_splits[91])

page_content='Year: 2016
Arr/Dep: Arr. Cargo (kgs)
Domestic Airport: Riyadh
Aircraft Traffic : 93127333.0
Lat, Long: 24.9427121, 46.7123544' metadata={'start_index': 0}


In [36]:
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/gen-lang-client-0716959645-1c6625495b1d.json"

In [37]:
# !cat {os.getenv("GOOGLE_APPLICATION_CREDENTIALS")}

In [38]:
# from langchain_google_vertexai import VertexAIEmbeddings
# from langchain.vectorstores import FAISS
# from langchain.docstore.document import Document
# from langchain_core.vectorstores import InMemoryVectorStore


# # os.environ["GOOGLE_CLOUD_REGION"] = "me-central2"

# # Initialize Vertex AI Embeddings
# embedding_model = VertexAIEmbeddings(
#     model_name="textembedding-gecko@001",  # Required!
# )

# # Initialize Vertex AI Embeddings
# embedding_model = VertexAIEmbeddings()

# # Sample documents
# documents = [
#     Document(page_content="LangChain is a framework for building LLM-powered apps."),
#     Document(page_content="Google Vertex AI provides powerful models like Gemini."),
# ]

# # Initialize with an embedding model
# vector_store = InMemoryVectorStore(embedding=embedding_model())


In [101]:
from sentence_transformers import SentenceTransformer
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceEmbeddings

# Initialize local embedding model using sentence-transformers
embedding_model = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

content_list = [doc.page_content for doc in all_splits]
vector_store = FAISS.from_texts(content_list, embedding_model)


In [102]:
def make_rag_prompt(query, relevant_passage):
    relevant_passage = ' '.join(relevant_passage)
    return (
        "You are a helpful chatbot on a Saudi government website, answering inquiries from Saudi Arabian citizens. "
        "Your responses should be based **only on the retrieved reference content** provided below.\n\n"
        "Instructions:\n"
        "1. The question may be in Arabic, colloquial Arabic, or English — adapt accordingly.\n"
        "2. Some terminology may be loosely translated (e.g., 'نظام حماية البيانات الشخصية' might appear as 'personal data protection system'). "
        "Use your judgment to interpret such terms accurately if they are clearly related.\n"
        "3. Do not guess or add information not found in the provided content.\n"
        "4. If the context doesn't contain relevant information, respond politely and state that the answer is not available.\n"
        "5. Use a clear, friendly, and easy-to-understand tone suitable for a general audience.\n\n"
        f"QUESTION: '{query}'\n"
        f"CONTEXT: '{relevant_passage}'\n\n"
        f"ANSWER:"
    )


In [103]:
def make_translation_prompt(query):

  translation_prompt = (
      "You will be given a question that may be in Arabic, Saudi dialect, or English.\n"
      "Your task is to return a clear and grammatically correct version of the question in English, suitable for use in a search engine.\n"
      "If the question is in Arabic or dialect, translate it to English.\n"
      "If it's already in English, fix any grammar, typos, or unclear phrases but keep it in English.\n\n"
      f"QUESTION:\n{query}\n\n"
      "REWRITE IN ENGLISH:"
)
  return translation_prompt


In [43]:
# import google.generativeai as genai

# genai.configure(api_key="AIzaSyDXM2rYFwObTTIFrJLMgVTvfjEaGwqj_D0")
# def generate_response(user_prompt):
#     model = genai.GenerativeModel('gemini-2.0-flash-lite')
#     answer = model.generate_content(user_prompt)
#     return answer.text

In [104]:
import google.generativeai as genai

def init_model():
  genai.configure(api_key="AIzaSyDXM2rYFwObTTIFrJLMgVTvfjEaGwqj_D0")
  model = genai.GenerativeModel('gemini-1.5-flash-002')
  return model

model = init_model()



In [74]:
def retrieve(query: str):
    query = make_translation_prompt(query)
    normalized_query = model.generate_content(query)
    query = normalized_query.text
    return vector_store.similarity_search(query)


In [75]:
def get_embedding(text: str) -> list[float]:
    return embedding_model.embed_query(text)


In [76]:
def generate_response(user_prompt):
    answer = model.generate_content(user_prompt)
    return answer.text

In [106]:
def generate_answer(query):
    relevant_text = retrieve(query)
    text = " ".join([doc.page_content for doc in relevant_text])
    prompt = make_rag_prompt(query, relevant_passage=text)
    answer = generate_response(prompt)
    return answer
answer = generate_answer("ين هي الهيئة السعودية للبيانات والذكاء الاصطناعي")
print(answer)


The Saudi Data and AI Authority (SDAIA) is a government agency in Saudi Arabia established on August 30, 2019, by a royal decree.  It oversees digital platforms such as Nafath (Unified National Access) and Tawakkalna.  The agency is directly linked to the Prime Minister and governed by a board of directors chaired by the Deputy Prime Minister.  It also has three other bodies linked to it: The National Center for Artificial Intelligence, The National Data Management Office, and the National Information Center.



In [46]:
answer = generate_answer("وش مزوّدي الخدمة الموجودين في السعودية؟")
print(answer)

Based on the provided text, Alibaba Cloud, Google Cloud (in Dammam), and Oracle (in Riyadh/ Jeddah) are service providers with a presence in Saudi Arabia.



In [47]:
answer = generate_answer("وش اسم نموذج علي بابا؟")
print(answer)

The Alibaba model mentioned in the text is called Qwen.



In [48]:
answer = generate_answer("وش أكبر النماذج المتوفرة؟ وإذا تقدر، عطِني أرقام.")
print(answer)


The passage mentions several large language models, including Google's Gemini, OpenAI's ChatGPT, Oracle's AI, and Alibaba's Qwen.  However, it doesn't provide specific numerical details about their sizes.



In [49]:
answer = generate_answer("وش نقاط القوة والضعف في كل نموذج؟")
print(answer)


I'm sorry, but this document focuses on the deployment of AI models in Saudi Arabia and doesn't contain information about the strengths and weaknesses of different AI models.  Therefore, I cannot answer your question using the provided text.



In [89]:
answer = generate_answer("كيف قاعدين يستخدمون نماذج الذكاء الاصطناعي التوليدي في السعودية؟")
print(answer)

مرحباً بك!  يستخدم نموذج جوجل Gemini في السعودية من خلال شراكة بين صندوق الاستثمارات العامة (PIF) و جوجل كلاود.  هذا المركز  يُدمج نماذج جوجل Gemini مع بيانات عربية لخلق قدرات ذكاء اصطناعي باللغة العربية،  مُصممة خصيصاً للتطبيقات السعودية.  كما تم إطلاق نموذج Alibaba's Qwen 2 في السعودية، ويركز على تقديم أداء ذكاء اصطناعي متطور مُصمم للشركات المحلية ودعم اللغة العربية.  أيضاً، تستخدم منصة Zoom  بنية تحتية سحابية من Oracle في المملكة لتشغيل مساعدين ذكاء اصطناعي محلياً، لضمان بقاء البيانات داخل المملكة والامتثال للوائح السعودية.



In [81]:
answer = generate_answer("could you summarize the first article of the Personal Data Protection Law")
print(answer)

For the purpose of implementing this Law, the following terms shall have the meanings assigned thereto, unless the context requires otherwise:  The definitions include the Law itself (Personal Data Protection Law), Regulations (Implementing Regulations of the Law), Competent Authority (to be determined by a resolution of the Council of Ministers), Personal Data (any data that may lead to identifying an individual, including name, personal identification number, addresses, etc.), and Processing (any operation carried out on Personal Data by any means, whether manual or automated).



In [87]:
answer = generate_answer("what is the first مادة من نظام حماية البيانات الشخصية؟")
print(answer)

مرحباً بك!  المادة الأولى من نظام حماية البيانات الشخصية تُعرّف المصطلحات المستخدمة في القانون، مثل "القانون" نفسه، و"اللوائح"، و"الجهة المختصة"، و"البيانات الشخصية"، و"المعالجة".



In [88]:
answer = generate_answer("what is the punishment for data violation")
print(answer)

Based on the provided text, publishing sensitive data with the intent to harm or for personal benefit is punishable by imprisonment not exceeding two years, a fine not exceeding three million Riyals, or both.  The Public Prosecution is responsible for investigating and prosecuting these violations.  Further details regarding other data violations and their punishments are not available in this document.



In [107]:
answer = generate_answer("give me some informations about domestic airports")
print(answer)

Based on the provided data, I can tell you about the number of departing passengers from Riyadh's domestic airport in several years.  In 2008, there were 39,046,870 departing passengers; in 2009, 41,848,150; in 2014, 65,644,640; and in 2016, 70,341,350.  The latitude and longitude coordinates for Riyadh's domestic airport are 24.9427121, 46.7123544.  I do not have information about other domestic airports.

