Assignment Test: Chatbot Development from Website Data. The data is at https://www.presight.io/privacy-policy.html
<div style="margin-left: 20px;">

**1. Data Access and Indexing:** <br>
Tasked with creating a chatbot, begin by web crawling the specified website to gather relevant data, then preprocess and structure this data into a searchable index, <br> ready for query retrieval.

</div>

<div style="margin-left: 20px;">

**2. Chatbot Development:** <br>
Develop a chatbot that employs natural language processing to comprehend user questions, searches the indexed data from 2.1 for the best match, and delivers <br> the most accurate response drawn from the website's information.

</div>

### **Crawling data**

In [66]:
import importlib
import crawling_data
importlib.reload(crawling_data)
from crawling_data import crawl_page

url = 'https://www.presight.io/privacy-policy.html'

crawl_page(url)

Dữ liệu đã được lưu vào structured_data.json


### **Building Chatbot**

#### Preparation

In [67]:
import json
import logging
import torch
import re
from sentence_transformers import SentenceTransformer, util
import os
import random

In [68]:
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
device = "cuda" if torch.cuda.is_available() else "cpu"
sbert_model = sbert_model.to(device)

In [69]:
# Tạo lại môi trường
from dotenv import load_dotenv
import google.generativeai as genai

load_dotenv(dotenv_path="environment.env")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)

#### **Part 2.1** Embedding data

In [70]:
logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')

def load_json_from_file(file_path):
    """
    Tải dữ liệu JSON 
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as f:  # Thêm mã hóa='utf-8'
            data = json.load(f)
            return data
    except FileNotFoundError:
        logging.error(f"File not found at '{file_path}'")
        return None
    except json.JSONDecodeError:
        logging.error(f"Invalid JSON format in file '{file_path}'")
        return None
    except UnicodeDecodeError as e: # Thêm ngoại lệ cụ thể hơn
        logging.error(f"UnicodeDecodeError occurred: {e} in file '{file_path}'")
        return None
    except Exception as e:
        logging.error(f"An unexpected error occurred: {e}")
        return None

In [71]:
file_path = "structured_data.json"
loaded_data = load_json_from_file(file_path)

# if loaded_data:
#     print("Successfully loaded JSON data from file:")
#     print(json.dumps(loaded_data, indent=1))
# else:
#     print("Failed to load JSON data")

In [72]:
# loaded_data['sections'][0]['content'] = loaded_data['sections'][1]['content']

# new_data = {
#         "section_title": None,
#         "content": None,
#         "subsections": [],
#         "items": []
#     }

# loaded_data['sections'].insert(1, new_data)

# loaded_data['sections'][1]['section_title'] = ["Latest Version"]
# loaded_data['sections'][1]['content'] = loaded_data['sections'][2]['section_title']

# del loaded_data['sections'][2]

In [73]:
print(json.dumps(loaded_data, indent=1))

{
 "sections": [
  {
   "section_title": [
    "PRIVACY POLICY"
   ],
   "content": [],
   "subsections": [],
   "items": []
  },
  {
   "section_title": [
    "Last updated 15 Sep 2023"
   ],
   "content": [
    "At Presight, we are committed to protecting the privacy of our customers and visitors to our website. This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors."
   ],
   "subsections": [],
   "items": []
  },
  {
   "section_title": [
    "Information Collection and Use"
   ],
   "content": [
    "We collect several different types of information for various purposes to provide and improve our Service to you."
   ],
   "subsections": [],
   "items": []
  },
  {
   "section_title": [
    "Types of Data Collected"
   ],
   "content": [],
   "subsections": [
    {
     "subsection_title": [
      "Personal Data"
     ],
     "content": [
      "While using our Service, we may ask you to provide us with certain personally identif

In [74]:
def clean_text(text):
    """
    Làm sạch chuỗi văn bản: Thay xuống dòng bằng khoảng trắng và xóa khoảng trắng thừa
    """
    if not isinstance(text, str):
        return ""
    text = text.replace('\n', ' ').strip()
    text = re.sub(r'\s+', ' ', text)

    return text

In [75]:
def preprocess_privacy_policy(data):
    """
    Xử lý privacy policy data từ cấu trúc JSON
    """
    # new_data = {
    #         "section_title": None,
    #         "content": None,
    #         "subsections": [],
    #         "items": []
    #     }

    # loaded_data['sections'].insert(1, new_data)

    # loaded_data['sections'][1]['section_title'] = ["Latest Version"]
    # loaded_data['sections'][1]['content'] = loaded_data['sections'][2]['section_title']

    # del loaded_data['sections'][2]

    
    processed_texts = []

    for section in data.get("sections", []):
        section_texts = []
        section_title = section.get("section_title", [])
        print(section_title)
        if isinstance(section_title, list):
            section_title = " ".join(item.strip() for item in section_title if isinstance(item, str))
        section_title = section_title.strip()

        section_content = section.get("content", [])
        if isinstance(section_content, list):
            section_content = " ".join(item.strip() for item in section_content if isinstance(item, str))
        section_content = section_content.strip()

        items = section.get("items", [])
        if isinstance(items, list):
            items = [item.strip() for item in items if isinstance(item, str)]
            if items:
                section_content += " " +  ", ".join(items)

        section_text = " ".join(filter(None, [section_title, section_content])).strip()

        if section_text:
            section_texts.append(section_text)  

        for subsection in section.get("subsections", []):
            subsection_title = subsection.get("subsection_title", [])
            if isinstance(subsection_title, list):
                subsection_title = " ".join(item.strip() for item in subsection_title if isinstance(item, str))
            subsection_title = subsection_title.strip()

            subsection_content = subsection.get("content", [])
            if isinstance(subsection_content, list):
                subsection_content = " ".join(item.strip() for item in subsection_content if isinstance(item, str))
            subsection_content = subsection_content.strip()

            subsection_items = subsection.get("items", [])
            if isinstance(subsection_items, list):
                subsection_items = [item.strip() for item in subsection_items if isinstance(item, str)]
                if subsection_items:
                    subsection_content += " " + ", ".join(subsection_items)

            subsection_text = " ".join(filter(None, [subsection_title, subsection_content])).strip()

            if subsection_text:
                section_texts.append(subsection_text)  

        if not (section_content or items or section.get("subsections")):
            continue

        processed_texts.append(section_texts)

    cleaned_texts = [text for text in processed_texts if text]
    return cleaned_texts

In [76]:
texts = preprocess_privacy_policy(loaded_data)
embeddings = [sbert_model.encode(text) for text in texts]

['PRIVACY POLICY']
['Last updated 15 Sep 2023']
['Information Collection and Use']
['Types of Data Collected']
['Use of Data']
['Consent']
['Access to Personal Information']
['Accessing Your Personal Information']
['Automated Edit Checks']
['Disclosure of Information']
['Sharing of Personal Data']
['Google User Data and Google Workspace APIs']
['Data Security']
['Data Retention & Disposal']
["Quality, Including Data Subjects' Responsibilities for Quality"]
['Monitoring and Enforcement']
['Cookies']
['Third-Party Websites']
['Changes to Privacy Policy']
['Contact Us']
['Purposeful Use Only']


#### **Part 2.2** Chatbot Development

##### Finding the best match paragraph based on index

In [77]:
def find_match(user_query, texts, embeddings, model):
    query_embedding = model.encode(user_query)

    best_score = -1
    best_match_index = []

    for i in range(len(embeddings)):
        for j in range(len(embeddings[i])): 
            keyword_score = util.cos_sim(query_embedding, embeddings[i][j]).item()
            if keyword_score > best_score:
                  best_score = keyword_score
                  best_match_index = [i, j]

    if best_match_index[1] == 0:
        result = "\n".join(item.strip() for item in texts[best_match_index[0]] if isinstance(item, str))
    else:
        result = "\n".join(item.strip() for item in texts[best_match_index[0]][best_match_index[1]] if isinstance(item, str))

    return result

##### Using Gemini API to answer queries

In [78]:
def chatbot_response(user_query, texts, embeddings, sbert_model):
    match = find_match(user_query, texts, embeddings, sbert_model)

    # Tạo lại môi trường
    from dotenv import load_dotenv
    import google.generativeai as genai
    import os

    load_dotenv(dotenv_path="environment.env")
    GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
    # print(GOOGLE_API_KEY)
    genai.configure(api_key=GOOGLE_API_KEY)

    model = genai.GenerativeModel("gemini-2.0-flash")
    prompt = f"With text {match}, your task is answer question {user_query}," \
    f"Requirements:" \
    f"1. Full information from text"\
    f"2. Answer with natural English text"\
    f"Output format must be break the line if it is too long"

    response = model.generate_content(prompt)
    print(match)
    return response.text

##### Comunicating with chatbot

In [79]:
def greeting_response(text):
    text = text.lower()

    bot_greetings = ['hi', 'hello', 'hey']
    user_greetings = ['hi', 'hey', 'hello']

    for word in text.split():
        if word in user_greetings:
            return random.choice(bot_greetings)

In [80]:
lists = ['Who do you share data with?',
'How do you use my personal information?',
'What types of personal data do you collect?',
'What about the privacy policy changes?',
'How about third-party websites?',
'What do you use cookies for?',
'What types of personal data might the service ask you to provide?',
'How can I access personal information?',
'When was the last version updated?']

Danh sách câu hỏi tham khảo:<br>
- Who do you share data with?
- How do you use my personal information?
- What types of personal data do you collect?
- What about the privacy policy changes?
- How about third-party websites?
- What do you use cookies for?
- What types of personal data might the service ask you to provide?
- How can I access personal information?
- When was the last version updated?

In [82]:
print('I will answer the queries about Privacy Policy of Presight. If you want to exist, type bye.')

exit_list = ['exit', 'see you later', 'bye', 'quit', 'break', 'thanks']

# while(True):
#     user_input = input()
#     print('User:',user_input)
#     if user_input.lower() in exit_list:
#         print('Chatbot: Bye Bye! Chat with you later!')
#         break
#     else:
#         if greeting_response(user_input) != None:
#             print('Chatbot: '+greeting_response(user_input)+'\n')
#         else:
#             response = chatbot_response(user_input, texts, embeddings, sbert_model)
#             print('Chatbot: '+response)

for q in lists:
    print(q)
    user_input = q
    if user_input.lower() in exit_list:
        print('Chatbot: Bye Bye! Chat with you later!')
        break
    else:
        if greeting_response(user_input) != None:
            print('Chatbot: '+greeting_response(user_input)+'\n')
        else:
            response = chatbot_response(user_input, texts, embeddings, sbert_model)
            print('Chatbot: '+response)



I will answer the queries about Privacy Policy of Presight. If you want to exist, type bye.
Who do you share data with?
Sharing of Personal Data Your personal data will not be subject to sharing, transfer, rental or exchange for the benefit of third parties, including AI models.
Chatbot: According to the text, your personal data will not be shared, transferred, rented, or exchanged for the benefit of third parties, including AI models.

How do you use my personal information?
Accessing Your Personal Information You have the right to access all of your personal information that we hold. Through the application, you can correct, amend, or append your personal information by logging into the application and navigating to your settings and profile.
Chatbot: The provided text focuses on your right to access and modify your personal information, but it **does not explain how the company uses your personal information.** It only states that you have the right to see all of it that they hold, 

RetryError: Timeout of 600.0s exceeded, last exception: 503 IOCP/Socket: Connection aborted (An established connection was aborted by the software in your host machine.
 -- 10053)