# Semantic Search

Semantic search is an advanced information retrieval technique that focuses on understanding the meaning and context behind user queries, 
rather than just matching keywords. Unlike traditional keyword-based searches, semantic search aims to deliver more relevant results by 
comprehending the intent and nuances of the query.

### Key Concepts

1. **Meaning Understanding**:
   - Semantic search leverages natural language processing (NLP) techniques to interpret the meaning of words and phrases within a query.
   - It considers synonyms, antonyms, related concepts, and the overall context to provide more accurate results.

2. **Vector Representations**:
   - Words and sentences are represented as vectors in high-dimensional spaces using models like Word2Vec, GloVe, or more advanced 
transformer-based models such as BERT, GPT, and Sentence-BERT.
   - These vector representations capture semantic relationships between words and phrases.

3. **Semantic Similarity**:
   - Semantic search calculates the similarity between query vectors and document vectors based on their proximity in the vector space.
   - Documents with higher similarity scores are considered more relevant to the query.

4. **Contextual Understanding**:
   - Semantic search takes into account the context in which words appear, allowing it to better understand the intent behind a query.
   - This is particularly useful for queries that involve multiple meanings or ambiguous terms.

### Advantages

1. **Improved Relevance**:
   - Semantic search delivers more relevant results by understanding the underlying meaning of queries.
   - It reduces the number of irrelevant documents returned, providing users with better and more accurate information.

2. **Handling Synonyms and Related Terms**:
   - Semantic search can identify synonyms and related terms, expanding the scope of a query to include semantically similar concepts.
   - This helps in capturing a broader range of relevant content.

3. **Better Understanding of Context**:
   - By considering the context in which words appear, semantic search can interpret queries more accurately, especially those involving 
complex or nuanced language.

4. **Enhanced User Experience**:
   - Users are more likely to find what they are looking for with fewer clicks and less time spent filtering through irrelevant results.
   - This leads to a better overall user experience on search platforms.

### Applications

Semantic search is widely used in various applications, including:

1. **Search Engines**:
   - Major search engines like Google use semantic understanding to provide more relevant search results.
   
2. **E-commerce Platforms**:
   - Enhances product discovery by understanding customer queries and suggesting products that match their intent.

3. **Content Management Systems (CMS)**:
   - Helps in finding the most relevant content for users, improving user engagement and satisfaction.

4. **Chatbots and Virtual Assistants**:
   - Provides more accurate responses to user queries by understanding the context and meaning of the input.

5. **Academic Research**:
   - Assists researchers in finding relevant papers and articles by comprehending complex research questions.

### Example

Consider a query for "best restaurants in New York City":

- **Keyword-Based Search**: Might return results containing all the words "best," "restaurants," "New York City."
- **Semantic Search**: Understands that the user is looking for top-rated dining establishments in NYC. It might also consider synonyms 
like "top restaurants" or related concepts like "high-end eateries."

In [1]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.4.1-py3-none-any.whl.metadata (10 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Downloading transformers-4.48.1-py3-none-any.whl.metadata (44 kB)
Collecting tqdm (from sentence-transformers)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Using cached torch-2.5.1-cp312-cp312-win_amd64.whl.metadata (28 kB)
Collecting scikit-learn (from sentence-transformers)
  Using cached scikit_learn-1.6.1-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting scipy (from sentence-transformers)
  Using cached scipy-1.15.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Downloading huggingface_hub-0.28.0-py3-none-any.whl.metadata (13 kB)
Collecting Pillow (from sentence-transformers)
  Using cached pillow-11.1.0-cp312-cp312-win_amd64.whl.metadata (9.3 kB)
Collecting filelock (fr

In [2]:
!pip install faiss-cpu

Collecting faiss-cpu
  Using cached faiss_cpu-1.9.0.post1-cp312-cp312-win_amd64.whl.metadata (4.5 kB)
Using cached faiss_cpu-1.9.0.post1-cp312-cp312-win_amd64.whl (13.8 MB)
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [3]:
!pip install protobuf==3.20.*

Collecting protobuf==3.20.*
  Using cached protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Using cached protobuf-3.20.3-py2.py3-none-any.whl (162 kB)
Installing collected packages: protobuf
Successfully installed protobuf-3.20.3


In [5]:
!pip install langchain

Collecting langchain
  Downloading langchain-0.3.16-py3-none-any.whl.metadata (7.1 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Using cached SQLAlchemy-2.0.37-cp312-cp312-win_amd64.whl.metadata (9.9 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Using cached aiohttp-3.11.11-cp312-cp312-win_amd64.whl.metadata (8.0 kB)
Collecting langchain-core<0.4.0,>=0.3.32 (from langchain)
  Downloading langchain_core-0.3.32-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.3 (from langchain)
  Using cached langchain_text_splitters-0.3.5-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.4,>=0.1.17 (from langchain)
  Downloading langsmith-0.3.2-py3-none-any.whl.metadata (14 kB)
Collecting pydantic<3.0.0,>=2.7.4 (from langchain)
  Downloading pydantic-2.10.6-py3-none-any.whl.metadata (30 kB)
Collecting tenacity!=8.4.0,<10,>=8.1.0 (from langchain)
  Using cached tenacity-9.0.0-py3-none-any.whl.metadata (1.2 kB)
Collecting aiohappyeyeballs>=2.3.0 (

In [6]:
import re
import numpy as np
from sentence_transformers import SentenceTransformer
from langchain.schema.document import Document

import faiss


In [7]:

from typing import List

class Document:
    def __init__(self, content: str, metadata: dict = None, id: int = None):
        self.content = content
        self.metadata = metadata
        self.id = id

class SemanticSearch:
    def __init__(self, documents: List[Document], model_name: str = 'sentence-transformers/all-MiniLM-L6-v2'):
        """
        Initialize the semantic search system with a pre-trained Sentence-BERT model.
        
        Args:
            documents (List[Document]): A list of Document objects.
            model_name (str): Name of the Sentence-BERT model to use.
        """
        self.documents = documents
        self.model = SentenceTransformer(model_name)
        self.doc_vectors = None
        self.vector_store = None

    def preprocess_text(self, text: str) -> str:
        """
        Preprocess the input text.
        
        Args:
            text (str): Input text.
            
        Returns:
            str: Processed text.
        """
        # Lowercase
        text = text.lower()
        
        # Remove special characters
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        
        # Remove extra whitespaces
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text

    def load_documents(self):
        """
        Preprocess and encode documents into vectors.
        """
        processed_docs = [self.preprocess_text(doc.content) for doc in self.documents]
        self.doc_vectors = self.model.encode(processed_docs, convert_to_numpy=True)
        
        # Print the shape of the document vectors to verify dimensions
        print(f"Document vectors shape: {self.doc_vectors.shape}")
        
        # Create FAISS index
        d = self.doc_vectors.shape[1]  # Dimension of the vectors
        self.vector_store = faiss.IndexFlatL2(d)
        self.vector_store.add(self.doc_vectors)

    def search(self, query: str, top_k: int = 5) -> List[str]:
        """
        Perform semantic search.
        
        Args:
            query (str): Search query.
            top_k (int): Number of top documents to retrieve.
            
        Returns:
            List[str]: Retrieved documents.
        """
        # Preprocess the query
        processed_query = self.preprocess_text(query)
        
        # Encode the query
        query_vector = self.model.encode(processed_query, convert_to_numpy=True).reshape(1, -1)
        
        # Print the shape of the query vector to verify dimensions
        print(f"Query vector shape: {query_vector.shape}")
        
        # Search in FAISS index
        distances, indices = self.vector_store.search(query_vector, top_k)
        
        # Retrieve and return the most similar documents
        retrieved_docs = [self.documents[idx].content for idx in indices[0]]
        return retrieved_docs



In [8]:
# Example usage
documents = [
    Document("The quick brown fox jumps over the lazy dog"),
    Document("Never jump over the lazy dog quickly"),
    Document("Quickly brown dogs never jump over fences")
]

semantic_search = SemanticSearch(documents)
semantic_search.load_documents()

query = "quick brown dog"
top_5_docs = semantic_search.search(query, top_k=5)

for doc in top_5_docs:
    print(doc)

Document vectors shape: (3, 384)
Query vector shape: (1, 384)
The quick brown fox jumps over the lazy dog
Quickly brown dogs never jump over fences
Never jump over the lazy dog quickly
Quickly brown dogs never jump over fences
Quickly brown dogs never jump over fences


In [9]:
document_1 = Document(
    content="""The core architecture of Informatica Cloud Integration (IICS) includes a cloud infrastructure, data management layer, integration 
services layer, analytics and reporting layer, and user interface.""",
    metadata={"source": "informatica_cloud_architecture"},
    id=1,
)

document_2 = Document(
    content="""Informatica Cloud differs from traditional on-premises Informatica PowerCenter in terms of deployment model, scalability, accessibility, 
and cost management.""",
    metadata={"source": "informatica_cloud_vs_powercenter"},
    id=2,
)

document_3 = Document(
    content="""The various deployment models in Informatica Cloud include Public Cloud, Private Cloud, and Hybrid Cloud, each offering different levels 
of control and security.""",
    metadata={"source": "deployment_models"},
    id=3,
)

document_4 = Document(
    content="""My approach to designing complex cloud data integration mappings involves modular design, data flow analysis, 
    and performance testing.""",
    metadata={"source": "integration_design"},
    id=4,
)

document_5 = Document(
    content="""To optimize performance in large-scale cloud data integration projects, I utilize parallel processing, efficient resource allocation, and 
caching mechanisms.""",
    metadata={"source": "performance_optimization"},
    id=5,
)

document_6 = Document(
    content="""Batch processing involves processing large volumes of data at scheduled intervals, while real-time processing enables immediate data 
processing with low latency.""",
    metadata={"source": "batch_vs_real_time"},
    id=6,
)

document_7 = Document(
    content="""My experience with Intelligent Data Lake includes understanding its key features such as automated data cataloging, intelligent search, 
and advanced analytics capabilities.""",
    metadata={"source": "intelligent_data_lake"},
    id=7,
)

document_8 = Document(
    content="""Handling complex data transformations using Informatica Cloud's mapping design involves utilizing complex expressions, conditional logic, 
and advanced functions.""",
    metadata={"source": "complex_transformations"},
    id=8,
)

document_9 = Document(
    content="""The process of creating and managing connection configurations in IICS includes defining them using the UI or programmatically via APIs, 
and securing sensitive information using encryption.""",
    metadata={"source": "connection_configurations"},
    id=9,
)

document_10 = Document(
    content="""A challenging data migration project I completed involved migrating data from on-premises systems to a cloud environment while ensuring 
minimal downtime and data integrity.""",
    metadata={"source": "data_migration_project"},
    id=10,
)

document_11 = Document(
    content="""To design a scalable, fault-tolerant cloud integration solution, I implemented redundancy in processing components, used load balancing 
techniques, and employed automated failover mechanisms.""",
    metadata={"source": "scalable_solution_design"},
    id=11,
)

document_12 = Document(
    content="""Strategies for handling data quality and data validation in cloud integrations include utilizing data quality tools within IICS and 
implementing error handling and retry logic.""",
    metadata={"source": "data_quality_validation"},
    id=12,
)

document_13 = Document(
    content="""The key differences between Informatica Cloud Data Integration and Application Integration involve the focus on moving and transforming 
large volumes of data versus integrating business processes and applications through APIs and messaging protocols.""",
    metadata={"source": "data_vs_application_integration"},
    id=13,
)

document_14 = Document(
    content="""My knowledge of security configurations in Informatica Cloud includes implementing role-based access control (RBAC), utilizing encryption 
for data at rest and in transit, and configuring secure connections using SSL/TLS.""",
    metadata={"source": "security_configurations"},
    id=14,
)

document_15 = Document(
    content="""My experience with pre-built connectors and custom connector development involves leveraging pre-built connectors for common systems like 
Salesforce, SAP, and Oracle, as well as developing custom connectors for unique integration requirements using Informatica’s connector SDK.""",
    metadata={"source": "connectors"},
    id=15,
)

document_16 = Document(
    content="""To diagnose and resolve performance bottlenecks in cloud integration workflows, I use profiling tools to identify slow-running tasks and 
optimize data flow paths, while increasing resource allocation where necessary.""",
    metadata={"source": "performance_bottlenecks"},
    id=16,
)

document_17 = Document(
    content="""The monitoring and logging capabilities in Informatica Cloud include utilizing the built-in monitoring dashboard to track integration 
activities and performance metrics, as well as configuring logging for detailed error tracking and auditing purposes.""",
    metadata={"source": "monitoring_logging"},
    id=17,
)

document_18 = Document(
    content="""My strategies for error handling and retry mechanisms involve implementing robust error handling logic within mappings and workflows, as 
well as setting up retry mechanisms with exponential backoff to handle transient errors gracefully.""",
    metadata={"source": "error_handling_retry"},
    id=18,
)

document_19 = Document(
    content="""My approach to implementing real-time data synchronization across multiple cloud platforms involves using real-time data streaming 
services like Apache Kafka or AWS Kinesis, as well as implementing change data capture (CDC) techniques.""",
    metadata={"source": "real_time_synchronization"},
    id=19,
)

document_20 = Document(
    content="""Ensuring data governance and compliance in cloud integration projects involves establishing data governance policies and procedures for 
managing metadata and access rights, while complying with industry regulations like GDPR, HIPAA, and CCPA through secure data handling practices.""",
    metadata={"source": "data_governance_compliance"},
    id=20,
)

document_21 = Document(
    content="""The process of creating and managing intelligent data services in Informatica Cloud includes defining them using the IICS UI or 
programmatically via APIs, as well as managing service configurations and deploying them to different environments.""",
    metadata={"source": "intelligent_data_services"},
    id=21,
)

document_22 = Document(
    content="""My experience with complex mapping transformations like lookup, router, and aggregator involves utilizing lookup transformations for 
reference data lookups, implementing router transformations to route data based on conditional logic, and using aggregators for summarizing data across 
multiple records.""",
    metadata={"source": "complex_mapping_transformations"},
    id=22,
)

document_23 = Document(
    content="""Handling incremental data loading and change data capture (CDC) involves configuring CDC settings in source systems to capture changes, 
as well as implementing incremental load strategies using timestamps or change keys in mappings.""",
    metadata={"source": "incremental_data_loading"},
    id=23,
)

document_24 = Document(
    content="""My experience integrating Informatica Cloud with major cloud platforms like AWS, Azure, and Google Cloud includes leveraging their 
respective services for seamless data integration.""",
    metadata={"source": "cloud_platform_integration"},
    id=24,
)

document_25 = Document(
    content="""Managing cloud data integration across different SaaS applications involves utilizing pre-built connectors for popular systems like 
Salesforce, ServiceNow, and MarketMuse, as well as implementing custom integrations where required using APIs and webhooks.""",
    metadata={"source": "saas_integration"},
    id=25,
)

document_26 = Document(
    content="""My proficiency with Informatica Cloud REST APIs includes using them to automate tasks such as workflow execution, asset management, and 
monitoring, while integrating external systems using API calls for real-time data exchange.""",
    metadata={"source": "rest_apis"},
    id=26,
)

document_27 = Document(
    content="""Using PowerCenter mappings in a cloud integration context involves leveraging them within IICS for complex ETL processes while ensuring 
compatibility between on-premises and cloud environments.""",
    metadata={"source": "powercenter_mappings"},
    id=27,
)

document_28 = Document(
    content="""My experience with custom scripting for complex integration scenarios includes writing scripts using Informatica’s scripting language to 
handle unique integration requirements, as well as integrating third-party tools and libraries to extend functionality.""",
    metadata={"source": "custom_scripting"},
    id=28,
)

document_29 = Document(
    content="""Designing an end-to-end data integration workflow for a complex business scenario involves identifying key data sources, targets, and 
transformation needs, as well as designing mappings and workflows that meet business objectives and performance requirements.""",
    metadata={"source": "end_to_end_workflow"},
    id=29,
)

document_30 = Document(
    content="""A project where I solved a critical data integration challenge using Informatica Cloud involved addressing issues related to real-time 
data synchronization across multiple systems, while implementing solutions using IICS features like CDC and real-time streaming services.""",
    metadata={"source": "critical_integration_challenge"},
    id=30,
)

document_31 = Document(
    content="""My internal architecture of Intelligent Cloud Services involves exploring the underlying mechanisms that support IICS, including service 
discovery, load balancing, and failover mechanisms, as well as analyzing how data is processed and managed within the cloud environment.""",
    metadata={"source": "internal_architecture"},
    id=31,
)

document_32 = Document(
    content="""Advanced techniques for managing large-scale data transformations include employing distributed processing techniques to handle large 
volumes of data efficiently, while utilizing parallel execution and caching strategies to optimize performance.""",
    metadata={"source": "large_scale_transformations"},
    id=32,
)

document_33 = Document(
    content="""Ensuring data privacy and security in multi-tenant cloud environments involves implementing isolation mechanisms to protect 
tenant-specific data, as well as using encryption, access controls, and audit logging to maintain security standards.""",
    metadata={"source": "data_privacy_security"},
    id=33,
)

document_34 = Document(
    content="""Integrating AI and machine learning capabilities with Informatica Cloud involves exploring how AI/ML can be used for predictive 
analytics, anomaly detection, and automated decision-making in integrations, while planning strategies that leverage machine learning models to enhance 
data processing.""",
    metadata={"source": "ai_ml_integration"},
    id=34,
)

document_35 = Document(
    content="""The future of cloud data integration is expected to evolve with the continued growth of serverless architectures and event-driven 
integration patterns, as well as increased adoption of AI/ML for automation and intelligence in cloud integrations.""",
    metadata={"source": "future_of_cloud_integration"},
    id=35,
)

document_36 = Document(
    content="""
### Chapter 1: Taskflows and Linear Taskflows

#### Section 2.1: Taskflows

1. **Taskflow Steps**
   - The different types of taskflow steps available in Informatica IICS include Data Task, Notification Task, Decision Step, File Watch Task, etc.

2. **Creating a Taskflow**
   - To create a new taskflow in Informatica IICS, you typically start by selecting the appropriate template (e.g., Single Task, Sequential Tasks with 
Decision) and then configure the steps and properties as needed.

3. **Taskflow Templates**
   - The types of taskflow templates available include Single Task for running one task on a schedule, and Sequential Tasks with Decision for running tasks 
in sequence followed by a decision step based on output. Each template is suited to different automation needs.

4. **Setting Taskflow Properties**
   - Important properties that can be set for a taskflow include the name, location, input fields, output fields, temporary fields, advanced properties, 
and notes. These are configured through the Properties section in the taskflow designer.

5. **Setting Taskflow Step Properties**
   - For various taskflow steps like Data Task, Notification Task, and Decision step, you set properties by accessing the Properties section for each step 
within the taskflow designer.

6. **Runtime Parameters**
   - Runtime parameters are variables that can be overridden at runtime to customize task behavior without changing the underlying code. They are used to 
make tasks more flexible and adaptable to different scenarios.

7. **Parameter Set**
   - A parameter set is a collection of parameters that can be reused across multiple tasks or steps within a taskflow, promoting consistency and reducing 
redundancy.

8. **The Expression Editor**
   - The Expression Editor is used to create expressions within a taskflow for conditional logic, calculations, or data transformations. It allows users to 
define complex operations using drag-and-drop components and scripting.

9. **Taskflow Functions**
   - Common taskflow functions include IF-THEN-ELSE, FOR-EACH, WHILE, etc., which are used to control the flow of execution based on specific conditions or 
loops.

10. **Running a Taskflow**
    - Methods to run a taskflow include manually from the taskflow designer, using APIs like REST, or through scheduled jobs. The RunAJob utility can also 
be used to execute taskflows with specified inputs.

11. **Taskflow Example**
    - A simple taskflow example might involve a Data Task step to extract data from a source system, followed by a Notification Task step to send an email 
if the extraction fails, and finally a Decision Step to determine whether to proceed with further processing based on the success or failure of the initial 
steps.

#### Section 2.1.12: Running a Taskflow

1. **Running a Taskflow from the Taskflow Designer**
   - To run a taskflow directly from the taskflow designer, you typically click the Run button within the designer interface.

2. **Using Taskflow Inputs**
   - Taskflow inputs are created and used by adding input fields to the taskflow and configuring them with values or expressions that will be passed to the 
tasks when executed.

3. **Publishing a Taskflow**
   - Publishing a taskflow involves saving it in a state where it can be accessed and run by other users or systems. It is important for making changes 
available and ensuring consistency across different environments.

4. **Running a Taskflow as an API**
   - To run a taskflow using REST APIs, you make HTTP requests to the appropriate endpoints, passing inputs as required, and handle responses to monitor 
execution status and results.

5. **Adding a Custom Name to a Taskflow Name**
   - A custom name can be added to a taskflow when running it by specifying an override API name in the taskflow properties or through the RunAJob utility.

6. **Scheduling a Taskflow**
   - Scheduling a taskflow involves creating a schedule and associating it with the desired taskflow. It is important for automating repetitive tasks at 
specific times or intervals.

7. **Monitoring Taskflow Status**
   - The status of a running taskflow can be monitored using APIs that provide real-time updates on execution progress, success/failure outcomes, and other 
relevant metrics.

#### Section 2.1.15: Taskflow Log Files

1. **Downloading Taskflow Log File**
   - To download a taskflow log file from Data Integration, you navigate to the My Jobs page, select the taskflow job, and use the provided options to 
access and download the log resource.

2. **Taskflow Log File Contents**
   - A taskflow log file contains information such as the asset name, type, duration, start/end times, location, run ID, URLs, runtime environment, status, 
subtask details, and error messages. This information is crucial for troubleshooting and auditing purposes.
    """,
    metadata={"source": "chapter1"},
    id=36,
)

document_37 = Document(
    content="""
### Chapter 3: Linear Taskflows

#### Section 3.1: Scheduling Linear Taskflow Jobs

1. **Configuring a Linear Taskflow**
   - Configuring a linear taskflow involves adding tasks in sequence, setting their order using sequence numbers, configuring error handling options like 
Stop on Error, and optionally setting email notification preferences.

2. **Running a Linear Taskflow**
   - Steps to run a scheduled linear taskflow include navigating to the Explore page, selecting the taskflow, clicking Actions, and choosing Run. 
Alternatively, you can configure a schedule for the taskflow to run automatically at specified intervals.

3. **Stopping a Linear Taskflow or Subtask**
   - To stop a running linear taskflow or its subtasks, you navigate to the My Jobs page, select the job, and use the Stop button. The behavior upon 
stopping depends on whether the Stop on Error option is enabled for the taskflow.
    """,
    metadata={"source": "chapter3"},
    id=37,
)
document_38 = Document(
    content="""Question: What is Informatica Cloud and what are its capabilities?
Answer: Informatica Cloud is a comprehensive platform for data integration and business intelligence. Its capabilities include ETL 
(Extract, Transform, Load) services, data replication, analytics tools, and more. It supports various connectors to integrate with 
different data sources, perform data transformations, and manage data pipelines efficiently."""
    ,
    metadata={"source": "informatica"},
    id=38,
)

document_39 = Document(
    content="""Question: What is ETL in the context of Informatica Cloud Data Integration?
Answer: In Informatica Cloud, ETL stands for Extract, Transform, and Load. It refers to the process of extracting data from source 
systems, transforming it into a desired format, and loading it into target systems or data warehouses. Informatica Cloud provides tools 
and services to automate this process, ensuring data accuracy and consistency across different environments."""
    ,
    metadata={"source": "informatica"},
    id=39,
)

document_40 = Document(
    content="""Question: Explain ETL concepts in detail with respect to Informatica Cloud.
Answer: ETL is the process of extracting data from various sources, transforming it into a usable format, and loading it into a data 
warehouse or another target system. In Informatica Cloud, this involves several key steps:
1. **Extract**: Gathering data from sources using extractors.
2. **Transform**: Cleaning, enriching, and formatting data to meet business requirements using transformation mappings.
3. **Load**: Storing the transformed data in the target systems using loaders. Informatica Cloud supports a wide range of connectors for 
seamless integration with different data sources and targets."""
    ,
    metadata={"source": "informatica"},
    id=40,
)

document_41 = Document(
    content="""Question: Explain Data Warehousing.
Answer: Data warehousing is the process of consolidating, organizing, and storing large amounts of historical data for analysis. It 
involves creating a centralized repository where data from multiple sources is integrated and made available for reporting, analytics, 
and decision-making. Data warehouses are designed to support complex queries and provide insights that drive business strategies."""
    ,
    metadata={"source": "database"},
    id=41,
)

document_42 = Document(
    content="""Question: Explain Dimensional Modeling in detail.
Answer: Dimensional modeling is a data modeling approach used in data warehouses. It involves creating dimension tables and fact tables 
to represent business concepts and relationships. Dimensions contain descriptive attributes, while facts store the actual measured 
values."""
    ,
    metadata={"source": "database"},
    id=42,
)

document_43 = Document(
    content="""Question: What are the different types of SCD (Slowly Changing Dimension) in Informatica Cloud?
Answer: SCD is a technique used to handle changes in dimension data. There are four main types:
1. **Type 0 SCD**: Replace entire records with new ones.
2. **Type 1 SCD**: Overwrite old records with new ones, keeping only the latest values.
3. **Type 2 SCD**: Keep a history of all changes by adding new records without deleting old ones.
4. **Type 3 SCD**: Track which version is current and when it was valid, typically using a status column."""
    ,
    metadata={"source": "informatica"},
    id=43,
)

document_44 = Document(
    content="""Question: How do you load Reference Tables and Hierarchies in Informatica Cloud?
Answer: Reference tables and hierarchies can be loaded similarly to dimensions, using one-time or incremental loads. For example:
- **Customer Dim**: Loading customer details with attributes like name, address, etc.
- **Time Dim**: Creating a comprehensive time dimension for date-related queries.
- **Region Dim**: Defining regions with hierarchies and attributes like region manager, country, etc."""
    ,
    metadata={"source": "informatica"},
    id=44,
)

document_45 = Document(
    content="""Question: What are fact tables in Informatica Cloud?
Answer: Fact tables are the core of a data warehouse. They contain the measurable values or metrics about business processes and events. 
Each fact table is linked to dimension tables, providing context and attributes for the facts. In Informatica Cloud, fact tables store 
large volumes of transactional data that can be analyzed using OLAP (Online Analytical Processing) tools."""
    ,
    metadata={"source": "database"},
    id=45,
)

document_46 = Document(
    content="""Question: What are the different load strategies for Fact Tables in Informatica Cloud?
Answer: There are several strategies to load fact tables in Informatica Cloud:
1. **Full Load**: Replacing all existing data with new data.
2. **Incremental Load**: Adding only new or changed records without deleting old ones.
3. **Type 0 SCD (Slowly Changing Dimension)**: Similar to Type 1, but typically used for fact tables by appending changes as new rows 
rather than overwriting.
4. **Type 1 SCD**: Overwriting existing data with new values in a fact table.
5. **Type 2 SCD**: Maintaining a history of all changes in a fact table, similar to dimension tables."""
    ,
    metadata={"source": "informatica"},
    id=46,
)

document_47 = Document(
    content="""Question: Explain in a step by step approach how to build a mapping in Informatica Cloud to implement a TYPE1 
dimension (Customer Dimension).
Answer: To create a TYPE1 dimension for customers:
1. **Create the Customer Dimension Table**: Design and create a table with attributes like customer_id, name, address, etc.
2. **Prepare Source Data**: Ensure your source data is clean and ready for loading into the dimension table.
3. **Build the Mapping**:
   - Open Informatica Cloud Studio.
   - Create a new mapping for the Customer Dimension.
4. **Configure Source and Target**:
   - Set up the source connector (e.g., Oracle, SQL Server).
   - Configure the target as your customer dimension table.
5. **Add Transformation Logic**: 
   - Use the Expression transformation to handle any necessary transformations.
6. **Load Data**: 
   - Apply a full load strategy using the Update Strategy set to "INSERT_ONLY".
7. **Test and Validate**: 
   - Execute the mapping and validate that all data is loaded correctly into the dimension table."""
    ,
    metadata={"source": "informatica"},
    id=47,
)

document_48 = Document(
    content="""Question: Explain in a step by step approach how to build a mapping in Informatica Cloud to implement a TYPE2 
dimension (Customer Dimension).
Answer: To create a TYPE2 dimension for customers:
1. **Create the Customer Dimension Table**: Design and create a table with attributes like customer_id, name, address, start_date, 
end_date, etc.
2. **Prepare Source Data**: Ensure your source data is clean and ready for loading into the dimension table.
3. **Build the Mapping**:
   - Open Informatica Cloud Studio.
   - Create a new mapping for the Customer Dimension.
4. **Configure Source and Target**:
   - Set up the source connector (e.g., Oracle, SQL Server).
   - Configure the target as your customer dimension table.
5. **Add Transformation Logic**: 
   - Use the Expression transformation to handle any necessary transformations.
6. **Load Data**: 
   - Apply a full load strategy using the Update Strategy set to "INSERT_ONLY".
7. **Maintain History**: 
   - In your mapping, add logic in the target insert or update to manage start_date and end_date for historical records.
8. **Test and Validate**: 
   - Execute the mapping and validate that all data is loaded correctly into the dimension table with history tracking."""
    ,
    metadata={"source": "informatica"},
    id=48,
)

document_49 = Document(
    content="""Question: Explain in a step by step approach how to build a mapping in Informatica Cloud to implement a TYPE1 fact 
(Invoice Line Fact).
Answer: To create a TYPE1 fact for invoice lines:
1. **Create the Invoice Line Fact Table**: Design and create a table with attributes like invoice_id, line_number, product_id, quantity, 
amount, etc.
2. **Prepare Source Data**: Ensure your source data is clean and ready for loading into the fact table.
3. **Build the Mapping**:
   - Open Informatica Cloud Studio.
   - Create a new mapping for the Invoice Line Fact.
4. **Configure Source and Target**:
   - Set up the source connector (e.g., Oracle, SQL Server).
   - Configure the target as your invoice line fact table.
5. **Add Transformation Logic**: 
   - Use the Expression transformation to handle any necessary transformations.
6. **Load Data**: 
   - Apply a full load strategy using the Update Strategy set to "INSERT_ONLY".
7. **Test and Validate**: 
   - Execute the mapping and validate that all data is loaded correctly into the fact table."""
    ,
    metadata={"source": "informatica"},
    id=49,
)

document_50 = Document(
    content="""Question: Explain in detail the Lookup transformation with atleast 5 examples.
Answer: The Lookup transformation retrieves data from a lookup source based on the current row's values. Here are five examples:
1. **Simple Lookup**: Retrieve customer details from a customer dimension table using a customer_id.
2. **Dynamic Lookup**: Use dynamic SQL to fetch product prices based on product IDs and dates.
3. **Multi-Valued Lookup**: Fetch multiple addresses for customers based on their IDs.
4. **Hierarchical Lookup**: Retrieve product categories and subcategories from a hierarchical structure.
5. **Conditional Lookup**: Apply conditions to filter data before lookup, e.g., only fetch active customers."""
    ,
    metadata={"source": "informatica"},
    id=50,
)

document_51 = Document(
    content="""Question: Explain in detail the Joiner transformation with atleast 5 examples.
Answer: The Joiner transformation combines data from two or more sources based on matching keys. Here are five examples:
1. **Basic Join**: Combine customer and order details to create a single view of sales.
2. **Multi-way Join**: Join three tables (e.g., customers, orders, products) to create a comprehensive sales report.
3. **Outer Joins**: Use Left or Right Outer Joins to include all records from one source even if there is no match in another.
4. **Self-Join**: Combine data within the same table based on different conditions.
5. **Dynamic Join**: Apply conditions dynamically to join sources, e.g., join only for recent orders."""
    ,
    metadata={"source": "informatica"},
    id=51,
)

document_52 = Document(
    content="""Question: Explain in detail the Router transformation with atleast 5 examples.
Answer: The Router transformation routes data based on specific conditions to different target tables or processes. Here are five 
examples:
1. **Conditional Routing**: Route sales data based on regions (e.g., East, West) to separate tables.
2. **Priority-based Routing**: Prioritize data routing based on urgency levels (e.g., high priority orders first).
3. **Filtering and Routing**: Filter data before routing it to specific targets, e.g., route only active customers.
4. **Error Handling Routing**: Route error records to a dedicated table for further investigation.
5. **Dynamic Routing**: Apply dynamic conditions at runtime based on current state or external inputs."""
    ,
    metadata={"source": "informatica"},
    id=52,
)

document_53 = Document(
    content="""Question: Explain in detail Parameters with atleast 5 examples.
Answer: Parameters allow you to pass values between mappings and taskflows dynamically. Here are five examples:
1. **Dynamic SQL**: Use parameters in SQL queries to filter data based on runtime values.
2. **Conditional Logic**: Apply conditional logic based on parameter values to alter mapping behavior.
3. **Taskflow Execution**: Pass parameters from one taskflow to another for seamless execution.
4. **Parameterized Taskflows**: Create reusable taskflows by passing inputs and outputs through parameters.
5. **Dynamic Update Strategies**: Adjust update strategies in mappings dynamically based on parameter values."""
    ,
    metadata={"source": "informatica"},
    id=53,
)

document_54 = Document(
    content="""
Scenario: You have an ETL process that is running well but needs to be optimized further to reduce execution time.
Problem: How would you set up performance monitoring and tuning in Informatica?
Response: 
1. **Performance Metrics:** Define key performance indicators (KPIs) such as data load times, error rates, and resource utilization.
2. **Monitoring Tools:** Utilize Informatica’s built-in monitoring tools to track performance metrics in real-time.
3. **Tuning Strategies:** Identify bottlenecks based on monitoring data and apply tuning strategies (e.g., parallelism, partitioning).
4. **Performance Reports:** Generate regular reports to measure the effectiveness of performance improvements.""",
    metadata={"source": "informatica"},
    id=54,
)

document_55 = Document(
    content="""
Scenario: Your company is implementing a new ETL solution that involves handling sensitive customer data. You need to ensure compliance 
with relevant regulations (e.g., GDPR).
Problem: How would you address data security and compliance requirements in your ETL processes?
Response:
1. **Data Masking:** Implement data masking to protect sensitive information during development and testing.
2. **Encryption:** Use encryption for secure data storage and transmission.
3. **Access Controls:** Define strict access controls based on user roles and responsibilities.
4. **Audit Trails:** Maintain audit trails for all data operations, including ETL processes.
5. **Compliance Checks:** Regularly perform compliance checks and audits to ensure adherence to regulatory requirements.""",
    metadata={"source": "Informatica"},
    id=55,
)

document_56 = Document(
    content="""
Scenario: Your company is planning an ETL process that will handle a large volume of customer transaction data (e.g., billions of 
records).
Problem: How would you approach handling such large volumes of data efficiently?
Response:
1. **Partitioning:** Partition data at the source system to reduce the amount of data processed during each run.
2. **Parallelism:** Increase parallel processing capabilities to handle the large volume of data more efficiently.
3. **Batch Processing:** Break the ETL process into smaller, manageable batches to avoid overwhelming the system.
4. **Resource Management:** Ensure that sufficient resources (e.g., memory, CPU) are allocated for handling large volumes of data.
5. **Performance Tuning:** Optimize mapping and workflow configurations to improve performance.""",
    metadata={"source": "Informatica"},
    id=56,
)

document_57 = Document(
    content="""
Scenario: You need to ensure the quality of the data being loaded into your ETL process, particularly in terms of completeness and 
accuracy.
Problem: How would you implement robust data quality checks in your Informatica mappings?
Response:
1. **Data Profiling:** Use Informatica’s Data Quality tools to profile data and identify missing values, duplicates, and anomalies.
2. **Validation Rules:** Define validation rules within ETL mappings to enforce business logic (e.g., date ranges, numeric constraints).
3. **Error Handling:** Implement error handling mechanisms to capture and log invalid records for further review.
4. **Data Cleansing:** Use Informatica’s data cleansing functions to correct or remove invalid data before loading.
5. **Quality Reporting:** Generate reports summarizing the results of data quality checks for stakeholders.""",
    metadata={"source": "Informatica"},
    id=57,
)

document_58 = Document(
    content="""
### Scenario 1: ETL Design for Customer Database

**Scenario:** You are tasked with designing an ETL process for a customer database that includes customer details, transactions, and 
products. The system needs to handle updates frequently and ensure data consistency across different systems.

**Problem:** How would you design the ETL architecture for this scenario?

**Response:**
1. **Source Definition:**
   - Identify and configure all source systems (e.g., Customer Database, Transaction Logs, Product Catalog).
2. **Dimension Tables Design:**
   - **Customer Dimension:** Create a TYPE2 dimension to handle historical data.
     - Steps:
       1. Define the schema for the customer dimension table.
       2. Implement a surrogate key for each customer record.
       3. Include columns for customer attributes (e.g., name, address, phone number).
       4. Add start and end dates to track changes over time.
   - **Product Dimension:** Similarly, create a TYPE2 dimension for product details with historical versioning.
     - Steps:
       1. Define the schema for the product dimension table.
       2. Implement a surrogate key for each product record.
       3. Include columns for product attributes (e.g., name, category, price).
       4. Add start and end dates to track changes over time.
3. **Fact Table Design:**
   - **Transaction Fact:** Design a fact table capturing transaction details, which will be linked to the customer and product 
dimensions.
     - Steps:
       1. Define the schema for the transaction fact table.
       2. Include foreign keys linking to the customer and product dimension tables.
       3. Add columns for transaction attributes (e.g., date, amount, quantity).
4. **ETL Process Flow:**
   - Extract data from sources using Informatica PowerCenter or Informatica Cloud Data Integration.
     - Steps:
       1. Set up source connectors for each system.
       2. Define extraction queries to fetch relevant data.
   - Transform data to match schema requirements of dimension tables (e.g., normalization, type conversion).
     - Steps:
       1. Use transformation components to clean and standardize data.
       2. Apply necessary transformations (e.g., date formatting, data type conversions).
   - Load transformed data into appropriate dimension and fact tables using TYPE2 incremental loads.
     - Steps:
       1. Configure the load process to handle historical changes.
       2. Update existing records with new information and mark old records as inactive.
5. **Error Handling and Logging:**
   - Implement robust error handling mechanisms, logging errors, and notifications for corrective actions.
     - Steps:
       1. Set up error handling components in mappings.
       2. Configure logging to capture detailed error messages.
       3. Send notifications (e.g., email alerts) when errors occur.
    """,
    metadata={"source": "Informatica"},
    id=58,
)

document_59 = Document(
    content="""
### Scenario 2: Build and Test ETL Process

**Scenario:** You need to build an ETL process in Informatica PowerCenter to load data into a data warehouse from various sources. The 
task includes loading customer details, sales transactions, and product information.

**Problem:** How would you structure your Informatica project and execute the ETL process?

**Response:**
1. **Project Setup:**
   - Create an Informatica repository and set up a new project.
     - Steps:
       1. Log in to Informatica PowerCenter.
       2. Create a new repository if one does not exist.
       3. Set up a new project within the repository.
   - Define source systems, target tables, and dimensions in the repository.
     - Steps:
       1. Add source connectors for each data source (e.g., databases, files).
       2. Define target tables in the data warehouse.
       3. Create dimension tables as required.
2. **Mapping Design:**
   - Design mappings for extracting data from each source system (e.g., Customer Database, Transaction Logs).
     - Steps:
       1. Open the Informatica Designer and create a new mapping.
       2. Connect source connectors to target tables.
       3. Define transformation logic as needed.
   - Apply transformations as necessary (e.g., normalization, type conversion).
     - Steps:
       1. Use transformation components like Expression Editor, Filter, Sorter, etc.
       2. Implement data cleansing and validation rules.
3. **Workflow Creation:**
   - Create a workflow to orchestrate the execution of multiple mappings.
     - Steps:
       1. Open Informatica Workflow Manager.
       2. Create a new workflow.
       3. Add tasks for each mapping.
   - Set dependencies and synchronization points to ensure proper order of operations.
     - Steps:
       1. Define task dependencies based on data flow.
       2. Use control objects like Decision, Sequence, etc., to manage execution order.
4. **Testing Strategy:**
   - Develop unit tests for each mapping to validate data extraction and transformation logic.
     - Steps:
       1. Use Informatica’s Test & Trace feature.
       2. Run test cases with sample data.
   - Perform integration testing by running the entire workflow with sample data.
     - Steps:
       1. Execute the workflow with a subset of production data.
       2. Verify that all mappings run successfully and produce expected results.
5. **Validation and Deployment:**
   - Validate the ETL process using a subset of production data.
     - Steps:
       1. Test the entire workflow end-to-end.
       2. Ensure data integrity and consistency across target tables.
   - Deploy the validated workflow to production and monitor its performance.
     - Steps:
       1. Schedule the workflow in Informatica PowerCenter.
       2. Monitor execution logs for any issues.
    """,
    metadata={"source": "Informatica"},
    id=59,
)

document_60 = Document(
    content="""
### Scenario 3: Troubleshoot Problems in Production

**Scenario:** You are monitoring your ETL processes and notice that the customer transaction load is significantly delayed. The logs 
indicate errors related to data quality issues.

**Problem:** What steps would you take to diagnose and resolve the issue?

**Response:**
1. **Review Logs:**
   - Analyze recent logs for specific error messages indicating data quality issues.
     - Steps:
       1. Access Informatica PowerCenter’s log files.
       2. Search for error codes and messages related to customer transactions.
2. **Data Profiling:**
   - Use Informatica Data Quality tools to profile customer and transaction data, identifying discrepancies or anomalies.
     - Steps:
       1. Open Informatica Data Quality tool.
       2. Load the relevant datasets (customer details, transactions).
       3. Run profiling tasks to identify issues like missing values, duplicates, etc.
3. **Re-run with Validation:**
   - Rerun the ETL process with enhanced validation logic to filter out invalid records.
     - Steps:
       1. Modify mappings to include additional validation rules.
       2. Execute the workflow again and monitor for errors.
4. **Correct Source Data:**
   - If the issue is due to incorrect source data, work with the source systems team to correct or update the data.
     - Steps:
       1. Identify the root cause of the data quality issues.
       2. Communicate with the source system team for corrections.
5. **Improve Error Handling:**
   - Enhance error handling in mappings and workflows to prevent similar issues in future runs.
     - Steps:
       1. Implement more robust error handling mechanisms.
       2. Configure notifications for recurring errors.
    """,
    metadata={"source": "Informatica"},
    id=60,
)

document_61 = Document(
    content="""
### Scenario 4: Performance Optimization Techniques

**Scenario:** Your ETL process for loading transaction data from a large database is experiencing performance bottlenecks. The current 
solution is taking over an hour, and you need to optimize it.

**Problem:** How would you approach optimizing the performance of this ETL process?

**Response:**
1. **Partitioning:**
   - Identify key partition columns in source tables (e.g., date) and configure Informatica to extract only relevant partitions.
     - Steps:
       1. Analyze source data for suitable partition columns.
       2. Configure the Extract stage in mappings to use partitioning.
2. **Parallelism:**
   - Increase parallelism by adding more sessions or distributing data across multiple physical nodes.
     - Steps:
       1. Adjust the number of sessions in the workflow.
       2. Use distributed processing if available.
3. **Index Management:**
   - Ensure that appropriate indexes are created on source and target tables for faster query execution.
     - Steps:
       1. Create indexes on key columns in source databases.
       2. Optimize indexes on target tables to improve load performance.
4. **Incremental Loads:**
   - Use incremental loads (e.g., TYPE1, TYPE2) to avoid full table scans during each run.
     - Steps:
       1. Configure mappings to extract only changed data.
       2. Update dimension and fact tables accordingly.
5. **Profiling and Tuning:**
   - Utilize Informatica’s profiling tools to identify bottlenecks and make necessary tuning adjustments.
     - Steps:
       1. Use Informatica Profiler to analyze performance metrics.
       2. Adjust configurations based on profiling results.
    """,
    metadata={"source": "Informatica"},
    id=61,
)

document_62 = Document(
    content="""
### Scenario 5: Best Practices in ETL Design

**Scenario:** You are designing an ETL solution for a company with multiple departments, including finance, operations, and marketing. 
Each department has different data requirements and constraints.

**Problem:** What best practices would you follow to ensure the success of this multi-departmental ETL project?

**Response:**
1. **Standardization:**
   - Establish a common schema and data dictionary across all departments.
     - Steps:
       1. Define a standardized set of tables and columns.
       2. Document the schema for reference.
2. **Reusability:**
   - Design reusable components (e.g., libraries, transformation templates) to promote consistency.
     - Steps:
       1. Create reusable mappings and workflows.
       2. Use parameterization where applicable.
3. **Security:**
   - Ensure that access controls are in place for different departments based on their requirements.
     - Steps:
       1. Implement role-based access control.
       2. Restrict data access to authorized users only.
4. **Documentation:**
   - Document the ETL architecture, mappings, and workflows comprehensively.
     - Steps:
       1. Maintain detailed documentation for each component.
       2. Use diagrams and flowcharts where necessary.
5. **Governance:**
   - Establish a data governance framework to ensure data quality, lineage, and compliance.
     - Steps:
       1. Define policies and procedures for data management.
       2. Monitor and audit data usage regularly.
    """,
    metadata={"source": "Informatica"},
    id=62,
)

document_63 = Document(
    content="""
### Scenario 6: Implementing Informatica MDM

**Scenario:** Your company is looking to implement an Informatica MDM solution for managing customer records across multiple systems.

**Problem:** How would you approach the implementation of an Informatica MDM system?

**Response:**
1. **Business Requirements Analysis:**
   - Gather requirements from business stakeholders, including data sources, master record definition, and governance rules.
     - Steps:
       1. Conduct meetings with departments to understand their needs.
       2. Document all functional and non-functional requirements.
2. **MDM Solution Design:**
   - Design a logical data model for master records, incorporating attributes relevant to customer management (e.g., identity, 
demographics).
     - Steps:
       1. Define the schema for master records.
       2. Identify key attributes to be included in each entity.
3. **ETL Mapping Development:**
   - Develop ETL mappings to extract, transform, and load data from source systems into the MDM repository.
     - Steps:
       1. Set up source connectors for each system.
       2. Create mappings to clean and standardize data.
       3. Load data into the MDM hub.
4. **Data Quality Management:**
   - Implement Informatica Data Quality tools to cleanse and standardize data in the MDM environment.
     - Steps:
       1. Use Data Quality components to identify and correct errors.
       2. Apply validation rules to ensure data integrity.
5. **User Training and Support:**
   - Provide training for end-users on how to use the MDM solution effectively.
     - Steps:
       1. Develop training materials and conduct sessions.
       2. Offer ongoing support and assistance.
    """,
    metadata={"source": "Informatica"},
    id=63,
)

document_64 = Document(
    content="""
### Scenario 7: Performance Monitoring and Tuning

**Scenario:** You have an ETL process that is running well but needs to be optimized further to reduce execution time.

**Problem:** How would you set up performance monitoring and tuning in Informatica?

**Response:**
1. **Performance Metrics:**
   - Define key performance indicators (KPIs) such as data load times, error rates, and resource utilization.
     - Steps:
       1. Identify critical metrics to monitor.
       2. Set thresholds for acceptable performance levels.
2. **Monitoring Tools:**
   - Utilize Informatica’s built-in monitoring tools to track performance metrics in real-time.
     - Steps:
       1. Access the Informatica PowerCenter Monitor.
       2. Configure dashboards to display relevant metrics.
3. **Tuning Strategies:**
   - Identify bottlenecks based on monitoring data and apply tuning strategies (e.g., parallelism, partitioning).
     - Steps:
       1. Analyze performance reports for slow operations.
       2. Adjust configurations to optimize processing.
4. **Performance Reports:**
   - Generate regular reports to measure the effectiveness of performance improvements.
     - Steps:
       1. Schedule periodic report generation.
       2. Review reports and make necessary adjustments.
    """,
    metadata={"source": "Informatica"},
    id=64,
)

document_65 = Document(
    content="""
### Scenario 8: Data Security and Compliance

**Scenario:** Your company is implementing a new ETL solution that involves handling sensitive customer data. You need to ensure 
compliance with relevant regulations (e.g., GDPR).

**Problem:** How would you address data security and compliance requirements in your ETL processes?

**Response:**
1. **Data Masking:**
   - Implement data masking to protect sensitive information during development and testing.
     - Steps:
       1. Use Informatica’s Data Masking tool.
       2. Define masking rules for sensitive fields.
2. **Encryption:**
   - Use encryption for secure data storage and transmission.
     - Steps:
       1. Encrypt data at rest using database features or file systems.
       2. Implement SSL/TLS for data in transit.
3. **Access Controls:**
   - Define strict access controls based on user roles and responsibilities.
     - Steps:
       1. Set up role-based access control (RBAC).
       2. Grant permissions only to authorized users.
4. **Audit Trails:**
   - Maintain audit trails for all data operations, including ETL processes.
     - Steps:
       1. Enable auditing in Informatica PowerCenter.
       2. Store logs securely for compliance purposes.
5. **Compliance Checks:**
   - Regularly perform compliance checks and audits to ensure adherence to regulatory requirements.
     - Steps:
       1. Conduct regular internal audits.
       2. Engage external auditors as needed.
    """,
    metadata={"source": "Informatica"},
    id=65,
)

document_66 = Document(
    content="""
### Scenario 9: Handling Large Volumes of Data

**Scenario:** Your company is planning an ETL process that will handle a large volume of customer transaction data (e.g., billions of 
records).

**Problem:** How would you approach handling such large volumes of data efficiently?

**Response:**
1. **Partitioning:**
   - Partition data at the source system to reduce the amount of data processed during each run.
     - Steps:
       1. Identify suitable partition columns (e.g., date).
       2. Configure the Extract stage in mappings to use partitioning.
2. **Parallelism:**
   - Increase parallel processing capabilities by adding more sessions or distributing data across multiple physical nodes.
     - Steps:
       1. Adjust the number of sessions in the workflow.
       2. Use distributed processing if available.
3. **Batch Processing:**
   - Break the ETL process into smaller, manageable batches to avoid overwhelming the system.
     - Steps:
       1. Define batch sizes based on system capacity.
       2. Schedule workflows to run in batches.
4. **Resource Management:**
   - Ensure that sufficient resources (e.g., memory, CPU) are allocated for handling large volumes of data.
     - Steps:
       1. Monitor resource usage during ETL runs.
       2. Allocate additional resources as needed.
5. **Performance Tuning:**
   - Optimize mapping and workflow configurations to improve performance.
     - Steps:
       1. Analyze and optimize SQL queries.
       2. Use Informatica’s tuning tools for further optimization.
    """,
    metadata={"source": "Informatica"},
    id=66,
)

document_67 = Document(
    content="""
### Scenario 10: Implementing Data Quality Checks

**Scenario:** You need to ensure the quality of the data being loaded into your ETL process, particularly in terms of completeness and 
accuracy.

**Problem:** How would you implement robust data quality checks in your Informatica mappings?

**Response:**
1. **Data Profiling:**
   - Use Informatica Data Quality tools to profile data and identify discrepancies or anomalies.
     - Steps:
       1. Open the Data Quality tool.
       2. Load datasets for profiling.
       3. Run profiling tasks to detect issues.
2. **Validation Rules:**
   - Define validation rules within ETL mappings to enforce business logic (e.g., date ranges, numeric constraints).
     - Steps:
       1. Use the Expression Editor or Validation components in mappings.
       2. Implement checks for data integrity and consistency.
3. **Error Handling:**
   - Implement error handling mechanisms to capture and log invalid records.
     - Steps:
       1. Set up error handling components in mappings.
       2. Configure logging to capture detailed error messages.
4. **Data Cleansing:**
   - Use Informatica’s data cleansing functions to correct or remove invalid data before loading.
     - Steps:
       1. Apply cleansing rules using the Cleansing component.
       2. Validate and verify cleansed data.
5. **Quality Reporting:**
   - Generate reports summarizing the results of data quality checks for stakeholders.
     - Steps:
       1. Use Informatica’s reporting tools to create dashboards.
       2. Schedule regular report generation.
       3. Review reports and take corrective actions as needed.
    """,
    metadata={"source": "Informatica"},
    id=67,
)

In [10]:
documents = [document_1, document_2, document_3, document_4, document_5, document_6, document_7, document_8, document_9, document_10, document_11, document_12, document_13, document_14, document_15, document_16, document_17, document_18, document_19, document_20, document_21, document_22, document_23, document_24, document_25, document_26, document_27, document_28, document_29, document_30, document_31, document_32, document_33, document_34, document_35, document_36, document_37, document_38, document_39, document_40, document_41, document_42, document_43, document_44, document_45, document_46, 
             document_47, document_48, document_49, document_50, 
             document_51, document_52, document_53, document_54, document_55, document_56, document_57, document_58, document_59, document_60, document_61, document_62, document_63, document_64, document_65, document_66, document_67]

In [11]:
semantic_search = SemanticSearch(documents)
semantic_search.load_documents()

query = "Troubleshooting a ETL Job"
top_5_docs = semantic_search.search(query, top_k=5)

for doc in top_5_docs:
    print(doc)

Document vectors shape: (67, 384)
Query vector shape: (1, 384)

### Scenario 3: Troubleshoot Problems in Production

**Scenario:** You are monitoring your ETL processes and notice that the customer transaction load is significantly delayed. The logs 
indicate errors related to data quality issues.

**Problem:** What steps would you take to diagnose and resolve the issue?

**Response:**
1. **Review Logs:**
   - Analyze recent logs for specific error messages indicating data quality issues.
     - Steps:
       1. Access Informatica PowerCenter’s log files.
       2. Search for error codes and messages related to customer transactions.
2. **Data Profiling:**
   - Use Informatica Data Quality tools to profile customer and transaction data, identifying discrepancies or anomalies.
     - Steps:
       1. Open Informatica Data Quality tool.
       2. Load the relevant datasets (customer details, transactions).
       3. Run profiling tasks to identify issues like missing values, duplicates, 