# Generative AI: AI-Powered Data Extraction and Content Retrieval with  GenAI Stack from AI_Planet_Hub

![Project Image](genaistack.png)

Link  to the data used: "https://github.com/DataTalksClub/machine-learning-zoomcamp"

# Install Required Packages

In [None]:
!pip install -q -U git+https://github.com/aiplanethub/genai-stack.git
!pip install -q -U langchain


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


# Set Up OpenAI API Key
To use the OpenAI API for this project, make sure to set up your API key. You can do this by running the following code snippet, which securely prompts you for your API key and stores it in the environment variable

In [None]:
import os
from getpass import getpass

api_key = getpass("Enter OpenAI API Key:")
os.environ['OPENAI_API_KEY'] = api_key

Enter OpenAI API Key:··········


### GenAI Stack Modules

The code below imports various modules from the GenAI Stack, a comprehensive library for AI and machine learning applications. Each module plays a specific role in different stages of an AI project, from data preprocessing and embedding to memory management and retrieval. These modules enable the development and deployment of sophisticated AI models efficiently.


In [None]:
from genai_stack.stack.stack import Stack
from genai_stack.etl.langchain import LangchainETL
from genai_stack.embedding.langchain import LangchainEmbedding
from genai_stack.vectordb.chromadb import ChromaDB
from genai_stack.prompt_engine.engine import PromptEngine
from genai_stack.model.gpt3_5 import OpenAIGpt35Model
from genai_stack.retriever.langchain import LangChainRetriever
from genai_stack.memory.langchain import ConversationBufferMemory

### Data Extraction and Transformation (ETL)

The code initiates an ETL (Extract, Transform, Load) process using the LangchainETL module from the GenAI Stack. It creates an ETL instance named "WebBaseLoader" to extract data from a list of websites specified in the "websites" variable. This step is fundamental for preparing data from web sources for subsequent AI and machine learning tasks.


In [None]:
# Create a list of websites for ETL
websites = [
    "https://github.com/DataTalksClub/machine-learning-zoomcamp"
    ]

etl = LangchainETL.from_kwargs(name="WebBaseLoader",
                               fields={"web_path": websites
                                       }
                               )

### Embedding Configuration

The code defines an embedding configuration using the "config" dictionary. It specifies the "model_name" for embedding, which in this case is set to "sentence-transformers/all-mpnet-base-v2." Additionally, it includes model-related arguments such as "device" set to "cpu," and "encode_kwargs" with "normalize_embeddings" set to "False." This configuration is used to create an embedding instance named "HuggingFaceEmbeddings" using the LangchainEmbedding module. Embeddings are crucial for representing and processing text data in various NLP tasks.


In [None]:
config = {
    "model_name": "sentence-transformers/all-mpnet-base-v2",
    "model_kwargs": {"device": "cpu"},
    "encode_kwargs": {"normalize_embeddings": False},
    }

embedding = LangchainEmbedding.from_kwargs(name="HuggingFaceEmbeddings", fields=config)

### ChromaDB Initialization

The code snippet initializes the ChromaDB, a database instance for storing and managing vectors and data. It utilizes the "ChromaDB.from_kwargs()" method to create the database. ChromaDB is commonly used in machine learning and data retrieval tasks for efficient storage and retrieval of vectors, embeddings, and other data relevant to the model's operations.


In [None]:
chromadb = ChromaDB.from_kwargs()

### OpenAI GPT-3.5 Model Initialization

The code initializes an instance of the OpenAI GPT-3.5 model. It utilizes the "OpenAIGpt35Model.from_kwargs()" method with parameters including the OpenAI API key, ensuring that the model can access the necessary resources and APIs for language understanding and generation tasks. GPT-3.5 is a powerful language model developed by OpenAI, capable of various natural language processing tasks.


In [None]:
llm = OpenAIGpt35Model.from_kwargs(parameters={"openai_api_key": api_key})

### Creating an AI Stack

This code block is responsible for creating a comprehensive AI stack. It assembles various components required for natural language understanding, processing, and generation tasks. The stack comprises the following elements:
- `etl`: Data extraction, transformation, and loading (ETL) components.
- `embedding`: Language embeddings for understanding and encoding text.
- `vectordb`: A vector database (ChromaDB) for efficient storage and retrieval of vectorized data.
- `model`: The language model (GPT-3.5) by OpenAI, used for generating human-like text.
- `prompt_engine`: An engine that assists in formulating appropriate prompts for the model.
- `retriever`: A component for retrieving specific information or data.
- `memory`: A memory buffer for storing and managing conversations or interactions.

Together, these components form an AI stack suitable for various language-related tasks, from data processing to conversational AI.


In [None]:
retriever=LangChainRetriever.from_kwargs()

Stack(
    etl=etl,
    embedding=embedding,
    vectordb=chromadb,
    model=llm,
    prompt_engine=PromptEngine.from_kwargs(should_validate=False),
    retriever=retriever,
    memory=ConversationBufferMemory.from_kwargs(),
    )

<genai_stack.stack.stack.Stack at 0x79649658f220>

In [None]:
etl.run()

# Evaluation

The provided code segment performs the following actions:

- It defines a list of prompts, which are questions or queries related to "ML Zoomcamp."
- It iterates through each prompt and retrieves answers to these prompts using the retriever object.
- For each prompt, it prints the prompt itself and the retrieved answer to the console, separating them with "PROMPT" and "ANSWER" labels.

In [None]:

prompts = [
    "Could you provide an overview of ML Zoomcamp? What is its primary focus, and what can participants expect to learn from the program?",
    "What are the key concepts and topics covered in ML Zoomcamp's curriculum?",
    "To obtain a certificate from ML Zoomcamp, what are the specific requirements that participants need to fulfill?"
]

for prompt in prompts:
    response = retriever.retrieve(prompt)
    print("PROMPT:", prompt)
    print("ANSWER:", response['output'])
    print("\n")

PROMPT: Could you provide an overview of ML Zoomcamp? What is its primary focus, and what can participants expect to learn from the program?
ANSWER: ML Zoomcamp is a free online program offered by DataTalksClub that focuses on teaching participants about machine learning engineering. The program is designed to be completed in four months and covers various topics related to machine learning. Participants can expect to learn about the fundamentals of machine learning, regression and classification techniques, evaluation metrics, deploying machine learning models, decision trees and ensemble learning, neural networks and deep learning, serverless deep learning, and Kubernetes and TensorFlow serving. The program also includes hands-on projects and homework assignments to reinforce the learning. By the end of the program, participants will have gained practical skills in machine learning engineering and be able to apply their knowledge to real-world projects.


PROMPT: What are the key con

# Conclusion

The project showcased the potential of the GenAI Stack for efficient data extraction and analysis. By integrating various modules and AI components, it provided a seamless workflow for content retrieval and generation based on natural language queries. The GenAI Stack's versatility and adaptability make it a valuable tool for AI-driven data processing and content generation.

In the context of the ML Zoomcamp curriculum, the project effectively retrieved concise and informative answers to a set of specific questions. The model demonstrated an understanding of the program's overview, key concepts and topics covered, and the requirements for obtaining a certificate from ML Zoomcamp. This highlights the potential of using NLP models for educational and informational purposes, enabling users to quickly access relevant information.

............................................................................

Follow me on Twitter 🐦, connect with me on LinkedIn 🔗, and check out my GitHub 🐙. You won't be disappointed!

👉 Twitter: https://twitter.com/NdiranguMuturi1  
👉 LinkedIn: https://www.linkedin.com/in/isaac-muturi-3b6b2b237  
👉 GitHub: https://github.com/Isaac-Ndirangu-Muturi-749  