**Step 1: Install Required Libraries**

In [9]:
!pip install langchain langchain-community langchain-core chromadb groq pypdf



**Step 2: Import Required Libraries**

In [10]:
import os
import json
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from groq import Groq

**Step 3: Extract Text from PDF**

In [11]:
# Suppress pypdf warnings
import logging
logging.getLogger("pypdf").setLevel(logging.ERROR)

# Define the dataset directory
dataset_path = "/content/drive/MyDrive/Dataset"

# Get a list of all PDF files in the directory
pdf_files = [os.path.join(dataset_path, f) for f in os.listdir(dataset_path) if f.endswith(".pdf")]

# Load all PDFs and combine their documents
documents = []
for pdf_file in pdf_files:
    loader = PyPDFLoader(pdf_file)
    documents.extend(loader.load())

print(f"Extracted {len(documents)} pages from {len(pdf_files)} PDFs.")

Extracted 322 pages from 8 PDFs.


**Step 4: Preprocess the Data**

In [12]:
  # Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

print(f"Split the documents into {len(texts)} chunks.")

Split the documents into 345 chunks.


**Step 5: Embed the Data Using Chroma DB**

In [13]:
# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create a vector store from the documents and persist it to disk
vectorstore_path = "vectorstore_db"  # Directory to save the vector store
vectorstore = Chroma.from_documents(texts, embeddings, persist_directory=vectorstore_path)
vectorstore.persist()  # Save the vector store to disk

print(f"Documents embedded and stored in Chroma DB at '{vectorstore_path}'.")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Documents embedded and stored in Chroma DB at 'vectorstore_db'.


Step 6: Set Up the Groq API and LLM

In [14]:
# Set up the Groq API
os.environ["GROQ_API_KEY"] = "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"
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])

print("Groq API initialized.")

Groq API initialized.


**Step 7: Create Memory Buffer for Chat History**

In [15]:
# Initialize memory for chat history
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

print("Memory buffer for chat history created.")

Memory buffer for chat history created.


**Step 8: Define Functions for Querying Groq and Performing RAG**

In [16]:
# Function to query the Groq API
def query_groq(prompt):
    response = groq_client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Function to perform RAG
def perform_rag(question, vectorstore, memory):
    # Retrieve relevant documents from the vector store
    relevant_docs = vectorstore.similarity_search(question, k=3)
    context = "\n".join([doc.page_content for doc in relevant_docs])

    # Add chat history to the prompt
    chat_history = memory.load_memory_variables({})["chat_history"]
    prompt = f"Context:\n{context}\n\nChat History:\n{chat_history}\n\nQuestion: {question}"

    # Query the Groq API
    answer = query_groq(prompt)

    # Update memory
    memory.save_context({"input": question}, {"output": answer})

    return answer

# Function to format the answer
def format_answer(answer):
    """
    Formats the answer into a user-friendly structure (e.g., bullet points or numbered lists).
    """
    if "steps" in answer.lower() or "step" in answer.lower():
        # Split the answer into lines and format as a numbered list
        lines = answer.split("\n")
        formatted_answer = "\n".join([f"{i+1}. {line.strip()}" for i, line in enumerate(lines) if line.strip()])
        return formatted_answer
    else:
        # Return the answer as is if no steps are detected
        return answer

**Step 9: Ask a Question and Get a Well-Formatted Answer**

In [17]:
# Define the question
question = "what are the steps for K means clustering?"

# Get the answer
answer = perform_rag(question, vectorstore, memory)

# Print the formatted answer
print("Question:", question)
print("\nAnswer:")
print(format_answer(answer))

Question: what are the steps for K means clustering?

Answer:
1. Based on the provided context, here are the steps for K-means clustering:
3. 1. **Specify the number of clusters (K)**: You need to decide how many clusters you want to form.
4. 2. **Choose initial centroids**: The initial centroids are often chosen randomly, but you can also use other methods such as hierarchical clustering or manually specify them.
5. 3. **Assign each point to a cluster**: Each point is assigned to the cluster with the closest centroid.
6. 4. **Calculate new centroids**: The centroid is typically the mean of the points in the cluster.
7. 5. **Update clusters**: If the cluster membership has changed, update the centroids and assign each point to the new cluster with the closest centroid.
8. 6. **Repeat steps 4 and 5**: Continue updating clusters and centroids until convergence or until a stopping condition is met (e.g., until relatively few points change clusters).
10. It's worth noting that the basic al

In [18]:
# Define the question
question = "What are the Major Tasks in Data Pre-processing"

# Get the answer
answer = perform_rag(question, vectorstore, memory)

# Print the formatted answer
print("Question:", question)
print("\nAnswer:")
print(format_answer(answer))

Question: What are the Major Tasks in Data Pre-processing

Answer:
According to the provided context, the Major Tasks in Data Pre-processing are:

1. **Data Cleaning**: This involves removing or correcting any errors or inconsistencies in the data.
2. **Data Integration**: This task involves combining data from multiple sources into a single, consistent dataset.
3. **Data Reduction**: This is the process of reducing the complexity or size of the data, often to make it more manageable or to reduce noise.
4. **Data Transformation and Data Discretization**: This includes converting data into a more suitable format for analysis by performing mathematical transformations such as normalization or binning.

These Major Tasks form a crucial part of the Data Pre-processing process, which helps to transform raw, unstructured data into a format that can be used for meaningful insights and analysis.


In [19]:
# Define the question
question = "How to Handle Missing Data?"

# Get the answer
answer = perform_rag(question, vectorstore, memory)

# Print the formatted answer
print("Question:", question)
print("\nAnswer:")
print(format_answer(answer))

Question: How to Handle Missing Data?

Answer:
Based on the provided context, here are several strategies to deal with missing data:

1. **Data Imputation**: 
   - Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the respective attribute.
   - Regression Imputation: Use a regression model to estimate missing values based on the relationships observed in the rest of the data.
   - Attribute mean: Replace missing values with the mean of the respective attribute for all samples.
   - Attribute mean for all samples belonging to the same class: Replace missing values with the mean of the respective attribute for all samples within the same class.

2. **Deletion Strategies**: 
   - Listwise Deletion/Dropping Rows: Remove rows with missing values.
   - Column Deletion: Remove entire columns with missing values.

3. **Interpolation**: 
   - Time Series Interpolation (for time-series data): Estimate missing values based on the values before and after them.



In [21]:
# Define the question
question = "in eda what is SKEWNESS? and what its equation"

# Get the answer
answer = perform_rag(question, vectorstore, memory)

# Print the formatted answer
print("Question:", question)
print("\nAnswer:")
print(format_answer(answer))

Question: in eda what is SKEWNESS? and what its equation

Answer:
**What is Skewness?**

According to the provided context, skewness is a measure of the asymmetry of the probability distribution of a random variable about its mean. In simpler terms, it tells you the amount and direction of skew (departure from horizontal symmetry) in the data distribution.

**Skewness Equation:**

There are three common measures of skewness:

1. **Pearson's First Skewness Coefficient**:
   The formula is: 
   1/n * ∑[(x_i - μ) / σ]^3
   
   where
   - n: sample size
   - x_i: individual data point
   - μ: sample mean
   - σ: sample standard deviation

2. **Fisher-Pearson Skewness Coefficient**: This is the most widely used measure of skewness.
   Its formula is the same as the Pearson's First Skewness Coefficient formula.

3. **Bowley Skewness**: 
   The formula is the difference between the distance of the 75th percentile from the mean and the distance of the 25th percentile from the mean.

The equati

In [22]:
# Define the question
question = "what is the Correlation Analysis (Nominal Data) equation ?"
# Get the answer
answer = perform_rag(question, vectorstore, memory)
# Print the formatted answer
print("Question:", question)
print("\nAnswer:")
print(format_answer(answer))

Question: what is the Correlation Analysis (Nominal Data) equation ?

Answer:
1. According to the provided context, the Correlation Analysis for Nominal Data is not a straightforward equation like the Pearson's Coefficient or other statistical measures.
3. Instead, it involves calculating the chi-square statistic (Χ²) and using a contingency table to analyze the relationship between two nominal attributes.
5. Here's a brief summary of the steps involved in Nominal Data Correlation Analysis:
7. 1. **Contingency Table**: Create a table to record the frequencies of different attribute combinations (Oij).
8. 2. **χ² Calculation**: Calculate the chi-square statistic using the formula:
10. χ² = Σ [(observed frequency - expected frequency)^2 / expected frequency]
12. where expected frequency = (row total × column total) / total number of observations.
14. 3. **Degrees of Freedom**: Determine the number of degrees of freedom (df) associated with the chi-square test, which is equal to (i-1) × (

In [23]:
# Ask a follow-up question to test memory
follow_up_question = "Can you explain the linear regression more detail?"
follow_up_answer = perform_rag(follow_up_question, vectorstore, memory)

# Print the follow-up question and answer
print("\nFollow-up Question:", follow_up_question)
print("\nFollow-up Answer:")
print(format_answer(follow_up_answer))



Follow-up Question: Can you explain the linear regression more detail?

Follow-up Answer:
Based on the provided context, here's a detailed explanation of linear regression:

**What is Linear Regression?**

Linear regression is a statistical method that models the relationship between a dependent variable (y) and one or more independent variables (x) using a linear equation. The goal is to create a model that can predict the value of y for new, unseen data.

**What is Linear in Linear Regression?**

Linear in linear regression refers to the fact that the model uses a linear equation to describe the relationship between x and y. In other words, the model assumes that the slope of the relationship between x and y is constant, and that the relationship can be represented by a straight line. This does not mean that the data must be a straight line; rather, it means that the model uses a linear equation to describe the relationship.

**The Simplest Case: Simple Linear Regression**

Simple l

In [24]:
# Ask a follow-up question to test memory
follow_up_question = "generate 5 questions about multiple linear regression?"
follow_up_answer = perform_rag(follow_up_question, vectorstore, memory)

# Print the follow-up question and answer
print("\nFollow-up Question:", follow_up_question)
print("\nFollow-up Answer:")
print(format_answer(follow_up_answer))



Follow-up Question: generate 5 questions about multiple linear regression?

Follow-up Answer:
Here are five questions about multiple linear regression:

1. What is the term for the change in the dependent variable (y) when there is a one-unit change in an independent variable (x), while holding all other independent variables constant in a multiple linear regression model?

2. What is the value in a multiple linear regression model that predicts where the regression plane crosses the Y-axis?

3. Suppose we are fitting a regression model to a dataset with two independent variables (x1 and x2) and one dependent variable (y). Which of the following is true about the slope (β2) for variable x2?

   a) It predicts the change in y per unit x2 holding x1 and x2 constant.
   b) It predicts the change in y per unit x2 holding x1 constant.
   c) It predicts the change in x1 per unit x2 holding x2 constant.

4. In a multiple linear regression model, how can the relationship between the dependent

**Step 10: Test Chat History**

In [25]:
# Save chat history to an external file
chat_history_path = "chat_history.json"
with open(chat_history_path, "w") as f:
    json.dump([{"type": message.type, "content": message.content} for message in memory.chat_memory.messages], f, indent=4)

print(f"\nChat history saved to '{chat_history_path}'.")


Chat history saved to 'chat_history.json'.


In [26]:
#Display chat history
print("\nChat History:")
print("-------------")

for i, message in enumerate(memory.chat_memory.messages):
    print(f"{i+1}. {message.type.capitalize()}: {message.content}")


Chat History:
-------------
1. Human: what are the steps for K means clustering?
2. Ai: Based on the provided context, here are the steps for K-means clustering:

1. **Specify the number of clusters (K)**: You need to decide how many clusters you want to form.
2. **Choose initial centroids**: The initial centroids are often chosen randomly, but you can also use other methods such as hierarchical clustering or manually specify them.
3. **Assign each point to a cluster**: Each point is assigned to the cluster with the closest centroid.
4. **Calculate new centroids**: The centroid is typically the mean of the points in the cluster.
5. **Update clusters**: If the cluster membership has changed, update the centroids and assign each point to the new cluster with the closest centroid.
6. **Repeat steps 4 and 5**: Continue updating clusters and centroids until convergence or until a stopping condition is met (e.g., until relatively few points change clusters).

It's worth noting that the basi