<a href="https://colab.research.google.com/github/NeoRedcraft/nlp-project-1/blob/main/Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 1: Introduction to the Problem

# Section 2: Dataset Description

## Section 2.1: Brief Description

[Provide a brief description of the knowledge sources used (e.g., PDFs, web pages, text
files, databases).]

## Section 2.2: Source of Documents

[State the source of the documents and how they were collected. ]

## Section 2.3: Dataset Structure

[Explain the dataset structure (number of documents, file types, size, domains).]

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the dataset
file_path = '/dataset/Mental_Health_FAQ.csv'
df = pd.read_csv(file_path)

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df['Questions'].dropna())

# Get feature names and sum TF-IDF scores
feature_names = vectorizer.get_feature_names_out()
dense = tfidf_matrix.todense()
denselist = dense.tolist()
df_tfidf = pd.DataFrame(denselist, columns=feature_names)
top_n = 20
tfidf_sum = df_tfidf.sum().sort_values(ascending=False).head(top_n)

# Plot TF-IDF Bar Chart
plt.figure(figsize=(12, 6))
sns.barplot(x=tfidf_sum.values, y=tfidf_sum.index, palette='viridis')
plt.title(f'Top {top_n} Words in Questions by TF-IDF Score')
plt.xlabel('TF-IDF Score')
plt.ylabel('Words')
plt.show()

FileNotFoundError: [Errno 2] No such file or directory: '/dataset/Mental_Health_FAQ.csv'

### Add Explaination on Distrubition of both Question and Answer lengths


## Section 2.4: Preprocessing

[Discuss any preprocessing steps applied (cleaning, chunking strategy, token limits,
metadata tagging, document filtering)]

## Section 2.5: Embedding Process

[ Describe the embedding process (model used, chunk size, overlap).]

In [None]:
# RAG Preprocessing for Qwen/Qwen3 Model
from langchain_community.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

# 1. Load Data
loader = CSVLoader(file_path='dataset/Mental_Health_FAQ.csv', source_column='Questions', encoding='utf-8')
documents = loader.load()

# 2. Text Splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(documents)

# 3. Embedding Model (Preparing for Qwen Retrieval)
embedding_model_name = 'sentence-transformers/all-MiniLM-L6-v2'
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

# 4. Vector Store Creation
persist_directory = './chroma_db'
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory=persist_directory
)
print(f'Vector store created successfully at {persist_directory} with {len(splits)} chunks.')

# Section 3: Requirements

## Section 3.1: LLM Frameworks

[LLM frameworks (e.g., LangChain, LlamaIndex)]

## Section 3.2: Embedding Model

[Embedding models]

## Section 3.3: Vector Database

[Vector databases (e.g., FAISS, Chroma, Pinecone)]

## Section 3.4: Backend and UI tools

[Backend and UI tools (Streamlit, Gradio)]


## Section 3.5: Additional Utilities

[Any additional utilities (PDF loaders, web scrapers, etc.)]


# Section 4: System Architecture

## Section 4.1: System Architecture

[Describe the overall architecture (retriever, vector store, LLM, prompt template). ]


## Section 4.2: Pipeline

[Explain the pipeline (query → embedding → similarity search → context injection). ]

## Section 4.3: Prompt Design and Grounding Strategy

[Present prompt design and grounding strategy]

## Section 4.4: System Flow Diagram

[Include System flow diagrams or pseudocode (if applicable) ]

# Section 5: System Evaluation (Unseen Queries)



## Section 5.1: Evaluation Setup

[Describe the evaluation setup (manual testing, benchmark questions, user simulation). ]

## Section 5.2: Report Metrics

[Report relevant metrics (e.g., response relevance, accuracy, faithfulness, latency). ]

## Section 5.3: Hallucination Handling

[Discuss hallucination handling and failure cases. ]

## Section 5.4: Retrived Context vs. Final Generated Answers

[Compare retrieved context vs. final generated answers]


# Section 6 Web Deployment



## Section 6.1: Streamlit Interface

[Develop a Streamlit or Gradio interface. ]

## Section 6.2: User Input

[Allow users to input questions or prompts]

## Section 6.3: Retrived Context

[Display Retrived Context (optional but encouraged)]


## Section 6.4: Chatbot Response

[Show chatbot Responses in real time]

# Section 7: Results and Analysis

## Section 7.1: Qualitative Results

[Present qualitative results (sample Q&A interactions).
] Discuss strengths, weaknesses, edge cases, and observed limitations. Analyze how retrieval quality affects response quality

## Section 7.2: Quantitative Results

[Present quantitative or structured evaluation results (if applicable). ] Discuss strengths, weaknesses, edge cases, and observed limitations. Analyze how retrieval quality affects response quality

# Section 8: Documentation

[Insert Link to the IEEE Paper]

# Section 9: Insights and conclusions

[Summarize what your group learned about building a LLM chatbot. Discuss system strengths,
limitations (e.g., retrieval errors, hallucinations), and propose areas for future improvement such
as better embeddings, reranking, or hybrid retrieval.]

# Section 10: References

## Section 10.1: Scholarity Articles

[Cite in APA format, and put a description of how you used it for your work]

## Section 10.2: Online References

[Put the website, blog, or article title, link, and how you incorporated it into your
work]

## Section 10.3: Artificial Intelligence Tools

[Put the model used (e.g., ChatGPT, Gemini), the complete transcript of your
conversations with the model (including your prompts and its responses), and a
description of how you used it for your work]