### **Installing Dependencies**

In [None]:
#installing dependencies, a few here might be redundant. These came up while debugging... So as we say.. don't touch something that's working
#this takes about 4-6 mins. and 2 session restarts
!pip install langchain
!pip install langchain_google_genai
!pip install google-generativeai
!pip install -U langchain-community
!pip install unstructured
!pip install "unstructured[pdf]"
!pip install google-ai-generativelanguage
!pip install "langchain[docarray]"
!pip install python-dotenv
#!pip install PyPDF2
!pip install PyMuPDF
# !pip install pdfminer.six
!pip install chromadb
!pip install nltk spacy
!python -m spacy download en_core_web_sm

In [None]:
#Connecting to Drive In case you have uploaded pdf's to your own drive
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
#importing stuff
from langchain_google_genai import ChatGoogleGenerativeAI
import os
import fitz
import nltk
import spacy
import google.generativeai as genai
from google.colab import userdata
from dotenv import load_dotenv
from IPython.display import display
from IPython.display import Markdown
from google.colab import files
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import urllib
from pdfminer.high_level import extract_text

from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import ChatGoogleGenerativeAI

from langchain.vectorstores import Chroma
import re

#downloading relavant words for lemmatization and stop words
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### **Setting up the API Key**
Just make a .env file analogous to the .env.example file just insert api key for gemini after the GOOGLE_API_KEY field.

In [3]:
#the file uploaded here is a .env having the gemini api key formatted as GOOGLE_API_KEY=abcdef123456
uploaded = files.upload()
for fn in uploaded.keys():
    with open(fn, 'wb') as f:
        f.write(uploaded[fn])

load_dotenv(fn)
api_ke = os.getenv('GOOGLE_API_KEY')

#print(api_ke)

genai.configure(api_key=api_ke)

Saving .env to .env


## **Initializing The  Gemini Model**

In [4]:
model = ChatGoogleGenerativeAI(model="gemini-1.5-flash",google_api_key=api_ke,
                             temperature=0.2,convert_system_message_to_human=True)

## **Processing PDFs**
### **Uploading PDFs**
Firstly extract and upload the extracted zip folder. Correct the file path as necessary.

### **Pre-Processing PDFs**
First we download stop words which hold negligible semantic meaning from the library. Then we tokenize using space, furhter we lematize the words while also stripping trailing and leading whitespaces. Finally we convert it into a single string. I had earlier also removed punctuation marks but it was impeding the CG cutoff values.

### **Extracting Text From PDFs**
We use the PyMuPDF library to extract text from the pdfs. Then we initialize and store  text based on the title of the pdf. So we make a combined list in all_pages and 2 seperate ones for si and placement chronicles

In [6]:
#if taking from mydrive
# pdf_folder_path = "/content/drive/MyDrive/pdf_files"

#make a folder with name pdf_files upload both pdfs into it
pdf_folder_path = "/content/pdf_files"

pdf_files = [f for f in os.listdir(pdf_folder_path) if f.endswith('.pdf')]

all_pages = []
pi_pages = []
si_pages = []

nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # text = re.sub(r'[^\w\s+]', '', text)

    doc = nlp(text)
    # Remove stopwords and lemmatize
    processed_words = []
    for token in doc:
        if token.text.lower() not in stop_words:
            processed_words.append(token.lemma_)

    processed_text = ' '.join(processed_words).strip()
    return processed_text


def extract_text_from_pdf(pdf_path):
    text = ""
    document = fitz.open(pdf_path)
    for page_num in range(len(document)):
        page = document.load_page(page_num)
        text += page.get_text()
    text = preprocess_text(text)
    return text

for pdf_file in pdf_files:
    file_name = os.path.splitext(pdf_file)[0]

    pdf_path = os.path.join(pdf_folder_path, pdf_file)
    text = extract_text_from_pdf(pdf_path)
    all_pages.append(text)

    if "placement chronicles" in file_name.lower():
        pi_pages.append(text)
    elif "si chronicles" in file_name.lower():
        si_pages.append(text)

#print(all_pages[28].page_content)
print(len(all_pages))
print(len(si_pages))
print(len(pi_pages))


2
1
1


In [None]:
# print(pi_pages[0])

### **Text Processing**
Firstly we combine the pages into a single string removing leading and trailing whitespaces in the process. Then we break the text into chunks of desirable size.

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=750)

context_all = "\n\n".join(str(p) for p in all_pages)
context_pi = "\n\n".join(str(p) for p in pi_pages)
context_si = "\n\n".join(str(p) for p in si_pages)

context_all = context_all.strip()
context_pi = context_pi.strip()
context_si = context_si.strip()

texts = text_splitter.split_text(context_all)
pi_texts = text_splitter.split_text(context_pi)
si_texts = text_splitter.split_text(context_si)

print(len(texts))
print(len(pi_texts))
print(len(si_texts))

# print("texts:", texts)
# print("pi_texts:", pi_texts)
# print("si_texts:", si_texts)

202
82
120


## **Making Corresponding Embeddings**
Firstly we initialize the embeddings model by gemini. Then we create corresponding embeddings and their retriever using Chromadb.

In [8]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001",google_api_key=api_ke)

vector_index = Chroma.from_texts(texts, embeddings).as_retriever(search_kwargs={"k":5})
vector_pi = Chroma.from_texts(pi_texts, embeddings).as_retriever(search_kwargs={"k":5})
vector_si = Chroma.from_texts(si_texts, embeddings).as_retriever(search_kwargs={"k":5})

## **Processing Queries**
# **Setting a Template**
We first give a set of instructions to our  llm model along with each query, which helps it in generating better answers.
# **Selecting Appropriate Retriever**
We create a function which identifies keywords in the queries given by the user. Then we assign corresponding retriever which has access to the corresponding info. If no keywords are identified we send the combined info of both docs.
# **Emphasize Number Function**
This function is made to put emphasis on numbers in user queries if any. This is due to the fact it is more likely to get a better result if we can find the exact number in the docs, instead of going the traditional route of semantic seach.
# **Building A Q/A Retrieval Chain**
In this function we first call the select retriever and the emphasize number function to process the queries first. Then we proceed to build the retrieval Q/A chain with the corresponding model, retriever and template.

In [9]:
template = """Use the following context to answer the question at the end.Look into the full context carefully and Don't try to make up answers. Give answers in detail only if asked. You can reference interviewee names and tell their experiences as general answers.Always end your response with "thanks for asking!".
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)
pi_keywords = ["placement", "placements"]
si_keywords = ["summer internship", "si", "summer internships"]
def select_retriever(question):
    # if (any(keyword in question.lower() for keyword in pi_keywords) and any(keyword in question.lower() for keyword in si_keywords)):
    #     return vector_index
    if any(keyword in question.lower() for keyword in pi_keywords):
        return vector_pi
    elif any(keyword in question.lower() for keyword in si_keywords):
        return vector_si
    else:
        return vector_index
def emphasize_numbers(question):
    numbers = re.findall(r'\d+(\.\d+)?', question)

    emphasized_query = question
    for number in numbers:
        emphasized_query = emphasized_query.replace(number, f"{number}^2")

    return emphasized_query


def get_qa_chain(question):
    emphasized_question = emphasize_numbers(question)
    selected_retriever = select_retriever(emphasized_question)
    qa_chain = RetrievalQA.from_chain_type(
        model,
        retriever=selected_retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
    )
    return qa_chain

In [None]:
# def remove_keywords(input_string, keywords):
#     # Combine all keywords into a single regex pattern
#     pattern = '|'.join(r'\b{}\b'.format(re.escape(keyword)) for keyword in keywords)

#     # Use re.sub to replace all occurrences with an empty string
#     return re.sub(pattern, '', input_string, flags=re.IGNORECASE)

# def check_query(question):
#     pi_keywords = ["placement", "placements"]
#     si_keywords = ["summer internship", "si", "summer internships"]
#     keywords = pi_keywords + si_keywords
#     question_lower = question.lower()

#     if ((any(word in question_lower for word in pi_keywords) and any(word in question_lower for word in si_keywords))):
#         question = remove_keywords(question_lower, keywords)
#         print(question)
#         result = qa_chain({"query": question})
#     elif any(word in question_lower for word in pi_keywords):
#         question = remove_keywords(question_lower, pi_keywords)
#         print(question)
#         result = pi_chain({"query": question})
#     elif any(word in question_lower for word in si_keywords):
#         question = remove_keywords(question_lower, si_keywords)
#         print(question)
#         result = si_chain({"query": question})
#     else:
#         print(question)
#         result = qa_chain({"query": question})

#     return result["result"]

## **Bot Testing**
This is just general testing of the bot you can look through this to get an idea of how well the bot works. I have hidden the output for the queries but you can press show hidden output to view the source docs retrieved by the bot.

In [None]:
question = "Elaborate on the selection process for Microsoft in summer internship in detail?"
qa_chain = get_qa_chain(question)
result = qa_chain({"query": question})
result



{'query': 'Elaborate on the selection process for Microsoft in summer internship in detail?',
 'result': "The selection process for Microsoft's summer internship program is quite rigorous and involves multiple rounds. Here's a breakdown based on the information provided:\n\n**Eligibility:**\n\n* **CG Cutoff:**  The minimum CGPA requirement is 7+ for software engineering internships. For hardware-related roles, the cutoff is 8+.\n* **Branch:** Open to students from A7, A3, A8, and AA branches (single and dual degree).\n\n**Recruitment Process:**\n\n**Round 1: Online Assessment (Software Engineering)**\n\n* **Duration:** 1.5 hours.\n* **Content:** Two questions based on basic concepts covered in CS F211 - Data Structures and Algorithms. The difficulty level is rated as easy to medium.\n* **Key:**  Time management is crucial. Aim to finish as quickly as possible while ensuring accuracy.\n\n**Round 2: Technical Interview 1 (Software Engineering)**\n\n* **Duration:** Approximately one hour.

In [None]:
Markdown(result["result"])

The selection process for Microsoft's summer internship program is quite rigorous and involves multiple rounds. Here's a breakdown based on the information provided:

**Eligibility:**

* **CG Cutoff:**  The minimum CGPA requirement is 7+ for software engineering internships. For hardware-related roles, the cutoff is 8+.
* **Branch:** Open to students from A7, A3, A8, and AA branches (single and dual degree).

**Recruitment Process:**

**Round 1: Online Assessment (Software Engineering)**

* **Duration:** 1.5 hours.
* **Content:** Two questions based on basic concepts covered in CS F211 - Data Structures and Algorithms. The difficulty level is rated as easy to medium.
* **Key:**  Time management is crucial. Aim to finish as quickly as possible while ensuring accuracy.

**Round 2: Technical Interview 1 (Software Engineering)**

* **Duration:** Approximately one hour.
* **Content:**
    * **Introduction:**  Begins with an introduction and discussion about your projects and internship experiences.
    * **Technical Depth:**  In-depth questions about the tech stack and concepts used in your projects.
    * **Design Patterns:** Basic to moderate questions on design patterns covered in CS F213 - Object-Oriented Programming.
    * **Coding:**  A coding question related to concepts covered in CS F211 - Data Structures and Algorithms. You need to pass specific test cases.

**Round 3: DSA Interview (Software Engineering)**

* **Content:**
    * **Project Discussion:**  The interviewer will delve into your resume projects.
    * **DSA Segment:**  In-depth discussion on sorting algorithms, likely covering your approach to preparing for DSA interviews using LeetCode and the CS F211 course content.  Expect detailed discussions on quick sort and merge sort.

**Round 4: CS Fundamentals Interview (Software Engineering)**

* **Content:**  This round focuses on your understanding of fundamental computer science concepts.

**Round 1: Online Assessment (Hardware Engineering)**

* **Duration:**  Not specified.
* **Content:**  Three challenging coding questions. Examples include:
    * Implementing a spiral output given a starting corner.
    * A question based on graphs, described as particularly tricky.
    * A problem involving dynamic programming, binary search, and bitmasking.

**Round 2: Online Assessment 2 (Hardware Engineering)**

* **Content:**  Shortlisting round, likely based on the previous round and your resume.  The questions are relatively easy and short in duration.  They focus on implementation-based problems, comparable to medium-level LeetCode questions.  The questions are more design-based than purely algorithmic.

**Round 2: Technical Interview (Hardware Engineering)**

* **Content:**  The round mainly covers questions based on ECE/EEE/INSTR F215 - Digital Design. It also includes basic questions related to the third-year course CDC - EEE/INSTR F313 - Analog Digital VLSI Design.  While not a specific course requirement, it's helpful to understand basic concepts like Static Time Analysis.  The interview will focus on topics covered in the courses.

**Round 3: Personal Interview (Hardware Engineering)**

* **Duration:**  Approximately 40 minutes.
* **Content:**  This round is like a personality test.  You'll be presented with situations and asked questions to gauge your professional ethics and how you would respond to different situations.  The interview also assesses your ability to handle challenging situations.

**General Advice:**

* **Confidence:**  Be confident in your answers, even if you get stuck.
* **Communication:**  Keep interacting with the interviewer and share your thought process.
* **Preparation:**  Thoroughly prepare for OOP concepts, especially for software engineering roles.
* **Fundamentals:**  Ensure you have a strong understanding of fundamental concepts, especially for hardware engineering roles.
* **Resources:**  Use online resources like GeeksforGeeks for general preparation and hdlbits to get comfortable with Verilog for hardware engineering.

Thanks for asking! 


In [None]:
question = "How can I prepare for personality interview in Nvidia?"
qa_chain = get_qa_chain(question)
result = qa_chain({"query": question})
result



{'query': 'How can I prepare for personality interview in Nvidia?',
 'result': "Based on the provided information, the personality interview at Nvidia focuses on assessing your communication skills, problem-solving abilities, and cultural fit. Here's how you can prepare:\n\n* **Practice answering common HR questions:**  The interviewee, Satvik Jain, suggests practicing answers to HR questions in advance. This includes questions about your interest in the company, your career goals, and your strengths and weaknesses.\n* **Highlight your relevant skills:**  Nvidia's technical interview focuses on your understanding of computer architecture, digital design, and microprocessors. Be prepared to discuss your experience with these concepts and how they relate to the role you're applying for.\n* **Demonstrate your problem-solving skills:**  The technical interview may involve coding challenges or questions that require you to think critically and apply your knowledge. Practice solving problems

In [None]:
Markdown(result["result"])

Based on the provided information, the personality interview at Nvidia focuses on assessing your communication skills, problem-solving abilities, and cultural fit. Here's how you can prepare:

* **Practice answering common HR questions:**  The interviewee, Satvik Jain, suggests practicing answers to HR questions in advance. This includes questions about your interest in the company, your career goals, and your strengths and weaknesses.
* **Highlight your relevant skills:**  Nvidia's technical interview focuses on your understanding of computer architecture, digital design, and microprocessors. Be prepared to discuss your experience with these concepts and how they relate to the role you're applying for.
* **Demonstrate your problem-solving skills:**  The technical interview may involve coding challenges or questions that require you to think critically and apply your knowledge. Practice solving problems using different approaches and be prepared to explain your thought process.
* **Show your enthusiasm and passion:**  Nvidia values candidates who are passionate about technology and eager to learn. Express your interest in the company and the specific role you're applying for.
* **Research the company and the role:**  Before the interview, take the time to learn about Nvidia's culture, values, and current projects. This will help you tailor your responses and demonstrate your genuine interest.

Remember, the personality interview is an opportunity to showcase your personality and connect with the interviewer on a personal level. Be confident, enthusiastic, and authentic.

Thanks for asking! 


In [None]:
question = "which company had CG cutoff as more than 8 in placements"
qa_chain = get_qa_chain(question)
result = qa_chain({"query": question})
result



{'query': 'which company had CG cutoff as more than 8 in placements',
 'result': 'The company with a CG cutoff of more than 8 in placements is **Millennium Management**. \n\nthanks for asking! \n',
 'source_documents': [Document(page_content='CG Cutoff - 8 + \n recruitment process - \u2028\n Round 1 - Online assessment \n\n round mcq base Math Probability . besides , \n code question . \n\n Round 2 - Puzzle - Based Interview \n\n ask various mathematical puzzle brain teaser 25 minute . \n find even practice book video \n mention next page . \n\n Round 3 - Technical interview \n\n round start fairly easy puzzle . however , move \n technical segment fairly quickly , puzzle standard . \n part , ask code run three question . likely \n correspond medium hard difficulty level Leetcode . \n Millennium Management \n 117 \n Round 4 - hr Interview \n\n round call hr round , start \n moderately hard mathematical puzzle . solve , proceed \n hr segment . , ask general question around resume , \n we

In [None]:
Markdown(result["result"])

The company with a CG cutoff of more than 8 in placements is **Millennium Management**. 

thanks for asking! 


### **A Tricky Question for companies which hire in multiple domains**
Ola cabs comes for hiring in chemical and tech placements both. So We can ask the model questions respectively by mentioning which field we are taking into consideration

In [None]:
question = "What are the relevant courses for ola cabs in chemical placements?"
qa_chain = get_qa_chain(question)
result = qa_chain({"query": question})
result



{'query': 'What are the relevant courses for ola cabs in chemical placements?',
 'result': 'The relevant courses for Ola Cabs in chemical placements are:\n\n* **CHE f213 - Chemical Engineering Thermodynamics**\n* **CHE F212 - Fluid Mechanics**\n* **CHE F241 - Heat Transfer**\n\nThese courses cover the core chemical engineering concepts that are essential for a job in the chemical industry. \n\nThanks for asking! \n',
 'source_documents': [Document(page_content='comprise 30 question answer 30 minute . question \n lengthy gate - level , cover knowledge thermodynamic , \n fluid mechanic , heat , etc . one well prepared online \n assessment computer - adaptive test , mean \n difficulty level question would adjust per one ’s performance . \n moreover , question design company , \n point search internet . \n Round 2 - Technical Interview \n\n test , 5 student shortlist whole Chemical \n department two round interview conduct . \n discussion base resume . question cover ML project , \n Chemic

In [None]:
Markdown(result["result"])

The relevant courses for Ola Cabs in chemical placements are:

* **CHE f213 - Chemical Engineering Thermodynamics**
* **CHE F212 - Fluid Mechanics**
* **CHE F241 - Heat Transfer**

These courses cover the core chemical engineering concepts that are essential for a job in the chemical industry. 

Thanks for asking! 


In [None]:
question = "What are the relevant courses for ola cabs in tech placements?"
qa_chain = get_qa_chain(question)
result = qa_chain({"query": question})
result



{'query': 'What are the relevant courses for ola cabs in tech placements?',
 'result': 'The relevant courses for Ola Cabs tech placements are:\n\n* **CS F213 - Object-Oriented Programming:** This course is particularly helpful as it covers the fundamentals of object-oriented programming, which is essential for software development.\n* **CS F211 - Data Structures and Algorithms:** This course is crucial for understanding the underlying principles of data structures and algorithms, which are essential for efficient software development.\n* **CS F212 - Database Systems:** This course provides knowledge of database systems, which are essential for managing and storing large amounts of data, a critical aspect of many tech roles.\n* **CS F303 - Computer Networks:** This course covers the fundamentals of computer networks, which are essential for understanding how data is transmitted and received over the internet.\n* **CS F342 - Computer Architecture:** This course provides knowledge of comp

In [None]:
Markdown(result["result"])

The relevant courses for Ola Cabs tech placements are:

* **CS F213 - Object-Oriented Programming:** This course is particularly helpful as it covers the fundamentals of object-oriented programming, which is essential for software development.
* **CS F211 - Data Structures and Algorithms:** This course is crucial for understanding the underlying principles of data structures and algorithms, which are essential for efficient software development.
* **CS F212 - Database Systems:** This course provides knowledge of database systems, which are essential for managing and storing large amounts of data, a critical aspect of many tech roles.
* **CS F303 - Computer Networks:** This course covers the fundamentals of computer networks, which are essential for understanding how data is transmitted and received over the internet.
* **CS F342 - Computer Architecture:** This course provides knowledge of computer architecture, which is essential for understanding how computers work at a low level.
* **CS F372 - Operating Systems:** This course covers the fundamentals of operating systems, which are essential for understanding how software interacts with hardware.
* **CS F464 - Machine Learning:** This course is particularly relevant for roles involving machine learning and data science.

Thanks for asking! 


In [11]:
question = "What are the relevant courses for cohesity ?"
qa_chain = get_qa_chain(question)
result = qa_chain({"query": question})
result



{'query': 'What are the relevant courses for cohesity ?',
 'result': 'The relevant courses for Cohesity are:\n\n* **EEE F111 - Electrical Sciences:** This course provides a fundamental understanding of electrical engineering principles, which are relevant to software engineering roles in companies like Cohesity that deal with hardware and infrastructure.\n* **EEE/INSTR F244 - Microelectronic Circuits:** This course focuses on the design and analysis of microelectronic circuits, which are essential for understanding the hardware components used in data storage and management systems.\n* **EEE F313 - Analog Digital VLSI Design:** This course delves into the design and implementation of Very Large Scale Integration (VLSI) circuits, which are crucial for building high-performance storage systems.\n* **CS F111 - Computer Programming:** This course provides a foundation in programming concepts and techniques, which are essential for software development roles.\n* **Math Courses:**  Mathemati

In [12]:
Markdown(result["result"])

The relevant courses for Cohesity are:

* **EEE F111 - Electrical Sciences:** This course provides a fundamental understanding of electrical engineering principles, which are relevant to software engineering roles in companies like Cohesity that deal with hardware and infrastructure.
* **EEE/INSTR F244 - Microelectronic Circuits:** This course focuses on the design and analysis of microelectronic circuits, which are essential for understanding the hardware components used in data storage and management systems.
* **EEE F313 - Analog Digital VLSI Design:** This course delves into the design and implementation of Very Large Scale Integration (VLSI) circuits, which are crucial for building high-performance storage systems.
* **CS F111 - Computer Programming:** This course provides a foundation in programming concepts and techniques, which are essential for software development roles.
* **Math Courses:**  Mathematics courses like MATH F111, MATH F112, and MATH F211 are important for understanding algorithms, data structures, and other fundamental concepts used in software engineering.

These courses provide a strong foundation in the technical skills required for software engineering roles at Cohesity. 

Thanks for asking! 
