### Installing and Importing the necessary libraries

In [1]:
pip install scikit-learn nltk

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import nltk
nltk.download('punkt_tab')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Download NLTK tokenizer
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt_tab to C:\Users\Mussaddiq
[nltk_data]     Khan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Mussaddiq
[nltk_data]     Khan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# created a function named 'load_text' to read the file and stored the readed file into a variable named 'text'
def load_text(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

text = load_text('python.txt')
print(text)
# successfully loaded  and read the file into var  'text'. see the loaded file below

Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991.
Key Features of Python: Easy to Read & Write Syntax is clean and similar to English, Interpreted Language Python code is executed line-by-line which makes debugging easier, dynamically Typed You don’t need to declare data types explicitly, Object-Oriented & Functional Supports multiple programming paradigms, Large Standard Library Comes with many modules and packages (like math, datetime, os, etc.), Cross-Platform Works on Windows, Mac, and Linux.
Common Uses of Python: Web Development (with frameworks like Django, Flask), Data Science & Machine Learning (using NumPy, pandas, scikit-learn), Automation & Scripting, Game Development, APIs and Backend Services, Artificial Intelligence & Deep Learning.



In [4]:
#Spliting the text into groups of a few sentences  using chunk_size (e.g., 3–5 at a time).
# how does above code line works is given below:
'''range(0, len(sentences), chunk_size) gives:

i = 0, 2, 4

For each i, sentences[i:i+chunk_size] slices the list:

sentences[0:2] → ["Sentence 1.", "Sentence 2."]

sentences[2:4] → ["Sentence 3.", "Sentence 4."]

sentences[4:6] → ["Sentence 5.", "Sentence 6."]

" ".join(...) turns each of those into one string:

"Sentence 1. Sentence 2.", and so on'''
def chunk_text(text, chunk_size=5): 
    sentences = sent_tokenize(text)
    return [" ".join(sentences[i:i+chunk_size]) for i in range(0, len(sentences), chunk_size)]    

In [5]:
#splitted the text into sentences an stored in var named 'chunks'
chunks = chunk_text(text, chunk_size=2)
for i, chunk in enumerate(chunks): #enumerate() works as Automatically gives an index (number) to each item.Returns both the index and the item
    print(f"Chunk {i+1}:\n{chunk}\n")

Chunk 1:
Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991.

Chunk 2:
Key Features of Python: Easy to Read & Write Syntax is clean and similar to English, Interpreted Language Python code is executed line-by-line which makes debugging easier, dynamically Typed You don’t need to declare data types explicitly, Object-Oriented & Functional Supports multiple programming paradigms, Large Standard Library Comes with many modules and packages (like math, datetime, os, etc. ), Cross-Platform Works on Windows, Mac, and Linux.

Chunk 3:
Common Uses of Python: Web Development (with frameworks like Django, Flask), Data Science & Machine Learning (using NumPy, pandas, scikit-learn), Automation & Scripting, Game Development, APIs and Backend Services, Artificial Intelligence & Deep Learning.



In [6]:
# Vectorize the text chunks:
'''Converting each chunk of text into a numerical vector (a list of numbers)
so that a machine learning model or algorithm (like similarity search) can process and compare them.'''

def vectorize_chunks(chunks):
    
#Computers don’t understand text — they understand numbers. So, i converted each chunk of text (like sentences or paragraphs) into a vector using technique 'TF-IDF' (Term Frequency-Inverse Document Frequency)

    vectorizer = TfidfVectorizer() #above line creates a TfidfVectorizer object from scikit-learn which is used to convert text into numerical vectors that reflect how important each word is in a document relative to other documents

    tfidf_matrix = vectorizer.fit_transform(chunks) #Transforms each chunk into a numerical TF-IDF vector.
    return vectorizer, tfidf_matrix
vectorizer, tfidf_matrix = vectorize_chunks(chunks) #calling the function with our chunks list and storing the outputs.

#Printing the feature names 
print("TF-IDF Feature Names:")
print(vectorizer.get_feature_names_out())

#Printing matrix shape
print("\nTF-IDF Matrix Shape:")
print(tfidf_matrix.shape)

#Printing matrix as array
print("\nTF-IDF Matrix as Array:")
print(tfidf_matrix.toarray())

TF-IDF Feature Names:
['1991' 'and' 'apis' 'artificial' 'automation' 'backend' 'by' 'clean'
 'code' 'comes' 'common' 'created' 'cross' 'data' 'datetime' 'debugging'
 'declare' 'deep' 'development' 'django' 'don' 'dynamically' 'easier'
 'easy' 'english' 'etc' 'executed' 'explicitly' 'features' 'first' 'flask'
 'for' 'frameworks' 'functional' 'game' 'guido' 'high' 'in' 'intelligence'
 'interpreted' 'is' 'it' 'its' 'key' 'known' 'language' 'large' 'learn'
 'learning' 'level' 'library' 'like' 'line' 'linux' 'mac' 'machine'
 'makes' 'many' 'math' 'modules' 'multiple' 'need' 'numpy' 'object' 'of'
 'on' 'oriented' 'os' 'packages' 'pandas' 'paradigms' 'platform'
 'programming' 'python' 'read' 'readability' 'released' 'rossum' 'science'
 'scikit' 'scripting' 'services' 'similar' 'simplicity' 'standard'
 'supports' 'syntax' 'to' 'typed' 'types' 'uses' 'using' 'van' 'was' 'web'
 'which' 'windows' 'with' 'works' 'write' 'you']

TF-IDF Matrix Shape:
(3, 101)

TF-IDF Matrix as Array:
[[0.21498599 0.

In [7]:
def answer_question(question, chunks, vectorizer, tfidf_matrix, top_n=1):
    question_vec = vectorizer.transform([question]) #it transforms the question text into a TF-IDF vector, using the same vectorizer 
    
    similarities = cosine_similarity(question_vec, tfidf_matrix).flatten()
#cosine_similarity compares the question vector to every chunk vector. It returns similarity scores between 0 and 1:
    
    best_chunk_indices = similarities.argsort()[-top_n:][::-1] #[-top_n:] gets the indices of the top n scores.
 # Return the most relevant chunk(s)
    return [chunks[i] for i in best_chunk_indices]

    
    # Now, dive deeper into the best chunks to find the best sentence(s)
    best_sentences = []
    for idx in best_chunk_indices:
        chunk = chunks[idx]
        sentences = sent_tokenize(chunk)
        sentence_vecs = vectorizer.transform(sentences)
        sentence_similarities = cosine_similarity(question_vec, sentence_vecs).flatten()
        best_sentence_idx = sentence_similarities.argmax()
        best_sentences.append(sentences[best_sentence_idx])
    return best_sentences

In [11]:
#  Answer user questions
def answer_question(question, chunks, vectorizer, tfidf_matrix, top_n=1):
    question_vec = vectorizer.transform([question])
    similarities = cosine_similarity(question_vec, tfidf_matrix).flatten()
    best_chunk_indices = similarities.argsort()[-top_n:][::-1]
    
    # Return the most relevant chunk(s)
    return [chunks[i] for i in best_chunk_indices]

chunks = chunk_text(text, chunk_size=1)
vectorizer, tfidf_matrix = vectorize_chunks(chunks)

# === Ask a question ===
question = "Who created Python?"
answers = answer_question(question, chunks, vectorizer, tfidf_matrix)

# === Print result ===
print("Answer:")
for ans in answers:
    print(ans)

Answer:
It was created by Guido van Rossum and first released in 1991.


answers = answer_question(question, chunks, vectorizer, tfidf_matrix)
print("\nBest answer:\n", "\n".join(answers), "\n")

In [None]:
# === MAIN EXECUTION ===
if __name__ == "__main__":
    file_path = "python.txt"  # Put your .txt file here
    text = load_text(file_path)
    chunks = chunk_text(text)
    vectorizer, tfidf_matrix = vectorize_chunks(chunks)
    
    print("Ask a question (type 'exit' to quit):")
    while True:
        question = input(">> ")
        if question.lower() == 'exit':
            break
        answers = answer_question(question, chunks, vectorizer, tfidf_matrix)
        print("\nAnswer based on document:\n",answers[0], "\n")

Ask a question (type 'exit' to quit):


>>  what are the uses of python



Answer based on document:
 Python is a high-level, interpreted programming language known for its simplicity and readability. It was created by Guido van Rossum and first released in 1991. Key Features of Python: Easy to Read & Write Syntax is clean and similar to English, Interpreted Language Python code is executed line-by-line which makes debugging easier, dynamically Typed You don’t need to declare data types explicitly, Object-Oriented & Functional Supports multiple programming paradigms, Large Standard Library Comes with many modules and packages (like math, datetime, os, etc. ), Cross-Platform Works on Windows, Mac, and Linux. Common Uses of Python: Web Development (with frameworks like Django, Flask), Data Science & Machine Learning (using NumPy, pandas, scikit-learn), Automation & Scripting, Game Development, APIs and Backend Services, Artificial Intelligence & Deep Learning. 

