<a href="https://colab.research.google.com/github/Hailemicael/NLP_SECOND_ASSIGNEMNT/blob/master/NLP_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Part 1: Importing Libraries and Reading Input**


In [3]:
# Import libraries
import nltk
from bs4 import BeautifulSoup
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Read input from file
with open('input.txt', 'r', encoding='utf-8') as file:
    input_text = file.read()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Part 2: Preprocessing Function

In [4]:
# Preprocessing function
def preprocess_text(text):
    # Tokenize words and handle punctuation and numbers
    words = nltk.word_tokenize(text)
    words = [word.lower() for word in words if word.isalnum()]

    # Remove HTML tags and comments
    soup = BeautifulSoup(text, "html.parser")
    clean_text = soup.get_text(separator=" ")

    # Remove special characters and non-alphabetic characters
    clean_text = re.sub(r"[^a-zA-Z\s]", "", clean_text)

    # Tokenize the text
    tokens = word_tokenize(clean_text.lower())

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    return " ".join(words)
# Tokenize and preprocess
processed_input = preprocess_text(input_text)

Part 3:  Context Window Slicing Algorithm



In [5]:
# Context Window Slicing Algorithm
def generate_slices(input_text, context_window_size=128):
    # Convert context_window_size to bytes
    context_window_bytes = context_window_size * 1024 * 1024


    # No Slice is needed in this case
    if len(input_text.encode('utf-8')) <= context_window_bytes:
        return [input_text]



    # Split into slices ensuring complete words
    slice_size = context_window_bytes
    words = processed_input.split()
    slices = []
    current_slice = ""

    for word in words:
        if len(current_slice.encode('utf-8')) + len(word.encode('utf-8')) <= slice_size:
            current_slice += " " + word
        else:
            slices.append(current_slice.strip())
            current_slice = word

    if current_slice:
        slices.append(current_slice.strip())

    # Ensure slices meet criteria
    final_slices = [slices[0]]
    for i in range(1, len(slices)):
        # Compare adjacent slices using cosine similarity
        vectorizer = TfidfVectorizer()
        tfidf_matrix = vectorizer.fit_transform([final_slices[-1], slices[i]])
        cosine_distance = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]
        # Print debugging information
        print(f"Slice {i + 1}: Cosine Similarity = {cosine_distance}")


        # Adjust the threshold
        if cosine_distance > 0.2:
            final_slices.append(slices[i])

    return final_slices


Part 4: Generate Slices and save to new text file


In [8]:
# Generate slices
slices = generate_slices(input_text)

# Save slices to a file
with open('slices_output.txt', 'w', encoding='utf-8') as output_file:
    for i, slice_text in enumerate(slices):
        output_file.write(f"Slice {i + 1}: {slice_text}\n")


**Part 2: Feeding the slice text to the Model**

In [9]:
# Cell 1: Install the replicate library
!pip install replicate



Part-1 Import necessary libraries, Authenticate with Replicate API, define the model and read slice text.


In [10]:
import replicate
REPLICATE_API_TOKEN = "r8_e9k5S3RuqDisxjw5p2S8PnSLoS8Nd0t10j8og"
client = replicate.Client(api_token=REPLICATE_API_TOKEN)

model_name = "meta/llama-2-70b-chat"

with open('slices_output.txt', 'r', encoding='utf-8') as input_file:
    slice_text = input_file.read()

Part-2: Providing the input sliced text, asking question based on the stored input and run the model.


In [12]:
# Provide initial input to the model
for event in client.stream(
    model_name,
    input={
        "prompt": f"Initial Input:\n\n{slice_text}\n\nUser Input:"
    },
):
    user_input = str(event)

#  Ask questions based on the stored input
user_question = input("You: ")

# Run the model with the stored input and the user's question
for event in client.stream(
    model_name,
    input={
        "prompt": f"{user_input}\n\nUser Question: {user_question}"
    },
):
    print(str(event), end="")


You: what is philosophy 

Philosophy is a field of study that involves critical thinking, analysis, and systematic inquiry into fundamental questions about existence, reality, knowledge, values, reason, and ethics. It is a discipline that seeks to understand and clarify the nature of things, the meaning of life, and the principles that govern human conduct. Philosophers often engage in debates, discussions, and arguments about various topics, and they use logical reasoning, empirical evidence, and historical context to support their views.

Philosophy encompasses many subfields, including metaphysics, epistemology,