### Section 1: Library Installations

> Install necessary libraries for Wikipedia scraping, natural language processing, and Lambda LLM model interaction.


In [1]:
!pip install wikipedia-api
!pip install wikipedia
!pip install nltk
!pip install scikit-learn
!pip install replicate

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting replicate
  Downloading replicate-0.23.1-py3-none-any.whl.metadata (22 kB)
Collecting httpx<1,>=0.21.0 (from replicate)
  Downloading httpx-0.26.0-py3-none-any.whl.metadata (7.6 kB)
Collecting pydantic>1 (from replicate)
  Downloading pydantic-2.6.0-py3-none-any.whl.metadata (81 kB)
     ---------------------------------------- 0.0/81.8 kB ? eta -:--:--
     ----- ---------------------------------- 10.2/81.8 kB ? eta -:--:--
     ------------------------------ --------- 61.4/81.8 kB 1.1 MB/s eta 0:00:01
     --------------------------------- ---- 71.7/81.8 kB 991.0 kB/s eta 0:00:01
     -----



### Section 2: Importing Libraries


In [2]:
import json
import wikipedia
import wikipediaapi
import re
import os
import replicate
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Section 3: Fetching Wikipedia Content

> Fetch Wikipedia content related to the specified title, 'medical', using the Wikipedia API, store it in a dictionary, and save it to a JSON file named 'medical_data_2MB.json', limiting the dataset size to 2 MB.






In [11]:
# Define the topic of interest
medical_title = 'Medical Science'

# Assign a user_agent for Wikipedia API requests
user_agent = 'hailegabriel/1.0 (bentitan98@gmail.com)'
wiki_wiki = wikipediaapi.Wikipedia(user_agent, 'en', wikipediaapi.ExtractFormat.WIKI)

# Initialize variables for data storage and size control
my_file = {}
total_size_mb = 0
i = 0

# Fetch Wikipedia content related to the medical_title
for items in wikipedia.search(medical_title, results=500):  # Adjust result count as needed
    # Retrieve page content
    page_content = wiki_wiki.page(items).text

    # Calculate content size in megabytes
    content_size_mb = len(page_content.encode('utf-8')) / (1024 ** 2)

    # Limit the dataset size to 2 MB
    if total_size_mb + content_size_mb > 2:
        break

    # Store page content in dictionary
    my_file[items] = page_content

    # Update total size
    total_size_mb += content_size_mb

# Create a JSON file and write the content
with open('medical_data_2MB.json', 'w', encoding='utf-8') as json_file:
    json.dump(my_file, json_file, ensure_ascii=False, indent=4)


### Section 4: Text Cleaning and Conversion

> Create text cleaning functions to eliminate unnecessary elements like newline characters, reference-style links, extra whitespaces, backslashes, and double-quotes. Also, implement a function to clean the data, save the cleaned content in a text file.

In [12]:
def clean_wikipedia_text(wiki_text):
    # Remove lines with "\n" and lines containing '=='
    cleaned_text = re.sub(r'\n|==.*?==', ' ', wiki_text)

    # Remove any remaining reference-style links (e.g., [1], [2])
    cleaned_text = re.sub(r'\[\d+\]', '', cleaned_text)

    # Remove extra whitespaces
    cleaned_text = ' '.join(cleaned_text.split())

    # Remove backslashes and double-quotes
    cleaned_text = cleaned_text.replace('\\', '').replace('"', '')

    return cleaned_text

def clean_and_convert_to_string(input_file):
    # Read the JSON file
    with open(input_file, 'r', encoding='utf-8') as file:
        data = json.load(file)

    # Clean each value in the JSON using clean_wikipedia_text function
    cleaned_data = {key: clean_wikipedia_text(value) for key, value in data.items()}

    # Convert the cleaned data to a single string
    cleaned_text = ' '.join(cleaned_data.values())

    # Specify the path using double backslashes or a raw string
    cleaned_medical_data_path = 'cleaned_medical_data.txt'  # or r'C:\Users\Admin\Documents\nlp\cleaned_medical_data.txt'

    # Write cleaned text to a new file
    with open(cleaned_medical_data_path, 'w', encoding='utf-8') as file:
        file.write(cleaned_text)

    return cleaned_medical_data_path




> Read the newly obtained txt file containing cleaned medical text data and store its content in a  variable.



In [14]:
# Use the wikipedia medical json file

input_json_file = r'medical_data_2MB.json'
cleaned_text_path = clean_and_convert_to_string(input_json_file)

# Read the cleaned text file
with open(cleaned_text_path, 'r', encoding='utf-8') as file:
    input_text = file.read()

### Section 5: Preprocessing Functions

> Preprocess text by removing special characters, preserving hashtags and mentions, tokenizing using NLTK, converting to lowercase, removing stopwords, and lemmatizing using WordNetLemmatizer.

In [15]:
# Preprocessing function

def preprocess_text(text):
    # Special Characters Removal: Preserve hashtags and mentions
    clean_text = re.sub(r"(?<=\w)[^\w\s]+(?=\w)|[^\w\s]", ' ', text)

    # Tokenization: Use nltk.word_tokenize or explore other methods
    tokens = nltk.word_tokenize(clean_text)

    # Lowercasing: Convert text to lowercase
    tokens = [token.lower() for token in tokens]

    # Stopword Removal: Customize the list of stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatization: Use WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return " ".join(tokens)

### Section 6: Cosine Similarity Calculation



> Check the cosine similarity between two text slices using TF-IDF vectorization. The function returns True if the similarity is equal to or greater than a specified threshold.

In [16]:
def cosine_similarity_checker(slice1, slice2, threshold=0.7):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([slice1, slice2])
    similarity = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]
    return similarity >= threshold

### Section 7: Text Slicing


> Generate text slices from the input text, adhering to a size limit in kilobytes. The function tokenizes the input into sentences, creating slices while ensuring each slice's size doesn't surpass the specified limit and is dissimilar enough to the previous slices. The similarity check is performed using cosine similarity with the cosine_similarity_checker function.



In [18]:
def generate_slices_by_size(input_text, size_limit=20):
    # Convert size_limit to bytes
    size_limit_bytes = size_limit * 1024

    # Tokenize the input text
    sentences = sent_tokenize(input_text)

    # Generate slices
    slices = []
    current_slice = b""

    for sentence in sentences:
        # Encode the current token in utf-8
        sentence_bytes = sentence.encode('utf-8')

        # Check if adding the next token exceeds the size limit
        if len(current_slice) + len(sentence_bytes) + 1 <= size_limit_bytes:
            current_slice += b" " + sentence_bytes
        else:
            # Check for duplicate before adding the new slice
            current_preprocessed_slice = preprocess_text(current_slice.decode('utf-8'))

            # Preprocess each previous slice before comparison
            if not any(cosine_similarity_checker(current_preprocessed_slice, preprocess_text(previous_slice.decode('utf-8'))) for previous_slice in slices):
                slices.append(current_slice)

            # Start a new slice with the current token
            current_slice = b" " + sentence_bytes

    # Check for the last slice after the loop
    if current_slice:
        current_preprocessed_slice = preprocess_text(current_slice.decode('utf-8'))
        # Preprocess each previous slice before comparison
        if not any(cosine_similarity_checker(current_preprocessed_slice, preprocess_text(previous_slice.decode('utf-8'))) for previous_slice in slices):
            slices.append(current_slice)

    return slices


> Generate text slices from the provided input text and save them to a text file. Each slice is written on a new line in the file.

In [19]:
slices = generate_slices_by_size(input_text)

# Save slices to a txt file
with open('slices.txt', 'w', encoding='utf-8') as output_file:
    for i, slice in enumerate(slices):
        output_file.write(f"{slice}\n")

### Section 8: Regeneration and Similarity Evaluation



> The code effectively reconstructs the original text from the provided text slices and saves it to a file. To ensure the quality of regeneration, the script evaluates the similarity between the content of the original and regenerated files,



In [20]:
def regenerate_sliced_file(slices, output_file):
    # Join the list of slices to reconstruct the original text
    reconstructed_text = b"".join(slices).decode('utf-8')

    # Write the reconstructed text to the output file
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(reconstructed_text)

# Specify the output file path
regenerated_file_path = r'regenerated_medical_data.txt'  # or r'C:\Users\Admin\Documents\nlp\regenerated_medical_data.txt'

# Call the function to regenerate the sliced file
regenerate_sliced_file(slices, regenerated_file_path)

# Check if the regeneration is successful by printing the content of the regenerated file
with open(regenerated_file_path, 'r', encoding='utf-8') as file:
    regenerated_text = file.read()


> This code calculates the cosine similarity between the original and regenerated texts after preprocessing.

In [22]:
def calculate_similarity(original_text, regenerated_text):
    # Preprocess the texts
    preprocessed_original = preprocess_text(original_text)
    preprocessed_regenerated = preprocess_text(regenerated_text)

    # Calculate cosine similarity
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([preprocessed_original, preprocessed_regenerated])
    similarity = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]

    return similarity

> This code calculates the similarity score between the original and regenerated text, providing insights into the effectiveness of the regeneration process.

In [23]:
# Calculate similarity between original and regenerated text
similarity_score = calculate_similarity(input_text, regenerated_text)

# Print the results
print("Similarity Score:", similarity_score)
print("Original Text Size (Bytes):", len(input_text.encode('utf-8')))
print("Regenerated Text Size (Bytes):", len(regenerated_text.encode('utf-8')))
print("Original Text Word Count:", len(word_tokenize(input_text)))
print("Regenerated Text Word Count:", len(word_tokenize(regenerated_text)))

Similarity Score: 0.9866304669360232
Original Text Size (Bytes): 2088620
Regenerated Text Size (Bytes): 1619998
Original Text Word Count: 352710
Regenerated Text Word Count: 272120


### Section 8: Model Interaction with Replicate

> This code demonstrates the interaction with the OpenAI model. It loads text slices from a file, and then engages with the LLM model called LLAMA, by providing initial input and subsequently asking questions based on user input.

In [25]:
import replicate

# Your API token
REPLICATE_API_TOKEN = "r8_4LS67CTNWRMuDy6XVykW6ooXMlEklDM1mGR5X"

# Initialize the Replicate client with your API token
client = replicate.Client(api_token=REPLICATE_API_TOKEN)

# Define your model
model_name = "meta/llama-2-70b-chat"

# Read slice text from input.txt
with open('slices.txt', 'r', encoding='utf-8') as input_file:
    slice_text = input_file.read()

# Provide initial input to the model
for event in client.stream(
    model_name,
    input={
        "prompt": f"Initial Input:\n\n{slice_text}\n\nUser Input:"
    },
):
    user_input = str(event)

# Ask questions based on the stored input
user_question = input("You: ")

# Run the model with the stored input and the user's question
for event in client.stream(
    model_name,
    input={
        "prompt": f"{user_input}\n\nUser Question: {user_question}"
    },
):
    print(str(event), end="")

You: what is medicine?


Hello! I'm happy to help you with your question.

Medicine is a broad term that refers to the science and practice of preventing, diagnosing, and treating diseases or medical conditions. It encompasses various fields such as pharmacology, surgery, psychiatry, and epidemiology, among others. The primary goal of medicine is to promote health, well-being, and quality of life for individuals and communities.

There are many different types of medicine, including:

1. Western medicine: Also known as allopathic medicine, this