In [81]:
from pprint import pprint
from dotenv import dotenv_values
import openai
env_vars = dotenv_values('.env')
openai.api_key = env_vars.get('OPENAI_API_KEY')
openai_client = openai.OpenAI(api_key=openai.api_key)

In [82]:
def fullopenai_model_answer(query, retrieved_documents, model="gpt-4-turbo-preview"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "As an assistant, your role is to be helpful and provide answers in a concise and summary-style manner. Your objective is to draw from the available document and your knowledge to deliver accurate and concise information. If you encounter a question for which you lack the answer or cannot find the relevant information, it's best to respond with a brief summary or let the user know that the answer is not available. Avoid including specific code examples and instead focus on offering assistance whenever possible. Your primary aim is to deliver clear and precise responses, ensuring that the information you provide is helpful to the user."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [83]:
def openai_model_answer(query, retrieved_documents, model="gpt-4-turbo-preview"):
    information_chunks = split_into_chunks(retrieved_documents, max_chunk_length=4096)  
    conversation = create_initial_conversation(query, information_chunks[0])
    for chunk in information_chunks[:2]:
        response = openai_client.chat.completions.create(
            model=model,
            messages=conversation['messages']
        )
        process_response(response)

        conversation['messages'].append(create_user_message(chunk))

    final_response = get_final_response(conversation)
    return final_response

def split_into_chunks(text, max_chunk_length):
    chunks = []
    current_chunk = ""
    for line in text.splitlines():
        if len(current_chunk) + len(line) > max_chunk_length:
            chunks.append(current_chunk.strip())
            current_chunk = ""
        current_chunk += line + "\n"
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

def create_initial_conversation(query, initial_chunk):
    return {
        'messages': [
            {'role': 'system', 'content': "As an assistant, your role is to be helpful and provide answers in a concise and summary-style manner. Your objective is to draw from the available document and your knowledge to deliver accurate and concise information. If you encounter a question for which you lack the answer or cannot find the relevant information, it's best to respond with a brief summary or let the user know that the answer is not available. Avoid including specific code examples and instead focus on offering assistance whenever possible. Your primary aim is to deliver clear and precise responses, ensuring that the information you provide is helpful to the user."},
            {'role': 'user', 'content': f'Question: {query}'},
            {'role': 'assistant', 'content': initial_chunk}
        ]
    }

def create_user_message(content):
    return {'role': 'user', 'content': content}

def process_response(response):
    print(response.choices[0].message.content)

def get_final_response(conversation):
    return conversation['messages'][-1]['content']

with open('../txt/w7/cluster_4_texts.txt', 'r') as file:
    retrieved_documents = file.read()


In [84]:
query = "Evaluate the document and provide an assessment of its overall content"
response = openai_model_answer(query, retrieved_documents)
pprint(response)

The document showcases a series of Python code blocks aimed at processing textual data for various tasks such as cleaning, normalizing, and vectorizing text, primarily focused on Amharic, a language spoken in Ethiopia. The text processing includes removing null values, hashtags, emojis, symbols, links, mentions, and standardizing characters for consistency. Additionally, it demonstrates tokenizing texts, extracting features like symbols and emojis, and then applying a Word2Vec model to tokenize sentences for embedding. Towards the end, it shifts towards downloading and utilizing models from Hugging Face, a popular repository for machine learning models, specifically for natural language processing tasks.

Here's a brief overview of the main processes covered:

1. **Preprocessing and Cleaning Data:** This includes reading CSV files, handling missing values, normalizing text (like replacing variants of letters with a consistent version), and extracting specific features such as hashtags,

In [85]:
query = "Evaluate the complexity level of the Python code collection and provide an assessment of its overall difficulty, ranging from beginner-friendly to advanced. Highlight any notable patterns or structures that contribute to its complexity."
response = openai_model_answer(query, retrieved_documents)
pprint(response)

The Python code collection presents a moderate-to-advanced difficulty level, catering to those with some familiarity with Python, data manipulation using pandas, natural language processing (NLP), and machine learning (ML) models. Here's a rundown of the aspects contributing to its complexity:

1. **Use of External Libraries**: The code extensively uses external libraries such as pandas for data manipulation, Gensim for Word2Vec models, and Hugging Face's transformers for working with pre-trained language models. Familiarity with these libraries is essential.

2. **Data Processing and NLP Techniques**: Demonstrated are various preprocessing steps typical in NLP tasks, such as emoji and URL extraction and removal, text normalization, and handling of special symbols and mentions. Such operations require understanding of both Python and basic NLP principles.

3. **Working with Word2Vec Models**: Part of the code involves training a Word2Vec model using the Gensim library. This involves un

In [86]:
query = 'Perform a comprehensive analysis of the document and provide insights on its content, structure, and potential improvements'
response = openai_model_answer(query, retrieved_documents)
pprint(response)

The document outlines a multifaceted process involving data extraction, cleaning, transformation, and machine learning model training. Presented as a series of Python code snippets, it encompasses reading, cleaning, preprocessing text data, and applying Word2Vec for generating word embeddings, followed by the integration of a model from Hugging Face Hub. Below is a breakdown analysis and suggested improvements:

### Content Analysis
1. **Data Processing and Cleaning:**
   - Reading CSV and JSON data.
   - Handling missing values and duplicate data.
   - Text normalization, including replacing specific characters, removing emojis, links, mentions, and non-alphanumeric symbols.

2. **Feature Extraction:**
   - Extracting hashtags, emojis, symbols, URLs, and mentions from the text data.

3. **Text Preprocessing:**
   - Standardizing certain Amharic letters.
   - Removing unnecessary whitespace and punctuation.

4. **Word2Vec Model Training:**
   - Tokenizing text data.
   - Training a Wor

In [87]:
query = "Please evaluate the supplied Python code, concentrating on the  python and sql code key words like how class are used, how many function and others related syntax."
response = openai_model_answer(query, retrieved_documents)
pprint(response)

The provided Python script spans several tasks, primarily focused on text preprocessing, data cleaning, and utilising machine learning for text vectorization. Let's break it down focusing on the Python and SQL keywords and syntax used, particularly classes, functions, and other related syntax:

1. **Import Statements**:
   - The script imports various modules and functions at different points, indicating dependencies on libraries like pandas, gensim, transformers, and a custom module (utils.preprocessing).

2. **Variables and Path Settings**:
   - Variables like `parsed_dir`, `cleaned_dir`, and `file_name` are set, which are used for specifying file paths.

3. **Classes and Methods**:
   - A class `MyPreprocessing` from the custom module `utils.preprocessing` is instantiated multiple times in different parts of the script. This class is presumably equipped with methods for text preprocessing.
   - Various methods from pandas (`read_csv`, `dropna`, `replace`, `apply`, `str`, and `to_csv