In [9]:
from pprint import pprint
from dotenv import dotenv_values
import openai
import pickle
env_vars = dotenv_values('.env')
openai.api_key = env_vars.get('OPENAI_API_KEY')
openai_client = openai.OpenAI(api_key=openai.api_key)

In [32]:
with open("../txt/pickle/cluster_lists.pkl", 'rb') as file:
    loaded_clusters = pickle.load(file)

retrieved_documents = loaded_clusters[0]
print(len(retrieved_documents))

14


In [35]:
def openai_model_answer(query, retrieved_documents):
    # Make a request to the OpenAI API for analysis for each cluster
    cluster_analysis = []

    for cluster in retrieved_documents:
        messages = [
            {"role": "system", "content": "As an assistant, your role is to be helpful and provide answers in a concise and summary-style manner. Your objective is to draw from the available document and your knowledge to deliver accurate and concise information. If you encounter a question for which you lack the answer or cannot find the relevant information, it's best to respond with a brief summary or let the user know that the answer is not available. Avoid including specific code examples and instead focus on offering assistance whenever possible. Your primary aim is to deliver clear and precise responses, ensuring that the information you provide is helpful to the user."},
            {"role": "user", "content": '\n'.join(cluster) + '\n' + query}
        ]

        response = openai_client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=messages,
        )

        analysis = response.choices[0].message.content.strip()
        cluster_analysis.append(analysis)
    
    return cluster_analysis

In [34]:
query = "Analyse the data passed to you and give a general overview on it?"
response = openai_model_answer(query, retrieved_documents)
pprint(response)

['The provided script appears to be a complex Python application designed for '
 'language processing and retrieval tasks, integrating various components and '
 "libraries to accomplish its objectives. Here's a general overview of its "
 'components and functionalities:\n'
 '\n'
 '1. **Environment and Dependency Management**: The application utilizes '
 '`python-dotenv` for managing environment variables, which is a common '
 'practice for configuring different aspects of an application outside the '
 'code base.\n'
 '\n'
 "2. **Language Model Integration**: It integrates OpenAI's language models "
 'through `ChatOpenAI` for generating text-based responses. This suggests the '
 'application might be used for chatbot or conversational AI purposes.\n'
 '\n'
 '3. **Text Processing and Management**: It involves text processing '
 'utilities, including a text loader and splitter, indicating the application '
 'processes and analyzes large texts or documents, possibly to extract '
 'relevant

In [None]:
query = "Evaluate the document and provide an assessment of its overall content"
response = openai_model_answer_list(query, retrieved_documents)
pprint(response)

In [85]:
query = "Evaluate the complexity level of the Python code collection and provide an assessment of its overall difficulty, ranging from beginner-friendly to advanced. Highlight any notable patterns or structures that contribute to its complexity."
response = openai_model_answer(query, retrieved_documents)
pprint(response)

The Python code collection presents a moderate-to-advanced difficulty level, catering to those with some familiarity with Python, data manipulation using pandas, natural language processing (NLP), and machine learning (ML) models. Here's a rundown of the aspects contributing to its complexity:

1. **Use of External Libraries**: The code extensively uses external libraries such as pandas for data manipulation, Gensim for Word2Vec models, and Hugging Face's transformers for working with pre-trained language models. Familiarity with these libraries is essential.

2. **Data Processing and NLP Techniques**: Demonstrated are various preprocessing steps typical in NLP tasks, such as emoji and URL extraction and removal, text normalization, and handling of special symbols and mentions. Such operations require understanding of both Python and basic NLP principles.

3. **Working with Word2Vec Models**: Part of the code involves training a Word2Vec model using the Gensim library. This involves un

In [86]:
query = 'Perform a comprehensive analysis of the document and provide insights on its content, structure, and potential improvements'
response = openai_model_answer(query, retrieved_documents)
pprint(response)

The document outlines a multifaceted process involving data extraction, cleaning, transformation, and machine learning model training. Presented as a series of Python code snippets, it encompasses reading, cleaning, preprocessing text data, and applying Word2Vec for generating word embeddings, followed by the integration of a model from Hugging Face Hub. Below is a breakdown analysis and suggested improvements:

### Content Analysis
1. **Data Processing and Cleaning:**
   - Reading CSV and JSON data.
   - Handling missing values and duplicate data.
   - Text normalization, including replacing specific characters, removing emojis, links, mentions, and non-alphanumeric symbols.

2. **Feature Extraction:**
   - Extracting hashtags, emojis, symbols, URLs, and mentions from the text data.

3. **Text Preprocessing:**
   - Standardizing certain Amharic letters.
   - Removing unnecessary whitespace and punctuation.

4. **Word2Vec Model Training:**
   - Tokenizing text data.
   - Training a Wor

In [87]:
query = "Please evaluate the supplied Python code, concentrating on the  python and sql code key words like how class are used, how many function and others related syntax."
response = openai_model_answer(query, retrieved_documents)
pprint(response)

The provided Python script spans several tasks, primarily focused on text preprocessing, data cleaning, and utilising machine learning for text vectorization. Let's break it down focusing on the Python and SQL keywords and syntax used, particularly classes, functions, and other related syntax:

1. **Import Statements**:
   - The script imports various modules and functions at different points, indicating dependencies on libraries like pandas, gensim, transformers, and a custom module (utils.preprocessing).

2. **Variables and Path Settings**:
   - Variables like `parsed_dir`, `cleaned_dir`, and `file_name` are set, which are used for specifying file paths.

3. **Classes and Methods**:
   - A class `MyPreprocessing` from the custom module `utils.preprocessing` is instantiated multiple times in different parts of the script. This class is presumably equipped with methods for text preprocessing.
   - Various methods from pandas (`read_csv`, `dropna`, `replace`, `apply`, `str`, and `to_csv

In [None]:
def fullopenai_model_answer(query, retrieved_documents, model="gpt-4-turbo-preview"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "As an assistant, your role is to be helpful and provide answers in a concise and summary-style manner. Your objective is to draw from the available document and your knowledge to deliver accurate and concise information. If you encounter a question for which you lack the answer or cannot find the relevant information, it's best to respond with a brief summary or let the user know that the answer is not available. Avoid including specific code examples and instead focus on offering assistance whenever possible. Your primary aim is to deliver clear and precise responses, ensuring that the information you provide is helpful to the user."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content