# Fine-Tune Large Language Model for Behavioral Activation Chatbot

## 1.Research Question
Behavioral Activation is a therapy method that helps reduce symptoms of depression and mood disorders by promoting involvement in rewarding activities. Recently, Large Language Models (LLMs) like GPT have introduced more intelligent chatbot capabilities. 

However, these LLMs are generalized for a wide range of conversations and aren't tailored specifically for tasks like Behavioral Activation. The challenge is to adapt these advanced LLMs to effectively assist in Behavioral Activation through chatbot interactions.

How could we develop a chatbot that bridge the gap between LLMs' general capabilities and the specific requirements of behavioral activation in a chatbot context?

We pick the Chatgpt as our LLM and the Dataset was provided by Florian Onur Kuhlmeier and Sven Scheu.
We start with data preprocessing.

## 2.Data Preprocessing for Fine Tuning (final version: messages.jsonl)
### Data understanding
•	id: this is the idea of the message (one id per row) -> ignore
•	conversation_id: an ID that signals which conversation the message belongs to.
•	flow_id: this is the id of the therapy session (behavioral activation was made up of three sessions / flows: verhaltenraktivierung-1, verhaltenraktivierung-2, verhaltenraktivierung-3). 
•	Step_id: every flow consisted of multiple steps. You can ignore this column.
•	Direction: SEND (by chatbot) vs. RECEIVE (by user)
•	Payload: this is the message content 
•	Content_type: which type of content the message has (image, text, question etc.)
•	message_order: usable to create the order of the messages (best to check with created_at)
•	created_at: date and time of the message -> best column to extract the order of the messages
•	interaction_order -> ignore

This dataset comes from a rule based chatbot by Florian Onur Kuhlmeier and Sven Scheu. To get the useful data from this data, we follow these steps:
1. Select rows where flow_id equals 'verhaltensaktivierung-2' that contain the most important data that we want to use in the fine-tuning. 
2. Narrow down to rows where content type is either 'text', 'question', or 'payload', as they are key components of prompt construction and sort the dataset by ‘created_at’.
3. Transfer the selected dataset by ‘Payload’ with ‘direction’ mapping  {'SEND': 'assistant', 'RECEIVE': 'user’} to extract conversations into a format that can be used by the LLM
4. Process format error checks and perform data analysis according to the OpenAI cookbook guidelines. 
5. Remove examples that violate OpenAI content policies, when we push data into a fine-tuned model.
 

In [1]:
path = r"C:\Users\Li\Desktop\Engineering Seminar Human-Centered Systems\data\verhaltensaktivierung.parquet.gzip"

import pandas as pd

df = pd.read_parquet(path)

### First image of Dataset

In [2]:
print(df.head())

                                     id                       conversation_id  \
0  12d29e68-e636-4fe7-abb8-3d1e3dc661c3  3cc89a19-5742-4f66-a93a-86cad116bea1   
1  cbee3172-53b5-4e3f-8bda-9e6a34d5280f  3cc89a19-5742-4f66-a93a-86cad116bea1   
2  2bfa2b69-d75d-4e43-8ded-29a7998a101b  3cc89a19-5742-4f66-a93a-86cad116bea1   
3  5e019e59-9c04-4e66-a3dc-cd28875b365c  3cc89a19-5742-4f66-a93a-86cad116bea1   
4  dc3518c7-f7b0-408d-890b-fc94c23d7af7  3cc89a19-5742-4f66-a93a-86cad116bea1   

                   flow_id step_id direction  \
0  verhaltensaktivierung-1   start      SEND   
1  verhaltensaktivierung-1   start   RECEIVE   
2  verhaltensaktivierung-1   start   RECEIVE   
3  verhaltensaktivierung-1   start      SEND   
4  verhaltensaktivierung-1   start      SEND   

                                             payload  content_type  \
0  {"content":{"url":"https://media0.giphy.com/me...         image   
1  {"content":{"flow_id":"verhaltensaktivierung-1...  flow_trigger   
2  {"content":

In [3]:
print("column names are ",df.columns)

column names are  Index(['id', 'conversation_id', 'flow_id', 'step_id', 'direction', 'payload',
       'content_type', 'message_order', 'interaction_order', 'created_at'],
      dtype='object')


In [4]:
print("column numbers are",df.count())

column numbers are id                   20137
conversation_id      20137
flow_id              20137
step_id              20137
direction            20137
payload              20137
content_type         20137
message_order        20137
interaction_order    20137
created_at           20137
dtype: int64


### Transfer the dataset as a csv file 
we can check the dataset file directly

In [5]:
# Specify the file path and name
file_path = r"C:\Users\Li\Desktop\Engineering Seminar Human-Centered Systems\data\verhaltensaktivierung.parquet.csv"
# Write DataFrame to CSV with UTF-8 encoding
df.to_csv(file_path, index=False, encoding='utf-8')

### Step 1 and 2: Filter
This Python code uses pandas to filter a DataFrame df in two steps:
1. Select rows where flow_id equals 'verhaltensaktivierung-2', because it contains the most important data that we want to use in the fine tuning.
2. Further narrow down to rows where content_type is either 'text', 'question', or 'payload', as they are key components of prompt construction.

In [6]:
filtered_df = df[df['flow_id'] == 'verhaltensaktivierung-2']

filtered_df = filtered_df[filtered_df['content_type'].isin(['text', 'question', 'payload'])]


In [7]:
filtered_df.head()

Unnamed: 0,id,conversation_id,flow_id,step_id,direction,payload,content_type,message_order,interaction_order,created_at
59,b416224d-57f9-46b3-9fe2-8a4c3d5245c1,fb0749b3-3391-4f35-9ad6-958a1b9a931c,verhaltensaktivierung-2,start,SEND,"{""content"":{""text"":""Lass uns mal versuchen, ei...",text,2,0,2023-06-11 18:32:24.503132
61,93988c30-1843-40f8-b6bf-790689a17b1e,fb0749b3-3391-4f35-9ad6-958a1b9a931c,verhaltensaktivierung-2,start,RECEIVE,"{""content"":{""payload"":""👍""},""content_type"":""pay...",payload,0,0,2023-06-11 18:32:35.093439
63,89fa05bc-116d-4471-a3de-a2091026f7cd,fb0749b3-3391-4f35-9ad6-958a1b9a931c,verhaltensaktivierung-2,start,RECEIVE,"{""content"":{""payload"":""Geht eigentlich""},""cont...",payload,0,0,2023-06-11 18:32:54.221356
64,a478b7d3-ab3d-484b-94b3-18d379838215,fb0749b3-3391-4f35-9ad6-958a1b9a931c,verhaltensaktivierung-2,start,SEND,"{""content"":{""text"":""Aber da bist du nicht alle...",text,2,0,2023-06-11 18:32:54.221356
65,b2625dd7-2a92-4b5a-9c8f-576c79ceb85d,fb0749b3-3391-4f35-9ad6-958a1b9a931c,verhaltensaktivierung-2,start,SEND,"{""content"":{""text"":""Aber: Positive Aktivitäten...",text,5,0,2023-06-11 18:32:54.221356


### Selected Dataset Generation
All downstream tasks based on this dataset: sorted_df.

In [8]:
filtered_df['created_at'] = pd.to_datetime(filtered_df['created_at'])

sorted_df = filtered_df.sort_values(by='created_at')
sorted_df.head()

Unnamed: 0,id,conversation_id,flow_id,step_id,direction,payload,content_type,message_order,interaction_order,created_at
59,b416224d-57f9-46b3-9fe2-8a4c3d5245c1,fb0749b3-3391-4f35-9ad6-958a1b9a931c,verhaltensaktivierung-2,start,SEND,"{""content"":{""text"":""Lass uns mal versuchen, ei...",text,2,0,2023-06-11 18:32:24.503132
61,93988c30-1843-40f8-b6bf-790689a17b1e,fb0749b3-3391-4f35-9ad6-958a1b9a931c,verhaltensaktivierung-2,start,RECEIVE,"{""content"":{""payload"":""👍""},""content_type"":""pay...",payload,0,0,2023-06-11 18:32:35.093439
68,362cd471-2cb8-4a8b-95cc-48c60cbf971b,fb0749b3-3391-4f35-9ad6-958a1b9a931c,verhaltensaktivierung-2,start,SEND,"{""content"":{""buttons"":[{""content"":{""accepts"":[...",question,2,0,2023-06-11 18:32:35.093439
63,89fa05bc-116d-4471-a3de-a2091026f7cd,fb0749b3-3391-4f35-9ad6-958a1b9a931c,verhaltensaktivierung-2,start,RECEIVE,"{""content"":{""payload"":""Geht eigentlich""},""cont...",payload,0,0,2023-06-11 18:32:54.221356
64,a478b7d3-ab3d-484b-94b3-18d379838215,fb0749b3-3391-4f35-9ad6-958a1b9a931c,verhaltensaktivierung-2,start,SEND,"{""content"":{""text"":""Aber da bist du nicht alle...",text,2,0,2023-06-11 18:32:54.221356


In [9]:
sorted_df.payload.head()

59    {"content":{"text":"Lass uns mal versuchen, ei...
61    {"content":{"payload":"👍"},"content_type":"pay...
68    {"content":{"buttons":[{"content":{"accepts":[...
63    {"content":{"payload":"Geht eigentlich"},"cont...
64    {"content":{"text":"Aber da bist du nicht alle...
Name: payload, dtype: object

In [10]:
print("sorted_df count: ", len(sorted_df),)

sorted_df count:  8991


In [11]:
print("unique conversation_id",len(sorted_df['conversation_id'].unique()))

unique conversation_id 119


### Step 3: JSONL Transformation
Transfer the selected dataset to extract conversations into a format that can be used by the LLM as OpenAI has provided a JSONL example:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
Source: https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

In [12]:
import json

role_mapping = {'SEND': 'assistant', 'RECEIVE': 'user'}
sorted_df['role'] = sorted_df['direction'].map(role_mapping)

# Function to extract content from payload
def extract_content(payload):
    try:
        payload_json = json.loads(payload)
        if 'content' in payload_json and 'title' in payload_json['content']:
            return payload_json['content']['title']
        if 'payload' in payload_json['content']:
            return payload_json['content']['payload']
        elif 'text' in payload_json['content']:
            return payload_json['content']['text']
    except json.JSONDecodeError:
        return payload
    return 'Content cannot be extracted'

#Apply this function to the payload column
sorted_df['content'] = sorted_df['payload'].apply(extract_content)

# Function for converting a single conversation to JSON
def conversation_to_json(group):
    # Add fixed system message
    system_message = {"role": "system", "content": "You are a helpful chatbot that based on Behavioural activation treatment."}
    messages = [system_message] + group[['role', 'content']].to_dict(orient='records')
    return {'messages': messages}

# Group by conversation_id and transform each group
conversations_json = sorted_df.groupby('conversation_id').apply(conversation_to_json)

#Specify the path to save the JSONL file
output_file_path = r"C:\Users\Li\Desktop\Engineering Seminar Human-Centered Systems\data\converted_messages.jsonl"

# Write each conversation to a JSONL file
with open(output_file_path, 'w', encoding='utf-8') as file:
    for conversation in conversations_json:
        json.dump(conversation, file, ensure_ascii=False)
        file.write('\n')

output_file_path


'C:\\Users\\Li\\Desktop\\Engineering Seminar Human-Centered Systems\\data\\converted_messages.jsonl'

### Step4: Data analysis for chat model fine-tuning
Supported by Data preparation and analysis for chat model fine-tuning: https://cookbook.openai.com/examples/chat_finetuning_data_prep

In [13]:
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

In [14]:
data_path = r"C:\Users\Li\Desktop\Engineering Seminar Human-Centered Systems\data\converted_messages.jsonl" 

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 119
First example:
{'role': 'system', 'content': 'You are a helpful chatbot that based on Behavioural activation treatment.'}
{'role': 'assistant', 'content': 'Lass uns mal versuchen, ein paar Aktivitäten zu finden, die dir Spaß machen! 🙌'}
{'role': 'user', 'content': '👍'}
{'role': 'assistant', 'content': 'Fällt es dir schwer, eine positive Aktivität in deinen Alltag einzubauen?'}
{'role': 'user', 'content': 'Geht eigentlich'}
{'role': 'assistant', 'content': 'Aber da bist du nicht allein! Viele haben genug für die Schule, Universität oder Arbeit 🧑💼 zu tun und nur wenig Freizeit.'}
{'role': 'assistant', 'content': 'Aber: Positive Aktivitäten müssen keine große Sache sein!'}
{'role': 'assistant', 'content': 'Manchmal kann es schon helfen, wenn du duschen gehst und dich danach frisch fühlst. 🚿'}
{'role': 'assistant', 'content': 'Auch solche kleinen Aktivitäten können helfen, deine Stimmung zu verbessern und auf bessere Gedanken zu kommen.'}
{'role': 'user', 'content': 'Ich 

In [15]:
# Format error checks
format_errors = defaultdict(int)

# Add a list to record the index of the wrong example
missing_assistant_examples = []

for i, ex in enumerate(dataset):
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1
        missing_assistant_examples.append(i)  # 记录发生错误的例子的索引

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
    if missing_assistant_examples:
        print("Missing assistant messages in examples:", missing_assistant_examples)
else:
    print("No errors found")

No errors found


In [16]:
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

In [17]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 1

#### Distribution of num_messages_per_example:
min / max: 2, 83
mean / median: 76.5546218487395, 75.0
p5 / p95: 75.0, 83.0

#### Distribution of num_total_tokens_per_example:
min / max: 50, 1790
mean / median: 1565.2689075630253, 1548.0
p5 / p95: 1508.8, 1707.4

#### Distribution of num_assistant_tokens_per_example:
min / max: 25, 1261
mean / median: 1142.5126050420167, 1120.0
p5 / p95: 1109.0, 1247.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


In [18]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Dataset has ~186267 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~558801 tokens


### Step 5:Remove examples that violate OpenAI content policies, when we push data into a fine-tuned model.

See 5.Fine-tuning

## 3.Design Chatbot
Behavioral Activation is a method for a psychical therapy by:
Taking part in psychologically beneficial activities,
Keeping away from psychologically harmful activities and 
Solving mechanisms problems that hinder access to rewards or enhance negative control.

### What should Chatbot do? 
 1.BA Introduction: Explain BA understandably at first and chatbot makes a self-introduction.
 2.Mood Track: Ask user’s emotion today.
 3.Activity Recommendation: Find activities that user likes and encourage the user to take part in them.
 4.Activity Management : Schedule (PST or ICS files if possible) the activities and check in.
 5.Incentive mechanism: prevent users from not doing the activity.
    1).Public Declaration: encourage users to share their goals and activities publicly, such as on twitter. The social pressure and potential for public accountability can be a strong incentive.
    2).Partner Supervision: encourage users to share their goals and activities to their friends and family so that they can supervise users’ activities and try to prevent users from avoiding participating in activities.
    3).Schedule Check-in history: remind users to check their finished und uncompleted schedule.
    4).Compliment and Praises: encourage users when they complete activities and remind users to remember and share the joy of successful completion of schedules.
  
### Knowledge Hub loading
Knowledge Hub contains relevant knowledge of BA.

In [19]:
file_path = r'C:\Users\Li\Desktop\Engineering Seminar Human-Centered Systems\data\knowledge.csv'
knowledge_df = pd.read_csv(file_path, sep='\\|\\|', engine='python',encoding='UTF-8')
knowledge_df

Unnamed: 0,entities,descriptions
0,Zest,"Great enthusiasm and energy, often marked by ..."
1,Zenith,A feeling of being at the peak or highest poi...
2,Yearning,"A deep longing, especially for something or s..."
3,Wonder,"A feeling of amazement and admiration, caused..."
4,Wistfulness,"A feeling of vague or regretful longing, ofte..."
...,...,...
95,BA Introduction,Explain BA understandably at first and chatbo...
96,Activity Recommendation,Find activities that user likes and encourage...
97,add schedule,add an activity
98,get schedule,get an activity


### Schedule List loading
User Information such as Schedule and User Mood.

In [20]:
file_path = r'C:\Users\Li\Desktop\Engineering Seminar Human-Centered Systems\data\calendar.csv'
schedule_df = pd.read_csv(file_path, sep='\\|\\|', engine='python',encoding='UTF-8')
schedule_df

Unnamed: 0,event_id,summary,start_date_time,end_date_time,time_zone,location,status,attendees,creation_date,last_modified_date,reminders


### Basis Functions for the Knowledge Hub: sort_dataframe,add_sort_entity,delete_matching_entity,search_description,update_entity_description.

In [21]:
def sort_dataframe(df):
    # Sort by the first word of entities column
    df.sort_values(by='entities', key=lambda x: x.str.split().str[0], inplace=True)

    return df

In [22]:
# test
df = sort_dataframe(knowledge_df)
df

Unnamed: 0,entities,descriptions
64,"""Exuberance","The quality of being full of energy, exciteme..."
61,"""Foreboding",A feeling that something bad will happen; fea...
49,"""Insecurity",A feeling of uncertainty or anxiety about one...
29,"""Pessimism",A tendency to see the worst aspect of things ...
96,Activity Recommendation,Find activities that user likes and encourage...
...,...,...
1,Zenith,A feeling of being at the peak or highest poi...
0,Zest,"Great enthusiasm and energy, often marked by ..."
97,add schedule,add an activity
99,delete schedule,delete an activity


In [23]:
def add_sort_entity(df, new_entity, new_description):

    new_row = pd.DataFrame({'entities': [new_entity], 'descriptions': [new_description]})
    
    df = pd.concat([df, new_row], ignore_index=True)
    
    df = sort_dataframe(df)
    
    return df

In [24]:
def delete_matching_entity(df, entity_to_delete):

    index_to_delete = df[df['entities'] == entity_to_delete].index.min()

    if pd.notna(index_to_delete):
        df = df.drop(index_to_delete)

    return df

In [25]:
def search_description(df, entity_to_search):
    # Remove extra spaces
    cleaned_search = entity_to_search.strip()

    # Use a more flexible matching method
    matching_rows = df[df['entities'].str.contains(cleaned_search, case=False, na=False, regex=False)]

    # Check if a matching line is found
    if not matching_rows.empty:
        # Return the description value of the first matching item
        return matching_rows.iloc[0]['descriptions']
    else:
        return None

In [26]:
search_description(df,"Behavioral Activation")

'Behavioral Activation is a method for a psychical therapy by: Taking part in psychologically beneficial activities, keeping away from psychologically harmful activities, and solving mechanisms problems that hinder access to rewards or enhance negative control.'

In [27]:
def update_entity_description(df, original_entity, updated_entity, updated_description):
    # Find lines matching original entity and description
    for index, row in df.iterrows():
        if row['entities'] == original_entity :
            df.at[index, 'entities'] = updated_entity
            df.at[index, 'descriptions'] = updated_description
            break  # Assume only the first matching row is updated
    else:
        print("No matching row found to update")

    return df

### Why we need RAG?
1. Update data in real time  (Mood detection)
2. Search support
3. Generate more accurate answers instead of making up confusing answers
4. decrease noice and reduce the impact of information overload


In [28]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pandas as pd

def normalize_embeddings(embeddings):
    """
    Normalize the embedding vector so that it becomes a unit vector.
    """
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return embeddings / norms

def extract_knowledges_from_df(df: pd.DataFrame, question, similarity_threshold=0.1):
    """
    A lookup using the entities column in a DataFrame returns a DataFrame containing the entities and descriptions columns for the entities most relevant to a single question.
     Based on a specified similarity threshold.
    """
    model_name='all-MiniLM-L6-v2'
    # Make sure the entities column exists
    if 'entities' not in df.columns:
        raise ValueError("DataFrame must have an 'entities' column")

    # Make sure the descriptions column exists
    if 'descriptions' not in df.columns:
        raise ValueError("DataFrame must have a 'descriptions' column")

    #Initialize SentenceTransformer model
    model = SentenceTransformer(model_name)

    # Get the value of entities column
    entities = df['entities'].tolist()

    # Calculate the embedding vector of the entity
    entity_embeddings = model.encode(entities, convert_to_tensor=False)
    entity_embeddings = normalize_embeddings(entity_embeddings)

    # Create faiss index (use inner product to simulate cosine similarity)
    index = faiss.IndexFlatIP(entity_embeddings.shape[1])
    index.add(entity_embeddings)

    # Calculate the embedding vector of the problem and normalize it
    question_embedding = model.encode(question, convert_to_tensor=False)
    question_embedding = normalize_embeddings(question_embedding.reshape(1, -1))

    # Search for all entities whose similarity to the question is higher than the threshold
    distances, indices = index.search(question_embedding, len(entities))

    # Filter out entities whose similarity is greater than or equal to the threshold
    filtered_indices = [index for index, distance in zip(indices[0], distances[0]) if distance >= similarity_threshold]

     # Get and return the rows containing the top three entities with the highest similarity
    top_indices = filtered_indices[:1]  
    result_df = df.iloc[top_indices]

    return result_df[['entities', 'descriptions']]


### Timestamp for Prompt

In [29]:
from datetime import datetime as dt

def get_current_timestamp():
    return dt.now().strftime("%Y-%m-%dT%H:%M:%S")


In [30]:
get_current_timestamp()

'2024-03-03T01:35:46'

### Schedule Management

#### Google Calendar 
support by official quickstart tool from Google Workspace
test user: 2
mesenlee123@gmail.com
flo.kuhlmeier@gmail.com		
https://developers.google.com/calendar/api/quickstart/python?hl=en

In [31]:
import re
import pandas as pd
from googleapiclient.discovery import build
from google.oauth2.credentials import Credentials

def google_calendar_service():
    creds = Credentials.from_authorized_user_file('token.json')
    service = build('calendar', 'v3', credentials=creds)
    return service

def get_user_email():
    creds = Credentials.from_authorized_user_file('token.json')
    service = build('calendar', 'v3', credentials=creds)
    user_info = service.calendarList().get(calendarId='primary').execute()
    user_email = user_info['summary']
    return user_email

def is_valid_datetime(dt_str):
    try:
        datetime.datetime.strptime(dt_str, '%Y-%m-%dT%H:%M:%S')
        return True
    except ValueError:
        return False

def is_valid_email(email):
    return re.match(r"[^@]+@[^@]+\.[^@]+", email)


#### add event to Google Calendar

In [32]:
def add_event(schedule_df):
    event_info = {
        'summary': input('Enter event summary: '),
        'location': input('Enter event location: '),
        'start': {
            'dateTime': input('Enter start date and time (YYYY-MM-DDTHH:MM:SS),skip when now: '),
            'timeZone': input('Enter start time zone (e.g., America/Los_Angeles), skip when Europe/Berlin: '),
        },
        'end': {
            'dateTime': input('Enter end date and time (YYYY-MM-DDTHH:MM:SS),skip when now: '),
            'timeZone': input('Enter end time zone (e.g., America/Los_Angeles),skip when Europe/Berlin: '),
        },
        'attendees': [],
        'reminders': {
            'useDefault': False,
            'overrides': [
                {'method': 'email', 'minutes': 24 * 60},
                {'method': 'popup', 'minutes': 10},
            ],
        },
    }

    # Validate inputs
    if event_info['summary'] == '':
        event_info['summary'] = 'new event'
    if event_info['location'] == '':
        event_info['location'] = 'KIT'
    if event_info['start']['dateTime'] == '':
        event_info['start']['dateTime'] = get_current_timestamp()
    if event_info['end']['dateTime'] == '':
        event_info['end']['dateTime'] = get_current_timestamp()
    if event_info['start']['timeZone'] == '':
        event_info['start']['timeZone'] = 'Europe/Berlin'
    if event_info['end']['timeZone'] == '':
        event_info['end']['timeZone'] = 'Europe/Berlin'
    
    if not is_valid_datetime(event_info['start']['dateTime']):
        print("Invalid start date and time format.")
        return schedule_df, None

    if not is_valid_datetime(event_info['end']['dateTime']):
        print("Invalid end date and time format.")
        return schedule_df, None

    attendee_emails = input('Enter attendees emails separated by comma (skip): ').split(',')
    if attendee_emails == '':
        user_email = get_user_email()
        print("user_email: ",user_email)
        attendee_emails = user_email
        event_info['attendees'] = [user_email]
    
    for email in attendee_emails:
        email = email.strip()
            
        if email and is_valid_email(email):
            event_info['attendees'].append({'email': email})
        if email != '' and not is_valid_email(email):
            print(f"Invalid email format: {email}")
            return schedule_df, None

    try:
        service = google_calendar_service()
        event = service.events().insert(calendarId='primary', body=event_info).execute()
    
        # Update DataFrame with response from API
        new_event = {
            'event_id': [event['id']],
            'summary': [event['summary']],
            'start_date_time': [event['start']['dateTime']],
            'end_date_time': [event['end']['dateTime']],
            'time_zone': [event['start']['timeZone']],
            'location': [event.get('location', '')],
            'status': [event.get('status', '')],
            'attendees': [', '.join([attendee['email'] for attendee in event.get('attendees', [])])],
            'creation_date': [event['created']],
            'last_modified_date': [event['updated']],
            'reminders': [str(event['reminders']['overrides'])]
        }
        new_event_df = pd.DataFrame(new_event)
    
        # Ensure the columns in both DataFrames match
        for column in new_event_df.columns:
            if column not in schedule_df.columns:
                schedule_df[column] = None
        
        schedule_df = pd.concat([schedule_df, new_event_df], ignore_index=True)
        return schedule_df, event['id']

    except Exception as e:
        print(f"An error occurred with the API call: {e}")
        return schedule_df, None

In [33]:
def add_event_to_schedule(schedule_df):

    schedule_df, event_id = add_event(schedule_df)
    print("Event ID:", event_id)
    print("schedule_df:", schedule_df)
    return schedule_df

#### get event from Google Calendar

In [34]:
def get_event_with_eventId(eventId):#example = 'f5seohftfpg8kcmt6mlfdcti44'
    creds = Credentials.from_authorized_user_file('token.json')
    service = build('calendar', 'v3', credentials=creds)
    try:
        instances = service.events().instances(calendarId='primary', eventId=eventId).execute()
        return instances
    except Exception as e:
        return f"An error occurred: {e}"

In [35]:
get_event_with_eventId("f5seohftfpg8kcmt6mlfdcti44")

{'kind': 'calendar#events',
 'etag': '"p33g9hvn6jra880o"',
 'summary': 'mesenlee123@gmail.com',
 'description': '',
 'updated': '2024-03-01T23:42:13.372Z',
 'timeZone': 'Europe/Berlin',
 'accessRole': 'owner',
 'defaultReminders': [{'method': 'popup', 'minutes': 30}],
 'nextSyncToken': 'COCY_uae1IQDEOCY_uae1IQDGAUg2tKJogIo2tKJogI=',
 'items': []}

In [36]:
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
import pandas as pd

def delete_event_with_eventId(eventId, schedule_df):  # Example eventId = 'f5seohftfpg8kcmt6mlfdcti44'
    creds = Credentials.from_authorized_user_file('token.json')
    service = build('calendar', 'v3', credentials=creds)

    try:
        # Attempt to delete the event from Google Calendar
        service.events().delete(calendarId='primary', eventId=eventId).execute()

        # If the event is successfully deleted from Google Calendar, 
        # also remove it from the DataFrame
        if eventId in schedule_df['event_id'].values:
            schedule_df = schedule_df[schedule_df['event_id'] != eventId]
            return schedule_df, f"Event with ID {eventId} has been deleted from both Google Calendar and DataFrame."
        else:
            return schedule_df, f"Event ID {eventId} not found in DataFrame but deleted from Google Calendar."

    except Exception as e:
        return schedule_df, f"An error occurred: {e}"

# Example usage:
# updated_df, message = delete_event_with_eventId('event_id_to_delete', schedule_df)
# print(message)
# print(updated_df)


In [37]:
delete_event_with_eventId("ralsh9mrefmvb86o4f31lo3s08",schedule_df)

(Empty DataFrame
 Columns: [event_id, summary, start_date_time, end_date_time, time_zone, location, status, attendees, creation_date, last_modified_date, reminders]
 Index: [],
 'An error occurred: <HttpError 410 when requesting https://www.googleapis.com/calendar/v3/calendars/primary/events/ralsh9mrefmvb86o4f31lo3s08? returned "Resource has been deleted". Details: "[{\'domain\': \'global\', \'reason\': \'deleted\', \'message\': \'Resource has been deleted\'}]">')

In [38]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd

def get_event_without_eventId(schedule_df, summary, location, startTime, endTime):
    # Load the sentence transformer model
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Create embeddings for the input parameters
    input_embedding = model.encode(f"{summary} {location} {startTime} {endTime}", convert_to_tensor=True)

    # Function to embed DataFrame row
    def embed_row(row):
        row_str = f"{row['summary']} {row['location']} {row['start_date_time']} {row['end_date_time']}"
        return model.encode(row_str, convert_to_tensor=True)

    # Compute embeddings for each row in the DataFrame
    schedule_df['embedding'] = schedule_df.apply(embed_row, axis=1)

    # Calculate cosine similarity and find the row with the highest similarity
    highest_similarity = -1
    closest_event_id = None
    for index, row in schedule_df.iterrows():
        row_embedding = row['embedding']
        similarity = util.pytorch_cos_sim(input_embedding, row_embedding)[0][0].item()
        print(f"Event ID: {row['event_id']}, Similarity: {similarity}")  # Debug print
        if similarity > highest_similarity:
            highest_similarity = similarity
            closest_event_id = row['event_id']

    if closest_event_id is None:
        print("No closely matching event found.")
    return get_event_with_eventId(closest_event_id)

# Example usage
# closest_event_id = get_event_without_eventId(schedule_df, 'new event', 'KIT', '2024-01-31T21:20:21+01:00', '2024-01-31T21:20:21+01:00')
# print("Closest Event ID:", closest_event_id)


In [39]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd

def delete_event_without_eventId(schedule_df, summary, location, startTime, endTime):
    # Load the sentence transformer model
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Create embeddings for the input parameters
    input_embedding = model.encode(f"{summary} {location} {startTime} {endTime}", convert_to_tensor=True)

    # Function to embed DataFrame row
    def embed_row(row):
        row_str = f"{row['summary']} {row['location']} {row['start_date_time']} {row['end_date_time']}"
        return model.encode(row_str, convert_to_tensor=True)

    # Compute embeddings for each row in the DataFrame
    schedule_df['embedding'] = schedule_df.apply(embed_row, axis=1)

    # Calculate cosine similarity and find the row with the highest similarity
    highest_similarity = -1
    closest_event_id = None
    for index, row in schedule_df.iterrows():
        row_embedding = row['embedding']
        similarity = util.pytorch_cos_sim(input_embedding, row_embedding)[0][0].item()
        print(f"Event ID: {row['event_id']}, Similarity: {similarity}")  # Debug print
        if similarity > highest_similarity:
            highest_similarity = similarity
            closest_event_id = row['event_id']

    if closest_event_id is None:
        print("No closely matching event found.")
    return delete_event_with_eventId(closest_event_id,schedule_df)

# Example usage
# closest_event_id = get_event_without_eventId(schedule_df, 'new event', 'KIT', '2024-01-31T21:20:21+01:00', '2024-01-31T21:20:21+01:00')
# print("Closest Event ID:", closest_event_id)


In [40]:
delete_event_without_eventId(schedule_df,'new event','KIT','2024-01-31T21:20:21+01:00','2024-01-31T21:20:21+01:00')

No closely matching event found.


(Empty DataFrame
 Columns: [event_id, summary, start_date_time, end_date_time, time_zone, location, status, attendees, creation_date, last_modified_date, reminders, embedding]
 Index: [],
 'An error occurred: Missing required parameter "eventId"')

### Transfer Dataframe to string to construct prompt

In [41]:
def dataframe_to_string(df):
    """
    Convert each row of the DataFrame to a string and add a period after the end of each row.
    If the DataFrame is empty, return an empty string.
    """
    # 检查DataFrame是否为空
    if df.empty:
        return ''

    # 将DataFrame的每行转换为由空格分隔的字符串，并添加句号
    lines = [' '.join(map(str, row)) + '.' for row in df.itertuples(index=False, name=None)]
    
    # 将所有行连接成一个单一的字符串
    context = ' '.join(lines)
    return " Given retrievaled context: " + context


In [42]:
dataframe_to_string(extract_knowledges_from_df(knowledge_df,"what should I do today?",0.1))

' Given retrievaled context: Activity Recommendation  Find activities that user likes and encourage the user to take part in them.'

#### Save messages to the specified folder
folder_path = r"folder_path"
example: save_messages_to_txt(messages, folder_path)

In [43]:
import os
import json
from datetime import datetime
folder_path = r"C:\Users\Li\Desktop\Engineering Seminar Human-Centered Systems"
def save_messages_to_jsonl(messages, folder_path):
    # Create a filename based on the current timestamp
    timestamp = dt.now().strftime("%Y%m%d%H%M%S")
    filename = f"{timestamp}.jsonl"

    # Make sure the folder path exists
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)

    #Create full file path
    file_path = os.path.join(folder_path, filename)

    # Save messages to JSONL file
    with open(file_path, 'w', encoding='utf-8') as file:
        for message in messages:
            # 将字典转换为JSON字符串并写入文件
            json_str = json.dumps(message)
            file.write(json_str + '\n')

    print(f"Messages saved to {file_path}")

### Single Answered Method
Use get_answer_with_single_question to modify other functions like: Mood track

In [44]:
import os
from openai import OpenAI
import openai

def get_answer_with_single_question(df, question, similarity_threshold = 0.1):

    # Extract relevant information as context
    context = dataframe_to_string(extract_knowledges_from_df(df,question,similarity_threshold))

    # Set OpenAI API key
    api_key = os.environ.get('OPENAI_API_KEY')
    openai.api_key = api_key

    # Initialize OpenAI client
    client = OpenAI(api_key=api_key)

    # Set up the model
    model="gpt-3.5-turbo-1106"

    try:
        messages=[
                    {"role": "system", "content": "You are a helpful assistant. Given context, answer in a noun or a phrase with noun."},
                    # instruction
                    {"role": "user", "content": question + " " + context}
        ]
        response = client.chat.completions.create(
            model=model,
            messages=messages
        )
        
        # Extract and return the answer
        answer = response.choices[0].message.content
        return answer

    except Exception as e:
        print(f"An error occurred while processing the problem: {e}")
        return "Unable to get answer"



In [45]:
get_answer_with_single_question(df,"what should I do today?")

'hobbies and interests'

### Mood and Activity Track

In [46]:
def mood_track(df, similarity_threshold = 0.1):
    history = str(messages[-5:])
    mood_track = get_answer_with_single_question(knowledge_df,"what is user's emotion ? Answer in a noun. given chat history: "+history,similarity_threshold)
    #print("mode_track: ",mood_track)
    timestamp = get_current_timestamp()#datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    full_entity = f"user emotion {timestamp}"
    new_description = mood_track
    # Add and sort the new entry
    df = add_sort_entity(df, full_entity, new_description)
    #print("entity: ",full_entity)
    #print("description: ",new_description)
    return full_entity + ", " + new_description

In [47]:
def dataframe_to_entity(df):
    """
    Convert each row of the DataFrame to a string and add a period after the end of each row.
    """
    # Convert each row of the DataFrame to a string separated by spaces and add periods
    lines = [' '.join(map(str, row)) + '.' for row in df.itertuples(index=False, name=None)]

    # Concatenate all lines into a single string
    entity =  ' '.join(lines)
    return str(entity)


In [48]:
def extract_activity_from_df(df: pd.DataFrame, question, similarity_threshold=0.1):
    """
    A lookup using the entities column in a DataFrame returns a DataFrame containing the entities and descriptions columns for the entities most relevant to a single question.
     Based on a specified similarity threshold.
    """
    model_name='all-MiniLM-L6-v2'
    # Make sure the entities column exists
    if 'entities' not in df.columns:
        raise ValueError("DataFrame must have an 'entities' column")

    # Make sure the descriptions column exists
    if 'descriptions' not in df.columns:
        raise ValueError("DataFrame must have a 'descriptions' column")

    # Initialize SentenceTransformer model
    model = SentenceTransformer(model_name)

    # Get the value of entities column
    entities = df['entities'].tolist()

    # Compute the embedding vector of an entity
    entity_embeddings = model.encode(entities, convert_to_tensor=False)
    entity_embeddings = normalize_embeddings(entity_embeddings)

    # Create faiss index (use inner product to simulate cosine similarity)
    index = faiss.IndexFlatIP(entity_embeddings.shape[1])
    index.add(entity_embeddings)

    # Calculate the embedding vector of the problem and normalize it
    question_embedding = model.encode(question, convert_to_tensor=False)
    question_embedding = normalize_embeddings(question_embedding.reshape(1, -1))

    # Search for all entities that are more similar to the question than a threshold
    distances, indices = index.search(question_embedding, len(entities))

    # Filter out entities whose similarity is greater than or equal to the threshold
    filtered_indices = [index for index, distance in zip(indices[0], distances[0]) if distance >= similarity_threshold]

     # Get and return the rows containing the top three entities with the highest similarity
    top_indices = filtered_indices[:1]  # Select the index with the highest similarity
    result_df = df.iloc[top_indices]
    
    return dataframe_to_entity(result_df[['entities']])

## 4.Fine tunning
### Step 5:removed examples that violate OpenAI content policies (Data Preprocessing)
This documents how did I remove examples. You can skip it and go to the fine-tuned model
Note:
Use standard version removed examples that violate our content policies: messages.jsonl, file_object.id: file-MfSaHHIx7s75dywsofhSQIpc
standard transfered version: converted_messages.jsonl, File ID: file-flA9y8B28JIGeZSq1nM8fPh9
(The job failed due to an invalid training file. This file failed moderation safety checks. The OpenAI Moderation API identifies fine-tuning examples that violate our content policies. To fine tune on this data, please try removing the flagged lines and uploading the file again. Flagged lines: 1, 2, 8, 10, 13, 14, 16, 19, 21, 22, 25, 27, 32, 33, 37, 38, 43, 50, 51, 52, 53, 54, 61, 62, 63, 64, 67, 69, 70, 72, 74, 75, 76, 81, 85, 87, 90, 91, 92, 94, 95, 96, 98, 99, 100, 105, 106, 107, 110, 112, 115, 117)
Given Test version:test_messages.jsonl, File ID: file-nSLhAXDwgLhFbA5OVnVqRuTi


The job failed due to an invalid training file. This file failed moderation safety checks. The OpenAI Moderation API identifies fine tuning examples that violate our content policies. To fine tune on this data, please try removing the flagged lines and uploading the file again. Flagged lines: 1, 2, 8, 10, 13, 14, 16, 19, 21, 22, 25, 27, 32, 33, 37, 38, 43, 50, 51, 52, 53, 54, 61, 62, 63, 64, 67, 69, 70, 72, 74, 75, 76, 81, 85, 87, 90, 91, 92, 94, 95, 96, 98, 99, 100, 105, 106, 107, 110, 112, 115, 117

The job failed due to an invalid training file. This file failed moderation safety checks. The OpenAI Moderation API identifies fine tuning examples that violate our content policies. To fine tune on this data, please try removing the flagged lines and uploading the file again. Flagged lines: 0, 5, 6, 8, 9, 11, 12, 14, 15, 19, 22, 26, 32, 38, 40, 41, 42, 43, 47, 50, 51, 53, 54, 55, 59, 61, 62, 64, 65

In [49]:
"""
def remove_lines_and_save(input_file_path, output_file_path, lines_to_remove):
    
    with open(input_file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()


    lines_to_remove = [line - 1 for line in lines_to_remove]
    lines = [line for index, line in enumerate(lines) if index not in lines_to_remove]

    with open(output_file_path, 'w', encoding='utf-8') as file:
        file.writelines(lines)
"""
# open when you want to upload new data for fine-tuning

"\ndef remove_lines_and_save(input_file_path, output_file_path, lines_to_remove):\n    \n    with open(input_file_path, 'r', encoding='utf-8') as file:\n        lines = file.readlines()\n\n\n    lines_to_remove = [line - 1 for line in lines_to_remove]\n    lines = [line for index, line in enumerate(lines) if index not in lines_to_remove]\n\n    with open(output_file_path, 'w', encoding='utf-8') as file:\n        file.writelines(lines)\n"

In [50]:
input_file = r'C:\Users\Li\Desktop\Engineering Seminar Human-Centered Systems\data\converted_messages.jsonl'  
output_file = r'C:\Users\Li\Desktop\Engineering Seminar Human-Centered Systems\data\messages.jsonl' 

In [51]:
"""
 
lines_to_remove = [1, 2, 8, 10, 13, 14, 16, 19, 21, 22, 25, 27, 32, 33, 37, 38, 43, 50, 51, 52, 53, 54, 61, 62, 63, 64, 67, 69, 70, 72, 74, 75, 76, 81, 85, 87, 90, 91, 92, 94, 95, 96, 98, 99, 100, 105, 106, 107, 110, 112, 115, 117]

remove_lines_and_save(input_file, output_file, lines_to_remove)
"""

'\n \nlines_to_remove = [1, 2, 8, 10, 13, 14, 16, 19, 21, 22, 25, 27, 32, 33, 37, 38, 43, 50, 51, 52, 53, 54, 61, 62, 63, 64, 67, 69, 70, 72, 74, 75, 76, 81, 85, 87, 90, 91, 92, 94, 95, 96, 98, 99, 100, 105, 106, 107, 110, 112, 115, 117]\n\nremove_lines_and_save(input_file, output_file, lines_to_remove)\n'

The job failed due to an invalid training file. This file failed moderation safety checks. The OpenAI Moderation API identifies fine tuning examples that violate our content policies. To fine tune on this data, please try removing the flagged lines and uploading the file again. Flagged lines: 0, 5, 6, 8, 9, 11, 12, 14, 15, 19, 22, 26, 32, 38, 40, 41, 42, 43, 47, 50, 51, 53, 54, 55, 59, 61, 62, 64, 65

In [52]:
'''

lines_to_remove = [0, 5, 6, 8, 9, 11, 12, 14, 15, 19, 22, 26, 32, 38, 40, 41, 42, 43, 47, 50, 51, 53, 54, 55, 59, 61, 62, 64, 65]

remove_lines_and_save(output_file, output_file, lines_to_remove)

'''

'\n\nlines_to_remove = [0, 5, 6, 8, 9, 11, 12, 14, 15, 19, 22, 26, 32, 38, 40, 41, 42, 43, 47, 50, 51, 53, 54, 55, 59, 61, 62, 64, 65]\n\nremove_lines_and_save(output_file, output_file, lines_to_remove)\n\n'

The job failed due to an invalid training file. This file failed moderation safety checks. The OpenAI Moderation API identifies fine tuning examples that violate our content policies. To fine tune on this data, please try removing the flagged lines and uploading the file again. Flagged lines: 0, 4, 5, 6, 7, 10, 12, 15, 20, 25, 26, 29, 31, 32, 35, 36, 37

In [53]:
'''
lines_to_remove = [0, 4, 5, 6, 7, 10, 12, 15, 20, 25, 26, 29, 31, 32, 35, 36, 37]

remove_lines_and_save(output_file, output_file, lines_to_remove)
'''

'\nlines_to_remove = [0, 4, 5, 6, 7, 10, 12, 15, 20, 25, 26, 29, 31, 32, 35, 36, 37]\n\nremove_lines_and_save(output_file, output_file, lines_to_remove)\n'

The job failed due to an invalid training file. This file failed moderation safety checks. The OpenAI Moderation API identifies fine tuning examples that violate our content policies. To fine tune on this data, please try removing the flagged lines and uploading the file again. Flagged lines: 0, 3, 5, 6, 8, 12, 16, 18, 19, 21

In [54]:
'''
lines_to_remove = [0, 3, 5, 6, 8, 12, 16, 18, 19, 21]

remove_lines_and_save(output_file, output_file, lines_to_remove)
'''

'\nlines_to_remove = [0, 3, 5, 6, 8, 12, 16, 18, 19, 21]\n\nremove_lines_and_save(output_file, output_file, lines_to_remove)\n'

In [55]:
file_path = r"C:\Users\Li\Desktop\Engineering Seminar Human-Centered Systems\data\messages.jsonl"

In [56]:
'''
#
from openai import OpenAI
client = OpenAI()
# use English version: converted_messages_en.jsonl
file_object  = client.files.create(
  file=open(file_path, "rb"),# could be: #messages.jsonl,#converted_messages.jsonl,#test_messages.jsonl
  purpose="fine-tune"
)
print("file_object.id:",file_object.id)
file_object_id = file_object.id
file_object 

'''
# open when you want to upload new data for fine-tuning

'\n#\nfrom openai import OpenAI\nclient = OpenAI()\n# use English version: converted_messages_en.jsonl\nfile_object  = client.files.create(\n  file=open(file_path, "rb"),# could be: #messages.jsonl,#converted_messages.jsonl,#test_messages.jsonl\n  purpose="fine-tune"\n)\nprint("file_object.id:",file_object.id)\nfile_object_id = file_object.id\nfile_object \n\n'

In [57]:
file_object_id = 'file-zDC09wFGIufN4eZaXd5FWFKv'#file-MI68lqdGDLIgx217LTbtFgzG


### Fine-tuning model
fine_tuning_job
fine_tuning_job.id="ftjob-YtoxfhaeCv5EDCjMt1PGbvHZ" 
fine_tuned_model_name = "gpt-3.5-turbo-0613"

In [58]:
'''
# open when you want to upload new data for fine-tuning
from openai import OpenAI
client = OpenAI()

fine_tuning_job = client.fine_tuning.jobs.create(
  training_file = file_object_id, 
  model="gpt-3.5-turbo"
)
print(fine_tuning_job.id)
fine_tuning_job

#FineTuningJob(id='ftjob-2Lpyr3aaKh1qmUa2PiqOr9ma', created_at=1704830407, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0613', object='fine_tuning.job', organization_id='org-1RBrqOHK4MGbSBFmx0Tqvb1b', result_files=[], status='validating_files', trained_tokens=None, training_file='file-flA9y8B28JIGeZSq1nM8fPh9', validation_file=None)
'''

'\n# open when you want to upload new data for fine-tuning\nfrom openai import OpenAI\nclient = OpenAI()\n\nfine_tuning_job = client.fine_tuning.jobs.create(\n  training_file = file_object_id, \n  model="gpt-3.5-turbo"\n)\nprint(fine_tuning_job.id)\nfine_tuning_job\n\n#FineTuningJob(id=\'ftjob-2Lpyr3aaKh1qmUa2PiqOr9ma\', created_at=1704830407, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs=\'auto\', batch_size=\'auto\', learning_rate_multiplier=\'auto\'), model=\'gpt-3.5-turbo-0613\', object=\'fine_tuning.job\', organization_id=\'org-1RBrqOHK4MGbSBFmx0Tqvb1b\', result_files=[], status=\'validating_files\', trained_tokens=None, training_file=\'file-flA9y8B28JIGeZSq1nM8fPh9\', validation_file=None)\n'

In [59]:
fine_tuning_job_id = 'ftjob-YtoxfhaeCv5EDCjMt1PGbvHZ'

In [60]:
'''
from openai import OpenAI
import time

client = OpenAI()

# Loop to check the status of the fine-tuning job
while True:
    fine_tuning_job = client.fine_tuning.jobs.retrieve(fine_tuning_job_id)
    if fine_tuning_job.status == 'succeeded':
        # The fine-tuning job is completed and the name of the fine-tuned model is obtained.
        fine_tuned_model_name = fine_tuning_job.fine_tuned_model
        print("finetunned model name:", fine_tuned_model_name)
        break
    elif fine_tuning_job.status == 'failed':
        print("Fine-tuning job failed.")
        break
    print("Wait for the fine-tuning job to complete...")
    time.sleep(60)
# open when you want to upload new data for fine-tuning
'''

'\nfrom openai import OpenAI\nimport time\n\nclient = OpenAI()\n\n# Loop to check the status of the fine-tuning job\nwhile True:\n    fine_tuning_job = client.fine_tuning.jobs.retrieve(fine_tuning_job_id)\n    if fine_tuning_job.status == \'succeeded\':\n        # The fine-tuning job is completed and the name of the fine-tuned model is obtained.\n        fine_tuned_model_name = fine_tuning_job.fine_tuned_model\n        print("finetunned model name:", fine_tuned_model_name)\n        break\n    elif fine_tuning_job.status == \'failed\':\n        print("Fine-tuning job failed.")\n        break\n    print("Wait for the fine-tuning job to complete...")\n    time.sleep(60)\n# open when you want to upload new data for fine-tuning\n'

In [61]:
fine_tuned_model_name = 'ft:gpt-3.5-turbo-0613:personal::8fDseZ4U'

In [75]:
import os
import time
from openai import OpenAI
import openai
from google.auth.transport.requests import Request
import os
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
import datetime
# 如果修改了授权范围，删除旧的 token.json 文件
"""if os.path.exists("token.json"):
    os.remove("token.json")
"""
SCOPES = ["https://www.googleapis.com/auth/calendar"]
creds = None
    # 检查是否有可用的授权令牌
if os.path.exists("token.json"):
    creds = Credentials.from_authorized_user_file("token.json", SCOPES)
    # 如果没有有效的凭证，让用户登录。
if not creds or not creds.valid:
    if creds and creds.expired and creds.refresh_token:
        creds.refresh(Request())
    else:
        flow = InstalledAppFlow.from_client_secrets_file(
            "client_secret_2_452665542359-3fc9a1q5lcg222qt6uvh3dgbes2n9ktr.apps.googleusercontent.com.json", SCOPES
        )
        creds = flow.run_local_server(port=0)
        # 保存凭证以供下次运行使用
    with open("token.json", "w") as token:
        token.write(creds.to_json())

def get_answer_with_context(client, model, messages):
    """
    Using the given OpenAI client and model, generate answers based on the provided list of messages.
    """
    try:
        response = client.chat.completions.create(model=model, messages=messages,max_tokens=300)
        return response.choices[0].message.content
    except Exception as e:
        print(f"An error occurred while processing the problem: {e}")
        return "Unable to get answer"
def format_output(context, user_mood, current_timestamp):
    
    output = ''

    if context:
        output += '. background knowledge: ' + context

    output += '. user mood: ' + user_mood


    output += '. ' + current_timestamp

    return output

# Set OpenAI API key
api_key = os.environ.get('OPENAI_API_KEY')
openai.api_key = api_key

# Initialize OpenAI client
client = OpenAI(api_key=api_key)

# Set up the model
model = fine_tuned_model_name

#Initial conversation message
print("Hi, I'm a chatbot for Behavioral Activation Therapy, focusing on positive activities and problem-solving. Talk to me about anything. 😊")
#Initial conversation message
messages = [
    {"role": "system", "content": "You are a helpful chatbot that based on Behavioural activation treatment. Your answer must be less than 3 sentences. Explain BA understandably at first and chatbot makes a self-introduction. You used Google Calendar API so that user can use order to add event, get event and delete event. Add event order is add schedule. Get event order is get schedule. Delete event order is delete schedule. Find activities that user likes and encourage the user to take part in them. Encourage users to share their goals and activities publicly, such as on twitter. The social pressure and potential for public accountability can be a strong incentive. Encourage users to share their goals and activities to their friends and family so that they can supervise users’ activities and try to prevent users from avoiding participating in activities. Remind users to check their finished und uncompleted schedule. Encourage users when they complete activities and remind users to remember and share the joy of successful completion of schedules. If depression or anxiety is detected, chatbot need offer comforting words, suggestions, or techniques for emotion regulation. Given some context, you must give a suitable response based on the context."}, {"role": "assistant", "content": "Hi, I'm a chatbot for Behavioral Activation Therapy, focusing on positive activities and problem-solving. Talk to me about anything. 😊"},
]
user_input_count = 0
user_mood = ""

while True:
    # Ask the user to enter a question
    user_input = input("This is an fine-tuned model.\n Please enter your text (or enter 'exit' to end): ").strip()
    if user_input.lower() == 'exit' or user_input.lower() == '':
        if user_input_count > 4: 
            save_messages_to_jsonl(messages,folder_path)
        print('messages: ',messages)
        chat_messages = messages
        break
    # increment counter
    user_input_count += 1
    print("user_input " + str(user_input_count) + ": " + user_input)

    print("###############################")
    # Execute mood_track every five user inputs
    if user_input_count % 5 == 0 and user_input_count>4 :
        user_mood = mood_track(knowledge_df, 0.1)
        print('user_mood: ',user_mood)
        print("###############################")
       
    current_timestamp = get_current_timestamp()
    context = dataframe_to_string(extract_knowledges_from_df(knowledge_df, user_input, 0.3))
    #schedule = dataframe_to_string(extract_knowledges_from_df(schedule_df, user_input, 0.1))

    formatted_output = format_output(context, user_mood, current_timestamp)

    print('context: ',context)
    print("###############################")
    # Add user message to conversation
    messages.append({"role": "user", "content": "user_input: "+user_input +". user_input: "+ context})
    # Get and print answers
    answer = get_answer_with_context(client, model, messages)
    print("Answer:", answer)
    print("###############################")
    # Add assistant's answers to conversation
    messages.append({"role": "assistant", "content": answer})
    
    # Implement schedule related operations: use
    #extract_knowledges_from_df(df,"new schedule",0.3)
    if extract_activity_from_df(knowledge_df,user_input,0.15)=="add schedule .":
        schedule_df,event_id = add_event(schedule_df)
        messages.append({"role": "user", "content": "added schedule: "+str(get_event_with_eventId(event_id)) + "what is the added schedule?"})
        answer = get_answer_with_context(client, model, messages)
        print("Answer:", answer)
        print("###############################")
        # Add assistant's answers to conversation
        messages.append({"role": "assistant", "content": answer})
    elif extract_activity_from_df(knowledge_df,user_input,0.15)=="get schedule .":
        summary = input("input your summary of event")
        location = input("input your location of event")
        startTime = input("input your startTime of event")
        endTime = input("input your endTime of event")
        get_schedule = get_event_without_eventId(schedule_df, summary, location, startTime, endTime)
        messages.append({"role": "user", "content": "get schedule: "+str(get_schedule)+ "what has user gotten"})
        answer = get_answer_with_context(client, model, messages)
        print("Answer:", answer)
        print("###############################")
        # Add assistant's answers to conversation
        messages.append({"role": "assistant", "content": answer})
        
    if extract_activity_from_df(knowledge_df,user_input,0.3)=="delete schedule .":
        summary = input("input your summary of event")
        location = input("input your location of event")
        startTime = input("input your startTime of event")
        endTime = input("input your endTime of event")
        delete_schedule=delete_event_without_eventId(schedule_df, summary, location, startTime, endTime)
        messages.append({"role": "user", "content": "removed schedule: "+str(delete_schedule)+ "what has user delete"})
        answer = get_answer_with_context(client, model, messages)
        print("Answer:", answer)
        print("###############################")
        # Add assistant's answers to conversation
        messages.append({"role": "assistant", "content": answer})
    time.sleep(1)

Hi, I'm a chatbot for Behavioral Activation Therapy, focusing on positive activities and problem-solving. Talk to me about anything. 😊
user_input 1: Can you give me an advice what activity should I take
###############################
context:   Given retrievaled context: Activity Recommendation  Find activities that user likes and encourage the user to take part in them.
###############################
Answer: Tell me, what are you interested in? 🙃
###############################
user_input 2: I like running and swimming
###############################
context:  
###############################
Answer: Sports and being outdoor are good activities to lift the mood. 🏃‍♂️💪
###############################
user_input 3: what is your suggestion?
###############################
context:   Given retrievaled context: Activity Recommendation  Find activities that user likes and encourage the user to take part in them.
###############################
Answer: How about doing some workout outside?

In [65]:
schedule_df

Unnamed: 0,event_id,summary,start_date_time,end_date_time,time_zone,location,status,attendees,creation_date,last_modified_date,reminders,embedding


## Finished tasks
1. Mood track (passive and active)
2. BA Introduction: Explain BA understandably at first and chatbot makes a self-introduction
3. Public Declaration: encourage users to share their goals and activities publicly, such as on twitter. The social pressure and potential for public accountability can be a strong incentive.
4. Partner Supervision: encourage users to share their goals and activities to their friends and family so that they can supervise users’ activities and try to prevent users from avoiding participating in activities.
5. Implementation of Activity Recommendation
6. Implementation of Incentive mechanism(Public Declaration,Partner Supervision,Schedule Check-in history,Compliment and Praises)
7. Schedule Management  (possible solution: Use 2 dfs, 1. Knowlegde(read only) 2.Schedule(read and write))
8. Compliment and Praises: encourage users when they complete activities and remind users to remember and share the joy of successful completion of schedules




### Updated functions
1. modified schedule management
2. modified the schedule table. It contains too less information. New columns: 
3. new prompt construction: Mood Intervention: If depression or anxiety is detected, chatbot need offer comforting words, suggestions, or techniques for emotion regulation. 
4. Google calendar API
5. reduce information overload