In [1]:
import pandas as pd

In [12]:
# read json file
data = pd.read_json("data/combined_dataset.json", lines=True)

In [17]:
data

Unnamed: 0,Context,Response
0,I'm going through some things with my feelings...,"If everyone thinks you're worthless, then mayb..."
1,I'm going through some things with my feelings...,"Hello, and thank you for your question and see..."
2,I'm going through some things with my feelings...,First thing I'd suggest is getting the sleep y...
3,I'm going through some things with my feelings...,Therapy is essential for those that are feelin...
4,I'm going through some things with my feelings...,I first want to let you know that you are not ...
...,...,...
3507,My grandson's step-mother sends him to school ...,Absolutely not!Â It is never in a child's best ...
3508,My boyfriend is in recovery from drug addictio...,I'm sorry you have tension between you and you...
3509,The birth mother attempted suicide several tim...,"The true answer is, ""no one can really say wit..."
3510,I think adult life is making him depressed and...,How do you help yourself to believe you requir...


In [27]:
# Display dataset info
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2752 entries, 0 to 3511
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Context   2752 non-null   object
 1   Response  2752 non-null   object
dtypes: object(2)
memory usage: 64.5+ KB


### Data cleaning
#### Remove duplicates

In [18]:
data.duplicated().sum()

760

In [20]:
data.drop_duplicates(inplace=True)

In [21]:
data.duplicated().sum()

0

#### Remove missing values

In [22]:
data.isna().sum()

Context     0
Response    0
dtype: int64

#### Clean the Context Text

In [23]:
import re

In [29]:
# Remove Extra Spaces, Tabs, and Newlines
data['Context'] = data['Context'].str.replace(r"\s+", " ", regex=True).str.strip()

In [34]:
# Standardize Capitalization
data['Context'] = data['Context'].str.lower()
data['Context'] = data['Context'].str.replace(r"(^\w|\.\s*\w)", lambda m: m.group().upper(), regex=True)


In [35]:
# Remove Sensitive Data
data['Context'] = data['Context'].str.replace(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b", "[EMAIL]", regex=True)
data['Context'] = data['Context'].str.replace(r"\b\d{10}\b", "[PHONE NUMBER]", regex=True)


In [36]:
# Normalize Punctuation
data['Context'] = data['Context'].str.replace(r"[?!]+", lambda m: m.group()[0], regex=True)
data['Context'] = data['Context'].str.replace(r"([.,!?])(\w)", r"\1 \2", regex=True)
data['Context'] = data['Context'].str.replace(r"\s([.,!?])", r"\1", regex=True)


#### Clean the Response Text

In [41]:
data['Response'] = data['Response'].str.replace(r"\s+", " ", regex=True).str.strip()


In [43]:
data['Response'] = data['Response'].str.lower()
data['Response'] = data['Response'].str.replace(r"(^\w|\.\s*\w)", lambda m: m.group().upper(), regex=True)


In [44]:
data['Response'] = data['Response'].str.replace(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b", "[EMAIL]", regex=True)
data['Response'] = data['Response'].str.replace(r"\b\d{10}\b", "[PHONE NUMBER]", regex=True)


In [45]:
data['Response'] = data['Response'].str.replace(r"[?!]+", lambda m: m.group()[0], regex=True)
data['Response'] = data['Response'].str.replace(r"([.,!?])(\w)", r"\1 \2", regex=True)
data['Response'] = data['Response'].str.replace(r"\s([.,!?])", r"\1", regex=True)


In [47]:
# Save to a new JSON file
data.to_json("data/cleaned_dataset.json", orient="records", lines=True)

print("Dataset cleaned and saved as 'data/cleaned_dataset.json'.")


Dataset cleaned and saved as 'data/cleaned_dataset.json'.


In [48]:
clean_data = pd.read_json("data/cleaned_dataset.json", lines=True)

In [49]:
clean_data

Unnamed: 0,Context,Response
0,I'm going through some things with my feelings...,"If everyone thinks you're worthless, then mayb..."
1,I'm going through some things with my feelings...,"Hello, and thank you for your question and see..."
2,I'm going through some things with my feelings...,First thing i'd suggest is getting the sleep y...
3,I'm going through some things with my feelings...,Therapy is essential for those that are feelin...
4,I'm going through some things with my feelings...,I first want to let you know that you are not ...
...,...,...
2747,"After first meeting the client, what is the pr...",Hi. This is an excellent question! i think tha...
2748,My boyfriend is in recovery from drug addictio...,I'm sorry you have tension between you and you...
2749,The birth mother attempted suicide several tim...,"The true answer is, ""no one can really say wit..."
2750,I think adult life is making him depressed and...,How do you help yourself to believe you requir...


### Tokenization

In [50]:
from transformers import AutoTokenizer

In [55]:
# Initialize the tokenizer
model_name = "gpt2"  # Replace with your desired model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Assign the eos_token as the padding token
tokenizer.pad_token = tokenizer.eos_token


In [56]:
# Tokenize the input and output columns
input_ids = tokenizer(list(clean_data["Context"]), padding=True, truncation=True, max_length=512, return_tensors="pt").input_ids
output_ids = tokenizer(list(clean_data["Response"]), padding=True, truncation=True, max_length=512, return_tensors="pt").input_ids

In [58]:
# Combine into a single dataset
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, input_ids, output_ids):
        self.input_ids = input_ids
        self.output_ids = output_ids

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "labels": self.output_ids[idx],
        }

# Create the dataset object
dataset = TextDataset(input_ids, output_ids)

#### Fine-Tune the Model

In [60]:
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_name)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=10,
    save_total_limit=2,
    logging_dir="./logs",
)

# Initialize the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

# Fine-tune the model
trainer.train()





model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.26.0`: Please run `pip install transformers[torch]` or `pip install 'accelerate>={ACCELERATE_MIN_VERSION}'`

#### Save the Fine-Tuned Model


In [61]:
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")
print("Model and tokenizer saved!")


Model and tokenizer saved!


####  Deploy as a Chatbot

In [81]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def load_saved_model(model_path):
    """
    Load the saved model and tokenizer.
    """
    try:
        # Load the tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        
        # Load the model
        model = AutoModelForCausalLM.from_pretrained(model_path)
        
        # Move model to GPU if available
        device = "cuda" if torch.cuda.is_available() else "cpu"
        model = model.to(device)
        
        return model, tokenizer
    
    except Exception as e:
        print(f"Error loading saved model: {e}")
        return None, None

def generate_response(model, tokenizer, prompt, max_length=200):
    """
    Generate a response using the saved model.
    """
    try:
        # Prepare the input
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        # Generate response
        outputs = model.generate(
            inputs.input_ids, 
            max_length=max_length, 
            num_return_sequences=1,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )
        
        # Decode the response
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        return response
    
    except Exception as e:
        print(f"Error generating response: {e}")
        return ""

def interactive_chat():
    """
    Interactive chat interface to input questions and get responses.
    """
    # Path to the saved model
    model_path = "./fine_tuned_model"
    
    # Load the saved model and tokenizer
    model, tokenizer = load_saved_model(model_path)
    
    if not model or not tokenizer:
        print("Failed to load the saved model.")
        return
    
    print("ðŸ¤– Interactive Model Response")
    print("Type 'exit' to quit the program")
    
    while True:
        try:
            # Get user input
            user_input = input("\nEnter your question: ").strip()
            
            # Check for exit condition
            if user_input.lower() in ['exit', 'quit', 'bye']:
                print("Goodbye! ðŸ‘‹")
                break
            
            # Generate and print response
            if user_input:
                response = generate_response(model, tokenizer, user_input)
                print("\nModel's Response:", response)
        
        except KeyboardInterrupt:
            print("\n\nChat interrupted. Goodbye! ðŸ‘‹")
            break
        except Exception as e:
            print(f"An error occurred: {e}")

if __name__ == "__main__":
    interactive_chat()

ðŸ¤– Interactive Model Response
Type 'exit' to quit the program


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Model's Response: i am sad to say, but I am sorry for being so sad about this. I was never happy with myself. I had to make a choice.

I can't help but be sad that my wife was not able to go through this. She was diagnosed with HIV, so I wanted to help her.

The day before my appointment, my family and I had a baby girl. She was just 3 months old. I'm now 21 years old. I took her to the emergency room and my doctor told me she was in a good condition.

I had to walk her through the front door to the hospital because it was so close to where I'm now. I told her the doctor I was going to take her to the emergency room and she said, "Oh my God, this is so difficult. I'm so sorry. I'm so sorry. I can't wait to get back to my life."

The day after, my family and I got


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Model's Response: i am sad to say that I am not the one to be blamed. I am not a one who will be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed.

I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am not the one to be blamed. I am


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Model's Response: i am deppressed by the other side. I am not the first to notice the lack of a good reason for this.

In the first place, I would have been surprised if the only reason why this group was not on the top of the list was because they weren't strong enough to handle the other side's attacks.

But, the second thing is that the rest of the group is the same as well. The one who was stronger than the others had to be the strongest, and that was the reason they were ranked No. 2.

If we compare them to the other side's members, we'll see that they are also the strongest.

But, their strength would also be different.

Even if the other side was stronger than them, they would still be ranked No. 3.

"Well, it's not like that. But, if we compare them to the other side's members, we'll see that they are also


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Model's Response: i am depressed, and I'm sorry," she said.

"I'm not sure I want to leave. I don't want to leave my house," she said.

The couple then left, leaving the couple in the living room, where they were still in the car with the couple's daughter.

The couple was driving home to a family member's home in south Winnipeg when they heard a loud bang, the woman said.

She said she got out of her car and saw a man in a black shirt running. She said she heard a woman screaming, "What are you doing? I'm a child," she said.

She said she thought she saw the man in the car and then saw a man in the street. She thought she saw him running, and then she saw a man outside the house.

She said she saw the man running toward the house. She saw the man's car, and then saw a man in the house
Goodbye! ðŸ‘‹
