## Capstone Project Title: Developmet of a Natural Language Multi-lingual Chatbot for the Kenyan Market

## Team Members 

1. Jessica Mutiso
2. Brian Waweru
3. Pamela Godia
4. Hellen Mwaniki

## 1. Project Overview 

This project aims to develop a natural language chatbot capable of generating human-like responses and understanding informal customer feedback expressed in English, Kenyan Swahili and Sheng. Designed for a startup expanding into the Kenyan market, the chatbot will help the company engage users more naturally and analyze feedback from social platforms and online conversations. By training on locally relevant dialogue data  including YouTube comments and Kenyan media the system will capture the linguistic and cultural nuances often missed by standard models.

## 1.1 Problem Statement
Startups entering new markets often struggle to understand customer feedback when it's expressed in local dialects or informal language. In Kenya, much of this communication occurs in Swahili and Sheng, which combine local slang, English, and Swahili in a fluid, often unstructured manner. Existing chatbot systems trained on formal English fail to grasp the tone, intent, or meaning behind such messages. This project aims to fill that gap by building a chatbot trained specifically on real-world Kenyan conversations to interpret and respond to customer queries and feedback with local context and relevance.

## 1.2 Objectives

- Collect and preprocess Kenyan user dialogue from YouTube, social media, and local content featuring Swahili and Sheng

- Fine-tune the chatbot with foundational data for conversational structure, while emphasizing local language patterns

- Build a sequence-to-sequence model  capable of handling informal, code-switched dialogue

- Evaluate the chatbot’s performance with emphasis on contextual relevance and local understanding

- Present a working prototype that simulates real customer feedback scenarios 

In [1]:
import re, os, csv
import pandas as pd 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# 2.0 Data Scraping and understanding

The Cornell Movie Dialogues Corpus contains fictional conversations from movie scripts. It is made up of several text files, but the two we are using here are:

- `movie_lines.txt` – contains individual lines of dialogue.

- `movie_conversations.txt` – contains sequences of line IDs, showing how those lines form a conversation.

In [2]:
# load the cornell `movie_lines.txt` dataset
# step 1 - Loading and Parsing movie_lines.txt
# GOAL: Create a dict `id2line` that maps a `line ID` to its actual dialogue.
movie_lines_path = r'original-data\movie_lines.txt'
# the dictionary
id2line = {}
# parsing
with open(movie_lines_path, encoding='utf-8', errors='ignore') as f:
    for line in f:
        parts = line.strip().split(" +++$+++ ")
        if len(parts) == 5:
            line_id, _, _, _, text = parts
            id2line[line_id] = text
# step 2 - Load and Parse `movie_conversations.txt`
# GOAL: To create a list of conversations, where each conversation 
# is a list of line IDs i.e. Extract conversations - list of line IDs
movie_conversations_path = r"original-data\movie_conversations.txt"
# list of conversations
conversations = []
# parsing
with open(movie_conversations_path, encoding='utf-8', errors='ignore') as f:
    for line in f:
        parts = line.strip().split(" +++$+++ ")
        if len(parts) == 4:
            line_ids_str = parts[3]
            # Convert string to actual list of line IDs
            line_ids = eval(line_ids_str)  # Safe here because the format is consistent
            conversations.append(line_ids)
# step 3 - Reconstruct Conversations from Line IDs
# GOAL: Convert line IDs back into actual text using the 
# `id2line` dictionary.
# dictionary container variable
reconstructed_conversations = []
#  parsing
for conv in conversations:
    dialogue = []
    for line_id in conv:
        line_text = id2line.get(line_id, "")
        dialogue.append(line_text)
    reconstructed_conversations.append(dialogue)
# step 4 - View Sample Conversations
# first 3 conversations
for i, convo in enumerate(reconstructed_conversations[:3]):
    print(f"\nConversation {i+1}")
    for turn in convo:
        print(turn)
# Save to Text File
with open("reconstructed_conversations.txt", "w", encoding="utf-8") as f:
    for convo in reconstructed_conversations:
        for turn in convo:
            f.write(turn + "\n")
        f.write("\n" + "-"*40 + "\n")


Conversation 1
Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Well, I thought we'd start with pronunciation, if that's okay with you.
Not the hacking and gagging and spitting part.  Please.
Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?

Conversation 2
You're asking me out.  That's so cute. What's your name again?
Forget it.

Conversation 3
No, no, it's my fault -- we didn't have a proper introduction ---
Cameron.
The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.
Seems like she could get a date easy enough...


### 2.1 Saving the `reconstructed_conversations` file

In [3]:
# Target folder 
output_folder = "chatbot-repo"
os.makedirs(output_folder, exist_ok=True)

# Full path to save the file
output_file_path = os.path.join(output_folder, "reconstructed_conversations.txt")

# Save the reconstructed conversations into the file
with open(output_file_path, "w", encoding="utf-8") as f:
    for convo in reconstructed_conversations:
        for turn in convo:
            f.write(turn + "\n")
        f.write("\n" + "-"*40 + "\n")

# confirmation
## print(f"File saved to: {os.path.abspath(output_file_path)}")
# Confirm that the file exists
if os.path.exists(output_file_path):
    print("✅ Yes, file successfully saved!")
    # print("📁 Location:", os.path.abspath(output_file_path))
else:
    print("❌ Failed to save the file.")


✅ Yes, file successfully saved!


### Summary

- `id2line` has mapped line IDs to actual dialogue text.

- `conversations` then holds the sequences of line IDs per conversation.

- `reconstructed_conversations` gives you the full text of the conversations.

### 2.2 YouTube datasets

In [4]:
yt_file_path_pamela = r'original-data\pamela-youtube_comments.csv'
yt_file_path_brian = r'original-data\brian_youtube_data_comments.csv'

# Pamela's file
yt_1 = pd.read_csv(yt_file_path_pamela)
print("Pamela's file has the shape:", yt_1.shape)

# Brian's file
yt_2 = pd.read_csv(yt_file_path_brian)
print("Brian's file has the shape:", yt_2.shape)
print('\n')

print("Pamela's YouTube File")
print(yt_1.head(), end = '\n\n\n')

print("Brian's YouTube File")
print(yt_2.head(), end = '\n\n\n')

Pamela's file has the shape: (25629, 2)
Brian's file has the shape: (2853, 2)


Pamela's YouTube File
      video_id                                            comment
0  qlZM3McwO1Q  What an incredible victory. I agree the Kenyan...
1  qlZM3McwO1Q                                                  ❤
2  qlZM3McwO1Q  “Claudia is an amazonian goddess with a beauti...
3  qlZM3McwO1Q  Proud of my motherland Kenya ❤❤❤and Africa.at ...
4  qlZM3McwO1Q                                               Damn


Brian's YouTube File
                                         Top Comment Reply
0  Apple missed the boat on AI OR... Apple is doi...   NaN
1  Who added the background music to the video it...   NaN
2                    16:26  FEMI KUTI !!! RAAHHH !!!   NaN
3            The greatest AI scam in history, is AI.   NaN
4  if I only knew Siri is a mess, I would bought ...   NaN




In [5]:
# Cleaning yt_1 file
## i.e dropping first column
yt_1 = yt_1.drop(yt_1.columns[0], axis=1)

# Cleaning yt_2 file
yt_2 = yt_2.drop(yt_2.columns[1], axis=1)

print("Pamela's YouTube File")
print("Now, Pamela's file has the shape:", yt_1.shape)
print(yt_1.head(), end = '\n\n\n')

print("Brian's YouTube File")
print("Now, Brian's file has the shape:", yt_2.shape)
print(yt_2.head(), end = '\n\n\n')


Pamela's YouTube File
Now, Pamela's file has the shape: (25629, 1)
                                             comment
0  What an incredible victory. I agree the Kenyan...
1                                                  ❤
2  “Claudia is an amazonian goddess with a beauti...
3  Proud of my motherland Kenya ❤❤❤and Africa.at ...
4                                               Damn


Brian's YouTube File
Now, Brian's file has the shape: (2853, 1)
                                         Top Comment
0  Apple missed the boat on AI OR... Apple is doi...
1  Who added the background music to the video it...
2                    16:26  FEMI KUTI !!! RAAHHH !!!
3            The greatest AI scam in history, is AI.
4  if I only knew Siri is a mess, I would bought ...




In [6]:
# combineing the two together
yt_combined = pd.concat([yt_1, yt_2], axis=0, ignore_index=True)
# Thus...
print("The Shape of the combine YouTube data is:", yt_combined.shape)

# comfirming succesful stacking
# print(yt_1.shape[0] + yt_2.shape[0] == yt_combined.shape[0])

The Shape of the combine YouTube data is: (28482, 2)


In [7]:
# Folder and file path parameters
folder_path = 'chatbot-repo'
file_path = os.path.join(folder_path, 'yt_combined.txt')

# Save the DataFrame to a txt file
yt_combined.to_csv(file_path, sep='\t', index=False)

# Confirmation message
print(f"Success, `yt_combined` has been saved to {folder_path}.")

Success, `yt_combined` has been saved to chatbot-repo.


In [8]:
# File names
file1 = r'chatbot-repo\reconstructed_conversations.txt' # Cornell-uni
file2 = r'chatbot-repo\yt_combined.txt' # youtube-comments
nlp_data = r'chatbot-repo\nlp_data.txt'  # Output file is literally named "nlp_data.txt"

# Read both files
with open(file1, "r", encoding="utf-8") as f1, open(file2, "r", encoding="utf-8") as f2:
    content1 = f1.read()
    content2 = f2.read()

# Combine contents
combined_data = content1 + "\n" + content2

# Write to output file
with open(nlp_data, "w", encoding="utf-8") as out:
    out.write(combined_data)

print(f"Files combined and saved to {nlp_data}")


Files combined and saved to chatbot-repo\nlp_data.txt


# 3.0 Data Cleaning

In [9]:
# preprocessing
# Function to clean the data
def clean_text(text):

    if pd.isnull(text):
        return ""
    
    text = text.lower()                          # Lowercase all text
    text = re.sub(r"http\S+|www\S+", "", text)   # Remove URLs
    text = re.sub(r"@\w+", "", text)             # Remove mentions
    text = re.sub(r"#\w+", "", text)             # Remove hashtags
    text = re.sub(r"[^\w\s]", "", text)          # Remove punctuation
    text = re.sub(r"\d+", "", text)              # Remove digits
    text = re.sub(r"\s+", " ", text).strip()     # Remove extra whitespace
    
    return text

In [10]:
##

def clean_text_file(input_file, output_file="cleaned_output.txt"):
    # Read the input file
    with open(input_file, 'r', encoding='utf-8') as f:
        original_text = f.read()
    
    # Clean the text
    if pd.isnull(original_text):
        cleaned_text = ""
    else:
        cleaned_text = original_text.lower()
        cleaned_text = re.sub(r"http\S+|www\S+", "", cleaned_text)
        cleaned_text = re.sub(r"@\w+", "", cleaned_text)
        cleaned_text = re.sub(r"#\w+", "", cleaned_text)
        cleaned_text = re.sub(r"[^\w\s]", "", cleaned_text)
        cleaned_text = re.sub(r"\d+", "", cleaned_text)
        cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()

    # Write the cleaned text to a new file in the same directory
    input_dir = os.path.dirname(input_file)
    output_path = os.path.join(input_dir, output_file)
    
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(cleaned_text)

    print(f"Cleaned text written to: {output_path}")


# 4.0 Data Preprocessing

In [11]:
# renaming the cleaned nlp_data_file to chatbot-repo

nlp_data_file = r'chatbot-repo\nlp_data.txt'

clean_text_file(nlp_data_file)

Cleaned text written to: chatbot-repo\cleaned_output.txt


In [12]:
nlp_data_file

'chatbot-repo\\nlp_data.txt'

In [13]:
# viewing the data 
file_path = 'chatbot-repo\\nlp_data.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    for i in range(20):  # Change 20 to however many lines you want to preview
        print(file.readline().strip())

Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Well, I thought we'd start with pronunciation, if that's okay with you.
Not the hacking and gagging and spitting part.  Please.
Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?

----------------------------------------
You're asking me out.  That's so cute. What's your name again?
Forget it.

----------------------------------------
No, no, it's my fault -- we didn't have a proper introduction ---
Cameron.
The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.
Seems like she could get a date easy enough...

----------------------------------------
Why?
Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.
That's a shame.



# 5.0 Creating the BaseLine Model

In [14]:
# Loading our dataset from a plain text file (one input per line)
with open(r"chatbot-repo\nlp_data.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

# Create a DataFrame from the lines
df = pd.DataFrame({'input': [line.strip() for line in lines if line.strip()]})

# Clean the dataset
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

# Sample a manageable dataset or take all if less than 1000 lines
sample_df = df.sample(n=min(1000, len(df)), random_state=42).reset_index(drop=True)

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), lowercase=True)
X_sample = vectorizer.fit_transform(sample_df['input'])

# Fit Nearest Neighbors Model
nn_model = NearestNeighbors(n_neighbors=1, metric='cosine').fit(X_sample)

# Define Chatbot Function
def baseline_chatbot(query):
    """
    Multilingual Baseline Chatbot using TF-IDF and Nearest Neighbors.
    Supports Swahili and English input.
    """
    query_vec = vectorizer.transform([query])
    distance, index = nn_model.kneighbors(query_vec)
    return sample_df.loc[index[0][0], 'input']

In [15]:
# Running - test
print("EN:", baseline_chatbot("What is AI?"))
print("SW:", baseline_chatbot("Habari yako kuhusu teknolojia?"))


EN: What?  What is it?
SW: But maybe you'd better not.  I've got a witch mad at me, and you might get into trouble.


### Coverting the data into a structured format

Converting the file from .txt to JSON format

In [16]:
import json

input_path = 'chatbot-repo\\nlp_data.txt'
output_path = 'chatbot-repo\\dialogues.json'

with open(input_path, 'r', encoding='utf-8') as file:
    raw_data = file.read()

# Split the text by the separator line
blocks = [block.strip() for block in raw_data.split('----------------------------------------') if block.strip()]

# Convert each block to a prompt-response pair
pairs = []
for i in range(0, len(blocks) - 1, 2):  # loop in steps of 2
    prompt = " ".join(blocks[i].splitlines()).strip()
    response = " ".join(blocks[i + 1].splitlines()).strip()
    pairs.append({"prompt": prompt, "response": response})

# Save as JSON
with open(output_path, 'w', encoding='utf-8') as out_file:
    json.dump(pairs, out_file, indent=2, ensure_ascii=False)

print(f"✅ Converted {len(pairs)} prompt-response pairs and saved to {output_path}")

✅ Converted 41554 prompt-response pairs and saved to chatbot-repo\dialogues.json


## Tokenizing the chatbot data

Using the following

Hugging Face's transformers library,
datasets to load the JSON,
A tokenizer compatible with the base model (e.g., DialoGPT).

In [38]:
# import the libraries 
! pip install transformers datasets



In [39]:
# import the libraries
from datasets import load_dataset
from transformers import AutoTokenizer

OSError: [WinError 127] The specified procedure could not be found. Error loading "c:\Users\hp\anaconda3\envs\learn-env\lib\site-packages\torch\lib\shm.dll" or one of its dependencies.

In [33]:
!  pip install torch



In [34]:
import torch
print(torch.__version__)

OSError: [WinError 127] The specified procedure could not be found. Error loading "c:\Users\hp\anaconda3\envs\learn-env\lib\site-packages\torch\lib\shm.dll" or one of its dependencies.

In [35]:
# Load the JSON File
# Load your JSON file
data = load_dataset('json', data_files='chatbot-repo/dialogues.json')
data

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response'],
        num_rows: 41554
    })
})

In [36]:
# Tokenizer using DialoGPT:
model_checkpoint = "microsoft/DialoGPT-medium"  # or DialoGPT-small/large
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

NameError: name 'AutoTokenizer' is not defined

In [41]:
tokenizer.pad_token = tokenizer.eos_token

## Concatenate Prompt and Response

In [42]:
## combine the prompt and response into one sequence (with end-of-sequence tokens):
def preprocess(example):
    # Combine prompt and response with special tokens
    combined = example['prompt'] + tokenizer.eos_token + example['response'] + tokenizer.eos_token
    return tokenizer(combined, truncation=True, padding="max_length", max_length=128)

### Apply tokenization to the dataset

In [43]:
# apply tokenization to the data set
tokenized_data = data.map(preprocess, batched=False)
tokenized_data.set_format(type='torch', columns=['input_ids', 'attention_mask'])

Map:   0%|          | 0/41554 [00:00<?, ? examples/s]

ValueError: PyTorch needs to be installed to be able to return PyTorch tensors.

# Next Steps..
# - Preprocess the dialogues
# - Tokenize
# - Pair conversations into input-output for chatbot modeling
# - Use with Seq2Seq models: Like an encoder-decoder for chatbot training
# - Store as CSV: Two columns: input_line, target_line