## Capstone Project Title: Developmet of a Natural Language Multi-lingual Chatbot for the Kenyan Market

## Team Members 

1. Jessica Mutiso
2. Brian Waweru
3. Pamela Godia
4. Hellen Mwaniki

## 1. Project Overview 

This project aims to develop a natural language chatbot capable of generating human-like responses and understanding informal customer feedback expressed in English, Kenyan Swahili and Sheng. Designed for a startup expanding into the Kenyan market, the chatbot will help the company engage users more naturally and analyze feedback from social platforms and online conversations. By training on locally relevant dialogue data  including YouTube comments and Kenyan media the system will capture the linguistic and cultural nuances often missed by standard models.

# 1.1. Stakeholders

Stakeholders in the Technonlogy Industry wanting to have a start-up in Kenya

## 1.2 Problem Statement
Startups entering new markets often struggle to understand customer feedback when it's expressed in local dialects or informal language. In Kenya, much of this communication occurs in Swahili and Sheng, which combine local slang, English, and Swahili in a fluid, often unstructured manner. Existing chatbot systems trained on formal English fail to grasp the tone, intent, or meaning behind such messages. This project aims to fill that gap by building a chatbot trained specifically on real-world Kenyan conversations to interpret and respond to customer queries and feedback with local context and relevance.

## 1.3 Objectives

- Collect and preprocess Kenyan user dialogue from YouTube, social media, and local content featuring Swahili and Sheng

- Fine-tune the chatbot with foundational data for conversational structure, while emphasizing local language patterns

- Build a sequence-to-sequence model  capable of handling informal, code-switched dialogue

- Evaluate the chatbot‚Äôs performance with emphasis on contextual relevance and local understanding

- Present a working prototype that simulates real customer feedback scenarios 

In [3]:
import re, os, csv
import pandas as pd 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# 2.0 Data Scraping and understanding

The Cornell Movie Dialogues Corpus contains fictional conversations from movie scripts. It is made up of several text files, but the two we are using here are:

- `movie_lines.txt` ‚Äì contains individual lines of dialogue.

- `movie_conversations.txt` ‚Äì contains sequences of line IDs, showing how those lines form a conversation.

In [5]:
# load the cornell `movie_lines.txt` dataset
# step 1 - Loading and Parsing movie_lines.txt
# GOAL: Create a dict `id2line` that maps a `line ID` to its actual dialogue.
movie_lines_path = r'original-data\movie_lines.txt'
# the dictionary
id2line = {}
# parsing
with open(movie_lines_path, encoding='utf-8', errors='ignore') as f:
    for line in f:
        parts = line.strip().split(" +++$+++ ")
        if len(parts) == 5:
            line_id, _, _, _, text = parts
            id2line[line_id] = text
# step 2 - Load and Parse `movie_conversations.txt`
# GOAL: To create a list of conversations, where each conversation 
# is a list of line IDs i.e. Extract conversations - list of line IDs
movie_conversations_path = r"original-data\movie_conversations.txt"
# list of conversations
conversations = []
# parsing
with open(movie_conversations_path, encoding='utf-8', errors='ignore') as f:
    for line in f:
        parts = line.strip().split(" +++$+++ ")
        if len(parts) == 4:
            line_ids_str = parts[3]
            # Convert string to actual list of line IDs
            line_ids = eval(line_ids_str)  # Safe here because the format is consistent
            conversations.append(line_ids)
# step 3 - Reconstruct Conversations from Line IDs
# GOAL: Convert line IDs back into actual text using the 
# `id2line` dictionary.
# dictionary container variable
reconstructed_conversations = []
#  parsing
for conv in conversations:
    dialogue = []
    for line_id in conv:
        line_text = id2line.get(line_id, "")
        dialogue.append(line_text)
    reconstructed_conversations.append(dialogue)
# step 4 - View Sample Conversations
# first 3 conversations
for i, convo in enumerate(reconstructed_conversations[:3]):
    print(f"\nConversation {i+1}")
    for turn in convo:
        print(turn)
# Save to Text File
with open("reconstructed_conversations.txt", "w", encoding="utf-8") as f:
    for convo in reconstructed_conversations:
        for turn in convo:
            f.write(turn + "\n")
        f.write("\n" + "-"*40 + "\n")


Conversation 1
Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Well, I thought we'd start with pronunciation, if that's okay with you.
Not the hacking and gagging and spitting part.  Please.
Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?

Conversation 2
You're asking me out.  That's so cute. What's your name again?
Forget it.

Conversation 3
No, no, it's my fault -- we didn't have a proper introduction ---
Cameron.
The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.
Seems like she could get a date easy enough...


### 2.1 Saving the `reconstructed_conversations` file

In [6]:
# Target folder 
output_folder = "chatbot-repo"
os.makedirs(output_folder, exist_ok=True)

# Full path to save the file
output_file_path = os.path.join(output_folder, "reconstructed_conversations.txt")

# Save the reconstructed conversations into the file
with open(output_file_path, "w", encoding="utf-8") as f:
    for convo in reconstructed_conversations:
        for turn in convo:
            f.write(turn + "\n")
        f.write("\n" + "-"*40 + "\n")

# confirmation
## print(f"File saved to: {os.path.abspath(output_file_path)}")
# Confirm that the file exists
if os.path.exists(output_file_path):
    print("‚úÖ Yes, file successfully saved!")
    # print("üìÅ Location:", os.path.abspath(output_file_path))
else:
    print("‚ùå Failed to save the file.")


‚úÖ Yes, file successfully saved!


### Summary

- `id2line` has mapped line IDs to actual dialogue text.

- `conversations` then holds the sequences of line IDs per conversation.

- `reconstructed_conversations` gives you the full text of the conversations.

### 2.2 YouTube datasets

In [7]:
yt_file_path_pamela = r'original-data\pamela-youtube_comments.csv'
yt_file_path_brian = r'original-data\brian_youtube_data_comments.csv'

# Pamela's file
yt_1 = pd.read_csv(yt_file_path_pamela)
print("Pamela's file has the shape:", yt_1.shape)

# Brian's file
yt_2 = pd.read_csv(yt_file_path_brian)
print("Brian's file has the shape:", yt_2.shape)
print('\n')

print("Pamela's YouTube File")
print(yt_1.head(), end = '\n\n\n')

print("Brian's YouTube File")
print(yt_2.head(), end = '\n\n\n')

Pamela's file has the shape: (25629, 2)
Brian's file has the shape: (2853, 2)


Pamela's YouTube File
      video_id                                            comment
0  qlZM3McwO1Q  What an incredible victory. I agree the Kenyan...
1  qlZM3McwO1Q                                                  ‚ù§
2  qlZM3McwO1Q  ‚ÄúClaudia is an amazonian goddess with a beauti...
3  qlZM3McwO1Q  Proud of my motherland Kenya ‚ù§‚ù§‚ù§and Africa.at ...
4  qlZM3McwO1Q                                               Damn


Brian's YouTube File
                                         Top Comment Reply
0  Apple missed the boat on AI OR... Apple is doi...   NaN
1  Who added the background music to the video it...   NaN
2                    16:26  FEMI KUTI !!! RAAHHH !!!   NaN
3            The greatest AI scam in history, is AI.   NaN
4  if I only knew Siri is a mess, I would bought ...   NaN




In [8]:
# Cleaning yt_1 file
## i.e dropping first column
yt_1 = yt_1.drop(yt_1.columns[0], axis=1)

# Cleaning yt_2 file
yt_2 = yt_2.drop(yt_2.columns[1], axis=1)

print("Pamela's YouTube File")
print("Now, Pamela's file has the shape:", yt_1.shape)
print(yt_1.head(), end = '\n\n\n')

print("Brian's YouTube File")
print("Now, Brian's file has the shape:", yt_2.shape)
print(yt_2.head(), end = '\n\n\n')


Pamela's YouTube File
Now, Pamela's file has the shape: (25629, 1)
                                             comment
0  What an incredible victory. I agree the Kenyan...
1                                                  ‚ù§
2  ‚ÄúClaudia is an amazonian goddess with a beauti...
3  Proud of my motherland Kenya ‚ù§‚ù§‚ù§and Africa.at ...
4                                               Damn


Brian's YouTube File
Now, Brian's file has the shape: (2853, 1)
                                         Top Comment
0  Apple missed the boat on AI OR... Apple is doi...
1  Who added the background music to the video it...
2                    16:26  FEMI KUTI !!! RAAHHH !!!
3            The greatest AI scam in history, is AI.
4  if I only knew Siri is a mess, I would bought ...




In [9]:
# combineing the two together
yt_combined = pd.concat([yt_1, yt_2], axis=0, ignore_index=True)
# Thus...
print("The Shape of the combine YouTube data is:", yt_combined.shape)

# comfirming succesful stacking
# print(yt_1.shape[0] + yt_2.shape[0] == yt_combined.shape[0])

The Shape of the combine YouTube data is: (28482, 2)


In [10]:
# Folder and file path parameters
folder_path = 'chatbot-repo'
file_path = os.path.join(folder_path, 'yt_combined.txt')

# Save the DataFrame to a txt file
yt_combined.to_csv(file_path, sep='\t', index=False)

# Confirmation message
print(f"Success, `yt_combined` has been saved to {folder_path}.")

Success, `yt_combined` has been saved to chatbot-repo.


In [11]:
# File names
file1 = r'chatbot-repo\reconstructed_conversations.txt' # Cornell-uni
file2 = r'chatbot-repo\yt_combined.txt' # youtube-comments
nlp_data = r'chatbot-repo\nlp_data.txt'  # Output file is literally named "nlp_data.txt"

# Read both files
with open(file1, "r", encoding="utf-8") as f1, open(file2, "r", encoding="utf-8") as f2:
    content1 = f1.read()
    content2 = f2.read()

# Combine contents
combined_data = content1 + "\n" + content2

# Write to output file
with open(nlp_data, "w", encoding="utf-8") as out:
    out.write(combined_data)

print(f"Files combined and saved to {nlp_data}")


Files combined and saved to chatbot-repo\nlp_data.txt


# 3.0 Data Cleaning

In [12]:
# preprocessing
# Function to clean the data
def clean_text(text):

    if pd.isnull(text):
        return ""
    
    text = text.lower()                          # Lowercase all text
    text = re.sub(r"http\S+|www\S+", "", text)   # Remove URLs
    text = re.sub(r"@\w+", "", text)             # Remove mentions
    text = re.sub(r"#\w+", "", text)             # Remove hashtags
    text = re.sub(r"[^\w\s]", "", text)          # Remove punctuation
    text = re.sub(r"\d+", "", text)              # Remove digits
    text = re.sub(r"\s+", " ", text).strip()     # Remove extra whitespace
    
    return text

In [13]:
##

def clean_text_file(input_file, output_file="cleaned_output.txt"):
    # Read the input file
    with open(input_file, 'r', encoding='utf-8') as f:
        original_text = f.read()
    
    # Clean the text
    if pd.isnull(original_text):
        cleaned_text = ""
    else:
        cleaned_text = original_text.lower()
        cleaned_text = re.sub(r"http\S+|www\S+", "", cleaned_text)
        cleaned_text = re.sub(r"@\w+", "", cleaned_text)
        cleaned_text = re.sub(r"#\w+", "", cleaned_text)
        cleaned_text = re.sub(r"[^\w\s]", "", cleaned_text)
        cleaned_text = re.sub(r"\d+", "", cleaned_text)
        cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()

    # Write the cleaned text to a new file in the same directory
    input_dir = os.path.dirname(input_file)
    output_path = os.path.join(input_dir, output_file)
    
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(cleaned_text)

    print(f"Cleaned text written to: {output_path}")


# 4.0 Data Preprocessing

In [14]:
# renaming the cleaned nlp_data_file to chatbot-repo

nlp_data_file = r'chatbot-repo\nlp_data.txt'

clean_text_file(nlp_data_file)

Cleaned text written to: chatbot-repo\cleaned_output.txt


In [15]:
nlp_data_file

'chatbot-repo\\nlp_data.txt'

In [19]:
print(nlp_data_file)

chatbot-repo\nlp_data.txt


In [16]:
# viewing the data 
file_path = 'chatbot-repo\\nlp_data.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    for i in range(20):  # Change 20 to however many lines you want to preview
        print(file.readline().strip())

Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Well, I thought we'd start with pronunciation, if that's okay with you.
Not the hacking and gagging and spitting part.  Please.
Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?

----------------------------------------
You're asking me out.  That's so cute. What's your name again?
Forget it.

----------------------------------------
No, no, it's my fault -- we didn't have a proper introduction ---
Cameron.
The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.
Seems like she could get a date easy enough...

----------------------------------------
Why?
Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.
That's a shame.



## 5.0 Baseline Model - TF-IDF + Cosine Similarity

Converting file into prompt-response pairs:

In [1]:
file_path = 'chatbot-repo\\nlp_data.txt'

# Load and split into blocks
with open(file_path, 'r', encoding='utf-8') as file:
    content = file.read()

# Split the content by scene/block separator
blocks = content.split('----------------------------------------')

prompt_response_pairs = []

for block in blocks:
    # Clean and split into lines
    lines = [line.strip() for line in block.strip().split('\n') if line.strip()]
    
    # Pair consecutive lines as prompt-response
    for i in range(len(lines) - 1):
        prompt = lines[i]
        response = lines[i + 1]
        prompt_response_pairs.append((prompt, response))

# Preview a few pairs
for i, (prompt, response) in enumerate(prompt_response_pairs[:10]):
    print(f"Prompt: {prompt}\nResponse: {response}\n")

Prompt: Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Response: Well, I thought we'd start with pronunciation, if that's okay with you.

Prompt: Well, I thought we'd start with pronunciation, if that's okay with you.
Response: Not the hacking and gagging and spitting part.  Please.

Prompt: Not the hacking and gagging and spitting part.  Please.
Response: Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?

Prompt: You're asking me out.  That's so cute. What's your name again?
Response: Forget it.

Prompt: No, no, it's my fault -- we didn't have a proper introduction ---
Response: Cameron.

Prompt: Cameron.
Response: The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.

Prompt: The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.
Resp

Vectorizing all prompts

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Assuming prompt_response_pairs has already been extracted
prompts = [pair[0] for pair in prompt_response_pairs]

Instantiate model

In [5]:
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the prompts
tfidf_matrix = vectorizer.fit_transform(prompts)

# Output shape: (n_prompts, n_features)
print(f"TF-IDF Matrix shape: {tfidf_matrix.shape}")

TF-IDF Matrix shape: (254094, 58575)


In [6]:
print(tfidf_matrix.shape)

(254094, 58575)


Compute Cosine Similarity between the user's input and your stored prompts, then return the best-matching response using TF-IDF + Cosine Similarity

In [7]:
from sklearn.metrics.pairwise import cosine_similarity

def get_best_response(user_input, vectorizer, tfidf_matrix, prompt_response_pairs, threshold=0.2):
    
    # Vectorize the user input
    user_vec = vectorizer.transform([user_input])

    # Compute cosine similarity with all prompts
    similarity_scores = cosine_similarity(user_vec, tfidf_matrix)

    # Get the index of the best-matching prompt
    best_index = similarity_scores.argmax()
    best_score = similarity_scores[0, best_index]

    if best_score < threshold:
        return "Sorry, I don't understand."

    # Return the corresponding response
    return prompt_response_pairs[best_index][1]

Example

In [13]:
# Assume prompt_response_pairs = [(prompt1, response1), (prompt2, response2), ...]
user_input = "Why can't she go out with someone?"
response = get_best_response(user_input, vectorizer, tfidf_matrix, prompt_response_pairs)

print("Bot:", response)

Bot: I don't know...  One of us has to be here in case Arbogast's on the way.


In [12]:
# Assume prompt_response_pairs = [(prompt1, response1), (prompt2, response2), ...]
user_input = "What is your name?"
response = get_best_response(user_input, vectorizer, tfidf_matrix, prompt_response_pairs)

print("Bot:", response)

Bot: Berger, Norwegian, and at your service, sir.


In [11]:
# Assume prompt_response_pairs = [(prompt1, response1), (prompt2, response2), ...]
user_input = "How can i be of help today?"
response = get_best_response(user_input, vectorizer, tfidf_matrix, prompt_response_pairs)

print("Bot:", response)

Bot: Paul, can you hand me the olives? Ruth, I need you to, what was it?


Optional Optimization (for very large datasets)

*Use sklearn.metrics.pairwise.linear_kernel instead of cosine_similarity for faster dot-product similarity (same result for normalized vectors like TF-IDF).

*Index the TF-IDF matrix with FAISS or Annoy for approximate nearest neighbors if performance becomes an issue.

In [14]:
from sklearn.metrics.pairwise import linear_kernel

def get_best_response_fast(user_input, vectorizer, tfidf_matrix, prompt_response_pairs, threshold=0.2):
    user_vec = vectorizer.transform([user_input])
    cosine_similarities = linear_kernel(user_vec, tfidf_matrix).flatten()

    best_index = cosine_similarities.argmax()
    best_score = cosine_similarities[best_index]

    if best_score < threshold:
        return "Sorry, I don't understand."

    return prompt_response_pairs[best_index][1]

Tips:
threshold=0.2 avoids returning weak matches; tune it based on your dataset.

Strip punctuation/lowercase your prompts and user input for better matching.

Saving the model

In [15]:
! pip install joblib



Save the TF-IDF vectorizer, matrix, and data

In [16]:
import joblib

# Save vectorizer
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')

# Save TF-IDF matrix (can be large, but compressed)
joblib.dump(tfidf_matrix, 'tfidf_matrix.pkl')

# Save prompt-response pairs (list of tuples)
joblib.dump(prompt_response_pairs, 'prompt_response_pairs.pkl')

['prompt_response_pairs.pkl']

Organize the files in a folder called model/