## Capstone Project Title: Developmet of a Natural Language Multi-lingual Chatbot for the Kenyan Market

## Team Members 

1. Jessica Mutiso
2. Brian Waweru
3. Pamela Godia
4. Hellen Mwaniki

## 1. Project Overview 

This project aims to develop a natural language chatbot capable of generating human-like responses and understanding informal customer feedback expressed in English, Kenyan Swahili and Sheng. Designed for a startup expanding into the Kenyan market, the chatbot will help the company engage users more naturally and analyze feedback from social platforms and online conversations. By training on locally relevant dialogue data  including YouTube comments and Kenyan media the system will capture the linguistic and cultural nuances often missed by standard models.


## 1.1 Problem Statement
Startups entering new markets often struggle to understand customer feedback when it's expressed in local dialects or informal language. In Kenya, much of this communication occurs in Swahili and Sheng, which combine local slang, English, and Swahili in a fluid, often unstructured manner. Existing chatbot systems trained on formal English fail to grasp the tone, intent, or meaning behind such messages. This project aims to fill that gap by building a chatbot trained specifically on real-world Kenyan conversations to interpret and respond to customer queries and feedback with local context and relevance.

## 1.2 Objectives

- Collect and preprocess Kenyan user dialogue from YouTube, social media, and local content featuring Swahili and Sheng

- Fine-tune the chatbot with foundational data for conversational structure, while emphasizing local language patterns

- Build a sequence-to-sequence model  capable of handling informal, code-switched dialogue

- Evaluate the chatbot’s performance with emphasis on contextual relevance and local understanding

- Present a working prototype that simulates real customer feedback scenarios 

## 2.0 Data Understanding 

## 2.1 In this project we sourced data from 2 sources
      the cornell  movie corpus
      youtube comments from local Kenyan content

## 2.2 Loading the datasets
movie_lines.txt and movie_conversations.txt
the dataset encoding format is  encoding='ISO-8859-1'

In [None]:
#importing the necessary libraries
import pandas as pd
import numpy as np
import random
import ast
import re
import matplotlib.pyplot as plt
import seaborn as sns
import string

# Read the lines
lines = pd.read_csv('movie_lines.txt', sep=r' \+\+\+\$\+\+\+ ', engine='python', header=None,
                    names=['lineID', 'characterID', 'movieID', 'character', 'text'],  encoding='ISO-8859-1')

# Read the conversations
conversations = pd.read_csv('movie_conversations.txt', sep=r' \+\+\+\$\+\+\+ ', engine='python', header=None,
                            names=['char1', 'char2', 'movieID', 'utteranceIDs'],  encoding='ISO-8859-1')


In [None]:
# Load scraped YouTube comments
yt_comments_one= pd.read_csv('youtube_comments.csv')

# Rename if needed
yt_comments_one.rename(columns={'comment': 'text'}, inplace=True)

#preview the dataset
yt_comments_one.head()


Unnamed: 0,video_id,text
0,qlZM3McwO1Q,What an incredible victory. I agree the Kenyan...
1,qlZM3McwO1Q,❤
2,qlZM3McwO1Q,“Claudia is an amazonian goddess with a beauti...
3,qlZM3McwO1Q,Proud of my motherland Kenya ❤❤❤and Africa.at ...
4,qlZM3McwO1Q,Damn
...,...,...
25624,q4sWUJxjc4g,Welcome home❤️
25625,q4sWUJxjc4g,"Uweeeeeh... All the best, lakini...😒"
25626,q4sWUJxjc4g,🥵woii finally our girl Lynn is here to ask goo...
25627,q4sWUJxjc4g,As usual 🦻


In [3]:
# Load scraped YouTube comments
yt_comments_two= pd.read_csv('data-from-youtube.csv')

# Rename if needed
yt_comments_two.rename(columns={'comment': 'text'}, inplace=True)

#preview the dataset
yt_comments_two

Unnamed: 0,Top Comment,Reply
0,Apple missed the boat on AI OR... Apple is doi...,
1,Who added the background music to the video it...,
2,16:26 FEMI KUTI !!! RAAHHH !!!,
3,"The greatest AI scam in history, is AI.",
4,"if I only knew Siri is a mess, I would bought ...",
...,...,...
2848,Still waiting for Apple Intelligence to be mor...,
2849,Today's apple is full of failures.\r\nI'm swit...,Yes! It was once a superior design product but...
2850,Finally,
2851,first comment ðŸŽ‰,"Wow, . . . You Were First To Attain OMEGA Stat..."


In [4]:
yt_comments_one.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25629 entries, 0 to 25628
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   video_id  25629 non-null  object
 1   text      25628 non-null  object
dtypes: object(2)
memory usage: 400.6+ KB


In [5]:
yt_comments_two.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2853 entries, 0 to 2852
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Top Comment  2850 non-null   object
 1   Reply        467 non-null    object
dtypes: object(2)
memory usage: 44.7+ KB


In [6]:
lines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304713 entries, 0 to 304712
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   lineID       304713 non-null  object
 1   characterID  304713 non-null  object
 2   movieID      304713 non-null  object
 3   character    304670 non-null  object
 4   text         304446 non-null  object
dtypes: object(5)
memory usage: 11.6+ MB


In [7]:
conversations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83097 entries, 0 to 83096
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   char1         83097 non-null  object
 1   char2         83097 non-null  object
 2   movieID       83097 non-null  object
 3   utteranceIDs  83097 non-null  object
dtypes: object(4)
memory usage: 2.5+ MB


Due to the large size of the dataset in this case sampling will be used to make the dataset that is to be merged managable

In [8]:
# Convert the string representation of lists to actual Python lists
conversations['utteranceIDs'] = conversations['utteranceIDs'].apply(ast.literal_eval)

# Preview
print(conversations.head())


  char1 char2 movieID              utteranceIDs
0    u0    u2      m0  [L194, L195, L196, L197]
1    u0    u2      m0              [L198, L199]
2    u0    u2      m0  [L200, L201, L202, L203]
3    u0    u2      m0        [L204, L205, L206]
4    u0    u2      m0              [L207, L208]


Sample 10,000 lines from the movie_lines dataset and 5000 lines from yt_comments_one and 2500 from yt_comments_two

In [9]:
# Shuffle the data and select 10,000 lines
sampled_yt1 = yt_comments_one.sample(n=5000, random_state=42).reset_index(drop=True)

# Preview
print(sampled_yt1.head())

      video_id                                               text
0  _b6D5wMzKZQ  God is hearing your cries.  You will overcome....
1  q4sWUJxjc4g           I feel for Sarah . This is so annoying……
2  _b6D5wMzKZQ  I think he suffocated the child till he had no...
3  _b6D5wMzKZQ  It is so emotional 😭😭😭😭..may the lord hear her...
4  _b6D5wMzKZQ                  This is too painful to be serious


In [10]:
# Shuffle the data and select 10,000 lines
sampled_yt2= yt_comments_two.sample(n=2500, random_state=42).reset_index(drop=True)

# Preview
print(sampled_yt2.head())

                                         Top Comment  \
0  The only thing Apple has been successful at is...   
1  looks like its tim cooks inability to manage A...   
2  The irony of using generative AI on your visua...   
3  I set the charging to 80% and in some cases, i...   
4  No offence, but this is clickbait shite. All s...   

                                               Reply  
0                     that is why they are brilliant  
1                                                NaN  
2  The #1 easiest way to lose respect as a creato...  
3                                                NaN  
4                                                NaN  


In [11]:
# Shuffle the data and select 10,000 lines
sampled_lines = lines.sample(n=10000, random_state=42).reset_index(drop=True)

# Preview
print(sampled_lines.head())


    lineID characterID movieID character  \
0  L350675       u1846    m121     NICKY   
1   L69531       u3916    m259       MAX   
2  L101238       u4195    m280  CINNABAR   
3  L359643       u6415    m427     SIDRA   
4  L155329       u4758    m316      REED   

                                                text  
0                              My mom wasn't a goat?  
1  I must protect my interests, Ms. Kyle.  And In...  
2  You said bad things hurt places.  So maybe goo...  
3                            She hates all freshmen.  
4                             What are you gonna do?  


Clean and Normalize Text

Preprocess Text Function

In [12]:
def clean_text(text):
    if pd.isnull(text):
        return ""
    
    text = text.lower()                          # Lowercase all text
    text = re.sub(r"http\S+|www\S+", "", text)   # Remove URLs
    text = re.sub(r"@\w+", "", text)             # Remove mentions
    text = re.sub(r"#\w+", "", text)             # Remove hashtags
    text = re.sub(r"[^\w\s]", "", text)          # Remove punctuation
    text = re.sub(r"\d+", "", text)              # Remove digits
    text = re.sub(r"\s+", " ", text).strip()     # Remove extra whitespace
    return text


In [13]:
sampled_lines['clean_text'] = sampled_lines['text'].apply(clean_text)
sampled_yt1['clean_text'] = sampled_yt1['text'].apply(clean_text)
sampled_yt2['clean_text'] = sampled_yt2['Top Comment'].apply(clean_text)

Link Lines to Conversations
We want to expand the utteranceIDs list so we can join each ID to its actual text from movie_lines.

In [14]:
#flatten conversation utterance IDs into pairs of (conversationID, lineID)
conversation_data = []

for idx, row in conversations.iterrows():
    for line_id in row['utteranceIDs']:
        conversation_data.append({
            'conversationID': idx,  # Assign index as a unique conversation ID
            'lineID': line_id
        })

#convert to DataFrame
conv_expanded = pd.DataFrame(conversation_data)

#preview
print(conv_expanded.head())


   conversationID lineID
0               0   L194
1               0   L195
2               0   L196
3               0   L197
4               1   L198


Merge Expanded Conversations with Sampled Lines

This lets us know which lines from the conversations are in our 10,000-line sample

In [15]:
#Merge sampled lines with conversation data
merged = pd.merge(conv_expanded, sampled_lines, on='lineID')

#Preview merged data
print(merged.head())


   conversationID lineID characterID movieID character  \
0               5   L275          u0      m0    BIANCA   
1              19   L862          u0      m0    BIANCA   
2              37   L759          u4      m0      JOEY   
3              44   L543          u5      m0       KAT   
4              48   L898          u0      m0    BIANCA   

                                            text  \
0                                 Forget French.   
1                    do you listen to this crap?   
2  Listen, I want to talk to you about the prom.   
3                                 Can we go now?   
4                              But you hate Joey   

                                    clean_text  
0                                forget french  
1                   do you listen to this crap  
2  listen i want to talk to you about the prom  
3                                can we go now  
4                            but you hate joey  


In [16]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   conversationID  10000 non-null  int64 
 1   lineID          10000 non-null  object
 2   characterID     10000 non-null  object
 3   movieID         10000 non-null  object
 4   character       9999 non-null   object
 5   text            9992 non-null   object
 6   clean_text      10000 non-null  object
dtypes: int64(1), object(6)
memory usage: 625.0+ KB


merge all the datasets

In [None]:
import pandas as pd

cornell = merged[['text']].copy()
cornell['response'] = None 
cornell['source'] = 'cornell'


# Select and rename columns for consistency
yt1 =sampled_yt1[['text']].copy()
yt1['response'] = None
yt1['source'] = 'youtube_one'

yt2 = sampled_yt2[['Top Comment']].copy()
yt2['response'] = None
yt2['source'] = 'youtube_two'

final_chatbot_data = pd.concat([cornell, yt1, yt2], ignore_index=True)

print(final_chatbot_data.head())
print(final_chatbot_data['source'].value_counts())



                                            text response   source Top Comment
0                                 Forget French.     None  cornell         NaN
1                    do you listen to this crap?     None  cornell         NaN
2  Listen, I want to talk to you about the prom.     None  cornell         NaN
3                                 Can we go now?     None  cornell         NaN
4                              But you hate Joey     None  cornell         NaN
cornell        10000
youtube_one     5000
youtube_two     2500
Name: source, dtype: int64
