## Team Members 

1. Jessica Mutiso
2. Brian Waweru
3. Pamela Godia
4. Hellen Mwaniki

## 1. Project Overview 

This project aims to develop a natural language chatbot capable of generating human-like responses and understanding informal customer feedback expressed in English, Kenyan Swahili and Sheng. Designed for a startup expanding into the Kenyan market, the chatbot will help the company engage users more naturally and analyze feedback from social platforms and online conversations. By training on locally relevant dialogue data  including YouTube comments and Kenyan media the system will capture the linguistic and cultural nuances often missed by standard models.



## 1.1 Problem Statement
Startups entering new markets often struggle to understand customer feedback when it's expressed in local dialects or informal language. In Kenya, much of this communication occurs in Swahili and Sheng, which combine local slang, English, and Swahili in a fluid, often unstructured manner. Existing chatbot systems trained on formal English fail to grasp the tone, intent, or meaning behind such messages. This project aims to fill that gap by building a chatbot trained specifically on real-world Kenyan conversations to interpret and respond to customer queries and feedback with local context and relevance.

## 1.2 Objectives

- Collect and preprocess Kenyan user dialogue from YouTube, social media, and local content featuring Swahili and Sheng

- Fine-tune the chatbot with foundational data for conversational structure, while emphasizing local language patterns

- Build a sequence-to-sequence model  capable of handling informal, code-switched dialogue

- Evaluate the chatbot’s performance with emphasis on contextual relevance and local understanding

- Present a working prototype that simulates real customer feedback scenarios 

## 2. Youtube Data Scrapping 

In [1]:
! pip install google-api-python-client



In [12]:
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
import csv

# Replace with your YouTube API key
API_KEY = 'AIzaSyAL-zoRJClDhT9CszYblZbf7CdmAn3NJxI'

# Initialize YouTube API client
youtube = build('youtube', 'v3', developerKey=API_KEY)

In [13]:
def get_comments_with_replies(video_id):
    comments = []
    
    try:
        request = youtube.commentThreads().list(
            part="snippet,replies",
            videoId=video_id,
            maxResults=100,
            textFormat="plainText"
        )
        response = request.execute()
        
        while request:
            for item in response.get("items", []):
                top_comment = item["snippet"]["topLevelComment"]["snippet"]["textDisplay"]
                comments.append({"video_id": video_id, "comment": top_comment})
                
                # Add replies if any
                replies = item.get("replies", {}).get("comments", [])
                for reply in replies:
                    reply_text = reply["snippet"]["textDisplay"]
                    comments.append({"video_id": video_id, "comment": reply_text})
            
            # Pagination
            if "nextPageToken" in response:
                request = youtube.commentThreads().list(
                    part="snippet,replies",
                    videoId=video_id,
                    maxResults=100,
                    pageToken=response["nextPageToken"],
                    textFormat="plainText"
                )
                response = request.execute()
            else:
                break
                
    except HttpError as e:
        print(f"Failed to fetch comments for video {video_id}: {e}")
    
    return comments

# List of video IDs
video_ids = [
    'qlZM3McwO1Q',
    '-voTKRBOEd0',
    '7JwKc6r5fAQ',
    '_b6D5wMzKZQ',
    'IB12MAwLs58',
    'QzIkndzWYU4',
    'P0cwqhA-YCk',
    'q4sWUJxjc4g'
]

# Collect comments for all videos
all_comments = []

for vid in video_ids:
    print(f"Fetching comments for video: {vid}")
    video_comments = get_comments_with_replies(vid)
    all_comments.extend(video_comments)

print(f"\nTotal comments collected: {len(all_comments)}")

Fetching comments for video: qlZM3McwO1Q
Fetching comments for video: -voTKRBOEd0
Fetching comments for video: 7JwKc6r5fAQ
Fetching comments for video: _b6D5wMzKZQ
Fetching comments for video: IB12MAwLs58
Fetching comments for video: QzIkndzWYU4
Fetching comments for video: P0cwqhA-YCk
Fetching comments for video: q4sWUJxjc4g

Total comments collected: 25629


In [14]:
# ✅ Export comments to a CSV file
csv_filename = "youtube_comments.csv"

if all_comments and isinstance(all_comments[0], dict):
    with open(csv_filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=["video_id", "comment"])
        writer.writeheader()
        writer.writerows(all_comments)

    print(f"✅ Comments exported to '{csv_filename}' successfully.")
else:
    print("⚠️ No structured comment data available to export.")

✅ Comments exported to 'youtube_comments.csv' successfully.


In [None]:
# import the pandas library
import pandas as pd

In [None]:
# importing the data
df = pd.read_csv('youtube_comments.csv')

In [32]:
# top rows
df.head(10)

Unnamed: 0,col1,col2,col3,col4,col5,col6
0,m0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']"
1,m1,1492: conquest of paradise,1992,6.2,10421,"['adventure', 'biography', 'drama', 'history']"
2,m2,15 minutes,2001,6.1,25854,"['action', 'crime', 'drama', 'thriller']"
3,m3,2001: a space odyssey,1968,8.4,163227,"['adventure', 'mystery', 'sci-fi']"
4,m4,48 hrs.,1982,6.9,22289,"['action', 'comedy', 'crime', 'drama', 'thrill..."
5,m5,the fifth element,1997,7.5,133756,"['action', 'adventure', 'romance', 'sci-fi', '..."
6,m6,8mm,1999,6.3,48212,"['crime', 'mystery', 'thriller']"
7,m7,a nightmare on elm street 4: the dream master,1988,5.2,13590,"['fantasy', 'horror', 'thriller']"
8,m8,a nightmare on elm street: the dream child,1989,4.7,11092,"['fantasy', 'horror', 'thriller']"
9,m9,the atomic submarine,1959,4.9,513,"['sci-fi', 'thriller']"


In [None]:
# information of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25629 entries, 0 to 25628
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   video_id  25629 non-null  object
 1   comment   25628 non-null  object
dtypes: object(2)
memory usage: 400.6+ KB


## Overall EDA

In [3]:
# importing relevant notebooks 
import pandas as pd 
import numpy as np
import random
import ast
import re
import matplotlib.pyplot as plt
import seaborn as sns
import string


In [None]:
#Reading movie_lines data 

with open('movie_lines.txt', encoding='utf-8') as f:
    for _ in range(5):
        print(f.readline())


L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!

L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!

L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.

L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?

L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.



In [65]:
with open("youtube_comments.csv", encoding="utf-8") as f:
    for _ in range(5):
        print(f.readline())


video_id,comment

qlZM3McwO1Q,What an incredible victory. I agree the Kenyans should have been celebrated at the end. This was an incredible performance.

qlZM3McwO1Q,❤

qlZM3McwO1Q,“Claudia is an amazonian goddess with a beautiful clam!” - Bruce Wayne

qlZM3McwO1Q,Proud of my motherland Kenya ❤❤❤and Africa.at large



In [66]:
with open ("data-from-youtube.csv", encoding ="utf-8") as f:
    for _ in range(5):
        print(f.readline())

Top Comment,Reply

"Apple missed the boat on AI OR... Apple is doing what it always does, waiting for others to prove a new technology, then ride in on their massive platform and take over. Time will tell which statement is true.",

"Who added the background music to the video its so fucking distracting. It sounds like nier automata bgm, that makes it impossible to focus",

16:26  FEMI KUTI !!! RAAHHH !!!,

"The greatest AI scam in history, is AI.",



In [11]:
columns = ['lineID', 'characterID', 'movieID', 'character', 'text']

movie_lines_data = pd.read_csv(
    'movie_lines.txt',
    sep=' \+\+\+\$\+\+\+ ',
    engine='python',
    names=columns,
    encoding='ISO-8859-1'  
)

movie_lines_data.head(10)


Unnamed: 0,lineID,characterID,movieID,character,text
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.
5,L924,u2,m0,CAMERON,Wow
6,L872,u0,m0,BIANCA,Okay -- you're gonna need to learn how to lie.
7,L871,u2,m0,CAMERON,No
8,L870,u0,m0,BIANCA,I'm kidding. You know how sometimes you just ...
9,L869,u0,m0,BIANCA,Like my fear of wearing pastels?


In [None]:
## Reading the conversation data

with open("movie_conversations.txt", encoding="ISO-8859-1") as f:
    for _ in range(5):
        print(f.readline())


u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']



In [14]:
columns = ['character1ID', 'character2ID', 'movieID', 'utteranceIDs']

conversation_data = pd.read_csv(
    "movie_conversations.txt",
    sep=' \+\+\+\$\+\+\+ ',
    engine='python',
    names=columns,
    encoding='ISO-8859-1'
)

conversation_data.head()

Unnamed: 0,character1ID,character2ID,movieID,utteranceIDs
0,u0,u2,m0,"['L194', 'L195', 'L196', 'L197']"
1,u0,u2,m0,"['L198', 'L199']"
2,u0,u2,m0,"['L200', 'L201', 'L202', 'L203']"
3,u0,u2,m0,"['L204', 'L205', 'L206']"
4,u0,u2,m0,"['L207', 'L208']"


In [15]:
# Reading Movie_charaters data 

with open ("movie_characters_metadata.txt" ,encoding="ISO- 8859-1") as f: 
    for _  in range(5):
        print(f.readline())

u0 +++$+++ BIANCA +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ f +++$+++ 4

u1 +++$+++ BRUCE +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ ? +++$+++ ?

u2 +++$+++ CAMERON +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ m +++$+++ 3

u3 +++$+++ CHASTITY +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ ? +++$+++ ?

u4 +++$+++ JOEY +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ m +++$+++ 6



In [17]:

columns = ['characterID', 'name', 'movie', 'gender', 'position']

character_data = pd.read_csv(
    "movie_characters_metadata.txt",
    sep=' \+\+\+\$\+\+\+ ',
    engine='python',
    names=columns,
    encoding='ISO-8859-1'
)

character_data.head(10)


Unnamed: 0,characterID,name,movie,gender,position
u0,BIANCA,m0,10 things i hate about you,f,4
u1,BRUCE,m0,10 things i hate about you,?,?
u2,CAMERON,m0,10 things i hate about you,m,3
u3,CHASTITY,m0,10 things i hate about you,?,?
u4,JOEY,m0,10 things i hate about you,m,6
u5,KAT,m0,10 things i hate about you,f,2
u6,MANDELLA,m0,10 things i hate about you,f,7
u7,MICHAEL,m0,10 things i hate about you,m,5
u8,MISS PERKY,m0,10 things i hate about you,?,?
u9,PATRICK,m0,10 things i hate about you,m,1


In [18]:
# Reading movie titles data 

with open ("movie_titles_metadata.txt", encoding= "ISO-8859-1") as f:
    for _ in range(5):
        print(f.readline())

m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance']

m1 +++$+++ 1492: conquest of paradise +++$+++ 1992 +++$+++ 6.20 +++$+++ 10421 +++$+++ ['adventure', 'biography', 'drama', 'history']

m2 +++$+++ 15 minutes +++$+++ 2001 +++$+++ 6.10 +++$+++ 25854 +++$+++ ['action', 'crime', 'drama', 'thriller']

m3 +++$+++ 2001: a space odyssey +++$+++ 1968 +++$+++ 8.40 +++$+++ 163227 +++$+++ ['adventure', 'mystery', 'sci-fi']

m4 +++$+++ 48 hrs. +++$+++ 1982 +++$+++ 6.90 +++$+++ 22289 +++$+++ ['action', 'comedy', 'crime', 'drama', 'thriller']



In [28]:
columns = ["MovieID", "Movie Title","Year", "Rating", "no_votes", "Genre"]

movie_titles = pd.read_csv(
    "movie_titles_metadata.txt",
    sep=' \+\+\+\$\+\+\+ ',
    engine='python',
    names=columns,
    encoding='ISO-8859-1'
)

movie_titles.head()

Unnamed: 0,MovieID,Movie Title,Year,Rating,no_votes,Genre
0,m0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']"
1,m1,1492: conquest of paradise,1992,6.2,10421,"['adventure', 'biography', 'drama', 'history']"
2,m2,15 minutes,2001,6.1,25854,"['action', 'crime', 'drama', 'thriller']"
3,m3,2001: a space odyssey,1968,8.4,163227,"['adventure', 'mystery', 'sci-fi']"
4,m4,48 hrs.,1982,6.9,22289,"['action', 'comedy', 'crime', 'drama', 'thrill..."


In [24]:
# Reading raw_script data

with open ("raw_script_urls.txt", encoding = "ISO-8859=1") as f:
        for _ in range(5):
              print(f.readline())

m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html

m1 +++$+++ 1492: conquest of paradise +++$+++ http://www.hundland.org/scripts/1492-ConquestOfParadise.txt

m2 +++$+++ 15 minutes +++$+++ http://www.dailyscript.com/scripts/15minutes.html

m3 +++$+++ 2001: a space odyssey +++$+++ http://www.scifiscripts.com/scripts/2001.txt

m4 +++$+++ 48 hrs. +++$+++ http://www.awesomefilm.com/script/48hours.txt



In [29]:
columns = ['movieID', 'title', 'year', 'rating', 'no_votes', 'genres']

raw_script = pd.read_csv(
    "movie_titles_metadata.txt",
    sep=' \+\+\+\$\+\+\+ ',
    engine='python',
    names=columns,
    encoding='ISO-8859-1'
)

raw_script.head()

Unnamed: 0,movieID,title,year,rating,no_votes,genres
0,m0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']"
1,m1,1492: conquest of paradise,1992,6.2,10421,"['adventure', 'biography', 'drama', 'history']"
2,m2,15 minutes,2001,6.1,25854,"['action', 'crime', 'drama', 'thriller']"
3,m3,2001: a space odyssey,1968,8.4,163227,"['adventure', 'mystery', 'sci-fi']"
4,m4,48 hrs.,1982,6.9,22289,"['action', 'comedy', 'crime', 'drama', 'thrill..."


In [None]:

# List of  dataset files and how many columns are in each
files_info = {
    "movie_lines.txt": 5,
    "movie_conversations.txt": 4,
    "movie_characters_metadata.txt": 5,
    "movie_titles_metadata.txt": 6,  
}

# Matching delimiter
delimiter = r' \+\+\+\$\+\+\+ '

for file, col_count in files_info.items():
    if not os.path.exists(file):
        print(f"{file} not found.")
        continue

    try:
        df = pd.read_csv(
            file,
            sep=delimiter,
            engine='python',
            encoding='ISO-8859-1',
            header=None,
            names=[f'col{i+1}' for i in range(col_count)]
        )
        print(f"{file} → Rows: {df.shape[0]}, Columns: {df.shape[1]}")
    except Exception as e:
        print(f"Error reading {file}: {e}")


movie_lines.txt → Rows: 304713, Columns: 5
movie_conversations.txt → Rows: 83097, Columns: 4
movie_characters_metadata.txt → Rows: 9035, Columns: 5
movie_titles_metadata.txt → Rows: 617, Columns: 6


## Pairing conversion_data & movie_lines_data

In [74]:
# Safely converting strings that looks like lists into actual list objects 

def safe_eval(val):
    if isinstance(val, str):
        return ast.literal_eval(val)
    return val

conversation_data['utteranceIDs'] = conversation_data['utteranceIDs'].apply(safe_eval)
import ast

# a lookup dictionary from lineID → text
line_map = dict(zip(movie_lines_data['lineID'], movie_lines_data['text']))

# input–response pairs 
pairs = []
for utterances in conversation_data['utteranceIDs']:
    for i in range(len(utterances) - 1):
        input_text = line_map.get(utterances[i])
        response_text = line_map.get(utterances[i + 1])
        if input_text and response_text:
            pairs.append((input_text, response_text))

# Converting to DataFrame
dialogue_df = pd.DataFrame(pairs, columns=['input', 'response'])


dialogue_df.sample(10)


Unnamed: 0,input,response
70683,I'll introduce you.,Ick. And those foul chemicals in the pots--
93684,Yeah. Way north.,What unit were you with ?
53506,"So, is this like a Japanese restaurant?",I'd better get in there.
129352,Jack.,Frank. I'm here. I always get here. Don't s...
88786,Only six?,What is this? 'Twenty Questions'?
68607,Let's talk about the work that you care so muc...,Sure. Where would you like to start?
89187,Then what would be enough? If we were married?,I wouldn't want you to marry me just to prove ...
96887,This kind of heat. It's pathetic.,"Well, I guess you pick your poison."
164927,Do you know where you're going ?,Yes.
99293,"In a day or two, yes.",Eve is going to stay. The house will not be cl...


## Language detection

In [None]:
# Generic set of common Swahili or Sheng words that the function will check for in the comments

swahili_sheng_vocab = {
    'watu', 'sana', 'ni', 'kweli', 'yaani', 'ndio', 'hapana', 'wewe',
    'walicheza', 'fiti', 'mambo', 'uko', 'niko', 'bro', 'manze', 'nani',
    'mbogi', 'gava', 'safi', 'poa', 'niko', 'shida', 'leo', 'kesho',
    'mathe', 'buda', 'manze', 'dem', 'jamo', 'mpango', 'sijui', 'nashindwa'
}


# Function to extract Swahili/Sheng words

def find_swahili_sheng(comment):
    comment = re.sub(r'[^\w\s]', '', str(comment)).lower()
    words_in_comment = comment.split()
    swahili_words = [word for word in words_in_comment if word in swahili_sheng_vocab]
    return swahili_words

# Applying to DataFrame
df['swahili_words'] = df['comment'].apply(find_swahili_sheng)
df['swahili_word_count'] = df['swahili_words'].apply(len)

# Filtering only those with Swahili/Sheng detected
df_swahili = df[df['swahili_word_count'] > 0]

print(df_swahili[['comment', 'swahili_words', 'swahili_word_count']].head())





                                               comment swahili_words  \
60                             Watu wa kasongo hoyeee.        [watu]   
102  Alafu msichana wa watu akishakimbia hivi boyfr...        [watu]   
281  Peperusha bendera nanii kenya 🇰🇪 to the world ...        [sana]   
356  Amazing, watu wa nguvu, always doing Kenya proud.        [watu]   
560                      Obiri and kipyego wee ni moto          [ni]   

     swahili_word_count  
60                    1  
102                   1  
281                   1  
356                   1  
560                   1  


In [72]:
 # Language detection function on Youtube_comments dataset

def detect_language(text):
    text = re.sub(r'[^\w\s]', '', str(text)).lower()
    tokens = text.split()
    matches = sum(word in swahili_sheng_vocab for word in tokens)

    if matches == 0:
        return 'english'
    elif matches >= 3:
        return 'swahili_sheng'
    else:
        return 'mixed'

# Applying to YouTube comments
df['lang'] = df['comment'].apply(detect_language)

df[['comment', 'lang']].sample(10)

Unnamed: 0,comment,lang
10624,Yes sure the guardian angel in him was sending...,english
23454,"My heart goes to serah,so sad,Abraham God is t...",english
23002,Serah is living every woman's greatest nightma...,english
10121,May he run mad and be a pain to himself and hi...,english
9186,"Hugs mama God is with you,Justice for Kingsley!",english
17389,this story is hard.serah is not as she has sa...,english
20569,I wonder if the roles were reversed whether Ma...,english
9157,May God give her strength the pain i too . Hav...,english
2919,How do someone kill a one year old child?😢😢,english
8907,So so sad,english


In [None]:
# Mixed language

mixed_comments = df[df['lang'] == 'mixed']
print("Mixed-language comments:")
print(mixed_comments[['comment']].head(10))


Mixed-language comments:
                                               comment
60                             Watu wa kasongo hoyeee.
102  Alafu msichana wa watu akishakimbia hivi boyfr...
281  Peperusha bendera nanii kenya 🇰🇪 to the world ...
356  Amazing, watu wa nguvu, always doing Kenya proud.
560                      Obiri and kipyego wee ni moto
562      Moto aliyouwasha Kipyegon ulikuwa mambo yote.
581  Congratulations to our athletic ladies,  l lov...
637           Proud to be kenyan 🇰🇪🇰🇪🇰🇪🇰🇪kenya ni home
655  .... und man sieht sogar auf dem ersten Blick ...
700     Kama wewe n mkenya weka like tukisonga ❤❤❤🎉🎉🎉🎉


## Data Cleaning & Preprocessing

In [75]:
import nltk
nltk.download('words')
from nltk.corpus import words
english_words = set(words.words())

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\helle\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [None]:
# Youtube Data 

youtube_df = pd.read_csv('youtube_comments.csv')

# Drop rows with missing comments
youtube_df.dropna(subset=['comment'], inplace=True)

# Remove duplicates
youtube_df.drop_duplicates(subset='comment', inplace=True)

# Basic text cleaning function
def clean_comment(text):
    text = re.sub(r"http\S+|www\S+|https\S+", '', str(text))  # Remove URLs
    text = re.sub(r"[^\w\s]", '', text)                       # Remove punctuation
    text = text.lower()                                       # Lowercase
    return text.strip()

# Apply cleaning
youtube_df['clean_comment'] = youtube_df['comment'].apply(clean_comment)

# Check result
youtube_df[['comment', 'clean_comment']].sample(10)


Unnamed: 0,comment,clean_comment
11733,True...such a beast!,truesuch a beast
16986,The only winner is the man.,the only winner is the man
23087,Hapa iko shida Sarah got sooo hurt 🤕 she is st...,hapa iko shida sarah got sooo hurt she is sti...
4977,Oooiiieee God will be your justice and your co...,oooiiieee god will be your justice and your co...
22591,Allow me to say Abraham ni kadinya wa kawaida....,allow me to say abraham ni kadinya wa kawaidaa...
2664,LNN NGUGI is truly God sent 🙏 He shall carry y...,lnn ngugi is truly god sent he shall carry yo...
23107,No One can love two women equally .it's a lie ...,no one can love two women equally its a lie ad...
1572,May God give you strength,may god give you strength
20130,Nobody can love two ladies at same time never....,nobody can love two ladies at same time neverh...
10196,May he never know any peace may the cry of his...,may he never know any peace may the cry of his...


In [58]:
# Dialogue Dataframe

# Drop missing or null input/response
dialogue_df.dropna(subset=['input', 'response'], inplace=True)

# Drop duplicates
dialogue_df.drop_duplicates(inplace=True)

# text cleaning function
def clean_text(text):
    text = re.sub(r"http\S+|www\S+", '', str(text))     # Remove URLs
    text = re.sub(r"[^\w\s]", '', text)                 # Remove punctuation
    text = re.sub(r"\s+", " ", text)                    # Remove extra spaces
    return text.lower().strip()

# Apply cleaning to input and response
dialogue_df['input_clean'] = dialogue_df['input'].apply(clean_text)
dialogue_df['response_clean'] = dialogue_df['response'].apply(clean_text)

# Remove rows where cleaned input or response is empty
dialogue_df = dialogue_df[(dialogue_df['input_clean'] != '') & (dialogue_df['response_clean'] != '')]

# Reset index
dialogue_df.reset_index(drop=True, inplace=True)

dialogue_df[['input_clean', 'response_clean']].sample(10)


Unnamed: 0,input_clean,response_clean
217098,the thing is ill need a first mate,i know where you can find any number of naked ...
82110,im afraid thats not possible,why not
190678,what is it a military spacecraft like a shuttl...,something like that that doesnt surprise you
218077,hochmut,vain proud such a person is hochmutsnarr he is...
193661,is it loaded,no i dont think so
31379,its probably going to need stitches,im going to throw up
64261,nick the ice is,get to the bridge hey hey down here
96206,think i got em,i dont know
194540,you could do it,i could
130460,such as,keeping an eye on dr duval
