## Team Members 

1. Jessica Mutiso
2. Brian Waweru
3. Pamela Godia
4. Hellen Mwaniki

## 1. Project Overview 


This project aims to develop a natural language chatbot that can generate human-like responses by learning from real-world conversations. Leveraging dialogue data from YouTube videos and movie scripts, the chatbot will be trained to understand conversational flow and context. The Cornell Movie Dialogues Corpus will serve as the primary training dataset, enabling the model to grasp nuances in dialogue structure, character interactions, and contextual relevance.

## 1.1 Problem Statement
Traditional rule-based chatbots often produce rigid, context-insensitive responses that break the natural flow of conversation. To build a more engaging and realistic conversational experience, this project will leverage deep learning techniques on real-world dialogue data. The goal is to develop a chatbot capable of understanding and generating coherent, context-aware responses in multi-turn conversations.


## 1.2 Objectives

- Scrape and preprocess real dialogue data from YouTube and integrate it with the Cornell Movie Dialogues Corpus

- Structure the dataset for effective training and evaluation

- Train a sequence-to-sequence chatbot model using RNNs or LSTMs

## Youtube Data Scrapping 

In [1]:
! pip install google-api-python-client



In [12]:
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
import csv

# Replace with your YouTube API key
API_KEY = 'AIzaSyAL-zoRJClDhT9CszYblZbf7CdmAn3NJxI'

# Initialize YouTube API client
youtube = build('youtube', 'v3', developerKey=API_KEY)

In [13]:
def get_comments_with_replies(video_id):
    comments = []
    
    try:
        request = youtube.commentThreads().list(
            part="snippet,replies",
            videoId=video_id,
            maxResults=100,
            textFormat="plainText"
        )
        response = request.execute()
        
        while request:
            for item in response.get("items", []):
                top_comment = item["snippet"]["topLevelComment"]["snippet"]["textDisplay"]
                comments.append({"video_id": video_id, "comment": top_comment})
                
                # Add replies if any
                replies = item.get("replies", {}).get("comments", [])
                for reply in replies:
                    reply_text = reply["snippet"]["textDisplay"]
                    comments.append({"video_id": video_id, "comment": reply_text})
            
            # Pagination
            if "nextPageToken" in response:
                request = youtube.commentThreads().list(
                    part="snippet,replies",
                    videoId=video_id,
                    maxResults=100,
                    pageToken=response["nextPageToken"],
                    textFormat="plainText"
                )
                response = request.execute()
            else:
                break
                
    except HttpError as e:
        print(f"Failed to fetch comments for video {video_id}: {e}")
    
    return comments

# List of video IDs
video_ids = [
    'qlZM3McwO1Q',
    '-voTKRBOEd0',
    '7JwKc6r5fAQ',
    '_b6D5wMzKZQ',
    'IB12MAwLs58',
    'QzIkndzWYU4',
    'P0cwqhA-YCk',
    'q4sWUJxjc4g'
]

# Collect comments for all videos
all_comments = []

for vid in video_ids:
    print(f"Fetching comments for video: {vid}")
    video_comments = get_comments_with_replies(vid)
    all_comments.extend(video_comments)

print(f"\nTotal comments collected: {len(all_comments)}")

Fetching comments for video: qlZM3McwO1Q
Fetching comments for video: -voTKRBOEd0
Fetching comments for video: 7JwKc6r5fAQ
Fetching comments for video: _b6D5wMzKZQ
Fetching comments for video: IB12MAwLs58
Fetching comments for video: QzIkndzWYU4
Fetching comments for video: P0cwqhA-YCk
Fetching comments for video: q4sWUJxjc4g

Total comments collected: 25629


In [14]:
# ✅ Export comments to a CSV file
csv_filename = "youtube_comments.csv"

if all_comments and isinstance(all_comments[0], dict):
    with open(csv_filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=["video_id", "comment"])
        writer.writeheader()
        writer.writerows(all_comments)

    print(f"✅ Comments exported to '{csv_filename}' successfully.")
else:
    print("⚠️ No structured comment data available to export.")

✅ Comments exported to 'youtube_comments.csv' successfully.


In [None]:
# import the pandas library
import pandas as pd

In [None]:
# importing the data
df = pd.read_csv('youtube_comments.csv')

In [None]:
# top rows
df.head()

Unnamed: 0,video_id,comment
0,qlZM3McwO1Q,What an incredible victory. I agree the Kenyan...
1,qlZM3McwO1Q,❤
2,qlZM3McwO1Q,“Claudia is an amazonian goddess with a beauti...
3,qlZM3McwO1Q,Proud of my motherland Kenya ❤❤❤and Africa.at ...
4,qlZM3McwO1Q,Damn


In [None]:
# information of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25629 entries, 0 to 25628
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   video_id  25629 non-null  object
 1   comment   25628 non-null  object
dtypes: object(2)
memory usage: 400.6+ KB
