## Team Members 

1. Jessica Mutiso
2. Brian Waweru
3. Pamela Godia
4. Hellen Mwaniki

## 1. Project Overview 


This project aims to develop a natural language chatbot that can generate human-like responses by learning from real-world conversations. Leveraging dialogue data from YouTube videos and movie scripts, the chatbot will be trained to understand conversational flow and context. The Cornell Movie Dialogues Corpus will serve as the primary training dataset, enabling the model to grasp nuances in dialogue structure, character interactions, and contextual relevance.

## 1.1 Problem Statement
Traditional rule-based chatbots often produce rigid, context-insensitive responses that break the natural flow of conversation. To build a more engaging and realistic conversational experience, this project will leverage deep learning techniques on real-world dialogue data. The goal is to develop a chatbot capable of understanding and generating coherent, context-aware responses in multi-turn conversations.


## 1.2 Objectives

- Scrape and preprocess real dialogue data from YouTube and integrate it with the Cornell Movie Dialogues Corpus

- Structure the dataset for effective training and evaluation

- Train a sequence-to-sequence chatbot model using RNNs or LSTMs

## Youtube Data Scrapping 

In [1]:
! pip install google-api-python-client



In [2]:
from googleapiclient.discovery import build
import csv

# Replace with your YouTube API key
API_KEY = 'AIzaSyAL-zoRJClDhT9CszYblZbf7CdmAn3NJxI'

# Initialize YouTube API client
youtube = build('youtube', 'v3', developerKey=API_KEY)

In [3]:
# List of video IDs you want to scrape; Add more as needed

VIDEO_ID = 'qlZM3McwO1Q'

# Initialize the YouTube API client
youtube = build('youtube', 'v3', developerKey=API_KEY)

In [4]:
# Function to fetch comments and replies
def get_comments_with_replies(video_id):
    all_comments = []

    request = youtube.commentThreads().list(
        part="snippet,replies",
        videoId=video_id,
        maxResults=100,
        textFormat="plainText"
    )
    response = request.execute()

    while request:
        for item in response['items']:
            top_comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment_thread = {
                'top_comment': top_comment,
                'replies': []
            }

            # Extract replies if they exist
            if 'replies' in item:
                for reply in item['replies']['comments']:
                    reply_text = reply['snippet']['textDisplay']
                    comment_thread['replies'].append(reply_text)

            all_comments.append(comment_thread)

        if 'nextPageToken' in response:
            request = youtube.commentThreads().list(
                part="snippet,replies",
                videoId=video_id,
                maxResults=100,
                textFormat="plainText",
                pageToken=response['nextPageToken']
            )
            response = request.execute()
        else:
            break

    return all_comments

In [5]:
# Call the function and get data 
comments_data = get_comments_with_replies(VIDEO_ID)

In [6]:

# Save the comments to a CSV file
with open('youtube_comments_replies.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Top Comment', 'Reply'])

    for item in comments_data:
        if item['replies']:
            for reply in item['replies']:
                writer.writerow([item['top_comment'], reply])
        else:
            writer.writerow([item['top_comment'], ''])

print("✅ Done! Comments and replies saved to 'youtube_comments_replies.csv'")

✅ Done! Comments and replies saved to 'youtube_comments_replies.csv'


In [7]:
import pandas as pd

In [8]:
df = pd.read_csv('youtube_comments_replies.csv')

In [9]:
df.head()

Unnamed: 0,Top Comment,Reply
0,Well done team Kenya❤,
1,"This channel is racist! Report the channel, it...",
2,"Congs Kenya, lots of love from Uganda",
3,"Just WOW Kenya, what a run, what a stamina!",
4,I'm a south African love kenyans,


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1088 entries, 0 to 1087
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Top Comment  1088 non-null   object
 1   Reply        139 non-null    object
dtypes: object(2)
memory usage: 17.1+ KB
