# TF-IDF + Cosine Similarity Chatbot (Cornell Movie Dialogs Corpus)
A simple NLP chatbot using TF-IDF and cosine similarity for conversational response matching.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import re
import ast
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
#Load the Cornell Movie Dialogs Dataset
conversations = pd.read_csv('original-data/movie_conversations.txt',sep=r'\s*\+\+\+\$\+\+\+\s*',engine='python', names=["character1", "character2", "movieID", "utteranceIDs"],encoding='iso-8859-1')
lines = pd.read_csv('original-data/movie_lines.txt',sep=r'\s*\+\+\+\$\+\+\+\s*',engine='python',names=["lineID", "characterID", "movieID", "character", "text"],encoding='iso-8859-1')

yt1= pd.read_csv('original-data/brian_youtube_data_comments.csv')
yt2= pd.read_csv('original-data/pamela-youtube_comments.csv')

# Convert utteranceIDs from string to list
conversations['utteranceIDs'] = conversations['utteranceIDs'].apply(ast.literal_eval)

In [10]:
#Extract Conversational Pairs (input-response)
line_dict = dict(zip(lines['lineID'], lines['text']))

pairs = []

for conv in conversations['utteranceIDs']:
    for i in range(len(conv) - 1):
        # Get both lines from the dictionary
        in_line = line_dict.get(conv[i])
        out_line = line_dict.get(conv[i+1])

        # Ensure both exist and are strings
        if isinstance(in_line, str) and isinstance(out_line, str):
            in_line = in_line.strip()
            out_line = out_line.strip()
            if in_line and out_line:
                pairs.append((in_line, out_line))

# Create DataFrame
chat_df = pd.DataFrame(pairs, columns=["input", "response"])

# Sanity check
print(f"✅ Extracted {len(chat_df)} conversation pairs.")
if not chat_df.empty:
    print(chat_df.sample(5))
else:
    print("❌ No valid pairs found.")

✅ Extracted 221282 conversation pairs.
                                                    input  \
92511   Oh, okay, forgive me.  Your neighbors are here...   
77948                                 So you're a lawyer.   
18704                                             Oh, no.   
7333    He suspects I know something. I think he was s...   
128637  Don't recognize him. You were trapped by Morga...   

                                                 response  
92511              Exactly what I mean.  It's all ruined.  
77948     That's right. What are you doing in Bodega Bay?  
18704                      I knew you'd understand. Here.  
7333                                                Why?!  
128637  ...Gawain and Perceval, Bors and Bohort, Carad...  


In [11]:
#Preprocess the Text
def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

chat_df['clean_input'] = chat_df['input'].apply(clean_text)

In [12]:
# Create TF-IDF Matrix for User Inputs
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
tfidf_matrix = vectorizer.fit_transform(chat_df['clean_input'])

In [13]:
# Chatbot Function Using Cosine Similarity
def chatbot_response(user_input):
    cleaned_input = clean_text(user_input)
    user_vec = vectorizer.transform([cleaned_input])
    similarities = cosine_similarity(user_vec, tfidf_matrix).flatten()
    best_match_idx = similarities.argmax()
    if similarities[best_match_idx] > 0:
        return chat_df.iloc[best_match_idx]['response']
    else:
        return "I'm not sure how to respond to that."

In [14]:
import gradio as gr

# Function that takes user input and returns chatbot response
def chat_with_bot(user_input):
    user_input_vector = vectorizer.transform([user_input])
    similarity = cosine_similarity(user_input_vector, tfidf_matrix)
    best_match_idx = similarity.argmax()
    return chat_df.iloc[best_match_idx]["response"]

# Create Gradio interface
iface = gr.Interface(fn=chat_with_bot,
                     inputs=gr.Textbox(lines=2, placeholder="Type a message..."),
                     outputs="text",
                     title="TF-IDF and cosine similarity baseline model",
                     description="Ask something and get a response based on similarity!")

iface.launch()


* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.




## Why This Baseline Model Needs Further Advancement

The current chatbot uses a TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer combined with cosine similarity to find the most relevant response to a user's input. While this method is simple and effective as a baseline, it has several limitations that make it unsuitable for production-level conversational AI:

### 1. **Lack of Context Awareness**
The model treats each input-response pair as independent. It doesn't remember prior messages in the conversation, making it unable to maintain coherent multi-turn interactions.

### 2. **No Semantic Understanding**
TF-IDF relies purely on the frequency of words. It doesn’t understand the meaning behind words or recognize synonyms. For instance, "How are you doing?" and "How's it going?" might be treated as unrelated.

### 3. **Rigid Matching**
The model will fail to respond well to slightly rephrased, paraphrased, or typo-ridden inputs, since it depends on exact or partial word overlap.

### 4. **Scalability Issues**
As the dataset grows, cosine similarity computations become slower, especially with larger TF-IDF matrices. This limits performance in real-time applications.

### 5. **No Personalization or Dynamic Learning**
The model cannot adapt to different users, preferences, or conversation styles. It also doesn’t improve over time unless retrained.


## Future Improvements

To build a more intelligent and natural chatbot

This baseline model is a great starting point for understanding chatbot mechanics, but it's only the first step toward building a truly intelligent conversational agent.
