<a href="https://colab.research.google.com/github/AliciaFalconCaro/LLM_Chatbot_Movies/blob/main/DataAnalysisForMoviesChatBotLLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data downloaded from:
https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows/data?select=imdb_top_1000.csv


In [2]:
import pandas as pd
df = pd.read_csv("imdb_top_1000.csv")
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [3]:
# Let's first explore the data.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


The majority of the data information is non-numeric/text (categorical)
There is a total of 1000 entries.

In [4]:
#Let's check null values:
print("Missing values:", df.isnull().sum())

Missing values: Poster_Link        0
Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64


The features 'certificate', 'meta_score' and 'Gross' contains a high number of null values. In this case, for simplicity, we drop all NaN values (missing values).

In [5]:
df_cleaned = df.dropna()
print("Missing values:", df_cleaned.isnull().sum())

Missing values: Poster_Link      0
Series_Title     0
Released_Year    0
Certificate      0
Runtime          0
Genre            0
IMDB_Rating      0
Overview         0
Meta_score       0
Director         0
Star1            0
Star2            0
Star3            0
Star4            0
No_of_Votes      0
Gross            0
dtype: int64


We use a freely available LLM for this example. In this case, we use DialoGPT

We need to create question-answers pairs based on the available dataset.

In [6]:
# Function to generate question-answer pairs
def generate_qa_pairs(row):
    qa_pairs = []

    # Example questions for each column
    if "Series_Title" in row and "Genre" in row:
        qa_pairs.append({
            "instruction": f"What is the genre of {row['Series_Title']}?",
            "response": row["Genre"]
        })
    if "Series_Title" in row and "IMDB_Rating" in row:
        qa_pairs.append({
            "instruction": f"What is the IMDb rating of {row['Series_Title']}?",
            "response": str(row["IMDB_Rating"])  # Convert numeric values to string
        })
    if "Series_Title" in row and "Gross" in row:
        qa_pairs.append({
            "instruction": f"How much did {row['Series_Title']} earn?",
            "response": str(row["Gross"]) if pd.notna(row["Gross"]) else "Unknown"
        })
    if "Released_Year" in row:
        qa_pairs.append({
            "instruction": f"In what year was {row['Series_Title']} released?",
            "response": str(row["Released_Year"])
        })
    if "Series_Title" in row and "Certificate" in row:
        qa_pairs.append({
            "instruction": f"What is the certificate for {row['Series_Title']}?",
            "response": row["Certificate"]
        })
    if "Series_Title" in row and "Runtime" in row:
        qa_pairs.append({
            "instruction": f"How long is {row['Series_Title']}?",
            "response": str(row["Runtime"])
        })
    if "Series_Title" in row and "Overview" in row:
        qa_pairs.append({
            "instruction": f"What is the overview of {row['Series_Title']}?",
            "response": row["Overview"]
        })
    if "Series_Title" in row and "Meta_score" in row:
        qa_pairs.append({
            "instruction": f"What is the meta score of {row['Series_Title']}?",
            "response": str(row["Meta_score"])
        })
    if "Series_Title" in row and "Director" in row:
        qa_pairs.append({
            "instruction": f"Who is the director of {row['Series_Title']}?",
            "response": row["Director"]
        })

    return qa_pairs

# Iterate over rows to create question-answer pairs
qa_data = []
for _, row in df_cleaned.iterrows():
    qa_data.extend(generate_qa_pairs(row))

# Display a few examples
print(qa_data[:5])


[{'instruction': 'What is the genre of The Shawshank Redemption?', 'response': 'Drama'}, {'instruction': 'What is the IMDb rating of The Shawshank Redemption?', 'response': '9.3'}, {'instruction': 'How much did The Shawshank Redemption earn?', 'response': '28,341,469'}, {'instruction': 'In what year was The Shawshank Redemption released?', 'response': '1994'}, {'instruction': 'What is the certificate for The Shawshank Redemption?', 'response': 'A'}]


We save the questions-answers pairs as json file to pass it to the model to train it.

In [7]:
import json

# Save question-answer pairs to a JSON file
output_path = "qa_data.json"
with open(output_path, "w") as f:
    json.dump(qa_data, f, indent=4)

print(f"Question-answer pairs saved to {output_path}")


Question-answer pairs saved to qa_data.json


We load the json file

In [8]:
# Load the JSON file containing question-answer pairs
input_path = "qa_data.json"  # Replace with your actual file path
with open(input_path, "r") as f:
    qa_data = json.load(f)

print(f"Loaded {len(qa_data)} question-answer pairs.")


Loaded 9000 question-answer pairs.


Some LLM requires specific fields in the training json file for the fine-tuning of the model.

In this case, we will use GPT2, which requires:


"instruction": qa["instruction"],


"context": "",  # Optionally, include a context if needed


"response": qa["response"]

In [9]:
# Modify/reformat data to include additional fields
formatted_data = []

for qa in qa_data:
    formatted_entry = {
        "instruction": qa["instruction"],  # Keep the original question
        "context": "",  # Optionally, include a context if needed
        "response": qa["response"]  # Keep the original answer
    }
    formatted_data.append(formatted_entry)

print(f"Formatted {len(formatted_data)} question-answer pairs.")


Formatted 9000 question-answer pairs.


In [10]:
#we save the reformated json file
output_path = "formatted_qa_data.json"  # Specify your desired output path
with open(output_path, "w") as f:
    json.dump(formatted_data, f, indent=4)

print(f"Formatted question-answer pairs saved to {output_path}")


Formatted question-answer pairs saved to formatted_qa_data.json


Given any of the two json files containing the question-answer pairs, we load the data as a huggingface dataset.

In [11]:
!pip install transformers datasets accelerate

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [33]:
from datasets import Dataset

# Load the JSON data
with open("formatted_qa_data.json", "r") as f:
    qa_data = json.load(f)

# Convert to Hugging Face Dataset
dataset = Dataset.from_dict({
    "text": [
        f"Question: {entry['instruction']} Context: {entry['context']} Answer: {entry['response']}"
        for entry in qa_data
    ]
})


We load GPT2 model and tokenizer

In [34]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load tokenizer and model
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Set pad_token to eos_token
tokenizer.pad_token = tokenizer.eos_token

In [35]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Apply tokenization to the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Split dataset into train and test if not already present
train_test_split = tokenized_datasets.train_test_split(test_size=0.1)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

# Add labels to the tokenized dataset
def add_labels(examples):
    examples["labels"] = examples["input_ids"]  # labels should be same as input_ids
    return examples

# Apply label addition
train_dataset = train_dataset.map(add_labels, batched=True)
test_dataset = test_dataset.map(add_labels, batched=True)


Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8100 [00:00<?, ? examples/s]

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

We use Hugging Face’s Trainer API for training of the model

In [37]:
from transformers import Trainer, TrainingArguments

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    num_train_epochs=3,
    logging_dir="./logs",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Set up Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)


Now we fine-tune the model and save the trained model

To use the model, you need the API key. It can be obtained from: https://wandb.ai/

In [38]:
#This step could take some time
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.0296,0.028588


Epoch,Training Loss,Validation Loss
1,0.0296,0.028588


KeyboardInterrupt: 

In [None]:
trainer.save_model("./gpt2-finetuned")
tokenizer.save_pretrained("./gpt2-finetuned")

#we save the tokenizer to ensure that the same tokens are used for the model next time. They can be imported/loaded as:
#model = GPT2LMHeadModel.from_pretrained("./gpt2-finetuned")
#tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-finetuned")



Now we test the fine-tuned model

In [None]:
from transformers import pipeline

# Load the fine-tuned model and tokenizer
fine_tuned_model = GPT2LMHeadModel.from_pretrained("./gpt2-finetuned")
fine_tuned_tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-finetuned")

# Create a text generation pipeline
qa_pipeline = pipeline("text-generation", model=fine_tuned_model, tokenizer=fine_tuned_tokenizer)

# Test with a new question
prompt = "Question: What is the genre of Inception? Context: Answer:"
result = qa_pipeline(prompt, max_length=50, num_return_sequences=1)
print(result[0]["generated_text"])


We verify the correct output of the chatbot through a query to the dataset.

In [None]:
print("The correct Genre is:",df_cleaned[df_cleaned["Series_Title"] == "Inception"]["Genre"])

Now we create the interactive chatbot

In [None]:
#this is the chatbot with the fine-tuned model
def chatbot_response(input_text):
    prompt = f"Question: {input_text}? Context: Answer:"
    result = qa_pipeline(prompt, max_length=50, num_return_sequences=1)
    return result[0]["generated_text"]

#start chatbot:
while True:
    user_input = input("User: ")
    if user_input.lower() in ["exit", "quit"]:
        print("Bot: Goodbye!")
        break
    response = chatbot_response(user_input)

In [17]:
#this is a chatbot with no fine-tuned model
import torch
def chatbot_response(input_text, chat_history_ids=None):
    # Encode user input
    input_ids = tokenizer.encode(input_text + tokenizer.eos_token, return_tensors="pt")
    attention_mask = torch.ones_like(input_ids)

    # Append to chat history
    if chat_history_ids is not None:
        input_ids = torch.cat([chat_history_ids, input_ids], dim=-1)
        attention_mask = torch.cat(
            [torch.ones_like(chat_history_ids), attention_mask], dim=-1
        )

    # Generate a response with attention mask
    chat_history_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_length=1000,
        pad_token_id=tokenizer.pad_token_id,
    )

    response = tokenizer.decode(
        chat_history_ids[:, input_ids.shape[-1]:][0], skip_special_tokens=True
    )

    return response, chat_history_ids


# Start Chatbot
chat_history_ids = None
while True:
    user_input = input("You: ")
    if user_input.lower() in ["exit", "quit"]:
        print("Bot: Goodbye!")
        break
    response, chat_history_ids = chatbot_response(user_input, chat_history_ids)
    print(f"Bot: {response}")


You: hello
Bot: Hello! :D
You: what can you help me with
Bot: I can't help you with anything.
You: quit
Bot: Goodbye!
