## Breaking Bad Data Processing with LLMs 🖊️ ⚙️ 📨

This notebook leverages Large Language Models (LLMs) - to fetch the dialogue (subtitles) of the TV series *Breaking Bad* from the Movie and TV-show wiki: *Fandom.com* and extract relationships between characters, locations, events & season number.

#### 1. Data Acquisition

- **Scraping**: Subtitles are scraped from the Breaking Bad Fandom Wiki for all seasons and saved as individual text files.
- **Cleaning**: The scraped subtitles are cleaned to remove timestamps and other unnecessary elements, leaving only the dialogue.

#### 2. Context and Prompt Creation

- **Wikipedia Summary**: The notebook utilizes the Wikipedia article "List of characters in the Breaking Bad franchise" to create a summarized context of key characters and their relationships using the LLM: `meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`. This summary is used as background knowledge for the LLM processing of the subtitles (dialogue in the TV-show).
- **LLM Prompts**: System prompts are  defined for the LLM. These prompts guide its analysis, ensuring that the extracted information follows a predefined JSON schema for representing relationships.

#### 3. LLM Processing, Extraction & Output

- **Episode Analysis**: The LLM (`Qwen/Qwen2.5-72B-Instruct-Turbo`) iterates through each episode's subtitle file. The content of the subtitles, along with details like episode name and season number, are fed to the LLM as prompts.
- **Entity and Relationship Extraction**: The LLM analyzes the script and extracts entities (characters, locations, events). It then identifies relationships between these entities using a set of predefined relationship types (e.g., "friend of," "enemy of," "works with").
- **JSON Structuring**: The extracted information is structured into a JSON format, for easy storage and further analysis...

- **JSON Output**: All the LLM-processed episode data is saved into a single JSON file named "breaking_bad_analysisV2.json".
- **Summary**: A summary indicating the number of processed episodes is displayed to the user along with failed episodes (if it unlikely would occur).

### Install & Import Libraries 🎛️

In [1]:
# Install required packages from requirements.txt
!pip install -r https://raw.githubusercontent.com/Markushenriksson13/NLP-and-Network-Analysis_Exam_Submission/refs/heads/main/requirements.txt -q

# importing the clear_output function from IPython.display module to reduce noise in outputs...
from IPython.display import clear_output

# Datascrapping libs
import os
import requests
from bs4 import BeautifulSoup
import time

# Datahandling
import os
import re
import splitfile

# LLM Libs & Setup
from openai import OpenAI
import json
from pydantic import BaseModel, Field
from typing import List, Optional
import textwrap

# Wikipedia import
import wikipediaapi

# Network analysis
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import sparse
import networkx as nx
import holoviews as hv
import hvplot.networkx as hvnx
from community import community_louvain
from holoviews import opts
import plotly.graph_objects as go
import random
from community import community_louvain

# Model prediction
from datasets import Dataset
from setfit import SetFitModel, SetFitTrainer
from sklearn.metrics import classification_report

# Gradio deployment
import gradio as gr


### API SETUP (User Together API-Key Input required)... 📝 📝 📝
For the LLM-processing you will need to input a Together API Key in the field "INSERT TOKEN" below
 * (https://api.together.ai/signin)

 Please note that you can use Google Colab SECRETS, if you have saved your Together API Key there - 
 * If yes? (Colab): 
    * 1. Remove the #'s below for Colab lib loading & TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')
    * 2. Remove the other line: TOGETHER_API_KEY = "INSERT TOKEN" or use #

In [None]:
# Setup OpenAI client with custom TogetherAPI key and base URL

# FOR COLAB YOU CAN USE GOOGLE COLAB SECRETS, if you have saved your Together API Key there - 
#  - If yes?: Remove the #'s below and remove the other line: TOGETHER_API_KEY = "INSERT TOKEN"

# from google.colab import userdata
# TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')

TOGETHER_API_KEY = "INSERT TOKEN" # INSERT YOUR TOKEN

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=TOGETHER_API_KEY
)

### Setup + Data Extraction ⚙️

#### Fetching Breaking Bad Data from Fandom (using subtitles of each season/episode) 📖

In [13]:
# Base URL
base_url = "https://breakingbad.fandom.com/wiki/Category:Breaking_Bad_Subtitles"

def get_season_links(base_url, target_seasons):
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    season_links = []

    for link in soup.select('a.category-page__member-link'):
        for season in target_seasons:
            if f"Season_{season}" in link['href']:
                season_links.append("https://breakingbad.fandom.com" + link['href'])
    return season_links

def get_episode_links(season_url):
    response = requests.get(season_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    episode_links = []
    for link in soup.select('a.category-page__member-link'):
        episode_links.append("https://breakingbad.fandom.com" + link['href'])
    return episode_links

def get_subtitles(episode_url):
    response = requests.get(episode_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    subtitle_pre = soup.find("pre")
    if subtitle_pre:
        subtitles = subtitle_pre.get_text(strip=True)
        return subtitles
    return ""

def save_subtitles(episode_name, subtitles, season):
    # Handle "5A" and "5B" cases
    season_folder = f"Season_{season}"
    os.makedirs(f"subtitles/{season_folder}", exist_ok=True)
    file_path = f"subtitles/{season_folder}/{season_folder} - {episode_name}.txt"
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(subtitles)

def scrape_and_save_subtitles():
    target_seasons = [1, 2, 3, 4, "5A", "5B"] # ALL SEASONS
    # REPLACE # IF ONLY SEASON 1 IS WANTED: target_seasons = [1] # ONLY SEASON 1?
    season_links = get_season_links(base_url, target_seasons)

    for season_url in season_links:
        # Extract season from URL
        season = None
        for s in target_seasons:
            if f"Season_{s}" in season_url:
                season = s
                break

        if season:
            episode_links = get_episode_links(season_url)
            for episode_url in episode_links:
                subtitles = get_subtitles(episode_url)
                episode_name = episode_url.split("/")[-1].replace("_", " ")
                save_subtitles(episode_name, subtitles, season)
                print(f"Saved subtitles for {episode_name} in Season {season}")
# Run the scraper and saver
scrape_and_save_subtitles()

clear_output()
print("Manuscript Saved!")

Manuscript Saved!


#### Data Cleaning 🧹 🧹 🧹

In [14]:
def clean_subtitle(input_text):
    # Fjern linjer med tidsstempler og numre
    lines = input_text.split('\n')
    cleaned_lines = []
    skip_next = False

    for line in lines:
        if skip_next:
            skip_next = False
            continue
        if re.match(r'^\d+$', line.strip()) or '-->' in line:
            skip_next = True
            continue
        if line.strip():
            cleaned_lines.append(line.strip())

    return ' '.join(cleaned_lines)

def process_directory(root_dir):
    for dirpath, dirnames, filenames in os.walk(root_dir):
        for filename in filenames:
            if filename.endswith('.txt'):
                file_path = os.path.join(dirpath, filename)

                with open(file_path, 'r', encoding='utf-8') as file:
                    content = file.read()

                cleaned_content = clean_subtitle(content)

                with open(file_path, 'w', encoding='utf-8') as file:
                    file.write(cleaned_content)
# path to the folder containing our fetched subtitles
root_directory = 'subtitles/'

process_directory(root_directory)

### Definition of Extraction Schema 📦

In [15]:
# we initialize the wikipedia api
wiki_wiki = wikipediaapi.Wikipedia(
    language='en',  # we set the language to english
    user_agent='BreakingBadNetwork/1.0'  # we insert a user-agent string for wiki
)

# we fetch a specific article
page = wiki_wiki.page('List of characters in the Breaking Bad franchise')

# we check if the article exists and print the entire content
if page.exists():
    article_content = page.text # saves the article as variable
    print("Title: ", page.title)
    print("Content: ", page.text[:100])  # we retrieve the entire text of the article
else:
    print("the article does not exist")

# save as a file
with open("wiki_breaking_bad_characters.txt", "w", encoding="utf-8") as file:
    file.write(page.text)

Title:  List of characters in the Breaking Bad franchise
Content:  Breaking Bad is a crime drama franchise created by American filmmaker Vince Gilligan. It started wit


### Context / Prompt creation 🛣️

 * We will need some context for the LLM when the manuscript is going to be processed. 
 * For that - we will use the Wikipedia article "List of characters in the Breaking Bad franchise" 
     * (https://en.wikipedia.org/wiki/List_of_characters_in_the_Breaking_Bad_franchise)
 * The purpose of it is to ensure that the LLM understands which characters is present during the processing of the manuscript
 * The result: Improvement of the network creation

#### Create LLM summary of characters to be used as part of the prompt during processing of the manuscript (subtitles) 🎛️

In [18]:
SUM_PROMPT = """
You are an expert analyst of fictional characters. Your task is to summarize key information about characters from the Breaking Bad universe. For each character provided, you should:

1. State their name
2. Describe their primary role in the story
3. Outline their key relationships to other characters

Your summary should be concise yet informative, focusing only on the most important aspects of each character. Avoid including any information not explicitly provided in the input. If you're unsure about any details, do not speculate.

Format your response as a bullet-point list, with each character as a main point and their details as sub-points. You need to focus on that you state the different variances of the characters names in the summary.

Example format:
• Character Name:
  - Role: [Brief description of their role]
  - Key Relationships: [List of important relationships]

Provide this summary based solely on the information given in the input, without adding any external knowledge about Star Wars.
"""

response = client.chat.completions.create(
    model='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo', 
    messages=[
        {'role': 'system', 'content': SUM_PROMPT},
        {'role': 'user', 'content': f"Summarize these Breaking Bad characters:\n\n{article_content}"}
    ],
    temperature=0.7
)

article_sum = response.choices[0].message.content

In [21]:
article_sum[:1000]

"• Walter White (also known by his alias Heisenberg):\n  - Role: A high school chemistry teacher turned methamphetamine manufacturer and dealer.\n  - Key Relationships: Skyler White (wife), Walter Jr. (son), Jesse Pinkman (business partner), Hank Schrader (brother-in-law), Saul Goodman (lawyer), Mike Ehrmantraut (associate).\n\n• Skyler White:\n  - Role: Walter's wife, who becomes involved in his money laundering activities.\n  - Key Relationships: Walter White (husband), Walter Jr. (son), Hank Schrader (brother-in-law), Marie Schrader (sister), Saul Goodman (lawyer).\n\n• Jesse Pinkman:\n  - Role: A small-time methamphetamine user, manufacturer, and dealer who becomes Walter's business partner.\n  - Key Relationships: Walter White (business partner), Andrea Cantillo (girlfriend), Brock Cantillo (Andrea's son), Saul Goodman (lawyer), Mike Ehrmantraut (associate).\n\n• Hank Schrader:\n  - Role: A U.S. Drug Enforcement Administration (DEA) agent and Walter's brother-in-law.\n  - Key Rela

#### Manuscript (subtitles) 📝 --> LLM Processing ⚙️ --> JSON schema of characters, events, locations & seasons 📦

* Here we use the create the system prompt for the LLM - It contains infomation of what we want the LLM to look for
* Furthermore, we specify "Background Information" to be the earlier LLM-processed summary of the Wikipedia article as context for the TV-show

In [None]:
SYSTEM_PROMPT = f"""
You are an assistant specialized in analyzing and structuring information about TV series. Your task is to help build a network of relationships between various entities in a given TV series, based on the following summary:

Background Information:
{article_sum}

This series features a range of characters involved in complex relationships. Your primary goal is to analyze these connections and structure them into JSON format.

Your task includes:
1. Identifying relevant entities such as characters, locations, events, and seasons in the series.
2. Establishing meaningful relationships between these entities, noting when each relationship occurs (season).

Key Guidelines:
- Each entity should have a unique name and a defined type (e.g., 'character', 'location').
- Relationships must always specify the source entity, target entity, relationship type, and season.
- Use only predefined relationship types provided.

Additionally, you should:
- Be able to answer questions about the structure and relationships in the series.
- Offer suggestions for expanding or refining the network.
- Identify central characters, significant events, and key locations, using network connections as a basis for insight into the series' narrative structure and character development.

Explain your choices and reasoning as needed, ensuring that your analysis aids in understanding the series’ narrative structure over time.

Output JSON only.
"""

In [None]:
import json
import os
from typing import Dict, Any

# path to the subtitles directory
subtitles_dir = 'subtitles'

def extract_relationships(script_content: str, episode_name: str, season_number: int) -> Dict[str, Any]:
    prompt = f"""
    Analyze the following Breaking Bad episode script and identify entities and their relationships.
    Episode: {episode_name}
    Season: {season_number}

    Please output ONLY a valid JSON object following exactly this schema:
    {{
        "entities": [
            {{
                "name": "string",
                "type": "Character" | "Location" | "Event" | "Season"
            }}
        ],
        "relationships": [
            {{
                "source": "string",
                "relation": "friend of" | "enemy of" | "related to" | "married to" | "works with" | "lives in" | "visits" | "owns" | "participates in" | "witnesses" | "causes" | "appears in" | "is central to" | "introduces" | "concludes" | "develops" | "part of",
                "target": "string",
                "season": {season_number}
            }}
        ]
    }}

    Script content:
    {script_content[:1000]}...
    """

    try:
        response = client.chat.completions.create(
            model='Qwen/Qwen2.5-72B-Instruct-Turbo',
            messages=[
                {'role': 'system', 'content': SYSTEM_PROMPT},
                {'role': 'user', 'content': prompt}
            ],
            temperature=0.7
        )

        response_text = response.choices[0].message.content.strip()

        # get response and fix weird characters
        response_text = ''.join(char for char in response.choices[0].message.content if ord(char) >= 32 or char in '\n\r\t')


        # Ensure we get valid JSON
        try:
            data = json.loads(response_text)
            return data
        except json.JSONDecodeError:
            # Fallback: Try to extract JSON from response
            json_start = response_text.find('{')
            json_end = response_text.rfind('}') + 1
            if json_start >= 0 and json_end > json_start:
                json_str = response_text[json_start:json_end]
                return json.loads(json_str)
            raise

    except Exception as e:
        print(f"Error processing episode {episode_name}: {str(e)}")
        return {"entities": [], "relationships": []}

def analyze_all_episodes(subtitles_dir: str) -> Dict[str, Any]:
    all_episode_data = {}

    for season_dir in sorted(os.listdir(subtitles_dir)):
        season_path = os.path.join(subtitles_dir, season_dir)
        if os.path.isdir(season_path):
            season_number = int(season_dir.split('_')[1]) if season_dir.split('_')[1].isdigit() else 0

            for episode_file in sorted(os.listdir(season_path)):
                if episode_file.endswith('.txt'):
                    episode_path = os.path.join(season_path, episode_file)
                    clean_episode_name = episode_file.replace('%27', "'").replace('%20', " ")

                    try:
                        with open(episode_path, 'r', encoding='utf-8') as file:
                            script_content = file.read()

                        episode_data = extract_relationships(script_content, clean_episode_name, season_number)
                        if episode_data["entities"] or episode_data["relationships"]:
                            all_episode_data[f"{season_dir} - {clean_episode_name}"] = episode_data

                    except Exception as e:
                        print(f"Error reading file {episode_path}: {str(e)}")
                        continue

    return all_episode_data

# Analyze episodes and save results as JSON
all_episode_data = analyze_all_episodes(subtitles_dir)

# Save to JSON file
with open('breaking_bad_analysisV2.json', 'w', encoding='utf-8') as f:
    json.dump(all_episode_data, f, indent=4, ensure_ascii=False)

# Optional: Print summary of results
print("\nAnalysis complete! Results saved to 'breaking_bad_analysisV2.json'")
print(f"Processed {len(all_episode_data)} episodes")

# Optional: Print sample of the data in JSON format
print("\nSample of the data:")
print(json.dumps(dict(list(all_episode_data.items())[:1]), indent=4))

Error processing episode Season_5B - Buried subtitles.txt: Invalid control character at: line 36 column 33 (char 763)

Analysis complete! Results saved to 'breaking_bad_analysis.json'
Processed 61 episodes

Sample of the data:
{
    "Season_1 - Season_1 - ...and the Bag's in the River subtitles.txt": {
        "entities": [
            {
                "name": "Walter White",
                "type": "Character"
            },
            {
                "name": "Classroom",
                "type": "Location"
            },
            {
                "name": "Chemistry Lesson",
                "type": "Event"
            },
            {
                "name": "Season 1",
                "type": "Season"
            }
        ],
        "relationships": [
            {
                "source": "Walter White",
                "relation": "teaches",
                "target": "Chemistry Lesson",
                "season": 1
            },
            {
                "source": "Wal