## Breaking Bad Data Processing with LLMs 🖊️ ⚙️ 📨

This notebook leverages Large Language Models (LLMs) - to fetch the dialogue (subtitles) of the TV series *Breaking Bad* from the Movie and TV-show wiki: *Fandom.com* and extract relationships between characters, locations, events & season number.

#### 1. Data Acquisition

- **Scraping**: Subtitles are scraped from the Breaking Bad Fandom Wiki for all seasons and saved as individual text files.
- **Cleaning**: The scraped subtitles are cleaned to remove timestamps and other unnecessary elements, leaving only the dialogue.

#### 2. Context and Prompt Creation

- **Wikipedia Summary**: The notebook utilizes the Wikipedia article "List of characters in the Breaking Bad franchise" to create a summarized context of key characters and their relationships using the LLM: `meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`. This summary is used as background knowledge for the LLM processing of the subtitles (dialogue in the TV-show).
- **LLM Prompts**: System prompts are  defined for the LLM. These prompts guide its analysis, ensuring that the extracted information follows a predefined JSON schema for representing relationships.

#### 3. LLM Processing, Extraction & Output

- **Episode Analysis**: The LLM (`Qwen/Qwen2.5-72B-Instruct-Turbo`) iterates through each episode's subtitle file. The content of the subtitles, along with details like episode name and season number, are fed to the LLM as prompts.
- **Entity and Relationship Extraction**: The LLM analyzes the script and extracts entities (characters, locations, events). It then identifies relationships between these entities using a set of predefined relationship types (e.g., "friend of," "enemy of," "works with").
- **JSON Structuring**: The extracted information is structured into a JSON format, for easy storage and further analysis...

- **JSON Output**: All the LLM-processed episode data is saved into a single JSON file named "breaking_bad_analysisV2.json".
- **Summary**: A summary indicating the number of processed episodes is displayed to the user along with failed episodes (if it unlikely would occur).

### Install & Import Libraries 🎛️

In [1]:
# Install required packages from requirements.txt
!pip install -r https://raw.githubusercontent.com/Markushenriksson13/NLP-and-Network-Analysis_Exam_Submission/refs/heads/main/requirements.txt -q

# importing the clear_output function from IPython.display module to reduce noise in outputs...
from IPython.display import clear_output

# Datascrapping libs
import os
import requests
from bs4 import BeautifulSoup
import time

# Datahandling
import os
import re

# LLM Libs & Setup
from openai import OpenAI
import json
from pydantic import BaseModel, Field
from typing import List, Optional

# Wikipedia import
import wikipediaapi


### API SETUP (User Together API-Key Input required)... 📝 📝 📝
For the LLM-processing you will need to input a Together API Key in the field "INSERT TOKEN" below
 * (https://api.together.ai/signin)

 Please note that you can use Google Colab SECRETS, if you have saved your Together API Key there - 
 * If yes? (Colab): 
    * 1. Remove the #'s below for Colab lib loading & TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')
    * 2. Remove the other line: TOGETHER_API_KEY = "INSERT TOKEN" or use #

In [2]:
# Setup OpenAI client with custom TogetherAPI key and base URL

# FOR COLAB YOU CAN USE GOOGLE COLAB SECRETS, if you have saved your Together API Key there - 
#  - If yes?: Remove the #'s below and remove the other line: TOGETHER_API_KEY = "INSERT TOKEN"

# from google.colab import userdata
# TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')

TOGETHER_API_KEY = "INSERT TOKEN" # INSERT YOUR TOKEN

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=TOGETHER_API_KEY
)

### Setup + Data Extraction ⚙️

#### Fetching Breaking Bad Data from Fandom (using subtitles of each season/episode) 📖

In [None]:
# url for breaking bad subs
base_url = "https://breakingbad.fandom.com/wiki/Category:Breaking_Bad_Subtitles"

# we define functions to extract data from each season/episode from fandom.com 
# subtitles are located inside a <pre></pre> tag on each page - which we tells BeautifulSoup to parse (inside html) in functions get_season_links & get_episode_links
# when episode_url has been specified - we tell soup to find the subtitles inside a <pre></pre> tag
def get_season_links(base_url, target_seasons):
    response = requests.get(base_url)  # get the page
    soup = BeautifulSoup(response.text, 'html.parser')  # parse it
    season_links = []  # list for links

    for link in soup.select('a.category-page__member-link'):  # find links
        for season in target_seasons:  # check seasons
            if f"Season_{season}" in link['href']:  # if it's a season
                season_links.append("https://breakingbad.fandom.com" + link['href'])  # add link
    return season_links  # return all links

def get_episode_links(season_url):
    response = requests.get(season_url)  # get season page
    soup = BeautifulSoup(response.text, 'html.parser')  # parse it - for episode links
    episode_links = []  # list for episodes
    for link in soup.select('a.category-page__member-link'):  # find episode links
        episode_links.append("https://breakingbad.fandom.com" + link['href'])  # add to list
    return episode_links  # return episode links

def get_subtitles(episode_url):
    response = requests.get(episode_url)  # get episode page
    soup = BeautifulSoup(response.text, 'html.parser')  # parse it - for season links
    subtitle_pre = soup.find("pre")  # find subtitles
    if subtitle_pre:  # if found
        subtitles = subtitle_pre.get_text(strip=True)  # get text
        return subtitles  # return subtitles
    return ""  # return empty if not found

def save_subtitles(episode_name, subtitles, season):
    # handle season 5A and 5B
    season_folder = f"Season_{season}"  # folder name
    os.makedirs(f"subtitles/{season_folder}", exist_ok=True)  # make folder
    file_path = f"subtitles/{season_folder}/{season_folder} - {episode_name}.txt"  # file path
    with open(file_path, 'w', encoding='utf-8') as file:  # open file
        file.write(subtitles)  # write subtitles

def scrape_and_save_subtitles():
    target_seasons = [1, 2, 3, 4, "5A", "5B"]  # all seasons
    # if only season 1 wanted: target_seasons = [1] 
    season_links = get_season_links(base_url, target_seasons)  # get season links

    for season_url in season_links:  # for each season
        season = None  # reset season
        for s in target_seasons:  # check seasons
            if f"Season_{s}" in season_url:  # if found
                season = s  # set season
                break  # exit loop

        if season:  # if season is set
            episode_links = get_episode_links(season_url)  # get episodes
            for episode_url in episode_links:  # for each episode
                subtitles = get_subtitles(episode_url)  # get subs
                episode_name = episode_url.split("/")[-1].replace("_", " ")  # get name
                save_subtitles(episode_name, subtitles, season)  # save subs
                print(f"Saved subs for {episode_name} in Season {season}")  # print status

# run the scraper
scrape_and_save_subtitles()

clear_output()  # clear output to remove obsolete noise from output
print("Manuscript Saved!")  # done!


#### Data Cleaning 🧹 🧹 🧹

 * Raw subtitles data needs to be cleaned before processing, since they contain a lot of noise in terms of timestamps and numbers. 

Example:

```bash
1
00:00:03,762 --> 00:00:05,264
In closing, I can tell you...
```

 * We need to remove unnecessary elements and irrelevant information to focus solely on the spoken content. This will help reduce noise for the LLM processing of the subtitles.

In [None]:
def clean_subtitle(input_text):
    # remove lines with timestamps and numbers
    lines = input_text.split('\n')  # split text into lines
    cleaned_lines = []  # list for cleaned lines
    skip_next = False  # flag to skip next line

    for line in lines:  # go through each line
        if skip_next:  # if we need to skip
            skip_next = False  # reset flag
            continue  # move to next line
        if re.match(r'^\d+$', line.strip()) or '-->' in line:  # if it's a number or timestamp
            skip_next = True  # set flag to skip next line
            continue  # skip this line
        if line.strip():  # if line is not empty
            cleaned_lines.append(line.strip())  # add to cleaned lines

    return ' '.join(cleaned_lines)  # join cleaned lines into one string

def process_directory(root_dir):
    for dirpath, dirnames, filenames in os.walk(root_dir):  # walk through the folder
        for filename in filenames:  # for each file
            if filename.endswith('.txt'):  # check if it's a txt file
                file_path = os.path.join(dirpath, filename)  # get full path

                with open(file_path, 'r', encoding='utf-8') as file:  # open file
                    content = file.read()  # read the content

                cleaned_content = clean_subtitle(content)  # clean the content

                with open(file_path, 'w', encoding='utf-8') as file:  # open file to write
                    file.write(cleaned_content)  # write cleaned content

# path to the folder with subtitles
root_directory = 'subtitles/'  # folder path

process_directory(root_directory)  # start processing


### Wikipedia Context for LLM character background infomation 🌎

We use the Wikipedia article **"List of characters in the Breaking Bad franchise"** to provide the LLM with background knowledge about the key characters and their relationships. This helps us improve the network creation by enabling the LLM to understand which characters are present and how they are connected.

**Fetching the Wikipedia Article:**
   - We import the `wikipediaapi` library
   - We fetch the article content using `wiki_wiki.page('List of characters in the Breaking Bad franchise').text`
   - We save the content to a file named `wiki_breaking_bad_characters.txt` for later use 

In [None]:
# we initialize the wikipedia api
wiki_wiki = wikipediaapi.Wikipedia(
    language='en',  # we set the language to english
    user_agent='BreakingBadNetwork/1.0'  # we insert a user-agent string for wiki
)

# we fetch a specific article
page = wiki_wiki.page('List of characters in the Breaking Bad franchise')

# we check if the article exists and print the entire content
if page.exists():
    article_content = page.text # saves the article as variable
    print("Title: ", page.title)
    print("Content: ", page.text[:100])  # we retrieve the entire text of the article
else:
    print("the article does not exist")

# save as a file
with open("wiki_breaking_bad_characters.txt", "w", encoding="utf-8") as file:
    file.write(page.text)

Title:  List of characters in the Breaking Bad franchise
Content:  Breaking Bad is a crime drama franchise created by American filmmaker Vince Gilligan. It started wit


### Context / Prompt creation 🛣️

* We will need some context for the LLM when the manuscript is processed.
* To achieve this, we will use the Wikipedia article **"List of characters in the Breaking Bad franchise**" 

  * [Wiki: List of characters in the Breaking Bad franchise](https://en.wikipedia.org/wiki/List_of_characters_in_the_Breaking_Bad_franchise)

* We create a prompt (**SUM_PROMPT** below) to instruct the LLM to summarize the characters and their relationships.
* The LLM (**`meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`**) is called to process the article content and create a summary. We use a different LLM for summary of the wiki-article because it has a higher Token input level than the Qwen-model used later for processing of subtitles.
* The generated summary of the wiki-article is stored in the **article_sum** variable.
  
The purpose of this is to ensure that the LLM understands which characters are present during the processing of the manuscript. The result will be an improvement in the network creation when we're going to process each subtitle.



#### Create LLM summary of characters to be used as part of the prompt during processing of the manuscript (subtitles) 🎛️

In [None]:
SUM_PROMPT = """
You are an expert analyst of fictional characters. Your task is to summarize key information about characters from the Breaking Bad universe. For each character provided, you should:

1. State their name
2. Describe their primary role in the story
3. Outline their key relationships to other characters

Your summary should be concise yet informative, focusing only on the most important aspects of each character. Avoid including any information not explicitly provided in the input. If you're unsure about any details, do not speculate.

Format your response as a bullet-point list, with each character as a main point and their details as sub-points. You need to focus on that you state the different variances of the characters names in the summary.

Example format:
• Character Name:
  - Role: [Brief description of their role]
  - Key Relationships: [List of important relationships]

Provide this summary based solely on the information given in the input, without adding any external knowledge about Star Wars.
"""

response = client.chat.completions.create(
    model='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo', 
    messages=[
        {'role': 'system', 'content': SUM_PROMPT},
        {'role': 'user', 'content': f"Summarize these Breaking Bad characters:\n\n{article_content}"}
    ],
    temperature=0.7
)

article_sum = response.choices[0].message.content

In [None]:
article_sum[:1000]

"• Walter White (also known by his alias Heisenberg):\n  - Role: A high school chemistry teacher turned methamphetamine manufacturer and dealer.\n  - Key Relationships: Skyler White (wife), Walter Jr. (son), Jesse Pinkman (business partner), Hank Schrader (brother-in-law), Saul Goodman (lawyer), Mike Ehrmantraut (associate).\n\n• Skyler White:\n  - Role: Walter's wife, who becomes involved in his money laundering activities.\n  - Key Relationships: Walter White (husband), Walter Jr. (son), Hank Schrader (brother-in-law), Marie Schrader (sister), Saul Goodman (lawyer).\n\n• Jesse Pinkman:\n  - Role: A small-time methamphetamine user, manufacturer, and dealer who becomes Walter's business partner.\n  - Key Relationships: Walter White (business partner), Andrea Cantillo (girlfriend), Brock Cantillo (Andrea's son), Saul Goodman (lawyer), Mike Ehrmantraut (associate).\n\n• Hank Schrader:\n  - Role: A U.S. Drug Enforcement Administration (DEA) agent and Walter's brother-in-law.\n  - Key Rela

#### Manuscript (subtitles) 📝 --> LLM Processing ⚙️ --> JSON schema of characters, events, locations & seasons 📦
This section is where we dive into the core analysis of the Breaking Bad subtitles, transforming unstructured text into structured data

##### Manuscript (Subtitles) Preparation 📝

* We start by loading the cleaned subtitle files from the **'subtitles'** directory, one by one
* Each file represents the dialogue from a specific episode of Breaking Bad

##### LLM Processing ⚙️

* **System Prompt**: We provide a detailed prompt (**SYSTEM_PROMPT**) to the LLM (**`Qwen/Qwen2.5-72B-Instruct-Turbo`**). This prompt includes instructions to:
  * Analyze the episode script.
  * Identify characters, locations, events, and the season.
  * Establish relationships between these entities using predefined relationship types.
  * Structure the extracted information into a JSON format following a specific schema.

* **Background Context**: The **SYSTEM_PROMPT** also includes the character summary we derived from Wikipedia (**article_sum**) to give the LLM some context.

* **Episode Analysis**: For each episode, we feed the script content, episode name, and season number to the LLM.

* **Extraction**: The LLM processes the script and extracts entities and their relationships based on the instructions in the prompt and the context we provided. We also account for errors that may occur during the LLM-processing.

##### JSON Schema Output 📦

* **Structuring**: We organize the extracted information into a JSON format according to the predefined schema.
  * **entities**: This contains a list of identified characters, locations, events, and seasons, each with a name and type.
  * **relationships**: This includes a list of relationships between entities, detailing the source, relation, target, and season.

* **Saving**: Finally, we save all the LLM-processed episode data into a single JSON file named **"breaking_bad_analysisV2.json"**.


##### System Prompt 🤖 

In [None]:
SYSTEM_PROMPT = f"""
You are an assistant specialized in analyzing and structuring information about TV series. Your task is to help build a network of relationships between various entities in a given TV series, based on the following summary:

Background Information:
{article_sum}

This series features a range of characters involved in complex relationships. Your primary goal is to analyze these connections and structure them into JSON format.

Your task includes:
1. Identifying relevant entities such as characters, locations, events, and seasons in the series.
2. Establishing meaningful relationships between these entities, noting when each relationship occurs (season).

Key Guidelines:
- Each entity should have a unique name and a defined type (e.g., 'character', 'location').
- Relationships must always specify the source entity, target entity, relationship type, and season.
- Use only predefined relationship types provided.

Additionally, you should:
- Be able to answer questions about the structure and relationships in the series.
- Offer suggestions for expanding or refining the network.
- Identify central characters, significant events, and key locations, using network connections as a basis for insight into the series' narrative structure and character development.

Explain your choices and reasoning as needed, ensuring that your analysis aids in understanding the series’ narrative structure over time.

Output JSON only.
"""

##### Definition of JSON Schema & LLM-processing of subtitles 🎛️ ⚙️ 📦

In [None]:
import json
import os
from typing import Dict, Any

# path to the subtitles folder
subtitles_dir = 'subtitles'  # where the subs are

def extract_relationships(script_content: str, episode_name: str, season_number: int) -> Dict[str, Any]:
    prompt = f"""
    Analyze the Breaking Bad script and find entities and their relationships.
    Episode: {episode_name}
    Season: {season_number}

    Output ONLY a valid JSON object like this:
    {{
        "entities": [
            {{
                "name": "string",
                "type": "Character" | "Location" | "Event" | "Season"
            }}
        ],
        "relationships": [
            {{
                "source": "string",
                "relation": "friend of" | "enemy of" | "related to" | "married to" | "works with" | "lives in" | "visits" | "owns" | "participates in" | "witnesses" | "causes" | "appears in" | "is central to" | "introduces" | "concludes" | "develops" | "part of",
                "target": "string",
                "season": {season_number}
            }}
        ]
    }}

    Script content:
    {script_content[:1000]}...
    """

    try:
        response = client.chat.completions.create(
            model='Qwen/Qwen2.5-72B-Instruct-Turbo',
            messages=[
                {'role': 'system', 'content': SYSTEM_PROMPT},
                {'role': 'user', 'content': prompt}
            ],
            temperature=0.7
        )

        response_text = response.choices[0].message.content.strip()

        # clean up weird characters
        response_text = ''.join(char for char in response_text if ord(char) >= 32 or char in '\n\r\t')

        # make sure we get valid JSON
        try:
            data = json.loads(response_text)
            return data
        except json.JSONDecodeError:
            # try to fix JSON if it fails
            json_start = response_text.find('{')
            json_end = response_text.rfind('}') + 1
            if json_start >= 0 and json_end > json_start:
                json_str = response_text[json_start:json_end]
                return json.loads(json_str)
            raise

    except Exception as e:
        print(f"Error processing episode {episode_name}: {str(e)}")
        return {"entities": [], "relationships": []}

def analyze_all_episodes(subtitles_dir: str) -> Dict[str, Any]:
    all_episode_data = {}  # store all data

    for season_dir in sorted(os.listdir(subtitles_dir)):  # go through each season
        season_path = os.path.join(subtitles_dir, season_dir)
        if os.path.isdir(season_path):
            season_number = int(season_dir.split('_')[1]) if season_dir.split('_')[1].isdigit() else 0

            for episode_file in sorted(os.listdir(season_path)):  # check each episode
                if episode_file.endswith('.txt'):
                    episode_path = os.path.join(season_path, episode_file)
                    clean_episode_name = episode_file.replace('%27', "'").replace('%20', " ")

                    try:
                        with open(episode_path, 'r', encoding='utf-8') as file:
                            script_content = file.read()  # read the script

                        episode_data = extract_relationships(script_content, clean_episode_name, season_number)
                        if episode_data["entities"] or episode_data["relationships"]:
                            all_episode_data[f"{season_dir} - {clean_episode_name}"] = episode_data

                    except Exception as e:
                        print(f"Error reading file {episode_path}: {str(e)}")
                        continue

    return all_episode_data

# finally to analyse episodes and save results as JSON
all_episode_data = analyze_all_episodes(subtitles_dir)

# then savin to JSON file
with open('breaking_bad_analysisV2.json', 'w', encoding='utf-8') as f:
    json.dump(all_episode_data, f, indent=4, ensure_ascii=False)

#  printing summary of results
print("\nAnalysis complete! Results saved to 'breaking_bad_analysisV2.json'")
print(f"Processed {len(all_episode_data)} episodes")

# also printing sample of the data in the JSON format we want to work with
print("\nSample of the data:")
print(json.dumps(dict(list(all_episode_data.items())[:1]), indent=4))


Error processing episode Season_5B - Buried subtitles.txt: Invalid control character at: line 36 column 33 (char 763)

Analysis complete! Results saved to 'breaking_bad_analysis.json'
Processed 61 episodes

Sample of the data:
{
    "Season_1 - Season_1 - ...and the Bag's in the River subtitles.txt": {
        "entities": [
            {
                "name": "Walter White",
                "type": "Character"
            },
            {
                "name": "Classroom",
                "type": "Location"
            },
            {
                "name": "Chemistry Lesson",
                "type": "Event"
            },
            {
                "name": "Season 1",
                "type": "Season"
            }
        ],
        "relationships": [
            {
                "source": "Walter White",
                "relation": "teaches",
                "target": "Chemistry Lesson",
                "season": 1
            },
            {
                "source": "Wal