# Breaking Bad Relationship Extraction with SetFit 🧑‍🔬

In this notebook, we dive into extracting relationships from the *Breaking Bad* TV series using the SetFit model training.

### 📂 Data and Setup
- We start with importing the LLM-preprocessed JSON data stored in `breaking_bad_analysisV2.json` generated from the notebook: ` M2_LLM_Data_Fetch_and_Processing_(JSON_Creation).ipynb` 
    * Online downloading from github as standard (- Offline option also available)
- Our focus is on training and evaluating a SetFit model to classify relationships between characters.

### 🧠 Model Selection and Fine-Tuning
- **Base Model**: We use the `sentence-transformers/paraphrase-mpnet-base-v2` model.
- **Fine-Tuning**: The model is fine-tuned on our dataset.
- **Data Split**: We allocate 80% for training and 20% for evaluation.

### ⚡ Efficient Execution on Colab
- Use **Colab GPU** (T4 GPU) to speed up the process - with an estimated runtime of ~5-6 minutes.

### 💾 Model Saving and Reusability
- The trained model is saved in the `saved_model` directory for future use.

### ✅ The model is utilized in a Gradio Interface in the other Notebook: 
#### ➡️ ➡️ ➡️ `M2_Main_Network_Analysis_and_Text_Classification.ipynb`


### Install & Import Libraries needed for model training 🎛️

In [None]:
# Install required packages from requirements.txt
!pip install -r https://raw.githubusercontent.com/Markushenriksson13/NLP-and-Network-Analysis_Exam_Submission/refs/heads/main/requirements.txt -q

# import libs
import json
import splitfile
import pandas as pd
from datasets import Dataset
from setfit import SetFitModel, SetFitTrainer
from sklearn.metrics import classification_report
import requests
import os
from filesplit.split import Split

### SetFit model for training 🧮

In [2]:
# ONLINE DOWNLOADING JSON file from Github repository
# URL to JSON-file
url = 'https://raw.githubusercontent.com/Markushenriksson13/NLP-and-Network-Analysis_Exam_Submission/main/breaking_bad_analysisV2.json'

# Download JSON-file
response = requests.get(url)
data = response.json()  # Convert to JSON-format

In [3]:
# OFFLINE LOADING OF JSON:
#with open("breaking_bad_analysisV2.json", 'r', encoding='utf-8') as file:
#    data = json.load(file)

In [None]:
def load_and_prepare_data():
    """Load JSON and prepare data for classification"""

    relationships = []
    labels = []

    for episode in data.values():
        for rel in episode.get('relationships', []):
            relationships.append(f"{rel['source']} - {rel['target']}")
            labels.append(rel['relation'])

    # Split into train/test (80/20)
    df = pd.DataFrame({'text': relationships, 'label': labels})
    train_size = int(len(df) * 0.8)

    train_data = Dataset.from_pandas(df[:train_size])
    test_data = Dataset.from_pandas(df[train_size:])

    return train_data, test_data

def train_and_evaluate():
    # load and prepare data
    train_dataset, test_dataset = load_and_prepare_data()

    print(f"Training samples: {len(train_dataset)}")
    print(f"Testing samples: {len(test_dataset)}")

    # start and train model
    model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
    trainer = SetFitTrainer(
        model=model,
        train_dataset=train_dataset,
        batch_size=16,
        num_iterations=20,
        num_epochs=1
    )

    trainer.train()

    # evaluate
    predictions = model.predict(test_dataset['text'])
    print("\nClassification Report:")
    print(classification_report(test_dataset['label'], predictions))

    # Example predictions
    print("\nExample Predictions:")
    for text, true_label, pred_label in zip(
        test_dataset['text'][:3],
        test_dataset['label'][:3],
        predictions[:3]
    ):
        print(f"\nText: {text}")
        print(f"True: {true_label}")
        print(f"Predicted: {pred_label}")

    # save Model
    model.save_pretrained("saved_model")

    return model

# run train & evaluation
model = train_and_evaluate()

##### FOR GITHUB PUSH: Since Github has a file limit on 100 MB, we would need to split the model files

In [None]:
# FOR GITHUB push
#from filesplit.split import Split

# filepath, output directory & chunk size
#input_file = 'saved_model/model.safetensors'
#output_dir = 'saved_model'
#chunk_size = 100 * 1024 * 1024  # 100 MB

# split model
#splitter = Split(input_file, output_dir)
#splitter.bysize(chunk_size)
