# Re-writing Movies Scripts
This notebook is designed to walk you through setting up an ML workflow that provides an interface labeling and interacting with movie scripts. We'll start with data preparation, create a Label Studio project with a prompt-generation workflow and ingest the data into the project. 

## Setup
Installation of the Label Studio SDK, used for setting up the project. 

In [None]:
!pip install label-studio-sdk

Import the Label Studio SDK and set the [API key](https://labelstud.io/guide/api.html) and URL of your Label Studio instance. 

In [2]:
# Import the SDK and the client module
from label_studio_sdk import Client

# Define the URL where Label Studio is accessible and the API key for your user account
LABEL_STUDIO_URL = 'http://localhost:8080'
API_KEY = 'b7fc43a9abe38ddbebca580cd1a6fd03b3778e8a'

# Connect to the Label Studio API and check the connection
ls = Client(url=LABEL_STUDIO_URL, api_key=API_KEY)
ls.check_connection()


{'status': 'UP'}

# Movie Scripts Project Setup
The following cells set up a project to display samples of movie dialogues. 

Additionally, we have an additional `prompt` area that will allow our project to interact with an LLM using the [Label Studio ML Backend - LLM Interactive](https://github.com/HumanSignal/label-studio-ml-backend/tree/master/label_studio_ml/examples/llm_interactive) example. This gives us a prompt area in our Labeling Interface to apply LLM interactions our output categories. 

In [20]:
project = ls.start_project(
    title='Movie Dialogus',
    label_config='''
<View>
   <Style>
    .lsf-main-content.lsf-requesting .prompt::before { content: ' loading...'; color: #808080; }
  </Style>
  <Paragraphs name="chat" value="$dialogue" layout="dialogue" />
  <Header value="User prompt:" />
  <View className="prompt">
  <TextArea name="prompt" toName="chat" rows="4" editable="true" maxSubmissions="1" showSubmitButton="false" />
  </View>
  <Header value="Bot answer:"/>
    <TextArea name="response" toName="chat" rows="4" editable="true" maxSubmissions="1" showSubmitButton="false" />

</View>
    '''
)

# Cornell Movie Dialogues Dataset
The following cells download and prepare our dataset. We will use the [Cornell Movie Dialogue Dataseat](http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip). The transformation will organize the data for our prompt-generation interaction to output data to a text area for the re-writing we want to apply

In [21]:
import requests
import os
import zipfile
import re
import json

def download_and_unzip(url, extract_to='.'):
    r = requests.get(url)
    zip_file_path = os.path.join(extract_to, 'cornell_movie_dialogs_corpus.zip')
    with open(zip_file_path, 'wb') as zip_file:
        zip_file.write(r.content)
    with zipfile.ZipFile(zip_file_path, 'r') as zip_file:
        zip_file.extractall(extract_to)

In [22]:
url = 'http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip'
download_and_unzip(url)

In [23]:
def read_file(file_path):
    with open(file_path, 'r', encoding='ISO-8859-1') as file:
        content = file.readlines()
    return content

# Parse movie lines
def parse_movie_lines(lines_file):
    lines_content = read_file(lines_file)
    line_dict = {}
    for line in lines_content:
        parts = line.strip().split(" +++$+++ ")
        if len(parts) == 5:
            line_id, character_id, movie_id, character_name, text = parts
            line_dict[line_id] = {"author": character_name, "text": text}
    return line_dict


# Parse conversations and create dialogues
def create_conversations(conversations_file, lines):
    conversations_content = read_file(conversations_file)
    movie_conversations = {}

    for conversation in conversations_content:
        parts = conversation.strip().split(" +++$+++ ")
        if len(parts) == 4:
            character_id1, character_id2, movie_id, line_ids_str = parts
            line_ids = json.loads(line_ids_str.replace("'", '"'))
            dialogue = [lines[line_id] for line_id in line_ids if line_id in lines]

            # Add the dialogue to the corresponding movie
            if movie_id not in movie_conversations:
                movie_conversations[movie_id] = []
            movie_conversations[movie_id].append({"dialogue": dialogue})
    return movie_conversations




In [24]:
characters_file_path = './cornell movie-dialogs corpus/movie_characters_metadata.txt'
movie_lines_file_path = './cornell movie-dialogs corpus/movie_lines.txt'
conversations_file_path = './cornell movie-dialogs corpus/movie_conversations.txt'

# Parsing the movie lines
parsed_lines = parse_movie_lines(movie_lines_file_path)

# Creating the dialogues
movie_conversations = create_conversations(conversations_file_path, parsed_lines)

In [None]:
movie_conversations['m0']

# Import data into Label Studio
Now that we have our dataset prepared, we can import it into Label Studio. Here we are just incorporating the first movie for simplicity. 

In [None]:
movie_conversations

In [None]:
# Import dialogues for the first movie
project.import_tasks(movie_conversations['m0']) 

# Import all movie dialogues 
# for movie, conversations in movie_conversations.items(): 
#     project.import_tasks(conversations)