## Data Preprocessing
This notebook preprocesses the text data/transcripts

In [34]:
import pandas as pd
from random import randint as rint
import re
import plotly.express as px
from statistics import mean
import os
import json

def display_doc(transcript:str, words_per_line = 20, lines_to_display=10, random_start=True):
    ''' Accepts a long string and displays in readable form in jupyter notebook.'''
    word_count = 0
    lines_count = 0
    words = []
    t_arr = transcript.split()
    if random_start and len(transcript)>(words_per_line*lines_to_display*7): 
        random_start_index = rint(0, len(t_arr)-words_per_line*lines_to_display)
        t_arr = t_arr[random_start_index:]
    for j in t_arr:
        words.append(j)
        word_count += 1
        if word_count>= words_per_line:
            print(' '.join(words))
            words = []
            word_count=0
            lines_count += 1
            if lines_count>=lines_to_display:
                break
    print(' '.join(words))
    return None

def save_arr_str(path:str, arr_str:list):
    ''' Saves an array of strings into a .txt file '''
    with open(path, "w", encoding="utf-8") as file:
        # Write each string to the file, separated by a newline
        for s in arr_str:
            file.write(s + "\n")

### Educator: Ted talk transcripts

In [23]:
# Educator Data
transcripts_df = pd.read_csv("./data/educator/ted_transcripts/transcripts.csv")
transcripts_df.info() # 2467 transcripts available in csv file

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2467 entries, 0 to 2466
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   transcript  2467 non-null   object
 1   url         2467 non-null   object
dtypes: object(2)
memory usage: 38.7+ KB


In [24]:
transcripts = transcripts_df['transcript']
display_doc(transcripts[11], random_start=True)

# We see that the transcripts include audience interactions in brackets eg.
# (Laughter) (Applause)
# These words might be confusing and might distract the LLMs from learning
# to speak like an educator/ted-talker
# Let's confirm using regex:

bracket_words = {}
all_bracket_words = []
for t in transcripts:
    for b_word in re.findall(r"\(([A-Za-z]+)\)", t):
        if b_word not in bracket_words:
            bracket_words[b_word] = 1
        else:
            bracket_words[b_word] += 1
        all_bracket_words.append(b_word)

bracket_hist = px.histogram(all_bracket_words, template="plotly_dark")
bracket_hist.update_layout(showlegend=False, xaxis_title="Bracket Words")
bracket_hist.show()
print(bracket_words.keys())

I'd like to talk about today is a way for people to travel, to meet people in a different way
than — because you can't travel all over the world at the same time. And a long time ago —
well, about 40 years ago — my mom had an exchange student. And I'm going to show you slides of
the exchange student. This is Donna. This is Donna at the Statue of Liberty. This is my mother and aunt
teaching Donna how to ride a bike. This is Donna eating ice cream. And this is Donna teaching my aunt
how to do a Filipino dance. I really think as the world is getting smaller, it becomes more and more
important that we learn each other's dance moves, that we meet each other, we get to know each other, we
are able to figure out a way to cross borders, to understand each other, to understand people's hopes and dreams,
what makes them laugh and cry. And I know that we can't all do exchange programs, and I can't force
everybody to travel; I've already talked about that to Chris and Amy, and they said that there

dict_keys(['Laughter', 'Applause', 'Music', 'Cheering', 'Sighs', 'Video', 'Singing', 'Whispering', 'Audience', 'Shouting', 'Sings', 'Lyrics', 'Chanting', 'Thunder', 'Scream', 'video', 'Whistling', 'Laughs', 'Bells', 'Trumpet', 'Voices', 'Screams', 'Explosion', 'Portamento', 'Tones', 'Blip', 'Screeching', 'Silence', 'Rattling', 'Clattering', 'Inaudible', 'Beep', 'Burst', 'Knocks', 'Spanish', 'Reading', 'Typewriting', 'Thuds', 'Breathing', 'Laugher', 'Whirring', 'Beatboxing', 'Screaming', 'Clapping', 'Tuning', 'k', 'Growling', 'Rustling', 'Japanese', 'Recording', 'Gasps', 'Sniffs', 'Coughs', 'Buzzing', 'Ringing', 'Buzzer', 'Jackhammer', 'Claps', 'Gunshot', 'Sobs', 'Noise', 'Cries', 'Hindi', 'Murmuring', 'Braying', 'Barking', 'Shouts', 'Sigh', 'Audio', 'Feedback', 'Cheers', 'Nonsense', 'Game', 'Mandarin', 'cheering', 'Live', 'Thumping', 'Static', 'Inhales', 'Barks', 'Sneezes', 'Whinny', 'Roar', 'Chatter', 'Honk', 'Translator', 'Honking', 'Laughing', 'Laughted', 'Crackling', 'Tapping', 'Fu

In [19]:
# As suspected these words will distract the model from learning the speakers' personalities
# We will remove them
transcripts_no_bracketed_words = [re.sub(r'\([^)]*\)', '', t) for t in transcripts]
display_doc(transcripts_no_bracketed_words[0])

# The URL column actually provides very brief summary of the talk:
url_summaries = []
for url in transcripts_df['url']:
    last_backslash = url.rfind('/')
    url_sum = ' '.join(url[last_backslash+1:-1].split('_'))
    url_summaries.append(url_sum)

save_transcripts = [url_summaries[i]+": "+ tnbw for i, tnbw in enumerate(transcripts_no_bracketed_words)]
save_arr_str("./data/educator/ted_transcripts.txt", save_transcripts)

people have never heard of, Gillian Lynne. Have you heard of her? Some have. She's a choreographer, and everybody knows
her work. She did "Cats" and "Phantom of the Opera." She's wonderful. I used to be on the board of
The Royal Ballet, as you can see. Anyway, Gillian and I had lunch one day and I said, "How did
you get to be a dancer?" It was interesting. When she was at school, she was really hopeless. And the
school, in the '30s, wrote to her parents and said, "We think Gillian has a learning disorder." She couldn't concentrate;
she was fidgeting. I think now they'd say she had ADHD. Wouldn't you? But this was the 1930s, and ADHD
hadn't been invented at this point. It wasn't an available condition.People weren't aware they could have that.Anyway, she went to
see this specialist. So, this oak-paneled room, and she was there with her mother, and she was led and sat
on this chair at the end, and she sat on her hands for 20 minutes while this man talked to
her mother about the problems 

### Politician: UN General Debates transcripts

In [23]:
un_debates_df = pd.read_csv("./data/politician/debates.csv")
print(un_debates_df.info()) # 7507 total speeches
display(un_debates_df.head()) 
un_debates_transcripts = un_debates_df['text']
tr_len = un_debates_transcripts.apply(lambda x: len(x))
print(f"Average transcript length: {int(mean(tr_len))} Max: {max(tr_len)} Min: {min(tr_len)}")
r_start = rint(0, len(un_debates_transcripts)-3)
for i in range(r_start, r_start+3):
    display_doc(un_debates_transcripts[i])

# Looks good to save as is for now
save_arr_str("./data/politician/un_transcripts.txt", un_debates_transcripts)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7507 entries, 0 to 7506
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   session  7507 non-null   int64 
 1   year     7507 non-null   int64 
 2   country  7507 non-null   object
 3   text     7507 non-null   object
dtypes: int64(2), object(2)
memory usage: 234.7+ KB
None


Unnamed: 0,session,year,country,text
0,44,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,44,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,44,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,44,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,44,1989,ZWE,﻿I should like at the outset to express my del...


Average transcript length: 17967 Max: 72041 Min: 2362
I should say that aggressions have come in the past from countries of different ideologies. In the face of aggression
that could lead to conflict between nations, we propose the alternative of diplomacy and political methods. Hence our active neutrality;
we do not justify or explain regional wars which only produce destruction or death, nor do we accept the existence
of any international or ideological right to provoke confrontations between sister States. We Guatemalans affirm that violence, even when labeled
'revolutionary', is new at this historical moment an obstacle to Central American development because funds are allocated for weapons rather
than to meet our needs. We have asserted our neutrality with regard to the differences that might exist among the
Central American countries and, at the same time, our energetic, diplomatic and political participation in the search for an understanding
and in the mechanisms for integratio

### Comedian: Stand-up transcripts

In [33]:
stand_up_df = pd.read_csv("./data/comedian/stand-up-data.csv")
print(stand_up_df.info()) # 300 transcripts
display(stand_up_df.head()) 
stand_up_transcripts = stand_up_df['transcript']
tr_len = stand_up_transcripts.apply(lambda x: len(x))
print(f"Average transcript length: {int(mean(tr_len))} Max: {max(tr_len)} Min: {min(tr_len)}")

r_start = rint(0,len(stand_up_transcripts)-3)
for i in range(r_start, r_start+3):
    display_doc(stand_up_transcripts[i])

# This looks good to save as is too
save_arr_str("./data/comedian/stand_up_transcripts.txt", stand_up_transcripts)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 330 entries, 0 to 329
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        330 non-null    object 
 1   date_posted  330 non-null    object 
 2   link         330 non-null    object 
 3   name         326 non-null    object 
 4   year         313 non-null    float64
 5   transcript   330 non-null    object 
dtypes: float64(1), object(5)
memory usage: 15.6+ KB
None


Unnamed: 0,title,date_posted,link,name,year,transcript
0,Russell Peters: Deported,"May 10th, 2020",https://scrapsfromtheloft.com/2020/05/10/russe...,Russell Peters,2020.0,"NARRATOR: Ladies and gentlemen, it’s start t..."
1,Jimmy O. Yang: Good Deal,"May 10th, 2020",https://scrapsfromtheloft.com/2020/05/10/jimmy...,Jimmy O. Yang,2020.0,"ANNOUNCER: Ladies and gentlemen, welcome to th..."
2,Jo Koy: Lights Out,"May 9th, 2020",https://scrapsfromtheloft.com/2020/05/09/jo-ko...,Jo Koy,2012.0,"L.A., are you ready? Live from the Alex Thea..."
3,Lee Mack: Going Out Live,"May 8th, 2020",https://scrapsfromtheloft.com/2020/05/08/lee-m...,Lee Mack,2010.0,This programme contains strong language Over ...
4,Lee Mack: Live,"May 7th, 2020",https://scrapsfromtheloft.com/2020/05/07/lee-m...,Lee Mack,2007.0,"PRESENTER: Ladies and gentlemen, please welco..."


Average transcript length: 42494 Max: 86031 Min: 92
cast Twenty-five years in a Hollywood film since The Joy Luck Club. Think about this, my first scene I shot
that movie, the first scene, hours prior, Dr. Ken had just been canceled. So, to me, that’s all I see
when I see that movie. And that movie got me through everything. Because you know what? Director Jon M. Chu
let me improvise. All that shit was just off the top of my head. Me and Me and Aquafina Aquafina
is my daughter. I love her so much. I’m Papa-fina to her Aquafina. We both improvised that shit. I mean,
we were just on a fucking It was so therapeutic. None of that shit was in the script. Cal State
Fullerton is my You know what I mean? That’s my new Toodle-oo, motherfuckers, you know what I mean? But it
was amazing that I got to improvise that whole scene and get away with it. I mean, it was, like,
crazy. We didn’t pay attention to the script to the point where the director was like, What the fuck, you
know? And then you jum

### Rapper: Eminem Song Lyrics

In [40]:
# Rapper Data, will use txt file retrieved as is
with open("./data/rapper/Eminem/ALL_eminem.txt", "r",encoding="utf-8") as f:
    lines = f.readlines()
    r_start = rint(0, len(lines)-10)
    for i in range(r_start, r_start+10):
        print(lines[i])
    print("Num lines:", len(lines))

I'm climbin' all up the sides of the asylum wall

And dive in a pile of Tylenol, you're like a vagina problem

To a diabolical gynecologist, tryna ball a fist I will

Fuck you just buy me, double timing the rhyming

I leave you stymied, that's why they still vilify me like Bill O'Reilly

I'ma show you what I mean when they call me the Harvey Weinstein of 2019

I'm a conniving (What?), when I'm on the mic I'ma standout

Like a lime green wife beater with a knife out

I'm a sight to see, but you can see from the ring I'm wearing

Me and this game, we got married already

Num lines: 23629


### Lawyer: Supreme Court Cases Transcripts

In [51]:
# Lawyer Data
def load_files_as_json(directory):
    file_contents = {}
    for filename in os.listdir(directory):
        if filename.endswith('.js'):
            file_path = os.path.join(directory, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                try:
                    data = json.load(file)
                    casename = data['caseName']
                    if casename not in file_contents:
                        file_contents[casename] = data
                    else: 
                        print(f"Warning! Duplicate case names in {directory}")
                except json.JSONDecodeError:
                    print(f"Error decoding JSON from file: {filename}")
    return file_contents

def get_lines(case_transcripts_dict: dict, verbose=False):
    lines = []
    if len(case_transcripts_dict['caseTranscripts']) > 0:
        for trans in case_transcripts_dict['caseTranscripts'][0]['transcript']:
            speaker = trans['speakerName']
            speaker_text = trans['textObjs'][0]['text']
            lines.append(speaker+": "+speaker_text)
            if verbose: print(speaker+": "+speaker_text)
    return ' '.join(lines)

years = ["2016","2017","2018"]

court_transcripts = []
for year in years:
    case_directory = f"./data/lawyer/{year}/"
    files_data = load_files_as_json(case_directory)
        
    for case_name,case_dict in files_data.items():
        case_transcript = get_lines(case_dict)
        save_str = f"(Transcript for {case_name} in {year}) "+case_transcript
        court_transcripts.append(save_str)

display_doc(court_transcripts[0])
save_arr_str("./data/lawyer/court_transcripts.txt", court_transcripts)

whatever activities that the IRS thought characterized what a church should be. James A. Feldman: I just don't think that
that's what the problem was. John G. Roberts, Jr.: What the -- what was the tenor -- James A. Feldman:
-- of that. John G. Roberts, Jr.: What was the tenor of the hundreds and hundreds of letters that --
that Congress received about what the IRS was doing? What did they understand the IRS to be doing? James A.
Feldman: So, if you look at the 20 -- on page, I think, 10,054 or so of the congressional record
-- I don't remember the volume number -- but it's cited by Petitioners and by us. Samuel A. Alito, Jr.:
Are you saying that the only purpose of the amendment was to avoid the sunset provision? James A. Feldman: I
think there were two purposes. Samuel A. Alito, Jr.: All right. James A. Feldman: That -- well, okay. Samuel A.
Alito, Jr.: Because they honestly would have to do something else, right? And that's what C(i) -- James A. Feldman:
Right. Samuel A. Alito, Jr.:

### Philosopher: Works from Aristotle, Plato, Aeschylus, and Epictetus

In [36]:
# Philosophers Data

philo_dir = "./data/philosopher/"

all_philo_text = []
for philo in os.listdir(philo_dir):
    philo_dir_ = philo_dir + philo + "/"
    for txt in os.listdir(philo_dir_):
        with open(philo_dir_+txt, 'r', encoding="utf-8") as f:
            lines = f.readlines()
            lines = [line for line in lines if (not line.startswith("Translated by")
                                                and any(c.isalpha() for c in line))]
            translated_by_lines = []
            for i, line in enumerate(lines):
                if "Available online at" in line:
                    lines = lines[i+2:]

            formatted_txt = ' '.join(lines)
            all_philo_text.append(formatted_txt)

display_doc(all_philo_text[0])
print(len(all_philo_text), "documents")
save_arr_str("./data/philosopher/all_philo.txt",all_philo_text)

gushing fount of tears Is wept away; no drop is left to shed. Dim are the eyes that ever watched
till dawn, Weeping, the bale-fires, piled for thy return, Night after night unkindled. If I slept, Each sound-the tiny humming
of a gnat, Roused me again, again, from fitful dreams Wherein I felt thee smitten, saw thee slain, Thrice for
each moment of mine hour of sleep. All this I bore, and now, released from woe, I hail my lord
as watch-dog of a fold, As saving stay-rope of a storm-tossed ship, As column stout that holds the roof aloft,
As only child unto a sire bereaved, As land beheld, past hope, by crews forlorn, As sunshine fair when tempest's
wrath is past, As gushing spring to thirsty wayfarer. So sweet it is to 'scape the press of pain. With
such salute I bid my husband hail Nor heaven be wroth therewith! for long and hard I bore that ire
of old. Sweet lord, step forth, Step from thy car, I pray-nay, not on earth Plant the proud foot, O
king, that trod down Troy! Women! why tarry y