# Import the full story

This code creates a list from the text file by creating a new item every time a new line is being created.

e.g. for the two speeches:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

The code will consider the first instance of a new line as the point after "First Citezen:", so the line "First Citezen:" will be the first item.
Then the next item starts at "Before ..." and ends at "speak." since a new line starts there after

In [1]:
import pandas as pd

# Load the text file and store each line in a list
with open("full-story.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

# Extracting the speakers and speeches from the list

From the list we can see that (after the 1st two items) a speaker is always found immediately after each item that contains '\n', and the relevant speeches are found between a speaker and the next occuring '\n' item.

Therefore, in our logic we first create a list of all indexes where '\n' items are found, and then we use that to locate the speaker and then speeches.

In [2]:
# Run a loop to create a list of indexes:

n_indexes = []

for i in range(len(lines)):
    if lines[i] == '\n':
        n_indexes.append(i)

In [21]:
# Locate all the speakers

speaker = []

for i in n_indexes:
    speaker.append(lines[i+1])

In [22]:
# find the number of all unique values of speaker 
len(list(set(speaker)))

310

In [23]:
# Now locate the relevant speeches

speech = []

for i in range(len(n_indexes)):
    if i+1 < len(n_indexes):
        speech.append(lines[n_indexes[i]+1+1:n_indexes[i+1]])

# cater for the final index
speech.append(lines[n_indexes[-1]+1+1])

In [24]:
# Now we account for the first speaker and the first speech:

speaker.insert(0, lines[0])
speech.insert(0, lines[1])

# Extract the data within delimiters to clean the text within

For each of the new lists there are extra characters besides the actual speakers and speeches, so we need to clean them to only the relevant information

In [25]:
# Remove all delimiters in speaker and remove the one lined lists in speech
for i in range(len(speaker)):
    
    speaker[i] = speaker[i].replace(":\n","")

    if i == 0:
        speech[i] = speech[i].replace("\n","")
    else:
        if len(speech[i]) == 1:
            speech[i] = speech[i][0].replace("\n","")

In [28]:
# Remove delimiters in speech for multilined speeches

speech_indexes = []

for i in range(len(speech)):
    if type(speech[i]) == list:
        speech_indexes.append(i)

for n in speech_indexes:
    speech_var = speech[n]
    clean_speech_var = []

    for i in range(len(speech_var)):
        clean_speech_var.append(speech_var[i].replace("\n", ""))

    speech.insert(n,clean_speech_var)
    speech.pop(n+1)

# Combine the speaker and speech list into a dataframe

In [30]:
# Combine the two lists into rows
data = list(zip(speaker, speech))

# Convert nested lists in list_2 to comma-separated strings
def flatten_cell(cell):
    if isinstance(cell, list):
        return ', '.join(cell)
    return cell

# Create the DataFrame
df = pd.DataFrame(data, columns=['speaker', 'speech'])

# Apply formatting to list_2
df['speech'] = df['speech'].apply(flatten_cell)

In [31]:
df

Unnamed: 0,speaker,speech
0,First Citizen,"Before we proceed any further, hear me speak."
1,All,"Speak, speak."
2,First Citizen,You are all resolved rather to die than to fam...
3,All,Resolved. resolved.
4,First Citizen,"First, you know Caius Marcius is chief enemy t..."
...,...,...
7219,ANTONIO,"Nor I; my spirits are nimble., They fell toget..."
7220,SEBASTIAN,"What, art thou waking?"
7221,ANTONIO,Do you not hear me speak?
7222,SEBASTIAN,"I do; and surely, It is a sleepy language and ..."
