In [1]:
import pandas as pd

# Load the text file
with open("truncated-story.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

This code creates a list from the text file by creating a new item every time a new line is being created.

e.g. for the two speeches:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

The code will consider the first instance of a new line as the point after "First Citezen:", so the line "First Citezen:" will be the first item.
Then the next item starts at "Before ..." and ends at "speak." since a new line starts there after

In [2]:
lines

['First Citizen:\n',
 'Before we proceed any further, hear me speak.\n',
 '\n',
 'All:\n',
 'Speak, speak.\n',
 '\n',
 'First Citizen:\n',
 'You are all resolved rather to die than to famish?\n',
 '\n',
 'All:\n',
 'Resolved. resolved.\n',
 '\n',
 'First Citizen:\n',
 'First, you know Caius Marcius is chief enemy to the people.\n',
 '\n',
 'All:\n',
 "We know't, we know't.\n",
 '\n',
 'First Citizen:\n',
 "Let us kill him, and we'll have corn at our own price.\n",
 "Is't a verdict?\n",
 '\n',
 'All:\n',
 "No more talking on't; let it be done: away, away!\n",
 '\n',
 'Second Citizen:\n',
 'One word, good citizens.\n',
 '\n',
 'First Citizen:\n',
 'We are accounted poor citizens, the patricians good.\n',
 'What authority surfeits on would relieve us: if they\n',
 'would yield us but the superfluity, while it were\n',
 'wholesome, we might guess they relieved us humanely;\n',
 'but they think we are too dear: the leanness that\n',
 'afflicts us, the object of our misery, is as an\n',
 'in

## Extracting the speakers and speeches from the list

From the list we can see that (after the 1st two items) a speaker is always found immediately after each item that contains '\n', and the relevant speeches are found between a speaker and the next occuring '\n' item.

Therefore, in our logic we first create a list of all indexes where '\n' items are found, and then we use that to locate the speaker and then speeches.

In [3]:
# Run a loop to create a list of indexes:

n_indexes = []

for i in range(len(lines)):
    if lines[i] == '\n':
        n_indexes.append(i)

In [4]:
n_indexes

[2, 5, 8, 11, 14, 17, 21, 24, 27, 39, 42]

In [91]:
# Locate all the speakers

speaker = []

for i in n_indexes:
    speaker.append(lines[i+1])

In [92]:
speaker

['All:\n',
 'First Citizen:\n',
 'All:\n',
 'First Citizen:\n',
 'All:\n',
 'First Citizen:\n',
 'All:\n',
 'Second Citizen:\n',
 'First Citizen:\n',
 'Second Citizen:\n',
 'All:\n']

In [93]:
lines[2+1+1:5]

['Speak, speak.\n']

In [94]:
lines[-1]

"Against him first: he's a very dog to the commonalty.\n"

In [95]:
# Now locate the relevant speeaches

speech = []

for i in range(len(n_indexes)):
    if i+1 < len(n_indexes):
        speech.append(lines[n_indexes[i]+1+1:n_indexes[i+1]])

# cater for the final index
speech.append(lines[n_indexes[-1]+1+1])

In [96]:
speech

[['Speak, speak.\n'],
 ['You are all resolved rather to die than to famish?\n'],
 ['Resolved. resolved.\n'],
 ['First, you know Caius Marcius is chief enemy to the people.\n'],
 ["We know't, we know't.\n"],
 ["Let us kill him, and we'll have corn at our own price.\n",
  "Is't a verdict?\n"],
 ["No more talking on't; let it be done: away, away!\n"],
 ['One word, good citizens.\n'],
 ['We are accounted poor citizens, the patricians good.\n',
  'What authority surfeits on would relieve us: if they\n',
  'would yield us but the superfluity, while it were\n',
  'wholesome, we might guess they relieved us humanely;\n',
  'but they think we are too dear: the leanness that\n',
  'afflicts us, the object of our misery, is as an\n',
  'inventory to particularise their abundance; our\n',
  'sufferance is a gain to them Let us revenge this with\n',
  'our pikes, ere we become rakes: for the gods know I\n',
  'speak this in hunger for bread, not in thirst for revenge.\n'],
 ['Would you proceed espe

In [97]:
# Now we account for the first speaker and the first speech:

speaker.insert(0, lines[0])
speaker

['First Citizen:\n',
 'All:\n',
 'First Citizen:\n',
 'All:\n',
 'First Citizen:\n',
 'All:\n',
 'First Citizen:\n',
 'All:\n',
 'Second Citizen:\n',
 'First Citizen:\n',
 'Second Citizen:\n',
 'All:\n']

In [98]:
speech.insert(0, lines[1])
speech

['Before we proceed any further, hear me speak.\n',
 ['Speak, speak.\n'],
 ['You are all resolved rather to die than to famish?\n'],
 ['Resolved. resolved.\n'],
 ['First, you know Caius Marcius is chief enemy to the people.\n'],
 ["We know't, we know't.\n"],
 ["Let us kill him, and we'll have corn at our own price.\n",
  "Is't a verdict?\n"],
 ["No more talking on't; let it be done: away, away!\n"],
 ['One word, good citizens.\n'],
 ['We are accounted poor citizens, the patricians good.\n',
  'What authority surfeits on would relieve us: if they\n',
  'would yield us but the superfluity, while it were\n',
  'wholesome, we might guess they relieved us humanely;\n',
  'but they think we are too dear: the leanness that\n',
  'afflicts us, the object of our misery, is as an\n',
  'inventory to particularise their abundance; our\n',
  'sufferance is a gain to them Let us revenge this with\n',
  'our pikes, ere we become rakes: for the gods know I\n',
  'speak this in hunger for bread, not i

In [99]:
# Extract the data within delimiters to clean the text within:

# modified_text = speaker[0].replace(":\n", "")

for i in range(len(speaker)):
   speaker[i] = speaker[i].replace(":\n","")

for i in range(len(speech)):
    if i == 0:
        speech[i] = speech[i].replace("\n","")
    else:
        if len(speech[i]) == 1:
            speech[i] = speech[i][0].replace("\n","")
        # else:
        #     for j in range(len(speech[i])):
        #         speech[i] = speech[i][j].replace("\n","")
speech[-1] = speech[-1].replace("\n","")

In [100]:
speaker

['First Citizen',
 'All',
 'First Citizen',
 'All',
 'First Citizen',
 'All',
 'First Citizen',
 'All',
 'Second Citizen',
 'First Citizen',
 'Second Citizen',
 'All']

In [101]:
speech

['Before we proceed any further, hear me speak.',
 'Speak, speak.',
 'You are all resolved rather to die than to famish?',
 'Resolved. resolved.',
 'First, you know Caius Marcius is chief enemy to the people.',
 "We know't, we know't.",
 ["Let us kill him, and we'll have corn at our own price.\n",
  "Is't a verdict?\n"],
 "No more talking on't; let it be done: away, away!",
 'One word, good citizens.',
 ['We are accounted poor citizens, the patricians good.\n',
  'What authority surfeits on would relieve us: if they\n',
  'would yield us but the superfluity, while it were\n',
  'wholesome, we might guess they relieved us humanely;\n',
  'but they think we are too dear: the leanness that\n',
  'afflicts us, the object of our misery, is as an\n',
  'inventory to particularise their abundance; our\n',
  'sufferance is a gain to them Let us revenge this with\n',
  'our pikes, ere we become rakes: for the gods know I\n',
  'speak this in hunger for bread, not in thirst for revenge.\n'],
 'W

In [92]:
test_list = ['a','b','c']

In [93]:
test_list.insert(0,'d')

In [94]:
test_list

['d', 'a', 'b', 'c']

In [102]:
speech_indexes = []

for i in range(len(speech)):
    if type(speech[i]) == list:
        speech_indexes.append(i)

speech_indexes

[6, 9]

In [103]:
for n in speech_indexes:
    speech_var = speech[n]
    clean_speech_var = []

    for i in range(len(speech_var)):
        clean_speech_var.append(speech_var[i].replace("\n", ""))

    speech.insert(n,clean_speech_var)
    speech.pop(n+1)  

In [104]:
speech

['Before we proceed any further, hear me speak.',
 'Speak, speak.',
 'You are all resolved rather to die than to famish?',
 'Resolved. resolved.',
 'First, you know Caius Marcius is chief enemy to the people.',
 "We know't, we know't.",
 ["Let us kill him, and we'll have corn at our own price.", "Is't a verdict?"],
 "No more talking on't; let it be done: away, away!",
 'One word, good citizens.',
 ['We are accounted poor citizens, the patricians good.',
  'What authority surfeits on would relieve us: if they',
  'would yield us but the superfluity, while it were',
  'wholesome, we might guess they relieved us humanely;',
  'but they think we are too dear: the leanness that',
  'afflicts us, the object of our misery, is as an',
  'inventory to particularise their abundance; our',
  'sufferance is a gain to them Let us revenge this with',
  'our pikes, ere we become rakes: for the gods know I',
  'speak this in hunger for bread, not in thirst for revenge.'],
 'Would you proceed especiall

In [105]:
len(speech)

12

In [106]:
len(speaker)

12

In [110]:
list_1 = [1, 2, 3]
list_2 = ['a', ['b', 'c'], 'd']

# Combine the two lists into rows
data = list(zip(list_1, list_2))

# Convert nested lists in list_2 to comma-separated strings
def flatten_cell(cell):
    if isinstance(cell, list):
        return ', '.join(cell)
    return cell

# Create the DataFrame
df = pd.DataFrame(data, columns=['list_1', 'list_2'])

# Apply formatting to list_2
df['list_2'] = df['list_2'].apply(flatten_cell)

df

Unnamed: 0,list_1,list_2
0,1,a
1,2,"b, c"
2,3,d


In [111]:
# Combine the two lists into rows
data = list(zip(speaker, speech))

# Convert nested lists in list_2 to comma-separated strings
def flatten_cell(cell):
    if isinstance(cell, list):
        return ', '.join(cell)
    return cell

# Create the DataFrame
df = pd.DataFrame(data, columns=['speaker', 'speech'])

# Apply formatting to list_2
df['speech'] = df['speech'].apply(flatten_cell)

df

Unnamed: 0,speaker,speech
0,First Citizen,"Before we proceed any further, hear me speak."
1,All,"Speak, speak."
2,First Citizen,You are all resolved rather to die than to fam...
3,All,Resolved. resolved.
4,First Citizen,"First, you know Caius Marcius is chief enemy t..."
5,All,"We know't, we know't."
6,First Citizen,"Let us kill him, and we'll have corn at our ow..."
7,All,"No more talking on't; let it be done: away, away!"
8,Second Citizen,"One word, good citizens."
9,First Citizen,"We are accounted poor citizens, the patricians..."


In [1]:
import re

text = "I am, but just 19 - screamed the boy! -- The narrator said."

# Replace everything that's NOT a letter, digit, or space with empty string
cleaned_text = re.sub(r'[^A-Za-z0-9 ]+', '', text)

print(cleaned_text)


I am but just 19  screamed the boy  The narrator said
