# Import the full story

This code creates a list from the text file by creating a new item every time a new line is being created.

e.g. for the two speeches:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

The code will consider the first instance of a new line as the point after "First Citezen:", so the line "First Citezen:" will be the first item.
Then the next item starts at "Before ..." and ends at "speak." since a new line starts there after

In [1]:
import pandas as pd

# Load the text file and store each line in a list
with open("full-story.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

# Extracting the speakers and speeches from the list

From the list we can see that (after the 1st two items) a speaker is always found immediately after each item that contains '\n', and the relevant speeches are found between a speaker and the next occuring '\n' item.

Therefore, in our logic we first create a list of all indexes where '\n' items are found, and then we use that to locate the speaker and then speeches.

In [2]:
# Run a loop to create a list of indexes:

n_indexes = []

for i in range(len(lines)):
    if lines[i] == '\n':
        n_indexes.append(i)

In [21]:
# Locate all the speakers

speaker = []

for i in n_indexes:
    speaker.append(lines[i+1])

In [22]:
# find the number of all unique values of speaker 
len(list(set(speaker)))

310

In [23]:
# Now locate the relevant speeches

speech = []

for i in range(len(n_indexes)):
    if i+1 < len(n_indexes):
        speech.append(lines[n_indexes[i]+1+1:n_indexes[i+1]])

# cater for the final index
speech.append(lines[n_indexes[-1]+1+1])

In [24]:
# Now we account for the first speaker and the first speech:

speaker.insert(0, lines[0])
speech.insert(0, lines[1])

# Extract the data within delimiters to clean the text within

For each of the new lists there are extra characters besides the actual speakers and speeches, so we need to clean them to only the relevant information

In [25]:
# Remove all delimiters in speaker and remove the one lined lists in speech
for i in range(len(speaker)):
    
    speaker[i] = speaker[i].replace(":\n","")

    if i == 0:
        speech[i] = speech[i].replace("\n","")
    else:
        if len(speech[i]) == 1:
            speech[i] = speech[i][0].replace("\n","")

In [27]:
speech

['Before we proceed any further, hear me speak.',
 'Speak, speak.',
 'You are all resolved rather to die than to famish?',
 'Resolved. resolved.',
 'First, you know Caius Marcius is chief enemy to the people.',
 "We know't, we know't.",
 ["Let us kill him, and we'll have corn at our own price.\n",
  "Is't a verdict?\n"],
 "No more talking on't; let it be done: away, away!",
 'One word, good citizens.',
 ['We are accounted poor citizens, the patricians good.\n',
  'What authority surfeits on would relieve us: if they\n',
  'would yield us but the superfluity, while it were\n',
  'wholesome, we might guess they relieved us humanely;\n',
  'but they think we are too dear: the leanness that\n',
  'afflicts us, the object of our misery, is as an\n',
  'inventory to particularise their abundance; our\n',
  'sufferance is a gain to them Let us revenge this with\n',
  'our pikes, ere we become rakes: for the gods know I\n',
  'speak this in hunger for bread, not in thirst for revenge.\n'],
 'W