<a href="https://colab.research.google.com/github/Jungyoonlim/middlemarch/blob/main/Middlemarch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **Middlemarch Project.**

Objective: Break down George Eliot's Middlemarch into relevant story metadata.

In [5]:
import os
print(os.getcwd())

/content


In [6]:
from google.colab import files

uploaded = files.upload()

Saving middlemarch.txt to middlemarch (1).txt


In [7]:
file_path = 'middlemarch.txt'

with open(file_path, 'r', encoding='utf-8') as file:
    middlemarch_text = file.read()

print("Text loaded successfully.")


Text loaded successfully.


In [8]:
import re
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
nltk.download('punkt')

# Assuming 'middlemarch_text' contains the entire text of Middlemarch

# Step 1: Remove Gutenberg Headers and Footers
# The headers/footers vary, but often include phrases like "Project Gutenberg"
start_marker = "PRELUDE."
end_marker = "THE END"

start = middlemarch_text.find(start_marker)
end = middlemarch_text.find(end_marker)

# Extract only the main text
main_text = middlemarch_text[start:end]

# Step 2: Tokenization
# Split into sentences
sentences = sent_tokenize(main_text)

# Split into words
words = word_tokenize(main_text)

print(f"Number of sentences: {len(sentences)}")
print(f"Number of words: {len(words)}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Number of sentences: 11900
Number of words: 370620


Named Entity Recognition

In [9]:
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.max_length = len(main_text)
doc = nlp(main_text)

character_entities = set()

for entity in doc.ents:
  if entity.label_ == 'PERSON':
    if not any(char.isdigit() for char in entity.text):
      character_entities.add(entity.text.strip())

for character in character_entities:
  print(character)

Warburton
Nick
Archie Duncan
Bat
Ladislaw?—shall
Call Fred Vincy
Leeds
Horrock
Thomas Aquinas
rick-thatcher
Peter
Killjoy
Fred
Vincy
Robert
Nimrod
comme
Joanna
DONNE
Vesalius
Porson
Sophy
Pope
Fitchett
Sophy Toller
Will
Ladislaw
Galen
Chus
Blindman
Minchin
Rosamond
Vincy
marquis
Mix
que
podremos
emphatically,—“she
Celia
Dagon
John Long
Sister Jane
Vicar
Abraham
marry Rosamond
Briggs
Italian Proverb
Walter Scott
Fred and Mary
Churchill
Waverley
Ian Vor
Winifred
Bruce
Lydgate
Cadwallader
Aristotle
Rigg
Bulstrode
James’s
Satire
Luck
Dagley
James
B.
Providence
Hiram Ford
divin qui
Blessed Virgin
Fred Vincy’s
Ned Plymdale
Ballard
Hate
Bam
Alfred
Lucy
Augustine
Wright
Naumann
Prayer
Clara
Harfager
Harriet
Tollers
Raffles
Eros
Jack
Edinburgh
Letty
Miss Brooke
Godwin
Lydgate
Joseph
Cranch
Jacob
Despond
Lowick Cicero
Freshitt Hall
Robert Brown’s
Faulkner
Parnassus
Vincy
Fred’s
Slaughter Lane
Tyke
Jeremy Taylor
Tamburlaine
Arthur
Carter
Beauty
Guydo
Rigg Featherstone
Trapping Bass
Letty Garth
Be

In [10]:
from textblob import TextBlob

sentence = TextBlob(sentences[0])
print(sentence.sentiment)

Sentiment(polarity=0.0, subjectivity=0.0)


Summary Generation

In [11]:
!pip install sumy



Topic Modeling

Text Summarization

In [3]:
import nltk

In [14]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

num_sentences_summary = 10

parser = PlaintextParser.from_string(main_text, Tokenizer("english"))
summarizer_list=("TextRankSummarizer:","LexRankSummarizer:","LuhnSummarizer:","LsaSummarizer") #list of summarizers
summarizers = [TextRankSummarizer(), LexRankSummarizer(), LuhnSummarizer(), LsaSummarizer()]

for i, summarizer in enumerate(summarizers):
  print(summarizer_list[i])
  for sentence in summarizer(parser.document, num_sentences_summary):
    print(sentence)
  print("-" * 30)


TextRankSummarizer:


KeyboardInterrupt: ignored

Topic Modeling

- Preprocess and Tokenize the Text
- Create Dictionary and Corpus
- LDA for Topic Modeling

In [23]:
processed_text = preprocess(main_text)

print(processed_text[:10])

['prelude', 'book', 'miss', 'brooke', 'chapter', 'chapter', 'ii', 'chapter', 'iii', 'chapter']


In [29]:
import re
# Adjust the regular expression to match the chapter pattern exactly
chapter_pattern = r'\nCHAPTER [IVXLCDM]+\.'  # This includes the newline character and period

# Use re.split to split the text into chapters
chapters = re.split(chapter_pattern, main_text)

chapters = chapters[1:]

# Filter out any empty strings that might have been created during the split
chapters = [chapter.strip() for chapter in chapters if chapter.strip()]

print(f"Total chapters found: {len(chapters)}")
for i in range(3):
    print(f"Start of chapter {i+1}: {chapters[i][:100]}")  # Preview first 100 characters

Total chapters found: 86
Start of chapter 1: Since I can do no good because a woman,
Reach constantly at something that is near it.
             
Start of chapter 2: “‘Dime; no ves aquel caballero que hacia nosotros viene sobre un
caballo rucio rodado que trae puest
Start of chapter 3: “Say, goddess, what ensued, when Raphael,
The affable archangel . . .
                    Eve
The st


In [31]:
from gensim.corpora import Dictionary

dictionary = Dictionary([processed_text])

dictionary.filter_extremes(no_below=5, no_above=0.7, keep_n=50000)

corpus = [dictionary.doc2bow(text) for text in [processed_text]]


Fine-Tune Models
- Narrative Generation and Understanding
- Custom NER Model


In [33]:
pip install transformers



In [7]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="bert-base-uncased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [8]:
fill_mask_pipe = pipeline("fill-mask", model="bert-base-uncased")

# Example usage
masked_text = "The main character in Middlemarch is [MASK]."
predictions = fill_mask_pipe(masked_text)

for prediction in predictions:
    print(prediction)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


{'score': 0.037618719041347504, 'token': 4300, 'token_str': 'arthur', 'sequence': 'the main character in middlemarch is arthur.'}
{'score': 0.013670222833752632, 'token': 2848, 'token_str': 'peter', 'sequence': 'the main character in middlemarch is peter.'}
{'score': 0.013061015866696835, 'token': 2852, 'token_str': 'dr', 'sequence': 'the main character in middlemarch is dr.'}
{'score': 0.011167364194989204, 'token': 17001, 'token_str': 'sgt', 'sequence': 'the main character in middlemarch is sgt.'}
{'score': 0.009783466346561909, 'token': 2198, 'token_str': 'john', 'sequence': 'the main character in middlemarch is john.'}


Further Metadata Extraction
- Character Relationships
- Plot Analysis
- Style and Tone Analysis