# Isolation in Wuthering Heights: A Computational Literary Analysis

### Proposal
I plan to study isolation in Wuthering Heights by prompting a language model trained on the novel. Wuthering Heights is a seminal work in English literature, a classic by any standards, written by Emily Brontë later published in 1847. The novel explores themes of isolation through the lives of its characters, namely Catherine and Heathcliff. In employing a language model trained on the text, I will generate and analyze outputs to uncover Brontë's portrayals of isolation. This method will be supplemented by a close reading of key passages to validate the model's insights and provide a deeper understanding of how isolation influences the characters and their interactions and conversely how isolation is reflected through the characters and settings.

### Import Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import spacy

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tatianasanchez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tatianasanchez/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
nlp = spacy.load('en_core_web_sm')

### Load & Preprocess Data

#### Load

In [8]:
file_path = '/Users/tatianasanchez/Desktop/DigHum150C_final/data/wuthering_heights.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

#### Tokenize

In [9]:
tokens = word_tokenize(text)

#### Lowercase

In [10]:
tokens = [word.lower() for word in tokens]

#### Remove Stopwords

In [11]:
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word.isalpha() and word not in stop_words]

#### Clean Text

In [12]:
cleaned_text = ' '.join(tokens)

### Vectorize

In [44]:
vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform([cleaned_text])

### Topic Model

In [45]:
lda = LatentDirichletAllocation(n_components=4, random_state=123)
lda.fit(doc_term_matrix)

In [46]:
terms = vectorizer.get_feature_names_out()
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}:")
    print([terms[i] for i in topic.argsort()[-10:]])

Topic 0:
['master', 'must', 'shall', 'one', 'could', 'catherine', 'linton', 'said', 'heathcliff', 'would']
Topic 1:
['inspected', 'insisting', 'insipid', 'inserting', 'inscribed', 'insanity', 'inroads', 'inquisitively', 'inspecting', 'larger']
Topic 2:
['inspected', 'insisting', 'insipid', 'inserting', 'inscribed', 'insanity', 'inroads', 'inquisitively', 'inspecting', 'larger']
Topic 3:
['inspected', 'insisting', 'insipid', 'inserting', 'inscribed', 'insanity', 'inroads', 'inquisitively', 'inspecting', 'larger']


In [40]:
for idx, terms in topics.items():
    print(f"Topic {idx}: {terms}")

Topic 0: ['join', 'addressing', 'grown', 'delight', 'welcome', 'pulling', 'pure', 'sentiment', 'human', 'faint']
Topic 1: ['join', 'addressing', 'grown', 'delight', 'welcome', 'pulling', 'pure', 'sentiment', 'human', 'faint']
Topic 2: ['master', 'must', 'shall', 'one', 'could', 'catherine', 'linton', 'said', 'heathcliff', 'would']


In [28]:
for idx, terms in topics.items():
    print(f"Topic {idx}: {terms}")

Topic 0: ['master', 'must', 'shall', 'one', 'could', 'catherine', 'linton', 'said', 'heathcliff', 'would']
Topic 1: ['join', 'addressing', 'grown', 'delight', 'welcome', 'pulling', 'pure', 'sentiment', 'human', 'faint']
Topic 2: ['join', 'addressing', 'grown', 'delight', 'welcome', 'pulling', 'pure', 'sentiment', 'human', 'faint']
Topic 3: ['join', 'addressing', 'grown', 'delight', 'welcome', 'pulling', 'pure', 'sentiment', 'human', 'faint']
