<a href="https://colab.research.google.com/github/R-802/LING-226-Assignments/blob/main/Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#LING226 2023 T3 Assignment Two
- Shemaiah Rangitaawa
- `300601546`

### **Research Questions**

> *How do linguistic features, including distributional and syntactic measures, contribute to the construction of the descriptive language in bestselling mystery novels compared to bestselling science fiction novels?*
   
This question aims to look into the linguistic features chosen for constructing the profiles. By focusing on distributional and syntactic measures, we can uncover the nuances in language use specific to each genre. The rationale behind this question is to understand the how authors use language in mystery and science fiction, contributing to the overall descriptive elements in their respective genres.

> *In what ways do the results of the linguistic analysis provide insights into the role of setting and atmosphere in genre-specific storytelling within bestselling mystery and science fiction novels?*
   
This question addresses the broader narrative context by linking linguistic analysis results to the role of setting and atmosphere in storytelling. It aims to connect the linguistic profiles with the overarching theme of how descriptive language influences the creation of setting and atmosphere in mystery and science fiction. By exploring this connection, the study seeks to extract meaningful insights into the distinct storytelling approaches employed by each genre.

### **Predictions**
**Descriptive Detail**:
   - Mystery novels will probably use more precise and plot-driven descriptions.
   - Science fiction novels are anticipated to have broader, more imaginative descriptions.

**Lexical Choices**:
   - Mystery novels might use language that evokes suspense and mystery.
   - Science fiction novels are expected to include technical and futuristic terminology.

**Atmosphere and Mood**:
   - Descriptions in mystery novels are predicted to create a tense, suspenseful mood.
   - In science fiction, the language is likely to evoke wonder and exploration.

##**Preprocessing Pipeline for Text Analysis**


In [None]:
import re
import nltk
import spacy
from tqdm.notebook import tqdm
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Load and configure the spacy model
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 1500000

# Download necessary NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

In [2]:
def clean_text(text):
    text = text.lower()

    # Remove punctuations and numbers
    text = re.sub('[^a-zA-Z]', ' ', text)

    # Remove single character words
    text = re.sub(r"\s+[a-zA-Z]\s+", ' ', text)

    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text)
    return text

In [22]:
def remove_stopwords(text, additional_stopwords=None):
    # NLTK's default English stopwords
    stop_words = set(stopwords.words('english'))

    # Add any additional stopwords provided
    if additional_stopwords:
        stop_words.update(additional_stopwords)

    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return ' '.join(filtered_text)

In [4]:
def lemmatize_text(text):
    doc = nlp(text)
    lemmatized_text = " ".join([token.lemma_ for token in doc])
    return lemmatized_text

In [5]:
def remove_character_names(text):
    # Process the text with spaCy
    doc = nlp(text)

    # Identify character names (entities labeled as PERSON)
    character_names = {ent.text for ent in doc.ents if ent.label_ == "PERSON"}

    # Reconstruct the text without character names
    filtered_text = ' '.join([token.text for token in doc if token.text not in character_names])

    return filtered_text

In [6]:
def count_words_in_corpus(corpus):
    return [len(doc.split()) for doc in corpus]

In [8]:
def preprocess(corpus, additional_stopwords=None, lower_percentile=10, upper_percentile=90):
    # Clean and normalize each document in the corpus
    cleaned_corpus = [clean_text(doc) for doc in tqdm(corpus, desc="Cleaning")]

    # Lemmatize the text
    lemmatized_corpus = [lemmatize_text(doc) for doc in tqdm(cleaned_corpus, desc="Lemmatizing")]

    # Remove character names
    no_character_names_corpus = [remove_character_names(doc) for doc in tqdm(lemmatized_corpus, desc="NER and Removal")]

    # Remove stop words
    final_corpus = [remove_stopwords(doc, additional_stopwords) for doc in tqdm(no_character_names_corpus, desc="Removing Stopwords")]

    # Counting words before and after preprocessing
    word_count_before = count_words_in_corpus(corpus)
    word_count_after = count_words_in_corpus(final_corpus)

    # Print corpus statistics
    print("\nCorpus Statistics:")
    print("------------------")
    print(f"Total word count before preprocessing: {sum(word_count_before)}")
    print(f"Total word count after preprocessing: {sum(word_count_after)}")
    print(f"Removed {sum(word_count_before) - sum(word_count_after)} words\n\n")

    return final_corpus

#**Data and Corpus Selection**

> To import the corpus into Colab please download the texts from [this link](https://drive.google.com/drive/folders/15A7y8NRaJv2LRBB6zDm4043G1f9sMTWD) and add under 'My Drive' in Google Drive.

### Selection Rationale
I have chosen ten texts for each genre, following these criteria to ensure a comprehensive and meaningful analysis:

- **Genre Representation**: Each book is a well-recognized example of its genre, ensuring a clear distinction between the mystery and science fiction categories. This selection helps maintain genre purity in the analysis, enabling a focused examination of genre-specific linguistic characteristics. Well-known works in each genre are chosen to represent common patterns and themes that are emblematic of their respective genres.

- **Narrative Styles**: The selection includes a range of narrative styles, from first-person accounts to omniscient narrators, offering diverse syntactic structures. This variety allows for a more nuanced exploration of how different narrative perspectives influence the use of descriptive language. By including a mix of narrative styles, the analysis can uncover how the choice of narrator affects the portrayal of setting and atmosphere.

- **Thematic Variety**: The books cover various sub-genres and themes, providing a rich linguistic variety for analysis. For instance, the mystery genre includes classic whodunits, psychological thrillers, and detective stories, while the science fiction genre encompasses hard sci-fi, space operas, and dystopian narratives. This thematic diversity ensures that the analysis captures a broad spectrum of language usage, avoiding biases that might arise from focusing on a narrow thematic range.

- **Publication Era**: The texts span different publication eras, reflecting the evolution of language and narrative techniques over time. This temporal spread helps in understanding how descriptive language in both genres has evolved and how historical and cultural contexts influence storytelling.

- **Authorial Background**: The authors of these texts come from varied backgrounds, contributing to diverse perspectives and styles in their writing. This inclusion enriches the linguistic analysis by introducing different cultural and individual influences in the use of language.

- **Critical and Popular Reception**: The texts are a mix of critically acclaimed works and popular bestsellers. This ensures that the analysis encompasses both literary quality and mass appeal, reflecting a balance between artistic merit and accessibility to a broader audience.

- **Ease of Acquisition**: Each novel in the corpora was relatively easy to find online from various websites.

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
import os

def read_text_files(directory_path):
    """Reads all text files in a directory and returns a set of their contents."""
    text_contents = set()
    for filename in os.listdir(directory_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(directory_path, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                full_text = file.read()
                text_contents.add(full_text)
    return text_contents

In [11]:
def get_titles(text_set):
    """Returns a list of the first lines of each text in a set."""
    titles = []
    for text in text_set:
        first_line = text.split('\n', 1)[0]
        titles.append(first_line)
    return titles

mystery_path = '/content/drive/My Drive/LING226 Assignment 2 Corpus/Mystery/'
scifi_path = '/content/drive/My Drive/LING226 Assignment 2 Corpus/SciFi/'

mystery_texts = read_text_files(mystery_path)
mystery_titles = get_titles(mystery_texts)
print("Mystery Novels:")
for title in mystery_titles:
    print(title)

scifi_texts = read_text_files(scifi_path)
scifi_titles = get_titles(scifi_texts)
print("\nScifi Novels:")
for title in scifi_titles:
    print(title)

Mystery Novels:
[The Da Vinci Code by Dan Brown 2003]
[In the Woods by Tana French 2007]
[Before I Go to Sleep by S.J Watson 2008]
[Sharp Objects by Gillian Flynn 2006]
[The Girl on the Train by Paula Hawkins 2015]
[The Woman in the Window by A.J. Finn 2018]
[And Then There Were None by Agatha Christie 1939]
[The Girl With The Dragon Tattoo by Stieg Larsson 2005]
[The Name of the Rose by Umberto Eco 1980]
[Murder on the Orient Express by Agatha Christie 1934]

Scifi Novels:
[DUNE by Frank Herbert 1965]
[Snow Crash by Neal Stephenson 1992]
[Neuromancer by William Gibson 1984]
[Brave New World by Aldous Huxley 1931]
[Nineteen Eighty-Four by George Orwell 1949]
[Fahrenheit 451 by Ray Bradbury 1953]
[The War of the Worlds by H. G. Wells 1898]
[Children of Time by Adrain Tchaokovsky 2015]
[The Hitchhiker’s Guide to the Galaxy by Douglas Adams 1979]
[The Martian by Andy Weir 2011]


# **Feature Extraction Functions**

In [None]:
import numpy as np
from collections import Counter

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.util import bigrams

nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

**Lexical Diversity**

> `lexical_diversity` calculates the ratio of unique words to the total number of words in a text, providing a measure of vocabulary richness for individual texts. `calculate_average_lexical_diversity` function computes the average lexical diversity across a collection of texts, offering an aggregated measure of vocabulary variety for comparative analysis.

In [13]:
def lexical_diversity(text):
    """Calculate lexical diversity as the ratio of unique words to total words."""
    tokens = word_tokenize(text)
    return len(set(tokens)) / len(tokens) if tokens else 0

def calculate_average_lexical_diversity(texts):
    """Calculate the average lexical diversity for a set of texts."""
    diversities = [lexical_diversity(text) for text in texts]
    return np.mean(diversities)

**Sentiment Analysis**

> This sentiment analysis implementation utilizes the BERT model from Hugging Face's Transformers library, specifically the `bert-base-uncased` variant. The process begins with the `segment_text` function, which divides a given text into smaller segments based on paragraph breaks, ensuring each segment's length doesn't exceed BERT's token limit (512 tokens). The `sentiment_analysis` function then processes these segments. Each segment is tokenized using BERT's tokenizer and fed into the BERT model for sentiment classification. The model, pre-loaded with weights for the base layers and untrained weights for the classification layer, runs on a GPU (if available) for better performance. For each text segment, the model outputs sentiment scores, which represent the polarity of sentiment (ranging from -1 for negative to 1 for positive sentiment). These scores can then be aggregated or analyzed further to understand the overall sentiment of the text. The implementation is designed to handle large texts (like novels) efficiently, making it suitable for in-depth sentiment analysis across various forms of lengthy written content.

In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Setting up the device for GPU usage
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Loading the tokenizer and model
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name).to(device)

cuda_available = torch.cuda.is_available()
gpu_name = torch.cuda.get_device_name(0) if cuda_available else "No CUDA Device Available"

cuda_available, gpu_name

In [15]:
def segment_text(text, max_length=512):
    # Split the text by newline characters and remove empty paragraphs
    paragraphs = [para for para in text.split('\n') if para.strip() != '']
    segments = []
    current_segment = ""

    for paragraph in paragraphs:
        if len(current_segment) + len(paragraph) <= max_length:
            current_segment += " " + paragraph if current_segment else paragraph
        else:
            if current_segment:
                segments.append(current_segment)
            current_segment = paragraph

    # Add the last segment if it contains any text
    if current_segment:
        segments.append(current_segment)

    return segments

def sentiment_analysis(text):
    segments = segment_text(text)
    sentiments = []

    for segment in segments:
        inputs = tokenizer(segment, return_tensors="pt", truncation=True, max_length=512, padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():  # Disable gradient calculation for inference
            outputs = model(**inputs)
            prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
            sentiment_score = prediction[:,1].item() - prediction[:,0].item()  # Positive - Negative
            sentiments.append(sentiment_score)

    return sentiments

def summarize_sentiments(sentiment_scores):
    # Flatten the list of lists into a single list of scores
    flat_scores = [score for sublist in sentiment_scores for score in sublist]

    # Calculate the average sentiment score
    avg_sentiment = sum(flat_scores) / len(flat_scores) if flat_scores else 0

    # Categorize the overall sentiment
    if avg_sentiment > 0:
        sentiment_category = 'Positive'
    else:  # Note: This includes avg_sentiment = 0 being labeled as 'Negative'
        sentiment_category = 'Negative'

    return avg_sentiment, sentiment_category

**Frequency of Adjectives and Adverbs**
> The text is tokenized and POS-tagged to count adjectives and adverbs. This count reflects the extent of descriptive language used.

In [16]:
def mean_adj_adv_freq_combined(texts):
    total_adjectives = 0
    total_adverbs = 0
    total_texts = len(texts)

    for text in texts:
        words = word_tokenize(text)
        tagged = pos_tag(words)

        for word, tag in tagged:
            if tag.startswith("JJ"):
                total_adjectives += 1
            elif tag.startswith("RB"):
                total_adverbs += 1

    # Calculating the mean for adjectives and adverbs
    avg_adjectives = total_adjectives / total_texts if total_texts > 0 else 0
    avg_adverbs = total_adverbs / total_texts if total_texts > 0 else 0

    return {"average_adjectives": avg_adjectives, "average_adverbs": avg_adverbs}

 **Part of Speech**

In [17]:
def pos_frequency(text):
    words = word_tokenize(text)
    tagged = pos_tag(words)
    pos_counts = {}
    for word, tag in tagged:
        pos_counts[tag] = pos_counts.get(tag, 0) + 1
    return pos_counts

**Top 10 Bigrams**

In [18]:
def get_bigrams(text):
    tokens = word_tokenize(text)
    bigram_counts = Counter(bigrams(tokens))
    return bigram_counts.most_common(10)

**Top 10 Words**

In [19]:
def get_most_common_words(text, num_words=10):
    tokens = word_tokenize(text)
    words = [word.lower() for word in tokens if word.isalpha()]
    word_freq = Counter(words)
    return word_freq.most_common(num_words)

# **Analysis**

## Lexicon Information
- **Lexical Diversity**: This metric was chosen to compare the complexity of language between mystery and science fiction genres. By measuring the variety in word usage, we can infer the level of sophistication and range of vocabulary, which is often indicative of the target audience and thematic depth of the genre.

- **Sentiment**: Included to gauge the emotional undercurrents prevalent in each genre, reflective of their respective thematic focuses. This analysis helps in understanding how the narrative tone of a genre, like the suspense in mystery or the awe in science fiction, shapes the reader's emotional journey.

- **Adjective and Adverb Frequency**: Used to quantify the descriptiveness and detail orientation in the narrative styles of both genres. This measure reveals the extent to which each genre relies on detailed descriptions to create vivid and immersive story worlds.

## Distributional and Syntactic Information
- **Part of Speech**: Implemented to highlight differences in narrative structures, such as the possible use of action-oriented verbs in science fiction versus descriptive adjectives in mystery. This helps in understanding how different genres employ various parts of speech to create their unique storytelling styles.

- **Bigrams**: Selected to uncover typical phrase patterns, providing insights into thematic elements and narrative techniques unique to each genre. By analyzing common bigrams, we can identify genre-specific tropes and stylistic preferences.

## Preprocessing Decisions
In the development of the preprocessing pipeline, several steps were chosen to refine the texts for analysis, ensuring alignment with the nature of the texts and the objectives of the assignment:

- **Text Normalization:**
This step ensures that texts from various sources are standardized, removing formatting and stylistic variations that aren't relevant to the linguistic analysis. The goal is to focus on the actual language content, free from inconsistencies that could skew comparative insights.

- **Lemmatization:**
Lemmatization was chosen to achieve a more accurate reflection of language use across different genres. By reducing words to their base forms, this process enables a fair comparison of vocabulary usage between various texts.

  Initially, the lemmatization was implemented using NLTK's WordNet Lemmatizer. However, despite its advantages in terms of simplicity and speed, it was observed that NLTK's approach could sometimes lack accuracy. After some research, I found that the inaccuracy is a result of NLTK's lemmatizer primarily relying on WordNet's morphological analysis, which doesn't always consider the contextual meanings of words. To address this limitation and enhance the precision of the analysis, I transitioned to using SpaCy's lemmatizer.

- **Character Name Removal:**
This step was important for shifting the focus from specific characters to the broader narrative and stylistic elements of the genre. Character names, especially in fiction, can recur frequently and could dominate the word frequency analysis, potentially overshadowing other significant linguistic features that are more reflective of the genre's characteristics.

- **Stop Word Removal:**
Implemented to filter out the linguistic 'noise' in the texts. Common words, while structurally important, often don't contribute to the thematic or stylistic understanding of a genre. Removing them sharpens the focus on more meaningful and genre-defining words and phrases, enabling a clearer analysis of the distinctive linguistic patterns.

##**Preprocessing**
> This cell can take up to 15 minutes to execute on a Tesla T4

In [23]:
additional_stopwords = {'could', 'would', 'might', 'ebook', 'com', 'sophie',
                        'justice', 'wargrave', 'mona', 'lisa', 'megan', 'cassie'
                        'molly', 'holsten', 'zaphod', 'bernard', 'guy', 'john',
                        'baron', 'poirot', 'mr', 'mrs', 'leigh', 'teabing', 'lombard',
                        'salander', 'amma', 'monk', 'colonel', 'arbuthnot', 'phone',
                        'abbot', 'gon', 'na', 'editor', 'feyd', 'rautha', 'muad', 'dib',
                        'bene', 'gesserit', 'molly', 'holsten', 'got', 'ta', 'wan'}
mystery_processed_texts = preprocess(mystery_texts, additional_stopwords)
scifi_processed_texts = preprocess(scifi_texts, additional_stopwords)

Cleaning:   0%|          | 0/10 [00:00<?, ?it/s]

Lemmatizing:   0%|          | 0/10 [00:00<?, ?it/s]

NER and Removal:   0%|          | 0/10 [00:00<?, ?it/s]

Removing Stopwords:   0%|          | 0/10 [00:00<?, ?it/s]


Corpus Statistics:
------------------
Total word count before preprocessing: 1164782
Total word count after preprocessing: 541395
Removed 623387 words




Cleaning:   0%|          | 0/10 [00:00<?, ?it/s]

Lemmatizing:   0%|          | 0/10 [00:00<?, ?it/s]

NER and Removal:   0%|          | 0/10 [00:00<?, ?it/s]

Removing Stopwords:   0%|          | 0/10 [00:00<?, ?it/s]


Corpus Statistics:
------------------
Total word count before preprocessing: 1050113
Total word count after preprocessing: 513942
Removed 536171 words




## **Feature Extraction and Profile Generation**
> This cell can take up to 10 minutes to execute on a Tesla T4

In [24]:
mystery_profile = {
    "Lexical Diversity": calculate_average_lexical_diversity(mystery_processed_texts),
    "Sentiment": [sentiment_analysis(text) for text in mystery_texts], # Sentiment of unprocessed texts
    "Adjective and Adverb Frequency": mean_adj_adv_freq_combined(mystery_processed_texts),
    "Top 10 Bigrams": get_bigrams(" ".join(mystery_processed_texts)),
    "Top 10 Words" : get_most_common_words(" ".join(mystery_processed_texts), num_words=10),
    "Part of Speech": pos_frequency(' '.join(mystery_processed_texts))
}

scifi_profile = {
    "Lexical Diversity": calculate_average_lexical_diversity(scifi_processed_texts),
    "Sentiment": [sentiment_analysis(text) for text in scifi_texts],
    "Adjective and Adverb Frequency": mean_adj_adv_freq_combined(scifi_processed_texts),
    "Top 10 Bigrams": get_bigrams(" ".join(scifi_processed_texts)),
    "Top 10 Words" : get_most_common_words(" ".join(scifi_processed_texts), num_words=10),
    "Part of Speech": pos_frequency(' '.join(scifi_processed_texts))
}

# **Results**

## **Question One**

> *How do linguistic features, including distributional and syntactic measures, contribute to the construction of the descriptive language in bestselling mystery novels compared to bestselling science fiction novels?*

In exploring this question, I focused on a set of quantifiable linguistic features to highlight the stylistic distinctions between mystery and science fiction novels. My analytical approach included:

- **Lexical Diversity**: This provided insights into the vocabulary's range and richness in each genre, illustrating how mystery and science fiction authors uniquely leverage language to set their narratives' tone and atmosphere.

- **Sentiment Analysis**: This aspect evaluated the emotional undertones within the texts, shedding light on how mystery and science fiction novels differently evoke moods and influence reader experiences, ranging from suspense and intrigue in mystery to awe and wonder in science fiction.

- **Adjective and Adverb Frequency**: The examination of these parts of speech offered insights into the level of descriptive detail employed by each genre. It highlighted the extent to which authors use adjectives and adverbs for world-building, creating vivid imagery, and character development.

- **Top 10 Most Common Bigrams**: This analysis revealed common phrase structures, instrumental in identifying recurring themes and narrative styles in each genre. It underscored the unique linguistic patterns that define mystery and science fiction storytelling.

- **Top 10 Most Common Words**: Analyzing the most frequently used words in each genre provided a direct insight into the thematic focus and narrative elements prevalent in mystery and science fiction. It reflected the character-driven nature and thematic emphasis of each genre.

- **Part of Speech Analysis**: This syntactic study showcased how each genre utilizes different parts of speech to weave their narratives, revealing stylistic preferences such as the use of descriptive language in mystery versus more action-oriented dialogue in science fiction.

## **Discussion**

### Mystery Novels

#### **Lexical Diversity**
Mystery novels, with a lexical diversity score of 0.12, tend to use a more focused vocabulary. This repetitive use of familiar terms and phrases is crucial in a genre centered around puzzles and mysteries, as it helps build tension and familiarity, guiding the reader through a labyrinth of clues and red herrings.

#### **Sentiment Analysis**
The positive average sentiment score (0.2862) in mystery novels might initially seem at odds with the genre's typically darker themes. However, this positivity could reflect the narrative's eventual resolution of conflicts and mysteries, offering a sense of closure and satisfaction. The novel selection could influence this score, suggesting a potential for broader sentiment range with a more diverse selection.

#### **Adjective and Adverb Usage**
The high frequency of adjectives (11,301.6) and adverbs (4,335.8) in mystery novels underscores the genre's reliance on detailed, descriptive language. This is essential for vividly painting scenes, whether they're crime scenes, suspicious settings, or intense character interactions, thereby heightening suspense and engagement.

#### **Common Bigrams and Words**
The prevalence of bigrams like “tell I,” “shake head,” and “cassie say,” coupled with frequently used verbs such as “say,” “go,” and “see,” indicates a narrative heavily anchored in dialogue. This stylistic choice is pivotal in advancing the plot and developing characters in mystery novels, often through investigative dialogues and interrogations.

#### **Part of Speech Frequencies**
A significant presence of nouns (NN) and adjectives (JJ) in mystery novels points towards a narrative style rich in descriptive elements, focusing on people, objects, and settings. The prominence of present-tense verbs (VBP) and adverbs (RB) further adds to the genre's dynamic and active storytelling approach.

### Science Fiction Novels

#### **Lexical Diversity**
The slightly higher lexical diversity score (0.14) in science fiction novels suggests a broader vocabulary. This is in line with the genre's penchant for exploring diverse concepts, settings, and futuristic technologies, necessitating an extensive and varied linguistic repertoire.

#### **Sentiment Analysis**
The lower average sentiment score (0.2612) in science fiction might reflect the complex and often more challenging themes of the genre, such as dystopian futures or ethical quandaries. The selection of novels can significantly impact this score, indicating the potential for varied sentiment expressions across different science fiction narratives.

#### **Adjective and Adverb Usage**
Although slightly lower than in mystery novels, the still substantial frequency of adjectives and adverbs in science fiction signifies a richly descriptive genre. This is crucial for crafting the intricate and often alien settings typical of science fiction, providing a vivid backdrop for the narrative.

#### **Common Bigrams and Words**
Unique bigrams like “old man,” “great nest,” and “reverend mother” in science fiction novels indicate a distinctive focus on character and setting. Words like “time” and “like” hint at frequent explorations of speculative concepts and imaginative comparisons, a hallmark of the genre.

#### **Part of Speech Frequencies**
Science fiction shares a similar prominence of nouns (NN) and adjectives (JJ) with mystery novels, but the lesser use of present tense verbs (VBP) suggests a narrative style possibly more focused on descriptive world-building than immediate action or dialogue.

### **Comparative Analysis**

Comparatively, mystery novels demonstrate a more focused vocabulary and a greater reliance on dialogue-driven narrative, reflecting the genre’s emphasis on plot development through conversation and interrogation. In contrast, science fiction tends to employ a broader vocabulary and is more inclined towards expansive and speculative world-building, as seen in its unique bigrams and word choices.

In summary, both genres effectively utilize descriptive language, but their approaches diverge to meet their specific storytelling needs. Mystery novels leverage language to create suspense and unfold the plot through dialogue, while science fiction uses it to construct complex worlds and explore speculative ideas. This analysis highlights the critical role of linguistic features in defining the narrative style and thematic essence of different literary genres.

In [26]:
# @title ##Formatting {display-mode: "form"}
def print_profile(profile, title):
    print(f"{title}")
    print("-" * len(title))

    # Lexical Diversity
    print(f"\nLexical Diversity: {profile['Lexical Diversity']:.2f}")

    # Sentiment
    sentiment = summarize_sentiments(profile['Sentiment'])
    print(f"\nSentiment: Average Score: {sentiment[0]:.4f}, Category: {sentiment[1]}")

    # Adjective and Adverb Frequency
    adj_adv_freq = profile.get("Adjective and Adverb Frequency", "Data not available")
    print(f"\nAdjective and Adverb Frequency: {adj_adv_freq}")

    # Top 10 Most Common Bigrams
    print("\nTop 10 Most Common Bigrams:")
    for bigram, frequency in profile['Top 10 Bigrams']:
        print(f"  {' '.join(bigram)}: {frequency}")

    # Top 10 Most Common Words
    print("\nTop 10 Most Common Words:")
    for word_info in profile['Top 10 Words']:
        word, frequency = word_info[:2]
        print(f"  {word}: {frequency}")

    # Part of Speech - Top 10
    print("\nTop 10 Part of Speech Frequencies:")
    pos_sorted = sorted(profile['Part of Speech'].items(), key=lambda item: item[1], reverse=True)[:10]
    for pos, count in pos_sorted:
        print(f"  {pos}: {count}")

In [27]:
print_profile(mystery_profile, "Mystery")

Mystery
-------

Lexical Diversity: 0.12

Sentiment: Average Score: 0.0143, Category: Positive

Adjective and Adverb Frequency: {'average_adjectives': 11301.6, 'average_adverbs': 4335.8}

Top 10 Most Common Bigrams:
  tell I: 874
  shake head: 369
  cassie say: 305
  give I: 299
  look I: 280
  go back: 256
  I say: 250
  look like: 238
  ask I: 214
  come back: 208

Top 10 Most Common Words:
  say: 9842
  i: 8336
  go: 4703
  know: 4664
  one: 4044
  see: 3610
  look: 3455
  think: 3401
  like: 3377
  get: 2951

Top 10 Part of Speech Frequencies:
  NN: 243681
  JJ: 110975
  VBP: 53060
  RB: 42717
  VB: 22386
  IN: 15499
  PRP: 8464
  NNS: 8441
  VBD: 8172
  CD: 8082


In [28]:
print_profile(scifi_profile, "Science Fiction")

Science Fiction
---------------

Lexical Diversity: 0.14

Sentiment: Average Score: 0.0017, Category: Positive

Adjective and Adverb Frequency: {'average_adjectives': 11166.7, 'average_adverbs': 3927.1}

Top 10 Most Common Bigrams:
  old man: 226
  tell I: 223
  look like: 222
  great nest: 191
  shake head: 155
  reverend mother: 151
  long time: 131
  venkat say: 130
  case say: 129
  give I: 128

Top 10 Most Common Words:
  say: 7968
  one: 3899
  go: 3282
  get: 3179
  see: 3102
  like: 2949
  know: 2877
  i: 2703
  think: 2491
  time: 2469

Top 10 Part of Speech Frequencies:
  NN: 241147
  JJ: 109728
  VBP: 42354
  RB: 38615
  VB: 19442
  IN: 14839
  NNS: 8885
  CD: 8310
  VBD: 7847
  VBG: 5020


In [29]:
# @title ##Graphing {display-mode: "form"}

import plotly.graph_objects as go

def plot_pos_pie_chart(pos_counts, title):
    total = sum(pos_counts.values())
    filtered_pos_counts = {tag: count for tag, count in pos_counts.items() if (count / total) * 100 >= 2}

    labels = list(filtered_pos_counts.keys())
    values = list(filtered_pos_counts.values())

    # Create a pie chart
    fig = go.Figure(go.Pie(
        labels=labels,
        values=values,
        hoverinfo='label+percent',
        textinfo='percent+label'
    ))

    # Update layout with the title and the key
    fig.update_layout(
        title=title,
        legend_title="Part of Speech"
    )

    # Show the plot
    fig.show()

#### POS Frequencies


In [31]:
plot_pos_pie_chart(scifi_profile["Part of Speech"], "Part of Speech Frequency in Science Fiction")

In [32]:
plot_pos_pie_chart(mystery_profile["Part of Speech"], "Part of Speech Frequency in Mystery")

## Question Two
> *In what ways do the results of the linguistic analysis provide insights into the role of setting and atmosphere in genre-specific storytelling within bestselling mystery and science fiction novels?*

### Mystery Novels

#### **Creation of Suspenseful and Descriptive Settings**
   - **Adjective and Adverb Usage**: The high frequency of adjectives and adverbs in mystery novels contributes to creating highly detailed and vivid settings. Descriptive language is used to paint a picture of each scene, whether it’s the dark alleyways of a crime scene or the subtle nuances of a suspect’s living room. This meticulous attention to detail is essential in a genre where the setting often holds clues to the mystery.

#### **Atmosphere of Tension and Uncertainty**
   - **Common Bigrams and Words**: The frequent use of dialogue-centric bigrams (e.g., “tell I,” “shake head”) and action-oriented verbs (e.g., “say,” “go,” “see”) contribute to an atmosphere of tension and dynamism. Dialogue drives the plot forward and often reveals crucial information, keeping readers engaged and perpetuating a sense of suspense.

#### **Enhanced Reader Engagement**
   - **Part of Speech Frequencies**: The dominance of nouns and adjectives indicates a focus on concrete imagery and specific details, which are key to immersing readers in the mystery and encouraging them to actively engage with the narrative, piecing together clues and theorizing alongside the characters.

### Science Fiction Novels

#### **Elaborate and Imaginative World-Building**
   - **Lexical Diversity**: The broader vocabulary range in science fiction is instrumental in constructing intricate and often fantastical worlds. These novel settings—whether they’re distant planets, futuristic cities, or alternate realities—are brought to life through the genre’s rich and diverse language.

#### **Atmospheric Depth and Speculative Elements**
   - **Adjective and Adverb Usage**: The substantial use of adjectives and adverbs helps in fleshing out the unique atmospheres of science fiction settings. Descriptions of advanced technologies, alien landscapes, or futuristic societies are not just visually immersive but also provoke thought about the possibilities of the future, a key element of the genre.

#### **Thematic and Conceptual Exploration**
   - **Common Bigrams and Words**: The use of distinctive bigrams and thematic words (e.g., “old man,” “great nest,” “time”) reflects the genre's focus on broader themes and concepts. This linguistic choice contributes to an atmosphere that is often reflective, thought-provoking, and oriented towards larger existential or ethical questions.

### Comparative Insights

In mystery novels, the linguistic elements work together to create a setting that is both detailed and suspenseful, inviting readers to become detectives themselves. The atmosphere is one of tension and engagement, where every word can be a clue and every description potentially holds the key to solving the mystery.

In contrast, science fiction novels use their linguistic repertoire to build expansive and imaginative settings that go beyond the familiar. The atmosphere in these novels is often awe-inspiring, contemplative, and speculative, encouraging readers to explore the possibilities of the unknown and the implications of advanced technologies or alternate realities.

Thus, the results of the linguistic analysis underscore the integral role of setting and atmosphere in genre-specific storytelling. In mystery novels, language is a tool for creating a tightly woven, suspenseful environment, while in science fiction, it serves to expand the reader's imagination and explore new frontiers, both physical and conceptual.

## Limitations and Improvements

Better Named Entity Recognition

A more agressive filtering approach

A larger corpus selection

