<a href="https://colab.research.google.com/github/R-802/LING-226-Assignments/blob/main/Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#LING226 2023 T3 Assignment Two
- Shemaiah Rangitaawa
- `300601546`

### **Research Questions**

> How do linguistic features, including distributional and syntactic measures, contribute to the construction of the descriptive language in bestselling mystery novels compared to bestselling science fiction novels?
   
This question aims to delve into the linguistic features chosen for constructing the profiles. By focusing on distributional and syntactic measures, we can uncover the nuances in language use specific to each genre. The rationale behind this question is to understand the fine-grained details of how authors employ language in mystery and science fiction, contributing to the overall descriptive elements in their respective genres.

> In what ways do the results of the linguistic analysis provide insights into the role of setting and atmosphere in genre-specific storytelling within bestselling mystery and science fiction novels?
   
This question addresses the broader narrative context by linking linguistic analysis results to the role of setting and atmosphere in storytelling. It aims to connect the linguistic profiles with the overarching theme of how descriptive language influences the creation of setting and atmosphere in mystery and science fiction. By exploring this connection, the study seeks to extract meaningful insights into the distinct storytelling approaches employed by each genre.

### **Predictions**
**Descriptive Detail**:
   - Mystery novels will probably use more precise and plot-driven descriptions.
   - Science fiction novels are anticipated to have broader, more imaginative descriptions.

**Lexical Choices**:
   - Mystery novels might use language that evokes suspense and mystery.
   - Science fiction novels are expected to include technical and futuristic terminology.

**Atmosphere and Mood**:
   - Descriptions in mystery novels are predicted to create a tense, suspenseful mood.
   - In science fiction, the language is likely to evoke wonder and exploration.

##**Preprocessing Pipeline for Text Analysis**
**Text Cleaning**: This step involves converting all text to lowercase, removing punctuation, and eliminating numbers. This standardization ensures uniformity and relevance in the analysis.

**Removing Stop Words**: Common words like "the", "is", and "in" are removed using NLTK's predefined list of stop words. These words are irrelevant to the overall meaning in most analysis contexts.

**Custom TF-IDF Vectorization**: Extends standard TF-IDF by applying thresholds to filter out words based on their frequency across the books. This approach allows us to focus on words that are uniquely significant to the novel being analyzed.

In [19]:
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

True

In [20]:
def clean_text(text):
    # Lowercasing
    text = text.lower()

    # Remove punctuations and numbers
    text = re.sub('[^a-zA-Z]', ' ', text)

    # Remove single character words
    text = re.sub(r"\s+[a-zA-Z]\s+", ' ', text)

    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text)
    return text

In [21]:
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return ' '.join(filtered_text)

In [30]:
def tfidf_filter(corpus, lower_percentile=10, upper_percentile=90):
    # Vectorize the corpus
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)

    # Calculate the inverse document frequency
    idf = vectorizer.idf_
    idf_dict = dict(zip(vectorizer.get_feature_names_out(), idf))

    # Determine lower and upper thresholds based on percentiles
    lower_threshold = np.percentile(list(idf_dict.values()), lower_percentile)
    upper_threshold = np.percentile(list(idf_dict.values()), upper_percentile)

    # Filter out words outside the middle range of TF-IDF scores
    middle_tfidf_words = {word: score for word, score in idf_dict.items()
                          if lower_threshold <= score <= upper_threshold}

    # Re-create the corpus without extreme TF-IDF words
    cleaned_corpus = []
    for document in corpus:
        tokens = document.split()
        filtered_tokens = [token for token in tokens if token in middle_tfidf_words]
        cleaned_corpus.append(' '.join(filtered_tokens))

    return cleaned_corpus

In [31]:
def preprocess(corpus, lower_percentile=10, upper_percentile=90):
    # Clean and normalize each document in the corpus
    cleaned_corpus = [clean_text(doc) for doc in corpus]

    # Remove stop words
    no_stopwords_corpus = [remove_stopwords(doc) for doc in cleaned_corpus]

    # Remove words outside the middle range of TF-IDF scores
    final_corpus = tfidf_filter(no_stopwords_corpus, lower_percentile, upper_percentile)
    return final_corpus

#**Data and Corpus Selection**

> To import the corpus into Colab please download the texts from [this link](https://drive.google.com/drive/folders/15A7y8NRaJv2LRBB6zDm4043G1f9sMTWD) and add under 'My Drive' in Google Drive.

### Selection Rationale
I have chosen ten texts for each genre, following these criteria to ensure a comprehensive and meaningful analysis:

- **Genre Representation**: Each book is a well-recognized example of its genre, ensuring a clear distinction between the mystery and science fiction categories. This selection helps maintain genre purity in the analysis, enabling a focused examination of genre-specific linguistic characteristics. Well-known works in each genre are chosen to represent common patterns and themes that are emblematic of their respective genres.

- **Narrative Styles**: The selection includes a range of narrative styles, from first-person accounts to omniscient narrators, offering diverse syntactic structures. This variety allows for a more nuanced exploration of how different narrative perspectives influence the use of descriptive language. By including a mix of narrative styles, the analysis can uncover how the choice of narrator affects the portrayal of setting and atmosphere.

- **Thematic Variety**: The books cover various sub-genres and themes, providing a rich linguistic variety for analysis. For instance, the mystery genre includes classic whodunits, psychological thrillers, and detective stories, while the science fiction genre encompasses hard sci-fi, space operas, and dystopian narratives. This thematic diversity ensures that the analysis captures a broad spectrum of language usage, avoiding biases that might arise from focusing on a narrow thematic range.

- **Publication Era**: The texts span different publication eras, reflecting the evolution of language and narrative techniques over time. This temporal spread helps in understanding how descriptive language in both genres has evolved and how historical and cultural contexts influence storytelling.

- **Authorial Background**: The authors of these texts come from varied backgrounds, contributing to diverse perspectives and styles in their writing. This inclusion enriches the linguistic analysis by introducing different cultural and individual influences in the use of language.

- **Critical and Popular Reception**: The texts are a mix of critically acclaimed works and popular bestsellers. This ensures that the analysis encompasses both literary quality and mass appeal, reflecting a balance between artistic merit and accessibility to a broader audience.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [6]:
import os

def read_text_files(directory_path):
    """Reads all text files in a directory and returns a set of their contents."""
    text_contents = set()
    for filename in os.listdir(directory_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(directory_path, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                full_text = file.read()
                text_contents.add(full_text)
    return text_contents

In [7]:
def get_titles(text_set):
    """Returns a list of the first lines of each text in a set."""
    titles = []
    for text in text_set:
        first_line = text.split('\n', 1)[0]
        titles.append(first_line)
    return titles

mystery_path = '/content/drive/My Drive/LING226 Assignment 2 Corpus/Mystery/'
scifi_path = '/content/drive/My Drive/LING226 Assignment 2 Corpus/SciFi/'

mystery_texts = read_text_files(mystery_path)
mystery_titles = get_titles(mystery_texts)
print("Mystery Novels:")
for title in mystery_titles:
    print(title)

scifi_texts = read_text_files(scifi_path)
scifi_titles = get_titles(scifi_texts)
print("\nScifi Novels:")
for title in scifi_titles:
    print(title)

Mystery Novels:
[Sharp Objects by Gillian Flynn 2006]
[The Name of the Rose by Umberto Eco 1980]
[The Girl on the Train by Paula Hawkins 2015]
[Before I Go to Sleep by S.J Watson 2008]
[The Girl With The Dragon Tattoo by Stieg Larsson 2005]
[In the Woods by Tana French 2007]
[And Then There Were None by Agatha Christie 1939]
[The Woman in the Window by A.J. Finn 2018]
[The Da Vinci Code by Dan Brown 2003]
[Murder on the Orient Express by Agatha Christie 1934]

Scifi Novels:
[Neuromancer by William Gibson 1984]
[Brave New World by Aldous Huxley 1931]
[The War of the Worlds by H. G. Wells 1898]
[DUNE by Frank Herbert 1965]
[The Hitchhiker’s Guide to the Galaxy by Douglas Adams 1979]
[Snow Crash by Neal Stephenson 1992]
[The Martian by Andy Weir 2011]
[Children of Time by Adrain Tchaokovsky 2015]
[Nineteen Eighty-Four by George Orwell 1949]
[Fahrenheit 451 by Ray Bradbury 1953]


# **Feature Extraction and Lexical Profile Generation**

## Lexicon Information

**Lexical Diversity and Vocabulary Richness**: Exploring lexical diversity and vocabulary richness in texts offer insights into how authors in different genres, like mystery and science fiction, utilize language. This feature measures the variety and sophistication of the vocabulary, shedding light on the linguistic intricacies that each genre employs to create its unique narrative worlds and tones. In doing so, it allows us to determine the thematic emphases of each genre, as mystery novels might lean towards a more nuanced, reality-anchored lexicon, whereas science fiction could delve into more innovative and speculative terminologies, each contributing to their distinctive atmospheres.

**Sentiment Analysis**: By evaluating the emotional tone, such as the prevalent suspense in mysteries or the awe in science fiction, this feature helps in dissecting how authors manipulate emotions to shape the reader's experience. It is particularly telling in understanding the intended emotional journey and the atmospheric creation within the narrative settings of each genre, revealing how authors use sentiment to immerse readers in the world they have created.

**Frequency of Adjectives and Adverbs**: Since adjectives and adverbs are often used to describe settings and characters, analyzing their frequency can reveal how visually or sensorially detailed each genre is. This can inform how each genre crafts its atmosphere and setting.

## Distributional Information

**Collocates**:
Collocate analysis delves into the contextual associations of key words within different genres, such as mystery and science fiction. It uncovers the thematic nuances and settings unique to each genre by examining how specific words cluster with others, providing insights into the distinct narrative techniques and atmospheres each genre employs.

**Part of Speech (POS) Analysis**:
POS analysis offers a look into the syntactic composition of different literary genres. By comparing how various parts of speech are used in mystery versus science fiction texts, we can infer stylistic and narrative preferences, such as the prevalence of descriptive language versus action-oriented dialogue, each contributing to the creation of unique settings and atmospheres.

**Bigram Analysis**:
Bigram analysis reveals the characteristic phrase structures in different genres, highlighting how specific word pairings contribute to the thematic and atmospheric construction in mystery and science fiction novels. This analysis is instrumental in understanding the narrative style and world-building techniques of each genre.

##**Extraction Function Definitions**

In [8]:
import nltk
from nltk import pos_tag
from nltk import bigrams, word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
import numpy as np
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('vader_lexicon', quiet=True)

True

**Lexical Diversity**
   - **Implementation**:

In [9]:
def lexical_diversity(text):
    """Calculate lexical diversity as the ratio of unique words to total words."""
    tokens = word_tokenize(text)
    return len(set(tokens)) / len(tokens) if tokens else 0

def calculate_average_lexical_diversity(texts):
    """Calculate the average lexical diversity for a set of texts."""
    diversities = [lexical_diversity(text) for text in texts]
    return np.mean(diversities)

**Sentiment Analysis**
   - **Implementation**:

In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification

if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("Using CPU")

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
model.to(device)

In [46]:
def segment_text(text, max_length=512):
    # Split the text by newline characters and remove empty paragraphs
    paragraphs = [para for para in text.split('\n') if para.strip() != '']
    segments = []
    current_segment = ""

    for paragraph in paragraphs:
        if len(current_segment) + len(paragraph) <= max_length:
            current_segment += " " + paragraph if current_segment else paragraph
        else:
            if current_segment:
                segments.append(current_segment)
            current_segment = paragraph

    # Add the last segment if it contains any text
    if current_segment:
        segments.append(current_segment)

    return segments


def sentiment_analysis(text):
    segments = segment_text(text)
    sentiments = []

    for segment in segments:
        inputs = tokenizer(segment, return_tensors="pt", truncation=True, max_length=512, padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():  # Disable gradient calculation for inference
            outputs = model(**inputs)
            prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
            sentiment_score = prediction[:,1].item() - prediction[:,0].item()  # Positive - Negative
            sentiments.append(sentiment_score)

    return sentiments

**Frequency of Adjectives and Adverbs**
   - **Implementation**: The text is tokenized and POS-tagged to count adjectives and adverbs. This count reflects the extent of descriptive language used.

In [11]:
def mean_adj_adv_freq_combined(texts):
    total_adjectives = 0
    total_adverbs = 0
    total_texts = len(texts)

    for text in texts:
        words = word_tokenize(text)
        tagged = pos_tag(words)

        for word, tag in tagged:
            if tag.startswith("JJ"):
                total_adjectives += 1
            elif tag.startswith("RB"):
                total_adverbs += 1

    # Calculating the mean for adjectives and adverbs
    avg_adjectives = total_adjectives / total_texts if total_texts > 0 else 0
    avg_adverbs = total_adverbs / total_texts if total_texts > 0 else 0

    return {"average_adjectives": avg_adjectives, "average_adverbs": avg_adverbs}

**Collocates**
   - **Implementation**: NLTK's BigramCollocationFinder is used to find words that frequently appear near a target word, giving insight into word associations.

In [12]:
def find_collocates(text, word):
    tokens = word_tokenize(text)
    bigram_measures = BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tokens)
    word_filter = lambda *w: word not in w
    finder.apply_ngram_filter(word_filter)
    return finder.nbest(bigram_measures.likelihood_ratio, 10)

 **Part of Speech (POS) Analysis**
   - **Implementation**: The text is POS-tagged to analyze the frequency of different parts of speech. This approach gives a broad overview of the syntactic structure of the text.

In [13]:
def pos_frequency(text):
    words = word_tokenize(text)
    tagged = pos_tag(words)
    pos_counts = {}
    for word, tag in tagged:
        pos_counts[tag] = pos_counts.get(tag, 0) + 1
    return pos_counts

**Top 10 Bigrams**
   - **Implementation**: The text is analyzed for bigrams to identify common word pairings.

In [14]:
def get_bigrams(text):
    tokens = word_tokenize(text)
    bigram_counts = Counter(bigrams(tokens))
    return bigram_counts.most_common(10)

##**Preprocessing and Feature Extraction**

In [43]:
mystery_processed_texts = preprocess(mystery_texts)
scifi_processed_texts = preprocess(scifi_texts)

In [45]:
# Apply feature extraction functions for each genre
mystery_profile = {
    "Lexical Diversity": calculate_average_lexical_diversity(mystery_processed_texts),
    "Sentiment": sentiment_analysis(text) for text in mystery_texts,
    "Adjective and Adverb Frequency": mean_adj_adv_freq_combined(mystery_processed_texts),
    "Top 10 Most Common Bigrams": get_bigrams(" ".join(mystery_processed_texts))
}

scifi_profile = {
    "Lexical Diversity":  calculate_average_lexical_diversity(scifi_processed_texts),
    "Sentiment": np.mean([sentiment_analysis(text) for text in scifi_texts]),
    "Adjective and Adverb Frequency": mean_adj_adv_freq_combined(scifi_processed_texts),
    "Top 10 Most Common Bigrams": get_bigrams(" ".join(scifi_processed_texts))
}

KeyboardInterrupt: ignored

In [36]:
def format_and_print_profile(profile, genre_name):
    print(f"{genre_name} Genre Profile:\n")
    for feature, value in profile.items():
        print(f"{feature}:")

        if isinstance(value, (int, float)):
            # For numerical values (e.g., Lexical Diversity, Sentiment)
            print(f"  {value:.4f}\n")

        elif isinstance(value, dict):
            # For dictionary values (e.g., Adjective and Adverb Frequency)
            for sub_feature, sub_value in value.items():
                print(f"  {sub_feature}: {sub_value:.2f}")
            print()  # Add a newline for better separation

        elif isinstance(value, list):
            for item in value[:10]:
                if isinstance(item, tuple) and all(isinstance(elem, str) for elem in item[0]):
                    # For bigrams, which are tuples of strings
                    bigram_str = ' '.join(item[0])
                    count = item[1]
                    print(f"  {bigram_str}: {count}")
            print()  # Add a newline for better separation

    print("\n---------------------------------------\n")

format_and_print_profile(mystery_profile, "Mystery")
format_and_print_profile(scifi_profile, "SciFi")

Mystery Genre Profile:

Lexical Diversity:
  0.3510

Sentiment:
  0.0039

Adjective and Adverb Frequency:
  average_adjectives: 3730.20
  average_adverbs: 855.30

Top 10 Most Common Bigrams:
  langdon sophie: 148
  da vinci: 125
  holy grail: 123
  martin vanger: 120
  sophie langdon: 111
  philip lombard: 84
  opus dei: 82
  justice wargrave: 81
  mr justice: 80
  henrik vanger: 77


---------------------------------------

SciFi Genre Profile:

Lexical Diversity:
  0.3576

Sentiment:
  -0.0067

Adjective and Adverb Frequency:
  average_adjectives: 3813.00
  average_adverbs: 931.90

Top 10 Most Common Bigrams:
  feyd rautha: 201
  muad dib: 199
  bene gesserit: 187
  uncle enzo: 132
  log entry: 119
  entry sol: 119
  ebooks ebook: 117
  ebook com: 117
  da id: 109
  duke leto: 61


---------------------------------------



#**Analysis**