<a href="https://colab.research.google.com/github/R-802/LING-226-Assignments/blob/main/Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#LING226 Assignment Two
- Shemaiah Rangitaawa
- `300601546`

### **Research Question**

> *How does the use of descriptive language differ in bestselling mystery novels compared to bestselling science fiction novels, and what does this tell us about the role of setting and atmosphere in genre-specific storytelling?*

This question aims to explore how descriptive language varies between bestselling mystery and science fiction novels, focusing on how these genres use language to create setting and atmosphere. The rationale is based on two key genre characteristics:

**Mystery Novels**:
- Expected to use language that builds suspense and intrigue, with detailed descriptions focused on setting the mood and guiding the plot.

**Science Fiction Novels**:
- Likely to use and imaginative descriptions for world-building, introducing futuristic or alien elements.

### **Predictions**
**Descriptive Detail**:
   - Mystery novels will probably use more precise and plot-driven descriptions.
   - Science fiction novels are anticipated to have broader, more imaginative descriptions.

**Lexical Choices**:
   - Mystery novels might use language that evokes suspense and mystery.
   - Science fiction novels are expected to include technical and futuristic terminology.

**Atmosphere and Mood**:
   - Descriptions in mystery novels are predicted to create a tense, suspenseful mood.
   - In science fiction, the language is likely to evoke wonder and exploration.

##**Preprocessing Pipeline**
Text preprocessing involves:
1. **Text Cleaning**: This step involves converting all text to lowercase, removing punctuation, and eliminating numbers. This standardization ensures uniformity and relevance in the analysis.

2. **Removing Stop Words**: Common words like "the", "is", and "in" are removed using NLTK's predefined list of stop words. These words are typically irrelevant to the overall meaning in most analysis contexts.

3. **Custom TF-IDF Vectorization**: Extends standard TF-IDF by applying thresholds to filter out words based on their frequency across the books. This approach allows us to focus on words that are uniquely significant to the novel being analyzed.

In [3]:
import nltk
import string
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = ''.join([char for char in text if char not in string.punctuation])  # Remove punctuation
    text = ''.join([char for char in text if not char.isdigit()])  # Remove numbers
    return text

In [5]:
def tokenize_and_remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text

In [44]:
class CustomTfidfVectorizer(TfidfVectorizer):
    def __init__(self, lower_threshold=0.05, upper_threshold=0.95, **kwargs):
        super().__init__(**kwargs)
        self.lower_threshold = lower_threshold
        self.upper_threshold = upper_threshold

    def fit_transform(self, raw_documents, y=None):
        dt_matrix = super().fit_transform(raw_documents, y)
        df = np.sum(dt_matrix > 0, axis=0).getA1() / len(raw_documents)
        terms_to_keep = np.where((df >= self.lower_threshold) & (df <= self.upper_threshold))[0]
        self.reduced_feature_names_ = np.array(self.get_feature_names_out())[terms_to_keep]
        return dt_matrix[:, terms_to_keep]

    def get_feature_names_out(self):
        try:
            return self.reduced_feature_names_
        except AttributeError:
            return super().get_feature_names_out()

In [7]:
def preprocess(documents, lower_threshold=0.05, upper_threshold=0.95):
    cleaned_docs = [clean_text(doc) for doc in documents]
    tokenized_docs = [' '.join(tokenize_and_remove_stopwords(doc)) for doc in cleaned_docs]
    vectorizer = CustomTfidfVectorizer(lower_threshold=lower_threshold, upper_threshold=upper_threshold)
    tfidf_matrix = vectorizer.fit_transform(tokenized_docs)
    return tfidf_matrix, vectorizer.get_feature_names_out()

##**Corpus Selection**

> To import the corpus into Colab please download the texts from [this link](https://drive.google.com/drive/folders/15A7y8NRaJv2LRBB6zDm4043G1f9sMTWD) and add them to your google drive.

### Mystery Novels
1. **"The Girl on the Train" by Paula Hawkins** - A psychological thriller with a complex narrative structure.

2. **"And Then There Were None" by Agatha Christie** - A classic whodunit by the renowned mystery writer.

3. **"In the Woods" by Tana French** - A novel combining police procedural with psychological depth.

4. **"The Da Vinci Code" by Dan Brown** - A fast-paced mystery intertwined with historical and religious themes.

5. **"The Woman in the Window" by A.J. Finn** - A psychological thriller with a gripping plot and unreliable narration.

6. **"Before I Go to Sleep" by S.J. Watson** - A psychological thriller about a woman suffering from anterograde amnesia and trying to piece together her identity.

7. **"Sharp Objects" by Gillian Flynn** - A dark mystery involving a journalist returning to her hometown to cover the murders of two preteen girls, uncovering disturbing secrets.

8. **"Murder on the Orient Express" by Agatha Christie** - A famous mystery set on a luxurious train.

9. **"The Girl with the Dragon Tattoo" by Stieg Larsson** - A gripping, internationally acclaimed novel featuring a complex investigation led by a journalist and a brilliant but troubled hacker.

10. **"The Name of the Rose" by Umberto Eco** - A historical mystery set in a 14th-century Italian monastery, where Brother William of Baskerville investigates mysterious deaths, combining elements of semiotics, theology, and philosophy.


### Science Fiction Novels
1. **"1984" by George Orwell** - A dystopian novel with profound political commentary.

2. **"The Hitchhiker's Guide to the Galaxy" by Douglas Adams** - A blend of science fiction and humor.

3. **"The Martian" by Andy Weir** - A gripping story of survival and ingenuity, detailing an astronaut's solitary struggle to survive on Mars after being left behind.


4. **"Dune" by Frank Herbert** - A science fiction epic with deep world-building and political intrigue.

5. **"Neuromancer" by William Gibson** - A cyberpunk novel with richly detailed futuristic settings.

6. **"Brave New World" by Aldous Huxley** - A dystopian novel exploring a technologically advanced future.

7. **"Snow Crash" by Neal Stephenson** - A cyberpunk novel that blends linguistics, computer science, and politics.

8. **"The War of the Worlds" by H.G. Wells** - An early science fiction novel depicting an alien invasion.

9. **"Children of Time" by Adrian Tchaikovsky** - An award-winning novel featuring the evolution of intelligent spiders on a terraformed planet, exploring themes of civilization, legacy, and survival.

10. **"Fahrenheit 451" by Ray Bradbury** - A dystopian novel centered around themes of censorship and book burning.


### Selection Rationale

- **Genre Representation**: Each book is a well-recognized example of its genre, ensuring a clear distinction between the mystery and science fiction categories.

- **Narrative Styles**: The selection includes a range of narrative styles, from first-person accounts to omniscient narrators, offering diverse syntactic structures.

- **Thematic Variety**: The books cover various sub-genres and themes, providing a rich linguistic variety for analysis.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [33]:
import os

def read_text_files(directory_path):
    """Reads all text files in a directory and returns a set of their contents."""
    text_contents = set()  # Use a set to store the contents
    for filename in os.listdir(directory_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(directory_path, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                full_text = file.read()  # Read the full text of the file
                text_contents.add(full_text)  # Add the full text to the set
    return text_contents

In [35]:
def print_titles(text_set):
  for text in text_set:
      first_line = text.split('\n', 1)[0]
      print(first_line)

mystery_path = '/content/drive/My Drive/LING226 Assignment 2 Corpus/Mystery/'
scifi_path = '/content/drive/My Drive/LING226 Assignment 2 Corpus/SciFi/'

mystery_texts = read_text_files(mystery_path)
print("Mystery Novels:")
print_titles(mystery_texts)

scifi_texts = read_text_files(scifi_path)
print("\nScifi Novels:")
print_titles(scifi_texts)

Mystery Novels:
[And Then There Were None by Agatha Christie 1939]
[The Woman in the Window by A.J. Finn 2018]
[Sharp Objects by Gillian Flynn 2006]
[In the Woods by Tana French 2007]
[The Da Vinci Code by Dan Brown 2003]
[The Name of the Rose by Umberto Eco 1980]
[Before I Go to Sleep by S.J Watson 2008]
[The Girl on the Train by Paula Hawkins 2015]
[Murder on the Orient Express by Agatha Christie 1934]
[The Girl With The Dragon Tattoo by Stieg Larsson 2005]

Scifi Novels:
[Brave New World by Aldous Huxley 1931]
[Fahrenheit 451 by Ray Bradbury 1953]
[Children of Time by Adrain Tchaokovsky 2015]
[Snow Crash by Neal Stephenson 1992]
[DUNE by Frank Herbert 1965]
[The Hitchhiker’s Guide to the Galaxy by Douglas Adams 1979]
[The War of the Worlds by H. G. Wells 1898]
[Nineteen Eighty-Four by George Orwell 1949]
[The Martian by Andy Weir 2011]
[Neuromancer by William Gibson 1984]


In [39]:
tfidf_matrix, feature_names = preprocess(mystery_texts)

In [46]:
print(feature_names)

['aa' 'aah' 'aaron' ... 'östermalmstorg' 'östersund' 'überstigen']
