# Part One — Syntax and Style 


In the first part of your coursework, your task is to explore the syntax and style of a set of 19th Century novels using the methods and tools that you learned in class. The texts you need for this part are in the novels subdirectory of the texts directory in the coursework Moodle template. The texts are in plain text files, and the filenames include the title, author, and year of publication, separated by hyphens. 

The template code provided in PartOne.py includes function headers for some sub-parts of this ques- tion. The main method of your finished script should call each of these functions in order. To complete your coursework, complete these functions so that they perform the tasks specified in the questions below. You may (and in some cases should) define additional functions.

Re-assessment template 2025

Note: The template functions here and the dataframe format for structuring your solution is a suggested but not mandatory approach. You can use a different approach if you like, as long as you clearly answer the questions and communicate your answers clearly.


In [1]:
from nltk.tokenize import word_tokenize
# import nltk.word_tokenize # incorrect -- raises error see: https://www.nltk.org/api/nltk.tokenize.html
# import nltk
from nltk.corpus import cmudict
from nltk.tokenize import sent_tokenize
from collections import Counter
from IPython.display import display, HTML

import spacy
from pathlib import Path

import pandas as pd
import os
import re
from math import log2

nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000

import warnings
warnings.filterwarnings('ignore') # for copy warnings in pandas 

## read_novels:
Each file in the novels directory contains the text of a novel, and the name of the file is the title, author, and year of publication of the novel, separated by hyphens. Complete the python function read_texts to do the following: 
- i. create a pandas dataframe with the following columns: text, title, author, year 
- ii. sort the dataframe by the year column before returning it, resetting or ignoring the dataframe index.

In [None]:
def read_file(filename):
    with open(filename, 'r') as f:
        return f.read()

def remove_chapter_headings(text):
    #pattern = r'chapter\s*(?:[ivxIIVX]+)'
    # pattern = r'(?i)CHAPTER \d+'  # adding this so that chapter breaks are not counted. could be refined for section breaks etc.
    pattern = r'(?i)chapter\s*(?:[ivxIIVX]+|\d+)' #* updating for both digits and roman numerals

    return re.sub(pattern, '', text)


def read_novels(path=Path.cwd() / "zips" / "p1-texts" / "novels"):
    """Reads texts from a directory of .txt files and returns a DataFrame with the text, title,
    author, and year"""
    files = [file for file in os.listdir(path) if file[-4:] == '.txt']
    df_data = []

    for i in range(len(files)):
        ''' Create a dictionary, and add the file items to it, then use populated dictionary to create a dataframe '''
        file_dict = {}
        file_dict['text'] = read_file(os.path.join(path,files[i]))
        title, author, year = files[i].split('-')
        file_dict['title'] = title 

        file_dict['author'] = author
        file_dict['year'] = year[:-4]
        df_data.append(file_dict)
    df = pd.DataFrame(df_data)
    ''' NOTE: I have opted to retain the original text as a copy, and have applied some cleaning to the documents
    Cleaned text is applied to column "text"
    '''
    df['text_c'] = df['text'].copy() # retain original copy
    mask = df['title'] == 'The_Black_Moth'
    ind = df[mask].index[0]
    df.loc[ind, 'text'] = df.loc[ind, 'text'][585:].strip()
    df['text'] = df['text'].apply(lambda x: remove_chapter_headings(x).strip()) # apply function to all items in text column
    df = df[['text_c','text', 'title', 'author', 'year']] # reorder columns
    df = df.sort_values('year') # return sorted by year (assumed default ascending)
    df.reset_index(inplace=True, drop=True)     
    return df     
    
df = read_novels()
# df[0:1]['text'].values


Unnamed: 0,text_c,text,title,author,year
0,\nCHAPTER 1\n\nThe family of Dashwood had long...,The family of Dashwood had long been settled i...,Sense_and_Sensibility,Austen,1811


## nltk_ttr: 
This function should return a dictionary mapping the title of each novel to its type-token ratio. Tokenize the text using the NLTK library only. Do not include punctuation as tokens, and ignore case when counting types.

In [35]:
def nltk_ttr(text):
    """Calculates the type-token ratio of a text. Text is tokenized using nltk.word_tokenize."""
    text_cleaned = [t.lower() for t in word_tokenize(text) if t.isalnum()] # I've assumed that we want the numbers, as well as text, as the question didn't state to exclude these.
    vocab = set(text_cleaned) # dedupe words
    ttr = len(vocab) / len(text_cleaned) # ratio words to the text
    return ttr


def get_ttrs(df):
    """helper function to add ttr to a dataframe"""
    results = {}
    for i, row in df.iterrows():
        results[row["title"]] = nltk_ttr(row["text"])
    return results
get_ttrs(df)


{'Sense_and_Sensibility': 0.05288568542519132,
 'North_and_South': 0.05492518524876262,
 'A_Tale_of_Two_Cities': 0.07075401657114008,
 'Erewhon': 0.09140189916402208,
 'The_American': 0.06373402148472664,
 'Dorian_Gray': 0.08359599310916864,
 'Tess_of_the_DUrbervilles': 0.07781506237653235,
 'The_Golden_Bowl': 0.04748457874040203,
 'The_Secret_Garden': 0.058174923913585794,
 'Portrait_of_the_Artist': 0.10497357039884671,
 'The_Black_Moth': 0.07844770693814845,
 'Orlando': 0.11390939170272334,
 'Blood_Meridian': 0.08570454932697223}

## c) flesch_kincaid
This function should return a dictionary mapping the title of each novel to the Flesch-Kincaid reading grade level score of the text.  
Use the NLTK library for tokenization and the CMU pronouncing dictionary for estimating syllable counts.

[Readibility](https://readabilityformulas.com/learn-how-to-use-the-flesch-kincaid-grade-level/)
$$
{\displaystyle 0.39 * \left({\frac {\text{total words}}{\text{total sentences}}}\right)
    +11.8\left({\frac {\text{total syllables}}{\text{total words}}}\right)-15.59 }
$$


### `count_syl()`

In [4]:
import re
from nltk.corpus import cmudict
def count_syl(word, d):
    """Counts the number of syllables in a word given a dictionary of syllables per word.
    if the word is not in the dictionary, syllables are estimated by counting vowel clusters

    Args:
        word (str): The word to count syllables for.
        d (dict): A dictionary of syllables per word.

    Returns:
        int: The number of syllables in the word.
    """
    counts = 0
    pattern = r'[aeiouy]+'

    if word in d:
        word_cmu = d[word][0]
        counts += sum(char[-1].isdigit() for char in word_cmu)
    elif word not in d:
        vowel_words = re.findall(pattern, word.lower())
        counts += len(vowel_words)
    return counts

##### Test for `count_syl()`

In [None]:
# text = df['text'][0]
# pattern = r'chapter [0-9] +'
# text = text.replace('\n', ' ').strip()
# text = re.sub(pattern, '', text.lower())
# d = cmudict.dict()
# sents=sent_tokenize(text.lower())
# words=[t.lower() for t in word_tokenize(text) if t.isalpha()]

# x = words[0:10] # sample of the words

# ''' Manual text of the function vs local use of the cmu dictionary '''

# temp=dict()
# for i in x:
#     counts=0
#     word_cmu = d[i][0]
#     counts += sum(char[-1].isdigit() for char in word_cmu)
#     temp[i] = (counts, word_cmu)
# print([t[0] for t in temp.values()] == [count_syl(i, d) for i in x])

# ''' And a test with some words not in the cmu dictionary'''
# y = ['aaaa', 'maaattaan']
# z = x+y
# print(list(zip(z,[count_syl(i, d) for i in z])))
# print(sum([count_syl(word, d) for word in z]))


### `fk_level()`, `get_fks()`, `add_fks_to_df()`

In [None]:

import nltk 
import re
def fk_level(text, d):
    """Returns the Flesch-Kincaid Grade Level of a text (higher grade is more difficult).
    Requires a dictionary of syllables per word.

    Args:
        text (str): The text to analyze.
        d (dict): A dictionary of syllables per word.

    Returns:
        float: The Flesch-Kincaid Grade Level of the text. (higher grade is more difficult)
    """
    text = text.replace('\n', ' ').strip() #* And removing line breaks "\n"
    sents=sent_tokenize(text.lower())
    words=[t.lower() for t in word_tokenize(text) if t.isalpha()]
    total_syllables = sum([count_syl(word, d) for word in words])
    fkl = 0.39 * (len(words) / len(sents)) + 11.8 * (total_syllables/ len(words))- 15.59
    # print(total_syllables, len(words), len(sents), fkl)

    return fkl


def get_fks(df):
    """helper function to add fk scores to a dataframe"""
    results = {}
    cmudict = nltk.corpus.cmudict.dict()
    for i, row in df.iterrows():
        results[row["title"]] = round(fk_level(row["text"], cmudict), 4)
    return results

def add_fks_to_df(df):
    ''' function to add scores to the dataframe, as referenced in get_fks() '''
    fks = get_fks(df)
    temp = pd.DataFrame({'title_f': fks.keys(), 'fks':fks.values()})
    df = df.merge(temp, left_on='title', right_on='title_f')
    return df.drop(columns='title_f')
fks_dict = get_fks(df) # return dictionary mapping
df = add_fks_to_df(df) # add to df

## d) flesch_kincaid
When is the Flesch Kincaid score *not* a valid, robust or reliable estimator of text difficulty? Give two conditions. (Text answer, 200 words maximum).

**Etymology**: Lingustic differences of a vocabulary are not represented by the FKS. For example, anachronistic terminology which is no longer part of the modern vernacular are mechanistically oversimplified (the syllable to word ration); FKS would be misleading when scoring of the complexity of a text. Example - Chaucer's middle English:
“Whan that Aprill with his shoures soote/The droghte of March hath perced to the roote” (1).
does not represent modern vernacular, and yet the syllable count and text length would deem its readibility as $\approx$ 6th-7th grade/ 11-13 years old.

**Context**: comprehension gained from word order or the organisation of the content is not factored into FKS (2). Scoring based on frequencies and ratios omits reader comprehension, contextual awareness and cultural understanding. Those for whom English as a second language, FKS is not robust, being based on the structure of English, and structures are not constant across language roots.

Example - Unreliable structure with verb past tense, while syntax is comparible for present tense; different structures added complexity for comprehension.

|         | German                            | English                   |
| ------- | --------------------------------- | ------------------------- |
| Past    | Ich habe einen Apfel **gegessen** | I **ate** an apple        |
| Present | Ich **esse** einen Apfel        | I **am eating** an apple. |

References
  

1) Canterbury Tales: General Prologue ll. 1-2, [source]([https://chaucer.fas.harvard.edu/pages/general-prologue-0](https://chaucer.fas.harvard.edu/pages/general-prologue-0)) accessed: 2025-06-18
2) Redish, Janice. (2000). Readability formulas have even more limitations than Klare discusses. ACM Journal of Computer Documentation. 24. 132-137, pg. 134. DOI: 10.1145/344599.344637.

## e) parse
The goal of this function is to process the texts with spaCy’s tokenizer and parser, and store the processed texts. Your completed function should: 
- i. Use the spaCy nlp method to add a new column to the dataframe that contains parsed and tokenized Doc objects for each text. 
- ii. Serialise the resulting dataframe (i.e., write it out to disk) using the pickle format. 
- iii. Return the dataframe. 
- iv. Load the dataframe from the pickle file and use it for the remainder of this coursework part. Note: one or more of the texts may exceed the default maximum length for spaCy’s model. You will need to either increase this length or parse the text in sections.

In [None]:
from tqdm.notebook import trange, tqdm

def parse(df, store_path=Path.cwd() / "pickles", out_name="parsed.pickle"):
    """Parses the text of a DataFrame using spaCy, stores the parsed docs as a column and writes the resulting  DataFrame to a pickle file"""
    docs = {}
    for i in trange(len(df)): # ! Uses TQDM notebook. Else switch to below...
    # for i in range(len(df)): # ^ Switch out from TQDM notebook. 
        docs[i] = nlp(df['text'][i]) # create spacy docs for each cell "text"
        # print(f"Document {i+1} / {len(df)} parsed.") #? added to see process when parsing (takes a while)
    df['docs'] = docs.values()
    df.to_pickle(os.path.join(store_path,out_name)) #~ write to pickle
    ''' The question requested the df is returned - I have commented out this return, to purely save, as we load in the next step from pickle '''
    # return df #^ To check the returned df
parse(df) #^ function call to parse and save df to pickle


In [8]:
def load_from_pickle(store_path=Path.cwd() / "pickles", out_name="parsed.pickle"):
    df = pd.read_pickle(os.path.join(store_path, out_name))
    return df


''' Load from pickle '''
df = load_from_pickle() 
df[0:2]
#? To check the returned type
# type(df['docs'][0]) # ~spacy.tokens.doc.Doc 

Unnamed: 0,text_c,text,title,author,year,fks,docs
0,\nCHAPTER 1\n\nThe family of Dashwood had long...,The family of Dashwood had long been settled i...,Sense_and_Sensibility,Austen,1811,10.8845,"(The, family, of, Dashwood, had, long, been, s..."
1,'Wooed and married and a'.'\n'Edith!' said Mar...,'Wooed and married and a'.'\n'Edith!' said Mar...,North_and_South,Gaskell,1855,6.6544,"(', Wooed, and, married, and, a, ', ., ', \n, ..."


## f) Working with parses:
the final lines of the code template contain three `for` loops. Write the functions needed to complete these loops so that they print: 

- i. The title of each novel and a list of the ten most common syntactic objects overall in the text. 
- ii. The title of each novel and a list of the ten most common syntactic subjects of the verb ‘to hear’ (in any tense) in the text, ordered by their frequency. 
- iii. The title of each novel and a list of the ten most common syntactic subjects of the verb ‘to hear’ (in any tense) in the text, ordered by their Pointwise Mutual Information.

### Notes


<span style="color: #ff0000">Note: </span> Novel 10 (Dorian Grey) does not have 10 words which are the subject of lemma 'hear', it will print 9.
<p> 
<ul style="list-style-type: '&#x1F4BB;';">

<li> In the following code, the vocabulary is calculated per document, and total words (vocab of the document) declared as `N`. </li>
</ul>
This follows the PMI definition by Church and Hanks:
<ul style="list-style-type: '&#x1F4C4;';">
<li><i>(1)Ref: *Word Association Norms, Mutual Information and Lexicography*, Church, K. W., and Hanks, P., in Computational Linguistics Volume 16, Number 1, March 1990</i></li>
</ul>

#### 💬 PMI definitions, per Church and Hanks (1)

$$PMI = log_2\bigg(\frac{\textcolor{gold}{P(x,y)}}{\textcolor{cyan}{P(x)} \textcolor{pink}{P(y)}}\bigg)$$

**Independent Probabilities**
- are calculated as counts - f(x) and f(y) 
- normalised to $\textcolor{lime}{N}$
- $\textcolor{cyan}{P(x)} = \frac{f(x)}{\textcolor{lime}{N}}$ and
- $\textcolor{pink}{P(y)} = \frac{f(y)}{\textcolor{lime}{N}}$

**Joint Probabilities**
- are calculated as counts of x and y co-occuring in a window - $f(x,y)$
- using Spacy dependencies limits this to syntactical spans - the sentences of a document
- normalised to $\textcolor{lime}{N}$
- $\textcolor{gold}{P(x,y)} = \frac{f(x,y)}{\textcolor{lime}{N}}$


### f i-iii)

In [None]:
from collections import Counter
def find_objects(token):
    ''' Function to find the objects of a token  '''
    objects = set()
    if 'obj' in token.dep_:
        objects.add(token)
    return objects


def objects_count(doc, n=10):
    ''' Assumed "object" as syntactical, rather than a word as Python object '''
    object_text = []

    for token in doc:
        objects = find_objects(token)
        if objects:
            object_text.extend([objects.text.lower() for objects in objects]) #? have made this lower, to not duplicate the types

    object_counts = Counter(object_text)
    return  object_counts.most_common()[:n]
# objects_count(df['docs'][0]) #*to test function with a sample doc

def adjective_counts(doc, n=10):
    """Extracts the most common adjectives in a parsed document. Returns a list of tuples."""
    pos_counts = Counter([token.text for token in doc if token.pos_ == 'ADJ'])
    return pos_counts.most_common()[:n]

''' Adjective Counts '''
# doc = df['docs'][len(df)-1] #*to test function with a sample doc
# df['docs'].apply(lambda x: adjective_counts(x)) # if you want to count adjectives for whole df

''' Manual test '''
# adj_counts= Counter([token.text for token in doc if token.pos_ == 'ADJ'])
# adj_counts.most_common()[:10]


In [11]:
'''- i. The title of each novel and a list of the ten most common syntactic objects overall in the text. '''
for i, row in df.iterrows():
    print(row['title'])
    print(objects_count(row['docs']))

Sense_and_Sensibility
[('it', 706), ('her', 689), ('him', 521), ('them', 404), ('me', 343), ('you', 331), ('which', 251), ('what', 210), ('time', 193), ('herself', 193)]
North_and_South
[('it', 878), ('her', 863), ('him', 779), ('me', 579), ('you', 443), ('which', 415), ('what', 391), ('them', 365), ('time', 291), ('margaret', 240)]
A_Tale_of_Two_Cities
[('him', 897), ('it', 861), ('you', 434), ('me', 426), ('them', 372), ('her', 354), ('which', 220), ('hand', 205), ('himself', 185), ('time', 174)]
Erewhon
[('which', 415), ('me', 372), ('it', 358), ('them', 305), ('him', 184), ('that', 121), ('time', 120), ('what', 107), ('us', 101), ('one', 100)]
The_American
[('it', 875), ('you', 725), ('him', 710), ('me', 656), ('her', 569), ('what', 324), ('that', 295), ('them', 276), ('newman', 240), ('which', 225)]
Dorian_Gray
[('him', 609), ('it', 475), ('me', 442), ('you', 358), ('that', 253), ('them', 186), ('life', 180), ('her', 178), ('what', 177), ('one', 97)]
Tess_of_the_DUrbervilles
[('he

In [30]:

def freq_counter_dict_words(doc):
    ''' 
    Returns a counter dictionary object for the words in the document 
    Only considers alphabet-character tokens (alpha) rather than excluding (e.g., with is_punct)
    Use: freq_counter_dict_words(doc)
    '''
    from collections import Counter

    words = [token.text.lower()
            for token in doc
            if token.is_alpha]
    word_freq = Counter(words)
    return word_freq
# freq_counter_dict_words(df['docs'][0]) #*to test function with a sample doc


def get_verbs(doc, verb='hear'):
    ''' Function to get the count of a verb. Retains original form '''
    verbs = []
    for token in doc:
        if token.pos_ == 'VERB' and token.lemma_==verb:

            verbs.append(token.text.lower())
    verb_counts =Counter(verbs)
    return verb_counts #~ retains the original verb form, returns their frequency

def sum_verbs(doc, verb='hear'):
    ''' Get the total sum of verbs matched with given lemma '''
    verb_counts = get_verbs(doc)
    return (verb, sum(verb_counts.values()))

def find_subjects(token, match_verb):
    ''' Function to find the subjects of a token - both up and down the parse tree '''
    subjects = []
    ''' Token's children '''

    for child in token.children:
        if 'subj' in child.dep_:
            subjects.append((child, token))
                
    return subjects


def subjects_by_verb_count(doc, verb='hear', n=10):
    """Extracts the most common subjects of a given verb in a parsed document. Returns a list."""
    ''' Added a parameter to remove multiple counts of 1 in the most frequent.
    > Setting to n=10 sliced the list by order, which didn't have much meaning in the returned values
    '''
    subject_text = []

    for token in doc:
        if token.pos_ == "VERB" and token.lemma_ == verb:
            subjects = find_subjects(token, verb)
            if subjects:
                for subject in subjects:
                    if not subject[0].is_punct: #~ adding this to ignore punctuation
                        subject_text.extend([(subject[0].text.lower(), subject[1]) for subject in subjects])

    return  Counter([s for s, v in subject_text]).most_common()[:n]
    # return  Counter([s for s, v in subjects_by_verb_count(doc, verb='hear')]).most_common()[:n]
# doc=df['docs'][0]



In [31]:
''' - ii. The title of each novel and a list of the ten most common syntactic subjects of the verb ‘to hear’ (in any tense) in the text, ordered by their frequency. '''

for i, row in df.iterrows():
    print(row['title'], subjects_by_verb_count(row['docs']))

Sense_and_Sensibility [('i', 28), ('you', 19), ('she', 14), ('elinor', 6), ('he', 6), ('they', 5), ('me', 4), ('jennings', 3), ('them', 2), ('both', 2)]
North_and_South [('she', 60), ('i', 42), ('he', 23), ('you', 16), ('they', 13), ('margaret', 10), ('me', 5), ('we', 5), ('who', 4), ('thornton', 3)]
A_Tale_of_Two_Cities [('i', 21), ('he', 19), ('you', 12), ('she', 10), ('they', 5), ('stranger', 2), ('clink', 2), ('monseigneur', 2), ('me', 2), ('him', 2)]
Erewhon [('i', 39), ('he', 4), ('they', 3), ('she', 2), ('formulae', 2), ('we', 1), ('who', 1), ('destruction', 1), ('machines', 1), ('one', 1)]
The_American [('he', 18), ('i', 13), ('you', 10), ('she', 5), ('newman', 4), ('they', 2), ('we', 2), ('who', 1), ('me', 1), ('one', 1)]
Dorian_Gray [('i', 24), ('he', 16), ('one', 3), ('you', 3), ('lovers', 1), ('hast', 1), ('who', 1), ('jars', 1), ('dorian', 1)]
Tess_of_the_DUrbervilles [('she', 39), ('i', 20), ('they', 12), ('he', 8), ('you', 8), ('who', 6), ('tess', 4), ('clare', 4), ('lad

In [32]:
def apply_log2(x):
    ''' Base2 logarithm '''
    return log2(x)

def get_verb_freqs(doc):
    ''' Treating the verb frequencies together '''
    verb_counts = get_verbs(doc)
    return sum(verb_counts.values()) # consider the verbs as a single verb (as we are treating 'hear' as tense-insensitive)

def calculate_probs(x, N):
    ''' A word or verb, divided by total vocab count, for example '''
    return x / N

def make_freq_p_df(doc):
    ''' Create dataframe of top ten co-occurring word-verb pairs
    This treats all verbs as a tense-insensitive lemmatised version
    - i.e., considers all verbs as the target verb - 'hear'
    '''
    all_word_freq = freq_counter_dict_words(doc)
    N = sum(all_word_freq.values()) # unique word total (N)
    top_ten = subjects_by_verb_count(doc, verb='hear')
    df_c = pd.DataFrame(data=top_ten, columns=['word', 'f_word_verb'])
    df_c['f_word'] = df_c['word'].map(all_word_freq)
    df_c['p_word_verb'] = df_c['f_word_verb'].apply(lambda x: calculate_probs(x,N))
    df_c['p_verb'] = calculate_probs(get_verb_freqs(doc),N) #? consider the verbs as a single verb (as we are treating 'hear' as tense-insensitive)
    df_c['p_word'] = df_c['f_word'].apply(lambda x: calculate_probs(x, N))
    return df_c[['word', 'f_word', 'f_word_verb', 'p_word','p_verb', 'p_word_verb']] # ~ reorder


def calculate_pmi(df):
    ''' The co-occurrence probabilities, scaled to independent probability product
    Then log_2 of this normalised value
    See markdown note on reference and approach
    '''
    df['P(w)P(v)'] = df['p_word']*df['p_verb'] #~ product of independent
    df['fracts'] = df['p_word_verb'] / df['P(w)P(v)'] # ~ joint probabilities scaled to product of independents 
    df['pmi'] = df['fracts'].apply(lambda x: apply_log2(x)) #~ apply log2 to result for pmi
    return df

def select_sort_df(df):
    ''' Sort by descending, reset index '''
    df = df[['word', 'pmi']].sort_values('pmi', ascending=False)
    return df.reset_index( drop=True)


def subjects_by_verb_pmi(doc, target_verb):
    """Extracts the most common subjects of a given verb in a parsed document. Returns a list."""
    df_pmi = make_freq_p_df(doc)
    df_pmi = calculate_pmi(df_pmi)
    return [select_sort_df(df_pmi)] # ~ returns a list ;)




In [33]:

# *iii. The title of each novel and a list of the ten most common syntactic subjects of the verb ‘to hear’ (in any tense) in the text, ordered by their Pointwise Mutual Information.'''
for i, row in df.iterrows():
    # Note: Dorian_Gray has 9 verb-word pairs where word is subject '''
    title = row['title']
    print(title)
    # print(subjects_by_verb_pmi(row["docs"], "hear"))
    display(HTML(subjects_by_verb_pmi(row['docs'], 'hear')[0].T.to_html()))

Sense_and_Sensibility


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,both,you,i,jennings,they,me,she,elinor,he,them
pmi,3.899939,3.4935,3.309079,3.153695,2.758211,2.697631,2.620744,2.614669,1.928663,1.581117


North_and_South


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,she,they,we,i,who,he,you,margaret,me,thornton
pmi,3.785776,3.399851,2.938283,2.88932,2.706553,2.527199,2.48084,2.199301,2.057408,1.74158


A_Tale_of_Two_Cities


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,clink,stranger,she,monseigneur,i,he,they,you,me,him
pmi,11.160835,7.460396,4.634141,4.419368,3.612105,3.551559,3.335559,3.273107,2.116441,1.234539


Erewhon


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,formulae,destruction,i,she,machines,he,who,they,one,we
pmi,10.15276,7.567798,4.602901,3.848979,3.81291,2.82858,2.135952,2.047725,1.617485,1.523404


The_American


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,we,he,they,who,i,you,she,one,newman,me
pmi,3.381236,3.285816,2.749944,2.61037,2.508956,2.424763,2.262193,2.179164,1.999043,0.804705


Dorian_Gray


Unnamed: 0,0,1,2,3,4,5,6,7,8
word,hast,jars,lovers,i,he,one,who,dorian,you
pmi,10.348039,9.348039,8.763076,4.207635,3.759324,3.164817,2.540684,1.654552,1.438146


Tess_of_the_DUrbervilles


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,lady,room,she,who,they,clare,i,you,tess,he
pmi,4.904466,4.350868,4.021884,3.927925,3.672135,3.445034,3.297947,2.2737,2.056146,2.022612


The_Golden_Bowl


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,man,she,you,maggie,which,he,him,i,they,it
pmi,4.547313,3.57273,3.416992,3.198065,3.194599,2.756214,2.183565,2.093034,1.594486,-0.464572


The_Secret_Garden


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,lennox,we,i,she,you,he,they,colin,one,mary
pmi,6.012931,4.852467,3.438107,2.91057,2.456538,2.17781,1.772467,1.729707,1.442183,1.246565


Portrait_of_the_Artist


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,listener,burst,he,voice,you,i,who,stephen,they,that
pmi,9.137738,7.330383,4.240036,4.008455,3.228845,2.91615,2.81581,2.701442,1.541548,0.184996


The_Black_Moth


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,we,street,madam,i,nothing,richard,you,he,she,it
pmi,5.048453,5.017202,4.557771,3.682519,3.634733,2.780163,1.215009,1.174223,0.678625,-0.092815


Orlando


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,hoof,children,she,orlando,i,he,one,they,which,it
pmi,9.905097,7.905097,4.011382,3.435862,3.283045,3.222102,2.943165,2.912631,1.922103,-0.031541


Blood_Meridian


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,nobody,who,she,they,i,he,we,you,all,man
pmi,7.104102,4.689065,4.51914,4.104961,4.059708,3.686182,3.561844,3.543387,2.882999,2.368039


## g) Ten marks are allocated for your github commit history. 
You should make regular, atomic commits with concise but informative commit messages. See the section titled Submission (both questions) for more details.

In [None]:
from pprint import pprint
''' Adding pprint for better dictionary output '''
if __name__ == "__main__":
    """ uncomment the following lines to run the functions once you have completed them"""
    # path = Path.cwd() / "p1-texts" / "novels"
    # print(path)
    #df = read_novels(path) # this line will fail until you have completed the read_novels function above.
    #print(df.head())
    #nltk.download("cmudict")
    #parse(df)
    # print(df.head()[0:2])
    # pprint(get_ttrs(df))
    # pprint(get_fks(df))
    # df = pd.read_pickle(Path.cwd() / "pickles" /"name.pickle")

    # for i, row in df.iterrows():
    #     print(row['title'], ': ',adjective_counts(row['docs']),'\n') #? this is using the dataframe, the function and docstring calls for a `doc` object.
   
    # for i, row in df.iterrows():
    #     print(row["title"])
    #     print(subjects_by_verb_count(row["docs"], "hear"))
    #     print("\n")

    # for i, row in df.iterrows():
    #     print(row["title"])
    #     print(subjects_by_verb_pmi(row["docs"], "hear"))
    #     print("\n")


 


# Part Two — Feature Extraction and Classification 
In the second part of the coursework, your task is to train and test machine learning classifiers on a dataset of political speeches. The objective is to learn to predict the political party from the text of the speech. The texts you need for this part are in the speeches sub-directory of the texts directory of the coursework Moodle template. For this part, you can structure your python functions in any way that you like, but pay attention to exactly what information (if any) you are asked to print out in each part. Your final script should print out the answers to each part where required, and nothing else.

## a) Read the hansard40000.csv dataset in the texts directory into a dataframe. 
Sub- set and rename the dataframe as follows: 
- i. rename the ‘Labour (Co-op)’ value in ‘party’ column to ‘Labour’, and then: 
- ii. remove any rows where the value of the ‘party’ column is not one of the four most common party names, and remove the ‘Speaker’ value. 
- iii. remove any rows where the value in the ‘speech_class’ column is not ‘Speech’. 
- iv. remove any rows where the text in the ‘speech’ column is less than 1000 characters long
- Print the dimensions of the resulting dataframe using the shape method.

Assuming we ignore blanks (`NaN`)

`df['speech_class'].value_counts(dropna=False)`
```
party
Conservative                        25079
Labour                               8038
Scottish National Party              2303
NaN                                  1647
.....

`df['speech_class'].value_counts()`
The top 4n less "Speaker"
party
Conservative                        25079
Labour                               8038
Scottish National Party              2303
Liberal Democrat                      803
.....
```

In [1]:
import pandas as pd
from pathlib import Path
import os
import warnings
warnings.filterwarnings('ignore')

In [2]:
def csv_to_df(path=Path.cwd() / "zips" / "p2-texts"):
    df = pd.read_csv(os.path.join(path,os.listdir(path)[0]))

    #^ - i. rename the ‘Labour (Co-op)’ value in ‘party’ column to ‘Labour’, and then: 
    df['party'].replace("Labour (Co-op)", "Labour", inplace=True)


    #^ - ii. remove any rows where the value of the ‘party’ column is not one of the four most common party names, and remove the ‘Speaker’ value. 
    values = ['Conservative', 'Labour', 'Scottish National Party', 'Liberal Democrat',]
    df = df[df['party'].isin(values)]

    #^ - iii. remove any rows where the value in the ‘speech_class’ column is not ‘Speech’. 
    df = df[df['speech_class'] == 'Speech']

    df.reset_index(drop=True, inplace=True) #? NB: needs reseting, otherwise the next step doesn't work


    #^ - iv. remove any rows where the text in the ‘speech’ column is less than 1000 characters long
    # indices= []
    ind= []
    for i, s in enumerate(df['speech'].values):
        # print(len(s))
        if len(s) < 1000:
            # indices.append((i, len(s)))
            ind.append(i)

    df = df[~df.index.isin(ind)]
    df.reset_index(inplace=True, drop=True)
    #^ - Print the dimensions of the resulting dataframe using the shape method.
    # len(df) == 36223-28139
    print(df.shape)
    return df #? Although this might not be wanted? "Function to print and nothing else?"

In [3]:
df = csv_to_df()

(8084, 8)


## b) Vectorise the speeches using TfidfVectorizer from scikit-learn. 
Use the default parameters, except for omitting English stopwords and setting max_features to 3000. Split the data into a train and test set, using stratified sampling, with a random seed of 26.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,  classification_report
from sklearn import svm
from pathlib import Path
path = Path(Path.cwd() / "pickles" / "speeches.pkl")
df = pd.read_pickle(path)


In [None]:
''' Stratify "If not None, data is split in a stratified fashion, using this as the class labels."
source: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
'''

def establish_tfidf(texts, **kwargs):
    ''' Pass texts without labels and kwargs for the TfidfVectorizer '''
    tfidf_vectorizer = TfidfVectorizer(**kwargs)
    return tfidf_vectorizer.fit_transform(texts) 

def establish_split(vectorised, **kwargs):
    ''' Pass vectorised dataset and kwargs for sample split '''
    return train_test_split(vectorised, y, **kwargs)

In [None]:

x = df.speech
y = df.party

vec_kw={
    'stop_words':"english",
    'max_features': 3000,
    }

tts_kw = {
    'random_state':26,
    'shuffle': True,
    'stratify':y
}
    

X_train, X_test, y_train, y_test = establish_split(establish_tfidf(x, **vec_kw), **tts_kw) 

## c) Random forest and SVM
Train **RandomForest** (with n_estimators=300) and **SVM** with linear kernel classifiers on the training set, and print 

- the scikit-learn macro-average f1 score and
- classification report for each classifier on the test set.

The label that you are trying to predict is the ‘party’ value. 

$$F1 = \frac{TP}{TP+\frac{1}{2}(FP+FN)}$$
- best value at 1 
- worst score at 0

In [None]:
def random_forest(X_train, y_train, X_test, y_test, average,**kwargs):
    rfc = RandomForestClassifier(**kwargs)
    rfc_fitted = rfc.fit(X_train, y_train)

    print(f1_score(y_test, rfc_fitted.predict(X_test), average=average))
    # params (y_true, y_predicted)
    print(classification_report(y_test, rfc.predict(X_test)))

def svm_cl(X_train, y_train, X_test, y_test,average, **kwargs):
    linear_svc = svm.SVC(**kwargs)
    linear_svc_fitted = linear_svc.fit(X_train, y_train)
    linear_svc_fitted.predict(X_test)
    
    print(f1_score(y_test, linear_svc.predict(X_test), average=average))
    print(classification_report(y_test, linear_svc_fitted.predict(X_test)))

In [None]:
rfkw = {
    'random_state':26,
    'n_estimators' : 300, 
}

random_forest(X_train, y_train, X_test, y_test, average='macro', **rfkw)

0.44849276102645497
                         precision    recall  f1-score   support

           Conservative       0.73      0.97      0.83      1205
                 Labour       0.74      0.45      0.56       579
       Liberal Democrat       0.00      0.00      0.00        67
Scottish National Party       0.88      0.26      0.40       170

               accuracy                           0.73      2021
              macro avg       0.59      0.42      0.45      2021
           weighted avg       0.72      0.73      0.69      2021



In [None]:

svmkw = {
    'kernel': 'linear',
    'random_state':26,
}

svm_cl(X_train, y_train, X_test, y_test, average='macro',**svmkw)

0.5846137591595653
                         precision    recall  f1-score   support

           Conservative       0.82      0.92      0.87      1205
                 Labour       0.72      0.68      0.70       579
       Liberal Democrat       0.83      0.07      0.14        67
Scottish National Party       0.78      0.53      0.63       170

               accuracy                           0.79      2021
              macro avg       0.79      0.55      0.58      2021
           weighted avg       0.79      0.79      0.78      2021



## d) N-Grams
Adjust the parameters of the Tfidfvectorizer so that unigrams, bi-grams and tri-grams will be considered as features, limiting the total number of features to 3000. Print the classification report as in 2(c) again using these parameters. 

In [None]:
''' establish new vector with ngrams (unigram, bigram and trigram)'''
vec_kw={
    'stop_words':"english",
    'max_features': 3000,
    'ngram_range' : (1,3)
    }

tts_kw = {
    'random_state':26,
    'shuffle': True,
    'stratify':y
}

X_train2, X_test2, y_train2, y_test2 = establish_split(establish_tfidf(x, **vec_kw), **tts_kw) 

In [None]:
rf_kw = {
    'random_state':26,
    'n_estimators' : 300, 
}

random_forest(X_train2, y_train2, X_test2, y_test2, average='macro', **rf_kw)

0.4775447410650979
                         precision    recall  f1-score   support

           Conservative       0.73      0.97      0.84      1205
                 Labour       0.77      0.47      0.58       579
       Liberal Democrat       0.00      0.00      0.00        67
Scottish National Party       0.86      0.35      0.49       170

               accuracy                           0.74      2021
              macro avg       0.59      0.45      0.48      2021
           weighted avg       0.73      0.74      0.71      2021



Previous: 0.44849276102645497  
With ngrams: 0.4775447410650979

In [16]:
svm_kw = {
    'kernel': 'linear',
    'random_state':26,
}
svm_cl(X_train2, y_train2, X_test2, y_test2, average='macro',**svm_kw)

0.5741880652063533
                         precision    recall  f1-score   support

           Conservative       0.83      0.92      0.87      1205
                 Labour       0.73      0.72      0.73       579
       Liberal Democrat       1.00      0.03      0.06        67
Scottish National Party       0.78      0.55      0.64       170

               accuracy                           0.80      2021
              macro avg       0.83      0.55      0.57      2021
           weighted avg       0.80      0.80      0.78      2021



Previous: 0.5846137591595653  
With ngrams: 0.5741880652063533

## e) Implement a new custom tokenizer and pass it to the tokenizer argument of Tfidfvectorizer. 
You can use this function in any way you like to try to achieve the best classification performance while keeping the number of features to no more than 3000, and using the same three classifiers as above. Print the clas- sification report for the best performing classifier using your tokenizer. Marks will be awarded both for a high overall classification performance, and a good trade-off between classification performance and efficiency (i.e., using fewer parameters).

In [None]:
import pandas as pd
# path = os.path.join('pickles', 'speeches.pkl')
# df.to_pickle(path) # saving to pickle it doesn't require parsing every load (~15m)
# df = pd.read_pickle(path)

In [None]:

#* RandomForest (with n_estimators=300) 
#* SVM Linear

#* Adjust the parameters of the Tfidfvectorizer
# * so that unigrams, bi-grams and tri-grams 
# *will be considered as features, limiting the
# * total number of features to 3000. 

In [50]:
from gensim.utils import simple_preprocess


def list_to_str(lst):
    ''' take a list of words or Spacy tokens and return a single joined string ''' 
    if lst and isinstance(lst[0], spacy.tokens.span.Span):
        return ' '.join([x.text for x in lst])
    elif lst and isinstance(lst[0], str):
        return ' '.join([x for x in lst])
    else:
        print('provide list of spaCy spans or string')

def gensim_simple(doc, as_list=False):
    if isinstance(doc[0], spacy.tokens.span.Span):
        doc = doc.text    
    else:
        doc=doc
        if as_list:
            return list_to_str(simple_preprocess(doc))
        else:
            return simple_preprocess(doc)    




# df['speech_pp']= df['docs'].apply(lambda x: gensim_simple(x))

## f) Explain your tokenizer function and discuss its performance. 

## g) Githistory
Ten marks are allocated for your github commit history. You should make regular, atomic commits with concise but informative commit messages. See the section below titled Submission (both questions) for more details. Part Two total marks: 50

# Git requirements

Submit using github classroom and confirm on Moodle The code template for this coursework part is made available to you as a Git repository on GitHub, via an invitation link for GitHub Classroom. 

- 1. First you follow the invitation link for the coursework that is available on the Moodle page of the module. 
- 2. Then clone the Git repository from the GitHub server that GitHub will create for you. Initially it will contain README.md and a folder structure for Part One and Part Two with placeholder python scripts (you can change these to Jupyter notebook files if you prefer). 
- 3. Enter your name in README.md (this makes it easy for us to see whose code we are marking) 
- 4. You must also enter the following Academic Declaration into README.md for your submission:
> “I have read and understood the sections of plagiarism in the College Policy on assessment offences and confirm that the work is my own, with the work of others clearly acknowledged. I give my permission to submit my report to the plagiarism testing database that the College is using and test it using plagiarism detection software, search engines or meta-searching software.” 
This refers to the document at: [link](https://www.bbk.ac.uk/student-services/exams/plagiarism-guidelines)

5. Whenever you have made a change that can “stand on its own”, say, “Implemented tokenizer method”, this is a good opportunity to commit the change to your local repository and also to push your changed local repository to the GitHub server. As a rule of thumb, in collaborative software development it is common to require that the code base should at least still compile after each commit.

- Entering your name in README.md (using a text editor), then doing a commit of your change to the file into the local repository, and finally doing a push of your local repository to the GitHub server would be an excellent way to start your coursework activities. 
- You can benefit from the GitHub server also to synchronise between, e.g., the Birkbeck lab machines and your own computer. You push the state of your local repository in the lab to the GitHub server before you go home; later, you can pull your changes to the repository on your home computer (and vice versa). Use meaningful commit messages (e.g., “Implemented the pickle output for Q1(e)”, or “fixed a bug in the PMI calculation”), and do not forget to push your changes to the GitHub server! 
- For marking, we plan to clone your repositories from the GitHub server shortly after the submission deadline. We additionally require you to confirm your github submission via a ‘confirm submission’ form on Moodle. The time of this confirmation will tell us if you would like your code to be considered for the regular (uncapped) deadline or for the late (capped) deadline two weeks later.

Deadlines 
-  The submission deadline is: 26th June 2025, 14:00 UK time. 
- The late cut-off deadline for receiving a capped mark is: 10th July 2025, 14:00 UK time.