# Part One — Syntax and Style 


In the first part of your coursework, your task is to explore the syntax and style of a set of 19th Century novels using the methods and tools that you learned in class. The texts you need for this part are in the novels subdirectory of the texts directory in the coursework Moodle template. The texts are in plain text files, and the filenames include the title, author, and year of publication, separated by hyphens. 

The template code provided in PartOne.py includes function headers for some sub-parts of this ques- tion. The main method of your finished script should call each of these functions in order. To complete your coursework, complete these functions so that they perform the tasks specified in the questions below. You may (and in some cases should) define additional functions.

Re-assessment template 2025

Note: The template functions here and the dataframe format for structuring your solution is a suggested but not mandatory approach. You can use a different approach if you like, as long as you clearly answer the questions and communicate your answers clearly.


In [None]:
from nltk.tokenize import word_tokenize
# import nltk.word_tokenize # incorrect -- raises error see: https://www.nltk.org/api/nltk.tokenize.html
# import nltk
from nltk.corpus import cmudict
from nltk.tokenize import sent_tokenize
from collections import Counter
import spacy
from pathlib import Path

import pandas as pd
import os
import re
from math import log2

nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000



## read_novels:
Each file in the novels directory contains the text of a novel, and the name of the file is the title, author, and year of publication of the novel, separated by hyphens. Complete the python function read_texts to do the following: 
- i. create a pandas dataframe with the following columns: text, title, author, year 
- ii. sort the dataframe by the year column before returning it, resetting or ignoring the dataframe index.

In [2]:
def read_file(filename):
    with open(filename, 'r') as f:
        return f.read()

def remove_chapter_headings(text):
    pattern = r'(?i)CHAPTER \d+'  #* adding this so that chapter breaks are not counted. could be refined for section breaks etc.
    return re.sub(pattern, '', text)


def read_novels(path=Path.cwd() / "zips" / "p1-texts" / "novels"):
    """Reads texts from a directory of .txt files and returns a DataFrame with the text, title,
    author, and year"""
    files = [file for file in os.listdir(path) if file[-4:] == '.txt']
    df_data = []

    for i in range(len(files)):
        ''' Create a dictionary, and add the file items to it, then use populated dictionary to create a dataframe '''
        file_dict = {}
        file_dict['text'] = read_file(os.path.join(path,files[i]))
        title, author, year = files[i].split('-')
        file_dict['title'] = title 
        file_dict['author'] = author
        file_dict['year'] = year[:-4]
        df_data.append(file_dict)
    df = pd.DataFrame(df_data)
    df['text_c'] = df['text'].copy() # retain original copy
    df['text'] = df['text'].apply(lambda x: remove_chapter_headings(x).strip()) # apply function to all items in text column
    df = df[['text_c','text', 'title', 'author', 'year']] # reorder columns
    df = df.sort_values('year') # return sorted by year (assumed default ascending)
    df.reset_index(inplace=True, drop=True)     
    return df     
    
df = read_novels()

In [3]:
df[0:2]

Unnamed: 0,text_c,text,title,author,year
0,\nCHAPTER 1\n\nThe family of Dashwood had long...,The family of Dashwood had long been settled i...,Sense_and_Sensibility,Austen,1811
1,'Wooed and married and a'.'\n'Edith!' said Mar...,'Wooed and married and a'.'\n'Edith!' said Mar...,North_and_South,Gaskell,1855


## nltk_ttr: 
This function should return a dictionary mapping the title of each novel to its type-token ratio. Tokenize the text using the NLTK library only. Do not include punctuation as tokens, and ignore case when counting types.

In [4]:
def nltk_ttr(text):
    """Calculates the type-token ratio of a text. Text is tokenized using nltk.word_tokenize."""
    text_cleaned = [t.lower() for t in word_tokenize(text) if t.isalnum()] # I've assumed that we want the numbers, as well as text.
    vocab = set(text_cleaned) # dedupe words
    ttr = len(vocab) / len(text_cleaned) # ratio words to the text
    return ttr


def get_ttrs(df):
    """helper function to add ttr to a dataframe"""
    results = {}
    for i, row in df.iterrows():
        results[row["title"]] = nltk_ttr(row["text"])
    return results
get_ttrs(df)


{'Sense_and_Sensibility': 0.05288568542519132,
 'North_and_South': 0.0549040694681204,
 'A_Tale_of_Two_Cities': 0.07075401657114008,
 'Erewhon': 0.0916875393290313,
 'The_American': 0.06385674022547373,
 'Dorian_Gray': 0.08359599310916864,
 'Tess_of_the_DUrbervilles': 0.0778145359339165,
 'The_Golden_Bowl': 0.04748457874040203,
 'The_Secret_Garden': 0.05847231570812455,
 'Portrait_of_the_Artist': 0.10497357039884671,
 'The_Black_Moth': 0.07870315352786751,
 'Orlando': 0.11390939170272334,
 'Blood_Meridian': 0.08570454932697223}

## c) flesch_kincaid
This function should return a dictionary mapping the title of each novel to the Flesch-Kincaid reading grade level score of the text.  
Use the NLTK library for tokenization and the CMU pronouncing dictionary for estimating syllable counts.

[Readibility](https://readabilityformulas.com/learn-how-to-use-the-flesch-kincaid-grade-level/)
$$
{\displaystyle 0.39 * \left({\frac {\text{total words}}{\text{total sentences}}}\right)
    +11.8\left({\frac {\text{total syllables}}{\text{total words}}}\right)-15.59 }
$$


### `count_syl()`

In [5]:
import re
from nltk.corpus import cmudict
def count_syl(word, d):
    """Counts the number of syllables in a word given a dictionary of syllables per word.
    if the word is not in the dictionary, syllables are estimated by counting vowel clusters

    Args:
        word (str): The word to count syllables for.
        d (dict): A dictionary of syllables per word.

    Returns:
        int: The number of syllables in the word.
    """
    counts = 0
    pattern = r'[aeiouy]+'

    if word in d:
        word_cmu = d[word][0]
        counts += sum(char[-1].isdigit() for char in word_cmu)
    elif word not in d:
        vowel_words = re.findall(pattern, word.lower())
        counts += len(vowel_words)
    return counts

##### Test for `count_syl()`

In [None]:
# text = df['text'][0]
# pattern = r'chapter [0-9] +'
# text = text.replace('\n', ' ').strip()
# text = re.sub(pattern, '', text.lower())
# d = cmudict.dict()
# sents=sent_tokenize(text.lower())
# words=[t.lower() for t in word_tokenize(text) if t.isalpha()]

# x = words[0:10] # sample of the words

# ''' Manual text of the function vs local use of the cmu dictionary '''

# temp=dict()
# for i in x:
#     counts=0
#     word_cmu = d[i][0]
#     counts += sum(char[-1].isdigit() for char in word_cmu)
#     temp[i] = (counts, word_cmu)
# print([t[0] for t in temp.values()] == [count_syl(i, d) for i in x])

# ''' And a test with some words not in the cmu dictionary'''
# y = ['aaaa', 'maaattaan']
# z = x+y
# print(list(zip(z,[count_syl(i, d) for i in z])))
# print(sum([count_syl(word, d) for word in z]))

True
[('the', 1), ('family', 3), ('of', 1), ('dashwood', 2), ('had', 1), ('long', 1), ('been', 1), ('settled', 2), ('in', 1), ('sussex', 2), ('aaaa', 1), ('maaattaan', 2)]


### `fk_level()`, `get_fks()`, `add_fks_to_df()`

In [6]:

import nltk 
import re
def fk_level(text, d):
    """Returns the Flesch-Kincaid Grade Level of a text (higher grade is more difficult).
    Requires a dictionary of syllables per word.

    Args:
        text (str): The text to analyze.
        d (dict): A dictionary of syllables per word.

    Returns:
        float: The Flesch-Kincaid Grade Level of the text. (higher grade is more difficult)
    """
    text = text.replace('\n', ' ').strip() #* And removing line breaks "\n"
    sents=sent_tokenize(text.lower())
    words=[t.lower() for t in word_tokenize(text) if t.isalpha()]
    total_syllables = sum([count_syl(word, d) for word in words])
    fkl = 0.39 * (len(words) / len(sents)) + 11.8 * (total_syllables/ len(words))- 15.59
    # print(total_syllables, len(words), len(sents), fkl)

    return fkl


def get_fks(df):
    """helper function to add fk scores to a dataframe"""
    results = {}
    cmudict = nltk.corpus.cmudict.dict()
    for i, row in df.iterrows():
        results[row["title"]] = round(fk_level(row["text"], cmudict), 4)
    return results

def add_fks_to_df(df):
    ''' function to add scores to the dataframe, as referenced in get_fks() '''
    fks = get_fks(df)
    temp = pd.DataFrame({'title_f': fks.keys(), 'fks':fks.values()})
    df = df.merge(temp, left_on='title', right_on='title_f')
    return df.drop(columns='title_f')
df = add_fks_to_df(df)


In [7]:
df

Unnamed: 0,text_c,text,title,author,year,fks
0,\nCHAPTER 1\n\nThe family of Dashwood had long...,The family of Dashwood had long been settled i...,Sense_and_Sensibility,Austen,1811,10.8845
1,'Wooed and married and a'.'\n'Edith!' said Mar...,'Wooed and married and a'.'\n'Edith!' said Mar...,North_and_South,Gaskell,1855,6.6583
2,Book the First--Recalled to Life\n\n\n\n\nI. T...,Book the First--Recalled to Life\n\n\n\n\nI. T...,A_Tale_of_Two_Cities,Dickens,1858,9.8547
3,"SAMUEL BUTLER.\nAugust 7, 1901\n\nCHAPTER I: W...","SAMUEL BUTLER.\nAugust 7, 1901\n\nCHAPTER I: W...",Erewhon,Butler,1872,14.6961
4,THE AMERICAN\n\nby Henry James\n\n\n1877\n\n\n...,THE AMERICAN\n\nby Henry James\n\n\n1877\n\n\n...,The_American,James,1877,8.1069
5,\nThe Picture of Dorian Gray\n\nby\n\nOscar Wi...,The Picture of Dorian Gray\n\nby\n\nOscar Wild...,Dorian_Gray,Wilde,1890,4.9577
6,Phase the First: The Maiden\n\n\nI\n\n\nOn an ...,Phase the First: The Maiden\n\n\nI\n\n\nOn an ...,Tess_of_the_DUrbervilles,Hardy,1891,7.6503
7,BOOK FIRST: THE PRINCE\n\n\n\n\nPART FIRST\n\n...,BOOK FIRST: THE PRINCE\n\n\n\n\nPART FIRST\n\n...,The_Golden_Bowl,James,1904,12.4346
8,THE SECRET GARDEN\n\nBY FRANCES HODGSON BURNET...,THE SECRET GARDEN\n\nBY FRANCES HODGSON BURNET...,The_Secret_Garden,Burnett,1911,4.6555
9,Chapter 1\n\nOnce upon a time and a very good ...,Once upon a time and a very good time it was t...,Portrait_of_the_Artist,Joyce,1916,6.4773


## d) flesch_kincaid **todo**
When is the Flesch Kincaid score *not* a valid, robust or reliable estimator of text difficulty? Give two conditions. (Text answer, 200 words maximum).

## e) parse
The goal of this function is to process the texts with spaCy’s tokenizer and parser, and store the processed texts. Your completed function should: 
- i. Use the spaCy nlp method to add a new column to the dataframe that contains parsed and tokenized Doc objects for each text. 
- ii. Serialise the resulting dataframe (i.e., write it out to disk) using the pickle format. 
- iii. Return the dataframe. 
- iv. Load the dataframe from the pickle file and use it for the remainder of this coursework part. Note: one or more of the texts may exceed the default maximum length for spaCy’s model. You will need to either increase this length or parse the text in sections.

In [None]:

def parse(df, store_path=Path.cwd() / "pickles", out_name="parsed.pickle"):
    """Parses the text of a DataFrame using spaCy, stores the parsed docs as a column and writes the resulting  DataFrame to a pickle file"""
    docs = {}
    for i in range(len(df)):
        docs[i] = nlp(df['text'][i]) # create spacy docs for each cell "text"
        print(f"Document {i+1} / {len(df)} parsed.") #? added to see process when parsing (takes a while)
    df['docs'] = docs.values()
    df.to_pickle(os.path.join(store_path,out_name)) #~ write to pickle
    # return df #^ To check the returned df
parse(df) #^ function call to parse and save df to pickle


In [None]:
def load_from_pickle(store_path=Path.cwd() / "pickles", out_name="parsed.pickle"):
    df = pd.read_pickle(os.path.join(store_path, out_name))
    return df


''' Load from pickle '''
df = load_from_pickle() 
#! IF TIME, move to regex
df['text'][10] = df['text'][10][585:]
df['text'][10] = df['text'][10].replace('CHAPTER ', '')
#? To check the returned type
# type(df['docs'][0]) # ~spacy.tokens.doc.Doc 

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['text'][10] = df['text'][10].replace('CHAPTER ', '')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tex

## f) Working with parses: **todo**
the final lines of the code template contain three `for` loops. Write the functions needed to complete these loops so that they print: 

### i. The title of each novel and a list of the ten most common syntactic objects overall in the text. 

In [None]:
from collections import Counter
def find_objects(token):
    ''' Function to find the objects of a token  '''
    objects = set()
    if 'obj' in token.dep_:
        objects.add(token)
    return objects



def objects_count(doc, n=10):
    ''' Assumed "object" as syntactical, rather than a word as Python object '''
    object_text = []

    for token in doc:
        objects = find_objects(token)
        if objects:
            object_text.extend([objects.text.lower() for objects in objects]) #? have made this lower, to not duplicate the types

    object_counts = Counter(object_text)
    return  object_counts.most_common()[:n]
# objects_count(df['docs'][0]) #*to test function with a sample doc

[('it', 706),
 ('her', 689),
 ('him', 521),
 ('them', 404),
 ('me', 343),
 ('you', 331),
 ('which', 251),
 ('what', 210),
 ('time', 193),
 ('herself', 193)]

In [None]:
def adjective_counts(doc, n=10):
    """Extracts the most common adjectives in a parsed document. Returns a list of tuples."""
    pos_counts = Counter([token.text for token in doc if token.pos_ == 'ADJ'])
    return pos_counts.most_common()[:n]
    
# doc = df['docs'][len(df)-1] #*to test function with a sample doc
# adjective_counts(doc)

### ii. The title of each novel and a list of the ten most common syntactic subjects of the verb ‘to hear’ (in any tense) in the text, ordered by their frequency. 


In [None]:
def freq_counter_dict_words(doc):
    ''' 
    Returns a counter dictionary object for the words in the document 
    Only considers alphabet-character tokens (alpha) rather than excluding (e.g., with is_punct)
    Use: freq_counter_dict_words(doc)
    '''
    from collections import Counter

    words = [token.text.lower()
            for token in doc
            if token.is_alpha]
    word_freq = Counter(words)
    return word_freq
# freq_counter_dict_words(df['docs'[0]]) #*to test function with a sample doc

In [35]:
def get_verbs(doc, verb='hear'):
    ''' Function to get the count of a verb. Retains original form '''
    verbs = []
    for token in doc:
        if token.pos_ == 'VERB' and token.lemma_==verb:

            verbs.append(token.text.lower())
    verb_counts =Counter(verbs)
    return verb_counts #~ retains the original verb form, returns their frequency

def sum_verbs(doc, verb='hear'):
    ''' Get the total sum of verbs matched with given lemma '''
    verb_counts = get_verbs(doc)
    return (verb, sum(verb_counts.values()))


In [None]:
# Testing functions
# get_verbs(df['docs'][0])
# #* Counter({'heard': 78, 'hear': 69, 'hearing': 22, 'hears': 1})

# sum_verbs(df['docs'][1])
# #* ('hear', 276)

Ref: *Word Association Norms, Mutual Information and Lexicography*, Church, K. W., and Hanks, P., in Computational Linguistics Volume 16, Number 1, March 1990

PMI by Church and Hanks definition:
$$PMI = log_2\bigg(\frac{P(x,y)}{P(x)P(y)}\bigg)$$

In this application, the vocabulary is calculated per document, and total words declared as `N`.

**Independent Probabilities**
- are calculated as counts - f(x) and f(y) 
- normalised to N
- $P(x) = \frac{f(x)}{N}$ and $P(y) = \frac{f(y)}{N}$

**Joint Probabilities**
- are calculated as counts of x and y co-occuring in a window - $f(x,y)$
- using Spacy dependencies limits this to syntactical spans - the sentences of a document
- normalised to N
- $P(x,y) = \frac{f(x,y)}{N}$

In [16]:
def find_subjects(token, match_verb):
    ''' Function to find the subjects of a token - both up and down the parse tree '''
    subjects = []
    ''' Token's children '''

    for child in token.children:
        if 'subj' in child.dep_:
            subjects.append((child, token))
                
    return subjects


In [None]:

def subjects_by_verb_count(doc, verb, n=10):
    """Extracts the most common subjects of a given verb in a parsed document. Returns a list."""
    ''' Added a parameter to remove multiple counts of 1 in the most frequent.
    > Setting to n=10 sliced the list by order, which didn't have much meaning in the returned values
    '''
    subject_text = []

    for token in doc:
        if token.pos_ == "VERB" and token.lemma_ == verb:
            subjects = find_subjects(token, verb)
            if subjects:
                for subject in subjects:
                    if not subject[0].is_punct: #~ adding this to ignore punctuation
                        subject_text.extend([(subject[0].text.lower(), subject[1]) for subject in subjects])

    return  subject_text
# doc=df['docs'][0]


In [23]:
def apply_log2(x):
    return log2(x)

def get_verb_freqs(doc):
    verb_counts = get_verbs(doc)
    return sum(verb_counts.values()) # consider the verbs as a single verb (as we are treating 'hear' as tense-insensitive)

def calculate_probs(x, N):
    return x / N

In [None]:
# verb_counts = get_verbs(doc)
# verb_f = sum(verb_counts.values()) # consider the verbs as a single verb (as we are treating 'hear' as tense-insensitive)
# verb_prob = verb_f / N

In [27]:
def make_freq_p_df(doc):
    all_word_freq = freq_counter_dict_words(doc)
    N = sum(all_word_freq.values()) # unique word total (N)
    top_ten = Counter([s for s, v in subjects_by_verb_count(doc, verb='hear')]).most_common()[:10]
    df_c = pd.DataFrame(data=top_ten, columns=['word', 'f_word_verb'])
    df_c['f_word'] = df_c['word'].map(all_word_freq)
    df_c['p_word_verb'] = df_c['f_word_verb'].apply(lambda x: calculate_probs(x,N))
    # df_c['p_verb'] = verb_prob
    df_c['p_verb'] = calculate_probs(get_verb_freqs(doc),N) #? consider the verbs as a single verb (as we are treating 'hear' as tense-insensitive)
    df_c['p_word'] = df_c['f_word'].apply(lambda x: calculate_probs(x, N))
    return df_c[['word', 'f_word', 'f_word_verb', 'p_word','p_verb', 'p_word_verb']] # ~ reorder


In [None]:
# df_c = make_freq_p_df(doc)

Counter({'heard': 78, 'hear': 69, 'hearing': 22, 'hears': 1})


In [None]:
_

Unnamed: 0,word,f_word,f_word_verb,p_word,p_verb,p_word_verb,P(w)P(v),fracts,pmi
9,both,93,2,0.000788,0.001441,1.7e-05,1e-06,14.927894,3.899939
1,you,1171,19,0.009923,0.001441,0.000161,1.4e-05,11.262847,3.4935
0,i,1961,28,0.016618,0.001441,0.000237,2.4e-05,9.91133,3.309079
7,jennings,234,3,0.001983,0.001441,2.5e-05,3e-06,8.899321,3.153695
5,they,513,5,0.004347,0.001441,4.2e-05,6e-06,6.765566,2.758211
6,me,428,4,0.003627,0.001441,3.4e-05,5e-06,6.487356,2.697631
2,she,1580,14,0.013389,0.001441,0.000119,1.9e-05,6.15067,2.620744
3,elinor,680,6,0.005762,0.001441,5.1e-05,8e-06,6.124827,2.614669
4,he,1094,6,0.009271,0.001441,5.1e-05,1.3e-05,3.807022,1.928663
8,them,464,2,0.003932,0.001441,1.7e-05,6e-06,2.992013,1.581117


In [None]:
# doc=df['docs'][0]


In [None]:

''' work in progress - to go into a function 
and possibly tidy the rest of functions
'''
# df_c['f_word'] = df_c['word'].map(all_word_freq)

# df_c['p_word'] = df_c['f_word'] / N
# df_c = df_c[['word', 'f_word', 'f_word_verb', 'p_word','p_verb', 'p_word_verb']] # ~ reorder
def calculate_pmi(df):
    df['P(w)P(v)'] = df['p_word']*df['p_verb'] #~ product of independent
    df['fracts'] = df['p_word_verb'] / df['P(w)P(v)'] # ~ joint probabilities scaled to product of independents 
    df['pmi'] = df['fracts'].apply(lambda x: apply_log2(x)) #~ apply log2 to result for pmi
    df.sort_values('pmi', ascending=False) #~ return ordered (descending)
    return df

# calculate_pmi(df_c)

In [47]:
def select_sort_df(df):
    df = df[['word', 'pmi']].sort_values('pmi', ascending=False)
    return df.reset_index( drop=True)


In [74]:
for i, row in df.iterrows():
    print(row["title"])
    doc = row['docs']
    df_pmi = make_freq_p_df(doc)
    df_pmi = calculate_pmi(df_pmi)
    print()
    print(select_sort_df(df_pmi))
    # print(calculate(row["parsed"], "hear"))
    print("\n")

Sense_and_Sensibility

       word       pmi
0      both  3.899939
1       you  3.493500
2         i  3.309079
3  jennings  3.153695
4      they  2.758211
5        me  2.697631
6       she  2.620744
7    elinor  2.614669
8        he  1.928663
9      them  1.581117


North_and_South

       word       pmi
0       she  3.786435
1      they  3.400510
2        we  2.938943
3         i  2.889980
4       who  2.707212
5        he  2.527858
6       you  2.481500
7  margaret  2.199961
8        me  2.058068
9  thornton  1.742240


A_Tale_of_Two_Cities

          word        pmi
0        clink  11.160835
1     stranger   7.460396
2          she   4.634141
3  monseigneur   4.419368
4            i   3.612105
5           he   3.551559
6         they   3.335559
7          you   3.273107
8           me   2.116441
9          him   1.234539


Erewhon

          word       pmi
0  destruction  7.568906
1            i  4.564957
2          she  3.850088
3     machines  3.814019
4           he  2.829688
5  

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['text'][10] = df['text'][10][585:]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][10] = df['text



In [None]:
# for i, row in df.iterrows():
#     print(row["title"])
#     print(subjects_by_verb_count(row["docs"], "hear"))
#     print("\n")

Counter({'hear': 3, 'heard': 1})

### - iii. The title of each novel and a list of the ten most common syntactic subjects of the verb ‘to hear’ (in any tense) in the text, ordered by their Pointwise Mutual Information.

In [None]:




def subjects_by_verb_pmi(doc, target_verb):
    """Extracts the most common subjects of a given verb in a parsed document. Returns a list."""
    pass
    for i, row in df.iterrows():
        print(row["title"])
        print(subjects_by_verb_pmi(row["parsed"], "hear"))
        print("\n")







## g) Ten marks are allocated for your github commit history. 
You should make regular, atomic commits with concise but informative commit messages. See the section titled Submission (both questions) for more details.

In [None]:
if __name__ == "__main__":
    """
    uncomment the following lines to run the functions once you have completed them
    """
    #path = Path.cwd() / "p1-texts" / "novels"
    #print(path)
    #df = read_novels(path) # this line will fail until you have completed the read_novels function above.
    #print(df.head())
    #nltk.download("cmudict")
    #parse(df)
    #print(df.head())
    #print(get_ttrs(df))
    #print(get_fks(df))
    #df = pd.read_pickle(Path.cwd() / "pickles" /"name.pickle")
    # print(adjective_counts(df)) #? this is using the dataframe, the function and docstring calls for a `doc` object.
    """ 
    for i, row in df.iterrows():
        print(row["title"])
        print(subjects_by_verb_count(row["parsed"], "hear"))
        print("\n")

    for i, row in df.iterrows():
        print(row["title"])
        print(subjects_by_verb_pmi(row["parsed"], "hear"))
        print("\n")
    """

# Part Two — Feature Extraction and Classification 
In the second part of the coursework, your task is to train and test machine learning classifiers on a dataset of political speeches. The objective is to learn to predict the political party from the text of the speech. The texts you need for this part are in the speeches sub-directory of the texts directory of the coursework Moodle template. For this part, you can structure your python functions in any way that you like, but pay attention to exactly what information (if any) you are asked to print out in each part. Your final script should print out the answers to each part where required, and nothing else.

## a) Read the hansard40000.csv dataset in the texts directory into a dataframe. 
Sub- set and rename the dataframe as follows: 
- i. rename the ‘Labour (Co-op)’ value in ‘party’ column to ‘Labour’, and then: 
- ii. remove any rows where the value of the ‘party’ column is not one of the four most common party names, and remove the ‘Speaker’ value. 
- iii. remove any rows where the value in the ‘speech_class’ column is not ‘Speech’. 
- iv. remove any rows where the text in the ‘speech’ column is less than 1000 characters long
- Print the dimensions of the resulting dataframe using the shape method.

Assuming we ignore blanks (`NaN`)

`df['speech_class'].value_counts(dropna=False)`
```
party
Conservative                        25079
Labour                               8038
Scottish National Party              2303
NaN                                  1647
.....

`df['speech_class'].value_counts()`
The top 4n less "Speaker"
party
Conservative                        25079
Labour                               8038
Scottish National Party              2303
Liberal Democrat                      803
.....
```

In [None]:
import pandas as pd
from pathlib import Path
import os
import warnings
warnings.filterwarnings('ignore')

In [None]:
def csv_to_df(path=Path.cwd() / "zips" / "p2-texts"):
    df = pd.read_csv(os.path.join(path,os.listdir(path)[0]))

    #^ - i. rename the ‘Labour (Co-op)’ value in ‘party’ column to ‘Labour’, and then: 
    df['party'].replace("Labour (Co-op)", "Labour", inplace=True)


    #^ - ii. remove any rows where the value of the ‘party’ column is not one of the four most common party names, and remove the ‘Speaker’ value. 
    values = ['Conservative', 'Labour', 'Scottish National Party', 'Liberal Democrat',]
    df = df[df['party'].isin(values)]

    #^ - iii. remove any rows where the value in the ‘speech_class’ column is not ‘Speech’. 
    df = df[df['speech_class'] == 'Speech']

    df.reset_index(drop=True, inplace=True) #? NB: needs reseting, otherwise the next step doesn't work


    #^ - iv. remove any rows where the text in the ‘speech’ column is less than 1000 characters long
    # indices= []
    ind= []
    for i, s in enumerate(df['speech'].values):
        # print(len(s))
        if len(s) < 1000:
            # indices.append((i, len(s)))
            ind.append(i)

    df = df[~df.index.isin(ind)]

    #^ - Print the dimensions of the resulting dataframe using the shape method.
    # len(df) == 36223-28139
    print(df.shape)
    return df #? Although this might not be wanted? "Function to print and nothing else?"

In [None]:
df = csv_to_df()

## b) Vectorise the speeches using TfidfVectorizer from scikit-learn. 
Use the default parameters, except for omitting English stopwords and setting max_features to 3000. Split the data into a train and test set, using stratified sampling, with a random seed of 26.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold

random_seed = 26
max_features = 3000

tfidf_vectorizer = TfidfVectorizer(stop_words="english",max_features=max_features)
tfidf = tfidf_vectorizer.fit_transform(df.speech) 
X_train, X_test, y_train, y_test = train_test_split(tfidf, df.party, random_state=random_seed, shuffle=True, stratify=df.party)

## c) Random forest and SVM
Train RandomForest (with n_estimators=300) and SVM with linear kernel classifiers on the training set, and print the scikit-learn macro-average f1 score and classification report for each classifier on the test set. The label that you are trying to predict is the ‘party’ value. 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,  classification_report
from sklearn import svm
rfc = RandomForestClassifier(random_state=random_seed)
n_estimators = 300

rfc_fitted = rfc.fit(X_train, y_train)

print(f1_score(y_test, rfc_fitted.predict(X_test), average='macro'))
# params (y_true, y_predicted)
print(classification_report(y_test, rfc.predict(X_test)))

In [None]:
linear_svc = svm.SVC(kernel='linear')
linear_svc_fitted = linear_svc.fit(X_train, y_train)
linear_svc_fitted.predict(X_test)

print(f1_score(y_test, linear_svc.predict(X_test), average='macro'))
print(classification_report(y_test, linear_svc_fitted.predict(X_test)))

## d) N-Grams
Adjust the parameters of the Tfidfvectorizer so that unigrams, bi-grams and tri-grams will be considered as features, limiting the total number of features to 3000. Print the classification report as in 2(c) again using these parameters. 

## e) Implement a new custom tokenizer and pass it to the tokenizer argument of Tfidfvectorizer. 
You can use this function in any way you like to try to achieve the best classification performance while keeping the number of features to no more than 3000, and using the same three classifiers as above. Print the clas- sification report for the best performing classifier using your tokenizer. Marks will be awarded both for a high overall classification performance, and a good trade-off between classification performance and efficiency (i.e., using fewer parameters).

## f) Explain your tokenizer function and discuss its performance. 

## g) Githistory
Ten marks are allocated for your github commit history. You should make regular, atomic commits with concise but informative commit messages. See the section below titled Submission (both questions) for more details. Part Two total marks: 50

# Git requirements

Submit using github classroom and confirm on Moodle The code template for this coursework part is made available to you as a Git repository on GitHub, via an invitation link for GitHub Classroom. 

- 1. First you follow the invitation link for the coursework that is available on the Moodle page of the module. 
- 2. Then clone the Git repository from the GitHub server that GitHub will create for you. Initially it will contain README.md and a folder structure for Part One and Part Two with placeholder python scripts (you can change these to Jupyter notebook files if you prefer). 
- 3. Enter your name in README.md (this makes it easy for us to see whose code we are marking) 
- 4. You must also enter the following Academic Declaration into README.md for your submission:
> “I have read and understood the sections of plagiarism in the College Policy on assessment offences and confirm that the work is my own, with the work of others clearly acknowledged. I give my permission to submit my report to the plagiarism testing database that the College is using and test it using plagiarism detection software, search engines or meta-searching software.” 
This refers to the document at: [link](https://www.bbk.ac.uk/student-services/exams/plagiarism-guidelines)

5. Whenever you have made a change that can “stand on its own”, say, “Implemented tokenizer method”, this is a good opportunity to commit the change to your local repository and also to push your changed local repository to the GitHub server. As a rule of thumb, in collaborative software development it is common to require that the code base should at least still compile after each commit.

- Entering your name in README.md (using a text editor), then doing a commit of your change to the file into the local repository, and finally doing a push of your local repository to the GitHub server would be an excellent way to start your coursework activities. 
- You can benefit from the GitHub server also to synchronise between, e.g., the Birkbeck lab machines and your own computer. You push the state of your local repository in the lab to the GitHub server before you go home; later, you can pull your changes to the repository on your home computer (and vice versa). Use meaningful commit messages (e.g., “Implemented the pickle output for Q1(e)”, or “fixed a bug in the PMI calculation”), and do not forget to push your changes to the GitHub server! 
- For marking, we plan to clone your repositories from the GitHub server shortly after the submission deadline. We additionally require you to confirm your github submission via a ‘confirm submission’ form on Moodle. The time of this confirmation will tell us if you would like your code to be considered for the regular (uncapped) deadline or for the late (capped) deadline two weeks later.

Deadlines 
-  The submission deadline is: 26th June 2025, 14:00 UK time. 
- The late cut-off deadline for receiving a capped mark is: 10th July 2025, 14:00 UK time.