# Assignment 1 - Thomas Steinthal

In this assignment, I am going to loop over several essays, written by high school students, and create a dataframe in which particular aspects of the essays are assigned. I will be creating several dataset, in which each entry represents an essay in a given folder with following attributes:

Relative frequency of ```nouns```, ```verbs```, ```adjectives``` and ```adverbs``` per 10000 words.
Total number of unique names of ```names```, ```locations``` and ```organisations```.

In [4]:
#Setup. Remember to call setup.sh first
import os #for filehandling
import pandas as pd #for dataframes
import re #for string segmentation

import spacy #for linguistic analysis
nlp = spacy.load("en_core_web_md") #The spacy-model



#Original filepath to data (I somehow have no access to copy the files to my own repository...)
#org_path = os.path.join("..", "..", "..", "..", "cds-assignment-templates", "cds-lang-assignment-1", "in", "USEcorpus")
org_path = os.path.join("USEcorpus")

#Example to test functions out of context
#pth = os.path.join("..", "..", "..", "..", "cds-assignment-templates", "cds-lang-assignment-1", "in", "USEcorpus", "a1", "0100.a1.txt")
#with open(pth, encoding="latin-1") as f:
#    example = f.read()
#example_doc = nlp(example)
#print(example)

I like ```functions```. That's why I have a collection of all relevant functions underneath:

Fun_list_files(path): Takes a filepath and returns a list of the content, sorted

Fun_rmv_punct(s): Takes a string, s, and removes all meta-data. Could be expanded to remove more complex phenomena but has been hardcoded for now

Fun_cou_rel_ling(doc, feature, per): Takes a doc (the nlp-object), a feature list (with the codes for the features as string) and a value for the denominator. A simple for-loop with count, does the job

Fun_cou_tot_propn: As above (no denominer though...). Could potentially be merged with the fun_cou_rel_ling to avoid redundancy, but leaving them like this allows for fine-tuning of the two elements (linguistic features and PROPN)

In [14]:
def fun_list_files(path):
    files = sorted(os.listdir(path))
    return (files)

#Remove punctuation function. Takes a string and returns it cleaned (could be extended to also require delimiters)
def fun_rmv_punct(s):
    s = re.sub('<.+?>', '', s) #Non-greedy pattern (https://stackoverflow.com/questions/8784396/how-to-delete-the-words-between-two-delimiters)
    return s

#Count linguistic features
def fun_cou_rel_ling(doc, feature, per):
    count = 0
    for token in doc: #count-for-loop for each word in doc
        if token.pos_ == feature:
            count += 1
    rel_freq = (count/len(doc)) * per #calculate relative frequency
    
    return rel_freq

#Count propn. Because of unique features, this function works a bit different from rel_freq (above)
def fun_cou_tot_propn(doc, feature):
    count_list = [] #A list for all propn...
    for token in doc:
        if token.ent_type_ == feature:
            count_list.append(token.text) #... is appended to list 
    c_list = list(set(count_list)) #Removing duplicates by converting to set, that doesn't allow duplicates
    count = len(c_list)
    return count


['Thomas', 'Thomas', 'Thomas']
['Thomas']
1


And now for the real task. First I will define a couple of ```lists``` to make everything easier. 

In [5]:
USEcorp_list = fun_list_files(org_path) #Lists entire USEcorp-folders
lingu_key_list = ["NOUN", "VERB", "ADJ", "ADV"] 
propn_key_list = ["PERSON", "LOC", "ORG"]
column_names_list = ["Filename", "Relfreq_NOUN", "RelFreq_VERB", "RelFreq_ADJ", "RelFreq_ADV", "Unique_PER", "Unique_LOC", "Unique_ORG"]

And then for the actual code. Eventually one could convert this entire chunk to a function.

First we create a ```for-loop``` for each folder in the USE-corpus. We need to output one df for each of these so we start defining it (as a list). Then we list the files in the folder.

Another ```for-loop``` is created for each file in the folder. Again we need an output, that we can append as a row to the dataframe, so that is created (or nullified). The file is also ```loaded```. 

First we need some ```preprocessing```. We do this with the fun_rmv_punct(text), before we ```convert``` the text to a spaCy-readable-format. 

Now we can start ```appending``` features to our row-list. First we pluck the four linguistic feature with the fun_cou_rel_freq and then we find the names entities with fun_cou_tot_freq. 

With the final list constructed, we have a list, that represent the file-specific-row in the final dataframe. We append this to the df_fea_list (that eventually will become output) before looping over again.

Having looped through the entire set of files in the folder, we ```convert``` the file to a df with pd and ```save``` it with its name. Becuase the load of the process is heavy, a small print() let's us know how far the computer has got. All in all, the code took 3 minutes to run. 



In [61]:

for subfolder in USEcorp_list:
    df_fea_list = []
    cur_dir = os.path.join(org_path, subfolder) #current directory in loop
    cur_file_list = fun_list_files(cur_dir)

    for essay in cur_file_list:
        fea_count_list = [essay] #The list for all features. Already named
        cur_essay_path = os.path.join(cur_dir, essay) #current file in loop
        with open(cur_essay_path, encoding="latin-1") as f: #read file
            text = f.read()
        
        text = fun_rmv_punct(text) #Removing meta-data and then transform (tokennise)
        doc = nlp(text)
        

        #First we want to extract the relative 4 linguistic features
        for feature in lingu_key_list:
            fea_count = fun_cou_rel_ling(doc, feature, 10000) #per 10000 words
            fea_count_list.append(fea_count)
        
        #And then the total count of PROPN
        for ent in propn_key_list:
            fea_count = fun_cou_tot_propn(doc, ent)
            fea_count_list.append(fea_count)

        #Finally I want to create an output dataframe for each
        df_fea_list.append(fea_count_list)

    df_features = pd.DataFrame(df_fea_list, columns = column_names_list)
    outpath = os.path.join("..", "out", subfolder + "_" + "annotations.csv")
    df_features.to_csv(outpath)
    print("Finished " + subfolder) #Ran in 3 min approx




Finished a1
Finished a2
Finished a3
Finished a4
Finished a5
Finished b1
Finished b2
Finished b3
Finished b4
Finished b5
Finished b6
Finished b7
Finished b8
Finished c1
