This assignment concerns using ```spaCy``` to extract linguistic information from a corpus of texts.

The corpus is an interesting one: *The Uppsala Student English Corpus (USE)*. All of the data is included in the folder called ```in``` but you can access more documentation via [this link](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2457).

For this exercise, you should write some code which does the following:

- Loop over each text file in the folder called ```in```
- Extract the following information:
    - Relative frequency of Nouns, Verbs, Adjective, and Adverbs per 10,000 words
    - Total number of *unique* PER, LOC, ORGS
- For each sub-folder (a1, a2, a3, ...) save a table which shows the following information:

|Filename|RelFreq NOUN|RelFreq VERB|RelFreq ADJ|RelFreq ADV|Unique PER|Unique LOC|Unique ORG|
|---|---|---|---|---|---|---|---|
|file1.txt|---|---|---|---|---|---|---|
|file2.txt|---|---|---|---|---|---|---|
|etc|---|---|---|---|---|---|---|

## Objective

This assignment is designed to test that you can:

1. Work with multiple input data arranged hierarchically in folders;
2. Use ```spaCy``` to extract linguistic information from text data;
3. Save those results in a clear way which can be shared or used for future analysis

## Some notes

- The data is arranged in various subfolders related to their content (see the [README](in/README.md) for more info). You'll need to think a little bit about how to do this. You should be able do it using a combination of things we've already looked at, such as ```os.listdir()```, ```os.path.join()```, and for loops.
- The text files contain some extra information that such as document ID and other metadata that occurs between pointed brackets ```<>```. Make sure to remove these as part of your preprocessing steps!
- There are 14 subfolders (a1, a2, a3, etc), so when completed the folder ```out``` should have 14 CSV files.

## Additional comments

Your code should include functions that you have written wherever possible. Try to break your code down into smaller self-contained parts, rather than having it as one long set of instructions.

For this assignment, you are welcome to submit your code either as a Jupyter Notebook, or as ```.py``` script. If you do not know how to write ```.py``` scripts, don't worry - we're working towards that!

Lastly, you are welcome to edit this README file to contain whatever informatio you like. Remember - documentation is important!


## Create a spacy NLP class


In [5]:
import spacy
nlp = spacy.load("en_core_web_md") # loads the entire model spacy into the variable nlp

In [6]:
import pandas as pd
import re
import os


## Testing on one text first

## Function for extracting nouns and stuff

In [26]:
def find_attributes (filename):
    filepath = os.path.join("..", "in", "USEcorpus", "c1", filename) # define file path, with open file name
    with open(filepath, "r", encoding="latin-1") as file: # open the file and encode using utf 8
        text = file.read()
    text = re.sub(r'<.*?>', '', text) # remove all characters between < > 
    doc = nlp(text) # use spacy nlp  to create and find tokens.

    # finding relFreg of nouns
    noun_count =0 # creating empty variables 
    verb_count =0
    adjective_count = 0
    adverb_count = 0

    for token in doc: # for loop that counts the number of times each adj, noun, verb and adv accours.
        if token.pos_ =="ADJ":
            adjective_count +=1
        elif token.pos_ == "NOUN":
            noun_count += 1
        elif token.pos_ == "VERB":
            verb_count +=1
        elif token.pos_ == "ADV":
            adverb_count += 1

    relative_freq_ADJ = (adjective_count/len(doc)) * 10000 # finding the relative frequence and storing in variable 
    relative_freq_ADJ = round(relative_freq_ADJ, 2)
    relative_freq_NOUN = (noun_count/len(doc)) * 10000
    relative_freq_NOUN = round(relative_freq_NOUN, 2)
    relative_freq_VERB = (verb_count/len(doc)) * 10000
    relative_freq_VERB = round(relative_freq_VERB, 2)
    relative_freq_ADV = (adverb_count/len(doc)) * 10000
    relative_freq_ADV = round(relative_freq_ADV, 2)
    # Finding Unique PER; LOC, ORG
    entities_PER = [] # creating empty list
    entities_LOC = []
    entities_ORG = []

# get named entities and add to list 
    for ent in doc.ents: # ent means entity # for loop that finds each word with either person, loc or org and appends to the matching variable 
        if ent.label_ == "PERSON": 
            entities_PER.append(ent.text)
        elif ent.label_ == "LOC":
            entities_LOC.append(ent.text)
        elif ent.label_ == "ORG":
            entities_ORG.append(ent.text)

    unique_entities_PER = set(entities_PER) # defining unique only 
    unique_entities_LOC = set(entities_LOC)
    unique_entities_ORG = set(entities_ORG) # using set to find the unique entities in the list.
   
    touple_of_data = []
    for doc in [filename]:
        touple_of_data.append((filename, relative_freq_NOUN, relative_freq_VERB, relative_freq_ADJ, relative_freq_ADV, unique_entities_PER, unique_entities_LOC, unique_entities_ORG))
        data = pd.DataFrame(touple_of_data, columns=['Filename', 'Noun Freq', 'Verb Freq', 'Adj Freq', 'Adv Freq', 'Unique PER', 'Unique LOC', 'Unique ORG'])
    
    return data
# creating a pandas dataframe, and storing in out folder as a csv file
    outpath = os.path.join("..", "out", filename + "annotations.csv") # creating variable which works like a function for code below
    #data.to_csv(outpath)



In [27]:
filepath_new = os.path.join("..", "in", "USEcorpus", "c1")
dataframes = []
for file in os.listdir(filepath_new):
    testone = os.path.join(filepath_new, file)
    data = find_attributes(file)
    dataframes.append(data)
final_data = pd.concat(dataframes)
final_data

#outpath = os.path.join("..", "out", "df.csv") # creating variable which works like a function for code below
#final_data.to_csv(outpath)
#print(dataframes)



Unnamed: 0,Filename,Noun Freq,Verb Freq,Adj Freq,Adv Freq,Unique PER,Unique LOC,Unique ORG
0,0140.c1.txt,1573.58,933.55,472.89,403.59,"{Geroge, Nick, Enoch Robinson, Benjy, George W...",{},"{Hemingway's, Fury, Time, The Sound and the, C..."
0,0165.c1.txt,1742.49,816.41,580.83,284.32,"{Nick, Benjy, Anderson, Quentin, Faulkner, Com...",{},"{Sherwood Anderson's, Bentley, Fury}"
0,0200.c1.txt,1177.65,1021.64,649.22,508.3,"{Miriam, Catherine, Edgar Linton, Isabella, Ca...",{},"{Penguin Classics, p, Nelly, Watts, Kettle, Ar..."
0,0219.c1.txt,1379.31,974.8,563.66,484.08,"{Catherine, Emily Brontë's, Terry, Emily, Long...",{},"{ch.7, Heatcliff, T. Eagleton's, Nelly, St Ive..."
0,0238.c1.txt,1092.9,1163.15,398.13,288.84,"{Emily Brontë, Catherine, Edgar Linton, Cathy,...",{},"{Longman, Kettle, Arnold, Prentize-Hall}"
0,0501.c1.txt,1231.93,1025.46,461.11,426.7,"{XXI, Catherine, Isabella, Nelly Dean, Hindley...",{},"{P.47, Watts, Popular Classics, Hareton, Earns..."
0,0502.c1.txt,1321.84,1219.67,434.23,408.68,"{Catherine, Carl R., L. Cookson, Lockwood, ruf...",{},"{Lockwood, Heatcliff, Nelly, Norgate, Grange}"


In [24]:

# function to find attributes for a single file
def find_attributes(filename, subdir):
    filepath = os.path.join("..", "in", "USEcorpus", subdir, filename)
    with open(filepath, "r", encoding="latin-1") as file:
        text = file.read()
    text = re.sub(r'<.*?>', '', text)
    doc = nlp(text)

    # finding relFreg of nouns
    noun_count = 0
    verb_count = 0
    adjective_count = 0
    adverb_count = 0

    for token in doc:
        if token.pos_ == "ADJ":
            adjective_count += 1
        elif token.pos_ == "NOUN":
            noun_count += 1
        elif token.pos_ == "VERB":
            verb_count += 1
        elif token.pos_ == "ADV":
            adverb_count += 1

    relative_freq_ADJ = (adjective_count/len(doc)) * 10000
    relative_freq_ADJ = round(relative_freq_ADJ, 2)
    relative_freq_NOUN = (noun_count/len(doc)) * 10000
    relative_freq_NOUN = round(relative_freq_NOUN, 2)
    relative_freq_VERB = (verb_count/len(doc)) * 10000
    relative_freq_VERB = round(relative_freq_VERB, 2)
    relative_freq_ADV = (adverb_count/len(doc)) * 10000
    relative_freq_ADV = round(relative_freq_ADV, 2)

    # Finding Unique PER; LOC, ORG
    entities_PER = []
    entities_LOC = []
    entities_ORG = []

    for ent in doc.ents:
        if ent.label_ == "PERSON":
            entities_PER.append(ent.text)
        elif ent.label_ == "LOC":
            entities_LOC.append(ent.text)
        elif ent.label_ == "ORG":
            entities_ORG.append(ent.text)

    unique_entities_PER = set(entities_PER)
    unique_entities_LOC = set(entities_LOC)
    unique_entities_ORG = set(entities_ORG)

    data = pd.DataFrame({
        'Filename': [filename],
        'Noun Freq': [relative_freq_NOUN],
        'Verb Freq': [relative_freq_VERB],
        'Adj Freq': [relative_freq_ADJ],
        'Adv Freq': [relative_freq_ADV],
        'Unique PER': [unique_entities_PER],
        'Unique LOC': [unique_entities_LOC],
        'Unique ORG': [unique_entities_ORG]
    })

    return data


# loop over all subdirectories and files
data_frames = []
data_dir = os.path.join("..", "in", "USEcorpus")
for subdir in sorted(os.listdir(data_dir)):
    # make the subdirectory path string
    subdir_path = os.path.join(data_dir, subdir)
    # skip any non-directory files
    if not os.path.isdir(subdir_path):
        continue
    # for each file in the subdirectory
    for filename in sorted(os.listdir(subdir_path)):
        # skip any non-text files
        if not filename.endswith(".txt"):
            continue
        # find attributes for the file
        file_data = find_attributes(filename, subdir)
        # add





KeyboardInterrupt: 