This assignment concerns using ```spaCy``` to extract linguistic information from a corpus of texts.

The corpus is an interesting one: *The Uppsala Student English Corpus (USE)*. All of the data is included in the folder called ```in``` but you can access more documentation via [this link](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2457).

For this exercise, you should write some code which does the following:

- Loop over each text file in the folder called ```in```
- Extract the following information:
    - Relative frequency of Nouns, Verbs, Adjective, and Adverbs per 10,000 words
    - Total number of *unique* PER, LOC, ORGS
- For each sub-folder (a1, a2, a3, ...) save a table which shows the following information:

|Filename|RelFreq NOUN|RelFreq VERB|RelFreq ADJ|RelFreq ADV|Unique PER|Unique LOC|Unique ORG|
|---|---|---|---|---|---|---|---|
|file1.txt|---|---|---|---|---|---|---|
|file2.txt|---|---|---|---|---|---|---|
|etc|---|---|---|---|---|---|---|

## Objective

This assignment is designed to test that you can:

1. Work with multiple input data arranged hierarchically in folders;
2. Use ```spaCy``` to extract linguistic information from text data;
3. Save those results in a clear way which can be shared or used for future analysis

## Some notes

- The data is arranged in various subfolders related to their content (see the [README](in/README.md) for more info). You'll need to think a little bit about how to do this. You should be able do it using a combination of things we've already looked at, such as ```os.listdir()```, ```os.path.join()```, and for loops.
- The text files contain some extra information that such as document ID and other metadata that occurs between pointed brackets ```<>```. Make sure to remove these as part of your preprocessing steps!
- There are 14 subfolders (a1, a2, a3, etc), so when completed the folder ```out``` should have 14 CSV files.

## Additional comments

Your code should include functions that you have written wherever possible. Try to break your code down into smaller self-contained parts, rather than having it as one long set of instructions.

For this assignment, you are welcome to submit your code either as a Jupyter Notebook, or as ```.py``` script. If you do not know how to write ```.py``` scripts, don't worry - we're working towards that!

Lastly, you are welcome to edit this README file to contain whatever informatio you like. Remember - documentation is important!


## Create a spacy NLP class


In [1]:
import spacy
nlp = spacy.load("en_core_web_md") # loads the entire model spacy into the variable nlp

In [5]:
import pandas as pd
import re
import os

## Testing on one text first

## Function for extracting nouns and stuff

In [47]:
def find_attributes (filename):
    filepath = os.path.join("..", "in", "USEcorpus", "a1", filename) # define file path, with open file name
    with open(filepath, "r", encoding="latin-1") as file: # open the file and encode using utf 8
        text = file.read()
    text = re.sub(r'<.*?>', '', text) # remove all characters between < > 
    doc = nlp(text) # use spacy nlp  to create and find tokens.

    # finding relFreg of nouns
    noun_count =0 # creating empty variables 
    verb_count =0
    adjective_count = 0
    adverb_count = 0

    for token in doc: # for loop that counts the number of times each adj, noun, verb and adv accours.
        if token.pos_ =="ADJ":
            adjective_count +=1
        elif token.pos_ == "NOUN":
            noun_count += 1
        elif token.pos_ == "VERB":
            verb_count +=1
        elif token.pos_ == "ADV":
            adverb_count += 1

    relative_freq_ADJ = (adjective_count/len(doc)) * 10000 # finding the relative frequence and storing in variable 
    relative_freq_ADJ = round(relative_freq_ADJ, 2)
    relative_freq_NOUN = (noun_count/len(doc)) * 10000
    relative_freq_NOUN = round(relative_freq_NOUN, 2)
    relative_freq_VERB = (verb_count/len(doc)) * 10000
    relative_freq_VERB = round(relative_freq_VERB, 2)
    relative_freq_ADV = (adverb_count/len(doc)) * 10000
    relative_freq_ADV = round(relative_freq_ADV, 2)
    # Finding Unique PER; LOC, ORG
    entities_PER = [] # creating empty list
    entities_LOC = []
    entities_ORG = []

# get named entities and add to list 
    for ent in doc.ents: # ent means entity # for loop that finds each word with either person, loc or org and appends to the matching variable 
        if ent.label_ == "PERSON": 
            entities_PER.append(ent.text)
        elif ent.label_ == "LOC":
            entities_LOC.append(ent.text)
        elif ent.label_ == "ORG":
            entities_ORG.append(ent.text)

    unique_entities_PER = set(entities_PER) # defining unique only 
    unique_entities_LOC = set(entities_LOC)
    unique_entities_ORG = set(entities_ORG) # using set to find the unique entities in the list.
    # checking to see if it has worked so far 
    #print(filename, relative_freq_NOUN, relative_freq_VERB, relative_freq_ADJ, relative_freq_ADV, unique_entities_PER, unique_entities_LOC, unique_entities_ORG)
    
    # creating a dictionary so i can store the data in a pandas dataframe 
    #datadic = [
    #{"Filename": filename, "RelFreq NOUN": relative_freq_NOUN, "RelFreq VERB": relative_freq_VERB, "RelFreq ADJ": relative_freq_ADJ, "RelFreq ADV": relative_freq_ADV, "Unique PER": unique_entities_PER, "Unique LOC": unique_entities_LOC, "Unique ORG": unique_entities_ORG}
#]
    #touple_of_data = []
    #for doc in [filename]:
        #touple_of_data.append((doc, relative_freq_NOUN, relative_freq_VERB, relative_freq_ADJ, relative_freq_ADV, unique_entities_PER, unique_entities_LOC, unique_entities_ORG))
    #print(touple_of_data)
    all_tuples = []
    touple_of_data = []
    for doc in [filename]:
        touple_of_data.append((filename, relative_freq_NOUN, relative_freq_VERB, relative_freq_ADJ, relative_freq_ADV, unique_entities_PER, unique_entities_LOC, unique_entities_ORG))
        all_tuples.append(touple_of_data)
    print(all_tuples)
# creating a pandas dataframe, and storing in out folder as a csv file
    #data = pd.DataFrame(touple_of_data)
    #print(data)
    #outpath = os.path.join("..", "out", filename + "annotations.csv") # creating variable which works like a function for code below
    #data.to_csv(outpath)



In [48]:
find_attributes("0100.a1.txt")

[[('0100.a1.txt', 1533.05, 1223.63, 801.69, 534.46, set(), set(), set())]]


In [49]:
filepath_new = os.path.join("..", "in", "USEcorpus", "a1")
for file in os.listdir(filepath_new):
    testone = os.path.join(filepath_new, file)
    find_attributes(file)

[[('0176.a1.txt', 1408.14, 1243.12, 649.06, 682.07, set(), set(), {'Visingsö Folk High School'})]]
[[('3040.a1.txt', 1168.09, 1737.89, 940.17, 655.27, set(), set(), set())]]
[[('2044.a1.txt', 1398.96, 1450.78, 556.99, 595.85, {'Marie Antoinette'}, set(), set())]]
[[('1102.a1.txt', 1198.63, 1381.28, 730.59, 730.59, {'katt', 'superintendet Morse', 'Britsh', 'Minette Walters'}, {'Africa', 'Asia'}, {'instace'})]]
[[('1029.a1.txt', 1342.04, 1330.17, 748.22, 439.43, set(), set(), set())]]
[[('0218.a1.txt', 1404.36, 1452.78, 714.29, 447.94, set(), set(), set())]]
[[('1043.a1.txt', 1356.32, 1264.37, 643.68, 678.16, set(), set(), set())]]
[[('0200.a1.txt', 1261.17, 1320.75, 774.58, 705.06, {'Dickens', 'Austen', 'Shakespeare'}, {'Caribbean'}, {'Oxford'})]]
[[('1074.a1.txt', 1552.68, 1293.9, 702.4, 674.68, set(), set(), set())]]
[[('2047.a1.txt', 1520.91, 1318.12, 722.43, 671.74, set(), set(), set())]]
[[('0187.a1.txt', 1298.7, 1428.57, 753.25, 753.25, set(), set(), {'the Cambridge Certificate in

KeyboardInterrupt: 

In [20]:
print(touple_of_data)

NameError: name 'touple_of_data' is not defined