This assignment concerns using ```spaCy``` to extract linguistic information from a corpus of texts.

The corpus is an interesting one: *The Uppsala Student English Corpus (USE)*. All of the data is included in the folder called ```in``` but you can access more documentation via [this link](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2457).

For this exercise, you should write some code which does the following:

- Loop over each text file in the folder called ```in```
- Extract the following information:
    - Relative frequency of Nouns, Verbs, Adjective, and Adverbs per 10,000 words
    - Total number of *unique* PER, LOC, ORGS
- For each sub-folder (a1, a2, a3, ...) save a table which shows the following information:

|Filename|RelFreq NOUN|RelFreq VERB|RelFreq ADJ|RelFreq ADV|Unique PER|Unique LOC|Unique ORG|
|---|---|---|---|---|---|---|---|
|file1.txt|---|---|---|---|---|---|---|
|file2.txt|---|---|---|---|---|---|---|
|etc|---|---|---|---|---|---|---|

## Objective

This assignment is designed to test that you can:

1. Work with multiple input data arranged hierarchically in folders;
2. Use ```spaCy``` to extract linguistic information from text data;
3. Save those results in a clear way which can be shared or used for future analysis

## Some notes

- The data is arranged in various subfolders related to their content (see the [README](in/README.md) for more info). You'll need to think a little bit about how to do this. You should be able do it using a combination of things we've already looked at, such as ```os.listdir()```, ```os.path.join()```, and for loops.
- The text files contain some extra information that such as document ID and other metadata that occurs between pointed brackets ```<>```. Make sure to remove these as part of your preprocessing steps!
- There are 14 subfolders (a1, a2, a3, etc), so when completed the folder ```out``` should have 14 CSV files.

## Additional comments

Your code should include functions that you have written wherever possible. Try to break your code down into smaller self-contained parts, rather than having it as one long set of instructions.

For this assignment, you are welcome to submit your code either as a Jupyter Notebook, or as ```.py``` script. If you do not know how to write ```.py``` scripts, don't worry - we're working towards that!

Lastly, you are welcome to edit this README file to contain whatever informatio you like. Remember - documentation is important!


## Create a spacy NLP class


In [1]:
import spacy
nlp = spacy.load("en_core_web_md") # loads the entire model spacy into the variable nlp

In [2]:
import pandas as pd
import re
import os


## Function for finding Rel Freq and Unique 

In [47]:
def find_attributes(directory): # Making a function called find_attributes with the parameter folderpath
    all_data = [] # An empty list to store each dataframe created.
    
    # Making a for loop that finds each file and the path to that file, and saves it in a variable folder_path
    # os.listdir makes a list of the specified directory with all the files in the directory.
    # os.path.join joins the "file" to the path for the file.
    for folder_name in os.listdir(directory): 
        folder_path = os.path.join(directory, folder_name)

        # Start by checking if the new variable is a directory, if true it moves on and finds the path to each file in the subfolder.
        if os.path.isdir(folder_path):
            for file_name in os.listdir(folder_path):
                file_path = os.path.join(folder_path, file_name)
                # If statement that checks if the new file_path is a file, is yes it moves on and opens the file encoding it as latin-1 
                # Latin-1 is used here, because the files could not be read with utf8. 
                # the read file is placed in a new variable caled text.
                if os.path.isfile(file_path):
                    with open(file_path, 'r', encoding="latin-1") as file:
                        text = file.read()

                    # Regexing removing alle places where there are angle brackets.
                    # sub - replaces the occurrance 
                    # . - means all characters
                    # * - means zero or more occurances
                    # ? - means zero or one coccurance. 
                    text = re.sub(r'<.*?>', '', text)
                    doc = nlp(text) # use spacy nlp to create and find tokens defined by spacy.

                    # Creating variables to be used below. 
                    noun_count =0  
                    verb_count =0
                    adjective_count = 0
                    adverb_count = 0

                    # For loop that counts the number of times each adj, noun, verb and adv accours, and adds one, by using spacys pos.
                    for token in doc: 
                        if token.pos_ =="ADJ":
                            adjective_count +=1
                        elif token.pos_ == "NOUN":
                            noun_count += 1
                        elif token.pos_ == "VERB":
                            verb_count +=1
                        elif token.pos_ == "ADV":
                            adverb_count += 1

                    # Finding the relative frequence by dividing a specific part of speech with the lenght of the text
                    # and multiplying på 10 000. 
                    relative_freq_ADJ = (adjective_count/len(doc)) * 10000 
                    relative_freq_ADJ = round(relative_freq_ADJ, 2)
                    relative_freq_NOUN = (noun_count/len(doc)) * 10000
                    relative_freq_NOUN = round(relative_freq_NOUN, 2)
                    relative_freq_VERB = (verb_count/len(doc)) * 10000
                    relative_freq_VERB = round(relative_freq_VERB, 2)
                    relative_freq_ADV = (adverb_count/len(doc)) * 10000
                    relative_freq_ADV = round(relative_freq_ADV, 2)


                    # Finding Unique PER; LOC, ORG
                    # creating empty list
                    entities_PER = [] 
                    entities_LOC = []
                    entities_ORG = []

                    # get named entities and add to list 
                    # ent means entity
                    # for loop that finds each word with either person, loc or org and appends to the matching variable 
                    for ent in doc.ents:  
                        if ent.label_ == "PERSON": 
                            entities_PER.append(ent.text)
                        elif ent.label_ == "LOC":
                            entities_LOC.append(ent.text)
                        elif ent.label_ == "ORG":
                            entities_ORG.append(ent.text)

                    # defining unique only with the set function
                    unique_entities_PER = set(entities_PER) 
                    unique_entities_LOC = set(entities_LOC)
                    unique_entities_ORG = set(entities_ORG)

                    # Creating an empty list to store the touples, created below.
                    touple_of_data = [] 

                    # Appending each variable together as a touples, and creating a dataframe out of the list. Specifying coloumns aswell.
                    touple_of_data.append((file_name, relative_freq_NOUN, relative_freq_VERB, relative_freq_ADJ, relative_freq_ADV, unique_entities_PER, unique_entities_LOC, unique_entities_ORG))
                    data = pd.DataFrame(touple_of_data, columns=['Filename', 'Noun Freq', 'Verb Freq', 'Adj Freq', 'Adv Freq', 'Unique PER', 'Unique LOC', 'Unique ORG'])
                    
                    # Appending each dataframe (which is only one row) into a new list created at the begining of this function. 
                    all_data.append(data)

                # concating / appending all dataframes together to create one dataframe for each text.
                final_data = pd.concat(all_data)
                
                # Saving the dataframe to folder out.
                outpath = os.path.join("..", "out", folder_name + ".csv") 
                final_data.to_csv(outpath, index= False)

    return final_data
                        



## Specifying the path and running the function

In [48]:
# Creating a variable which has the directory path.
directory = os.path.join("..", "in", "USECorpus")
# Running the function
dfs = find_attributes(directory)

