## Portfolio 1 - Extracting linguistic features using spaCy

*By Sofie Mosegaard, 22-02-2024*

(OBS: I have made an error in relation to calculating the unique POS, LOC, and ORG features. In the code below, I just count the total number and not the total unique number)

This assignment concerns using spaCy to extract linguistic information from a corpus of texts.
This assignment is designed to test that you can:

1. Work with multiple input data arranged hierarchically in folders;
2. Use spaCy to extract linguistic information from text data;
3. Save those results in a clear way which can be shared or used for future analysis

### Import packacges

In [None]:
import os
import pandas as pd
import glob
import re

In [None]:
import spacy
# python -m spacy download en_core_web_md
nlp = spacy.load("en_core_web_md")

### Extracting linguistic features using spaCy 
For each text in subfolders of the folder 'in', I will extract linguistic features and append it to a subfolder-specific table. In the end, a .csv file for each subfolder will be created and saved in the folder 'out'.

In [None]:
# First, specify the filepath to the folder with all the data
filepath = os.path.join(
                        "..",
                        "in",
                        "USEcorpus"
                        )

# Loop over each of the 14 subfolders (a1, a2, a3...)
for subfolder in sorted(os.listdir(filepath)): # sorted = loops through the subfolders in the original, sorted order
    subfolder_path = os.path.join(filepath, subfolder)

    if os.path.isdir(subfolder_path): # Check if the specified directory exists or nor

        # Initialize empty lists to store data  
        filenames = [] # Original name of the text files in the subfolders
        nouns_freq = []
        verbs_freq = []
        adverbs_freq = []
        adjectives_freq = []
        no_unique_per = []
        no_unique_loc = []
        no_unique_org = []
        enteties = []

    # Loop over each text file in the subfolder
        for file in glob.glob(os.path.join(subfolder_path, "*.txt")):
            if os.path.isfile(file): # Function checks whether there excists files on the specified path
                with open(file, "r", encoding = "latin-1") as f:
                    text = f.read()

                    text = re.sub(r"<*?>", "", text) # Remove metadata between <> and replace it with "" (= nothing hehe)

                    doc = nlp(text) # Create spacy doc

                    ### Count number of each POS ###

                    # Count number of nouns
                    nouns_count, verbs_count, adverb_count, adjective_count = 0, 0, 0, 0

                    for token in doc:
                        if token.pos_ == "NOUN":
                            nouns_count += 1
                        elif token.pos_ == "VERB":
                            verbs_count += 1
                        elif token.pos_ == "ADV":
                            adverb_count += 1
                        elif token.pos_ == "ADJ":
                            adjective_count += 1
                
                    # Calculate their relative frequency per 10,000 words and round the decimals
                    nouns_relative_freq = round((nouns_count/len(doc) * 10000), 2)
                    verbs_relative_freq = round((verbs_count/len(doc) * 10000), 2)
                    adverb_relative_freq = round((adverb_count/len(doc) * 10000), 2)
                    adjective_relative_freq = round((adjective_count/len(doc) * 10000), 2)

                    ### Count total number of unique PER, LOC, and ORG entities ###

                    unique_per_count = 0
                    unique_loc_count = 0
                    unique_org_count = 0

                    # Iterate over each entity in the spacy doc
                    for ent in doc.ents:
                        # Check the entity label --> if it is unique, then increment the corresponding count
                        if ent.label_ == "PERSON":
                            unique_per_count += 1
                        elif ent.label_ == "LOC":
                            unique_loc_count += 1
                        elif ent.label_ == "ORG":
                            unique_org_count += 1


                    # Append the relative frequency of POS and counts of unique entities to the lists
                    filenames.append(os.path.basename(file))
                    nouns_freq.append(nouns_relative_freq)
                    verbs_freq.append(verbs_relative_freq)
                    adverbs_freq.append(adverb_relative_freq)
                    adjectives_freq.append(adjective_relative_freq)
                    #no_unique_per.append(unique_per_count)
                    #no_unique_loc.append(unique_loc_count)
                    #no_unique_org.append(unique_org_count)
    

    # Create a pandas dataframe for each subfolder
    df = pd.DataFrame({
        "Filename": filenames, 
        "Nouns_Relative_Freq": nouns_freq,
        "Verbs_Relative_Freq": verbs_freq,
        "Adverbs_Relative_Freq": adverbs_freq,
        "No_unique_per": no_unique_per,
        "No_unique_loc": no_unique_loc,
        "No_unique_org": no_unique_org
    })

    # Save the dataframe as a .csv file
    csv_filename = f"../out/{subfolder}_data.csv"
    df.to_csv(csv_filename)


### Test - import one .csv file to check how the table looks

In [None]:
table_tester = pd.read_csv("../out/a1_data.csv")
table_tester