## Portfolio 1 - Extracting linguistic features using spaCy

*By Sofie Mosegaard, 22-02-2024*

This assignment concerns using spaCy to extract linguistic information from a corpus of texts.
This assignment is designed to test that you can:

1. Work with multiple input data arranged hierarchically in folders;
2. Use spaCy to extract linguistic information from text data;
3. Save those results in a clear way which can be shared or used for future analysis

### Import packages

In [1]:
import os
import pandas as pd
import glob
import re

In [3]:
import spacy 
# python -m spacy download en_core_web_md
nlp = spacy.load("en_core_web_md")

KeyboardInterrupt: 

### Define functions

In [4]:
# Calculate the relative frequency per 10,000 words and round the decimals
# The input is the number of POS and the total number of tokens in the given text, while the 
# output will be the he relative frquency per 10,000 words

def rel_freq(count, len_doc): 
    return round((count/len_doc * 10000), 2)

# Count total number of unique PER, LOC, and ORG entities
# The input is a spacy doc object, while the output will be the total number of unique persons, locations (LOC),
# and organisations (ORG) mentioned in the specified, input doc object.

def no_unique_ents(doc):
    enteties = []

    for ent in doc.ents: 
        enteties.append((ent.text, ent.label_))

    enteties_df = pd.DataFrame(enteties, columns=["enteties", "label"])
    enteties_df = enteties_df.drop_duplicates()
    unique_counts = enteties_df.value_counts(subset = "label")
    
    unique_labels = ['PERSON', 'LOC', 'ORG']
    unique_row = []

    for label in unique_labels:
        if label in (unique_counts.index):
            unique_row.append(unique_counts[label])
        else:
            unique_row.append(0)

    return unique_row

### Extracting linguistic features using spaCy 
For each text in subfolders of the folder 'in', I will extract linguistic features and append it to a subfolder-specific table. In the end, a .csv file for each subfolder will be created and saved in the folder 'out'.

In [None]:
# First, specify the filepath to the folder with all the data
filepath = os.path.join(
                        "..",
                        "in",
                        "USEcorpus"
                        )

# Loop over each of the 14 subfolders (a1, a2, a3...)
for subfolder in sorted(os.listdir(filepath)): # sorted = loops through the subfolders in the original, sorted order
    subfolder_path = os.path.join(filepath, subfolder)

    if os.path.isdir(subfolder_path): # Check if the specified directory exists or nor

        # Create a pandas dataframe for each subfolder with specified column names
        out_df = pd.DataFrame(columns=("Filename",
                                        "RelFreq NOUN",
                                        "RelFreq VERB",
                                        "RelFreq ADJ",
                                        "RelFreq ADV",
                                        "No. Unique PER",
                                        "No. Unique LOC",
                                        "No. Unique ORG"))

        # Loop over each text file in the subfolder
        for file in glob.glob(os.path.join(subfolder_path, "*.txt")):
            if os.path.isfile(file): # Function checks whether there excists files on the specified path
                
                with open(file, "r", encoding = "latin-1") as f:
                    text = f.read()

                    text = re.sub(r"<*?>", "", text) # Remove metadata between <> and replace it with "" (= nothing hehe)

                    doc = nlp(text) # Create spacy doc

                    # Count number of each POS
                    nouns_count, verbs_count, adverb_count, adjective_count = 0, 0, 0, 0

                    for token in doc:
                        if token.pos_ == "NOUN":
                            nouns_count += 1
                        elif token.pos_ == "VERB":
                            verbs_count += 1
                        elif token.pos_ == "ADV":
                            adverb_count += 1
                        elif token.pos_ == "ADJ":
                            adjective_count += 1
            
                    nouns_relative_freq = rel_freq(nouns_count, len(doc))
                    verbs_relative_freq = rel_freq(verbs_count, len(doc))
                    adjective_relative_freq = rel_freq(adjective_count, len(doc))
                    adverb_relative_freq = rel_freq(adverb_count, len(doc))
                    
                    # Count total number of unique PER, LOC, and ORG entities
                    No_unique_per, No_unique_loc, No_unique_org = no_unique_ents(doc)

                    # Append the name of the text to the filenames folder 
                    text_name = file.split("/")[-1]

                    # Append the extracted linguistic features for each text to a row in the out_df
                    text_row = [text_name, nouns_relative_freq,
                                verbs_relative_freq, adjective_relative_freq,
                                adverb_relative_freq, No_unique_per,
                                No_unique_loc, No_unique_org]

                    out_df.loc[len(out_df)] = text_row
            
            # Specify path to the output folcder and name of the specific .csv file
            csv_outpath = os.path.join("..", "out", f"{subfolder}_data.csv")

        out_df.to_csv(csv_outpath)  

### Test - import one .csv file to check how the table looks

In [29]:
table_tester = pd.read_csv("../out/a1_data.csv")
table_tester

Unnamed: 0.1,Unnamed: 0,Filename,RelFreq NOUN,RelFreq VERB,RelFreq ADJ,RelFreq ADV,No. Unique PER,No. Unique LOC,No. Unique ORG
0,0,1112.a1.txt,1273.96,1465.61,631.34,823.00,0,0,0
1,1,0107.a1.txt,1204.99,1440.44,886.43,609.42,0,0,0
2,2,1071.a1.txt,1396.71,1302.82,622.07,481.22,3,0,4
3,3,0191.a1.txt,1251.49,1358.76,750.89,679.38,0,0,2
4,4,3064.a1.txt,1538.46,1372.14,665.28,665.28,0,0,0
...,...,...,...,...,...,...,...,...,...
298,298,0149.a1.txt,1239.11,1239.11,735.72,735.72,1,0,0
299,299,0100.a1.txt,1524.48,1216.78,797.20,531.47,0,0,0
300,300,3045.a1.txt,1144.58,1385.54,622.49,401.61,0,1,0
301,301,0128.a1.txt,1355.93,1210.65,641.65,726.39,0,0,0
