### Quick Description

This notebook calculates the POS Tagging categories for each sentence within a document. We count the frequency of each category in a sentence, and save the count vector as features.

In [1]:
import pandas as pd
import spacy
import yaml
import glob
import os
import sys

from tqdm import tqdm
from collections import Counter

sys.path.append("../../../utils")
from absolute_path_builder import AbsolutePathBuilder

2022-08-13 15:55:31.828621: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-08-13 15:55:31.831186: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-08-13 15:55:31.831194: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-08-13 15:55:36.706558: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-08-13 15:55:36.706606: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (guilherme-V

In [2]:
DATASET = "twitter"

input_path = AbsolutePathBuilder.get_path(
    f"04_{DATASET}_scored",
    filepaths="../../../config/filepaths.yaml"
)

output_path = AbsolutePathBuilder.get_path(
    f"05_{DATASET}_features",
    filepaths="../../../config/filepaths.yaml"
)

In [3]:
def calculate_pos_tag(input_path, output_path):
    pos_tagger = spacy.load("en_core_web_sm")
    
    filenames = [file.split("/")[-1] for file in glob.glob(os.path.join(input_path, "*"))]    
    for file in tqdm(filenames):
        df = pd.read_csv(os.path.join(input_path, file))

        df_pos = (
            pd.DataFrame(
                df.text.apply(
                    lambda s: dict(Counter([token.pos_ for token in pos_tagger(s)]))
                ).values
                .tolist()
            ).fillna(0)
        )

        df_pos.columns = [f"POS_{col}" for col in df_pos.columns]

        df = pd.concat([df, df_pos], axis=1)
        df.to_csv(os.path.join(output_path, file), index=False)

In [4]:
calculate_pos_tag(input_path, output_path)

100%|██████████████████████████████████████████████████████████████| 500/500 [00:02<00:00, 171.63it/s]
