# Examples of NLP Features

This jupyter noteook contains example code to generate linguistic features with the functions in nlp_research.py

## `data_to_df()`

This function creates an easy way to access and visualize parts-of-speech tagging with spacy.
The tags and values are stored in a pandas dataframe.
Given a spacy pipeline and filepath, the function will output a dataframe showing parts-of-speech tags and their values.

Run the following codeblock to install spacy and pandas:

In [3]:
%pip install spacy
%pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
from nlp_functions import data_to_df

In [2]:
data_to_df(pipeline="en_core_web_lg", file_path="sample.txt")

Unnamed: 0,TEXT,LEMMA,POS,TAG,DEP,SHAPE,ALPHA,STOP
0,You,you,PRON,PRP,nsubj,Xxx,True,True
1,know,know,VERB,VBP,parataxis,xxxx,True,False
2,",",",",PUNCT,",",punct,",",False,False
3,when,when,SCONJ,WRB,advmod,xxxx,True,True
4,I,I,PRON,PRP,nsubj,X,True,True
...,...,...,...,...,...,...,...,...
61,everyone,everyone,PRON,NN,nsubj,xxxx,True,True
62,knew,know,VERB,VBD,relcl,xxxx,True,False
63,your,your,PRON,PRP$,poss,xxxx,True,True
64,name,name,NOUN,NN,dobj,xxxx,True,True


## `tag_ratio()`

This function allows the user to tag parts of speech and other attributes of the tokens in their text.
Each column in the above data frame can be passed to the tag argument in the function.
The function then creates and outputs a dictionary containing all instances of the categories and on average how many are present per 100 words in the text.

In [5]:
from nlp_functions import tag_ratio

### Tagging parts-of-speech:

In [6]:
tag_ratio(pipeline='en_core_web_lg', file_path="sample.txt", tag="POS", amount=100)

defaultdict(int,
            {'PRON': 15.151515151515152,
             'VERB': 13.636363636363635,
             'PUNCT': 12.121212121212121,
             'SCONJ': 4.545454545454546,
             'ADV': 7.575757575757576,
             'ADP': 9.090909090909092,
             'NOUN': 15.151515151515152,
             'AUX': 3.0303030303030303,
             'ADJ': 6.0606060606060606,
             'DET': 7.575757575757576,
             'PROPN': 1.5151515151515151,
             'PART': 1.5151515151515151,
             'CCONJ': 1.5151515151515151,
             'NUM': 1.5151515151515151})

### Tagging parts-of-speech using the Penn Treebank tagset

In [10]:
tag_ratio(pipeline='en_core_web_lg', file_path="test.txt", tag="TAG", amount=100)

defaultdict(int,
            {'PRP': 8.547008547008547,
             'VBP': 1.7094017094017095,
             ',': 6.1253561253561255,
             'WRB': 1.566951566951567,
             'RB': 5.413105413105413,
             'IN': 8.547008547008547,
             'PRP$': 1.9943019943019942,
             'NN': 13.96011396011396,
             'VBZ': 1.7094017094017095,
             'JJ': 5.555555555555555,
             'DT': 8.262108262108262,
             'NNS': 4.700854700854701,
             'VB': 2.564102564102564,
             'WP': 0.5698005698005698,
             '.': 4.843304843304843,
             'VBD': 7.4074074074074066,
             'RP': 0.5698005698005698,
             'VBN': 0.7122507122507122,
             'NNP': 1.4245014245014245,
             ':': 1.282051282051282,
             'VBG': 2.849002849002849,
             'CC': 4.273504273504273,
             'CD': 0.14245014245014245,
             'HYPH': 0.2849002849002849,
             '_SP': 0.8547008547008548,
         

### Tagging dependancies:

In [11]:
tag_ratio(pipeline='en_core_web_lg', file_path="test.txt", tag="DEP", amount=100)

defaultdict(int,
            {'nsubj': 10.683760683760683,
             'parataxis': 0.14245014245014245,
             'punct': 12.678062678062679,
             'advmod': 6.267806267806268,
             'advcl': 1.282051282051282,
             'prep': 7.977207977207977,
             'poss': 2.1367521367521367,
             'pobj': 7.4074074074074066,
             'ROOT': 4.843304843304843,
             'acomp': 1.1396011396011396,
             'det': 7.4074074074074066,
             'amod': 4.5584045584045585,
             'ccomp': 2.2792022792022792,
             'attr': 2.2792022792022792,
             'prt': 0.7122507122507122,
             'acl': 0.5698005698005698,
             'oprd': 0.14245014245014245,
             'conj': 5.413105413105413,
             'neg': 1.1396011396011396,
             'appos': 0.7122507122507122,
             'cc': 4.415954415954416,
             'nummod': 0.14245014245014245,
             'relcl': 1.9943019943019942,
             'dobj': 5.2706552706

## `num_tense_inflected_verbs()`