# Examples of NLP Features

This jupyter noteook contains example code to generate linguistic features with the functions in nlp_research.py

## `data_to_df()`

This function creates an easy way to access and visualize parts-of-speech tagging with spacy.
The tags and values are stored in a pandas dataframe.
Given a spacy pipeline and filepath, the function will output a dataframe showing parts-of-speech tags and their values.

Run the following codeblock to install spacy and pandas:

In [1]:
%pip install spacy
%pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from nlp_functions import data_to_df

In [3]:
data_to_df(pipeline="en_core_web_lg", file_path="sample.txt")

Unnamed: 0,TEXT,LEMMA,POS,TAG,DEP,SHAPE,ALPHA,STOP
0,You,you,PRON,PRP,nsubj,Xxx,True,True
1,know,know,VERB,VBP,parataxis,xxxx,True,False
2,",",",",PUNCT,",",punct,",",False,False
3,when,when,SCONJ,WRB,advmod,xxxx,True,True
4,I,I,PRON,PRP,nsubj,X,True,True
...,...,...,...,...,...,...,...,...
61,everyone,everyone,PRON,NN,nsubj,xxxx,True,True
62,knew,know,VERB,VBD,relcl,xxxx,True,False
63,your,your,PRON,PRP$,poss,xxxx,True,True
64,name,name,NOUN,NN,dobj,xxxx,True,True


## `tag_ratio()`

This function allows the user to tag parts of speech and other attributes of the tokens in their text.
Each column in the above data frame can be passed to the tag argument in the function.
The function then creates and outputs a dictionary containing all instances of the categories and on average how many are present per 100 words in the text.

In [4]:
from nlp_functions import tag_ratio

### Tagging parts-of-speech:

In [5]:
tag_ratio(pipeline='en_core_web_lg', file_path="sample.txt", tag="POS", amount=100)

defaultdict(int,
            {'PRON': 15.151515151515152,
             'VERB': 13.636363636363635,
             'PUNCT': 12.121212121212121,
             'SCONJ': 4.545454545454546,
             'ADV': 7.575757575757576,
             'ADP': 9.090909090909092,
             'NOUN': 15.151515151515152,
             'AUX': 3.0303030303030303,
             'ADJ': 6.0606060606060606,
             'DET': 7.575757575757576,
             'PROPN': 1.5151515151515151,
             'PART': 1.5151515151515151,
             'CCONJ': 1.5151515151515151,
             'NUM': 1.5151515151515151})

### Tagging parts-of-speech using the Penn Treebank tagset

In [6]:
tag_ratio(pipeline='en_core_web_lg', file_path="test.txt", tag="TAG", amount=100)

defaultdict(int,
            {'PRP': 8.547008547008547,
             'VBP': 1.7094017094017095,
             ',': 6.1253561253561255,
             'WRB': 1.566951566951567,
             'RB': 5.413105413105413,
             'IN': 8.547008547008547,
             'PRP$': 1.9943019943019942,
             'NN': 13.96011396011396,
             'VBZ': 1.7094017094017095,
             'JJ': 5.555555555555555,
             'DT': 8.262108262108262,
             'NNS': 4.700854700854701,
             'VB': 2.564102564102564,
             'WP': 0.5698005698005698,
             '.': 4.843304843304843,
             'VBD': 7.4074074074074066,
             'RP': 0.5698005698005698,
             'VBN': 0.7122507122507122,
             'NNP': 1.4245014245014245,
             ':': 1.282051282051282,
             'VBG': 2.849002849002849,
             'CC': 4.273504273504273,
             'CD': 0.14245014245014245,
             'HYPH': 0.2849002849002849,
             '_SP': 0.8547008547008548,
         

### Tagging dependancies:

In [7]:
tag_ratio(pipeline='en_core_web_lg', file_path="test.txt", tag="DEP", amount=100)

defaultdict(int,
            {'nsubj': 10.683760683760683,
             'parataxis': 0.14245014245014245,
             'punct': 12.678062678062679,
             'advmod': 6.267806267806268,
             'advcl': 1.282051282051282,
             'prep': 7.977207977207977,
             'poss': 2.1367521367521367,
             'pobj': 7.4074074074074066,
             'ROOT': 4.843304843304843,
             'acomp': 1.1396011396011396,
             'det': 7.4074074074074066,
             'amod': 4.5584045584045585,
             'ccomp': 2.2792022792022792,
             'attr': 2.2792022792022792,
             'prt': 0.7122507122507122,
             'acl': 0.5698005698005698,
             'oprd': 0.14245014245014245,
             'conj': 5.413105413105413,
             'neg': 1.1396011396011396,
             'appos': 0.7122507122507122,
             'cc': 4.415954415954416,
             'nummod': 0.14245014245014245,
             'relcl': 1.9943019943019942,
             'dobj': 5.2706552706

## `num_tense_inflected_verbs()`

This function takes in a spacy pipeline, file_path, and per word amount. It loads the desired pipeline as a natural language processor and uses this to create a spacy doc object version of the file provided. 
The function then loops through the tokens in the text, filtered using `token.is_alpha` to ignore punctuation and digits. It calculates the ratio of tense inflected verbs to total words in the text, and outputs on average how many tense inflected verbs are present per specified word amount (by default: 100). 
Tense inflected verbs are considered to be verbs in past or present tense, and modal auxiliaries.

In [9]:
from nlp_functions import num_tense_inflected_verbs

In [18]:
print("The following represents the average number of tenseinflected verbs present per 100 words of the given text:")
num_tense_inflected_verbs(pipeline='en_core_web_lg', file_path="sample.txt", amount=100)

The following represents the average number of tenseinflected verbs present per 100 words of the given text:


14.285714285714285

## `calculate_idea_density()`

This function takes in a pipeline and filepath and transforms into a spacy doc object, as before. The function then calculates and outputs the average idea density per sentence in the document. Idea density is defined as te number of propositions(verbs, adjectives, adverbs, prepositions, and conjunctions) divided by the number of words in a sentence.

In [13]:
from nlp_functions import calculate_idea_density

In [17]:
print("The following represents the average idea density per sentence in the given text:")
calculate_idea_density(pipeline='en_core_web_lg', file_path="sample.txt")

The following represents the average idea density per sentence in the given text:


1.05995670995671

## `abstractness()`

This function takes in a pipeline and a file path and a dataset path as inputs. It loads the prefered pipeline and turns the contents of the file into a spacy doc object. 
The function then calculates the average abstractness value of all words in the text, using the inverse of the concreteness value taken from the data set. This value is returned. 
The dataset values are on a five point scale, going from abstract to concrete. For the purpose of this feature, the scale is inverted. More details on the dataset please see [this article](https://link.springer.com/article/10.3758/s13428-013-0403-5#Sec10)

In [20]:
from nlp_functions import abstractness

In [22]:
print("The following represents the average abstractness value of all words in the text:")
abstractness(pipeline='en_core_web_lg', file_path="sample.txt", dataset_path="datasets/dataset_for_abstractness.xlsx")

The following represents the average abstractness value of all words in the text:


0.04472862928497764

## `semantic_ambiguity()`

This function takes in a pipeline, file path and a dataset path as inputs. It loads the prefered pipeline and turns the contents of the file into a spacy doc object. The function then calculates the average semantic ambiguity value for all words in the text using a semantic diversity value from a dataset.
The dataset value is based on a measure that considers words that appear in a wide range of contexts on diverse topics more sematically diverse than those that appear in a restricted set of similar contexts. More details on the methods of calculation are present in [this article](https://link.springer.com/article/10.3758/s13428-012-0278-x#SecESM1).

In [24]:
from nlp_functions import semantic_ambiguity

In [26]:
print("The following represents the average semantic ambiguity value for all words in the provided text:")
semantic_ambiguity(pipeline='en_core_web_lg', file_path="sample.txt", dataset_path="datasets/dataset_for_semantic_ambiguity.xlsx")

The following represents the average semantic ambiguity value for all words in the provided text:


0.3134804578708375

## `word_frequency()`

This function takes in a pipeline, a file path and a dataset path as inputs. It loads the prefered pipeline and turns the contents of the file into a spacy doc object. It calculates the word frequency value of each word using a dataset, and outputs the average value of all words. More details on the dataset values are present in [this article](https://link.springer.com/article/10.3758/BRM.41.4.977#SecESM1).

In [27]:
from nlp_functions import word_frequency

In [29]:
print("the following represents the average word frequency value across all words in the given text:")
word_frequency(pipeline='en_core_web_lg', file_path="sample.txt", dataset_path="datasets/dataset_for_word_frequency.xlsx")

the following represents the average word frequency value across all words in the given text:


0.641867548337011

## `word_prevalence()`

This function takes in a pipeline, a file path and a datset path as inputs. It loads the prefered pipeline and turns the contents of the file into a spacy doc object. The function calculates the average word prevalence value across all words in the text and returns the result. The word prevalence values are extracted from a datset. More information on them can be found in [this article](https://link.springer.com/article/10.3758/s13428-018-1077-9#Sec9).

In [30]:
from nlp_functions import word_prevalence

In [32]:
print("The following represents the average word prevalence value across all words in the text:")
word_prevalence(pipeline='en_core_web_lg', file_path="sample.txt", dataset_path="datasets/dataset_for_word_prevalence_and_familiarity.xlsx")

The following represents the average word prevalence value across all words in the text:


0.4249447629308993

## `word_familiarity()`

This function takes in a pipeline, a file path and a dataset path as inputs. It loads the prefered pipeline and turns the contents of the file into a spacy doc object. The function calculates the average word familiarity across all words in the text and returns the result. The word familiarity values are taken from a dataset, in which they are calculated based on a z standardized measure of how many people know the word. More information on these values can be found in [this article](https://link.springer.com/article/10.3758/s13428-018-1077-9#Sec9).

In [33]:
from nlp_functions import word_familiarity

In [36]:
print("The following represents the average word familiarity value across all words in the given text:")
word_familiarity(pipeline='en_core_web_lg', file_path="sample.txt", dataset_path="datasets/dataset_for_word_prevalence_and_familiarity.xlsx")

The following represents the average word familiarity value across all words in the given text:


0.177838994634659

## `age_of_acquisition()`

This function takes in a pipeline, a file path and a dataset path as inputs. It loads the prefered pipeline, and turns the contents of the file into a spacy doc object. The function calculates and returns the average age of acquisition value across all words in the text. This is a measure of semantic complexity, with a higher age of acquisition representing more complex language. The value is taken from a dataset, more information on this value can be found in [this article](https://link.springer.com/article/10.3758/s13428-018-1077-9).

In [37]:
from nlp_functions import age_of_acquisition

In [39]:
print("The following represents the average age of aquisition value across all words in the text:")
age_of_acquisition(pipeline='en_core_web_sm', file_path="sample.txt", dataset_path="datasets/dataset_for_age_of_acquisition.xlsx")

The following represents the average age of aquisition value across all words in the text:


0.9691179896458826