# Examples of NLP Features

This jupyter noteook contains example code to generate linguistic features with the functions in nlp_research.py

## Installation

For specific installation instructions please see [this section](https://github.com/Digital-Working-Group/natural-language-processing/blob/nlp_research/spacy/README.md#installation) of the README

## `data_to_df()`

This function creates an easy way to access and visualize parts-of-speech tagging with spacy.
The tags and values are stored in a pandas dataframe.
Given a spacy pipeline and filepath, the function will output a dataframe showing parts-of-speech tags and their values.

In [2]:
from pos_tagging import data_to_df

In [1]:
import spacy
data_to_df(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

NameError: name 'data_to_df' is not defined

## `tag_ratio()`

This function allows the user to tag parts of speech of the tokens in their text.
The function then creates and outputs a dictionary containing all instances of the categories and on average how many are present per 100 words in the text.

In [1]:
import spacy
from pos_tagging import tag_ratio

### Tagging parts-of-speech:

In [12]:
tag_ratio(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt", amount=100)

{'POS': defaultdict(int,
             {'PRON': 15.151515151515152,
              'VERB': 13.636363636363635,
              'PUNCT': 12.121212121212121,
              'SCONJ': 4.545454545454546,
              'ADV': 7.575757575757576,
              'ADP': 9.090909090909092,
              'NOUN': 15.151515151515152,
              'AUX': 3.0303030303030303,
              'ADJ': 6.0606060606060606,
              'DET': 7.575757575757576,
              'PROPN': 1.5151515151515151,
              'PART': 1.5151515151515151,
              'CCONJ': 1.5151515151515151,
              'NUM': 1.5151515151515151}),
 'TAG': defaultdict(int,
             {'PRP': 9.090909090909092,
              'VBP': 4.545454545454546,
              ',': 7.575757575757576,
              'WRB': 4.545454545454546,
              'RB': 9.090909090909092,
              'IN': 7.575757575757576,
              'PRP$': 3.0303030303030303,
              'NN': 12.121212121212121,
              'VBZ': 1.5151515151515151,
       

## `stats_proportion_coordinators()`

This function takes in a natural language processor and file path. It calls stats_proportion_part_of_speech with specified kwargs to determine mean, min, max, and standard deviation of the proportion of coordinators in a sentence.

In [2]:
from pos_tagging import stats_proportion_coordinators

In [7]:
print("The following is a dictionary containing the mean, min, max, and standard deviation of the proportion of coordinators in a sentence:")
stats_proportion_coordinators(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following is a dictionary containing te mean, min, max, and standard deviation of the proprtion of coordinators in a sentence:


{'mean': 0.015151515151515152,
 'max': 0.045454545454545456,
 'min': 0.0,
 'std': 0.0262431940540739}

## `stats_proportion_auxiliaries()`

This function takes in a natural language processor and file path. It calls stats_proportion_part_of_speech with specified kwargs to determine mean, min, max, and standard deviation of the proportion of auxiliaries in a sentence.

In [11]:
from pos_tagging import stats_proportion_auxiliaries

In [19]:
print("The following is a dictionary containing the mean, min, max, and standard deviation of the proportion of auxiliaries to words in a sentence:")
stats_proportion_auxiliaries(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following is a dictionary containing the mean, min, max, and standard deviation of the proportion of auxiliaries to words in a sentence:


{'mean': 0.015151515151515152,
 'max': 0.045454545454545456,
 'min': 0.0,
 'std': 0.0262431940540739}

## `stats_proportion_adjectives()`

This function takes in a natural language processor and file path. It calls stats_proportion_part_of_speech with specified kwargs to determine mean, min, max, and standard deviation of the proportion of adjectives in a sentence.

In [20]:
from pos_tagging import stats_proportion_adjectives

In [21]:
print("The following is a dictionary containing the mean, min, max, and standard deviation of the proportion of adjectives to words in a sentence:")
stats_proportion_adjectives(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following is a dictionary containing the mean, min, max, and standard deviation of the proportion of adjectives to words in a sentence:


{'mean': 0.0722943722943723,
 'max': 0.1,
 'min': 0.045454545454545456,
 'std': 0.027283032478941528}

## `stats_proportion_subjects()`

This function takes in a natural language processor and file path. It calls stats_proportion_part_of_speech with specified kwargs to determine mean, min, max, and standard deviation of the proportion of subjects in a sentence.

In [22]:
from pos_tagging import stats_proportion_subjects

In [23]:
print("The following is a dictionary containing the mean, min, max, and standard deviation of the proportion of subjects to words in a sentence:")
stats_proportion_adjectives(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following is a dictionary containing the mean, min, max, and standard deviation of the proportion of subjects to words in a sentence:


{'mean': 0.0722943722943723,
 'max': 0.1,
 'min': 0.045454545454545456,
 'std': 0.027283032478941528}

## `num_tense_inflected_verbs()`

This function takes in a natural language processor, file_path, and per word amount. It loads the desired pipeline as a natural language processor and uses this to create a spacy doc object version of the file provided. 
The function then loops through the tokens in the text, filtered using `token.is_alpha` to ignore punctuation and digits. It calculates the ratio of tense inflected verbs to total words in the text, and outputs on average how many tense inflected verbs are present per specified word amount (by default: 100). 
Tense inflected verbs are considered to be verbs in past or present tense, and modal auxiliaries.

In [13]:
from syntactic_complexity import num_tense_inflected_verbs

In [15]:
print("The following represents the average number of tenseinflected verbs present per 100 words of the given text:")
num_tense_inflected_verbs(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt", amount=100)

The following represents the average number of tenseinflected verbs present per 100 words of the given text:


14.285714285714285

## `calculate_idea_density()`

This function takes in an nlp and filepath and transforms into a spacy doc object, as before. The function then calculates and outputs the average idea density per sentence in the document. Idea density is defined as te number of propositions(verbs, adjectives, adverbs, prepositions, and conjunctions) divided by the number of words in a sentence.

In [17]:
from semantic_complexity import calculate_idea_density

In [19]:
print("The following represents the average idea density per sentence in the given text:")
calculate_idea_density(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the average idea density per sentence in the given text:


[("You know, when I think back on my life, it's funny how the little things really shape who you become.",
  0.55),
 ('I grew up in this small town called Ridgewood, tucked away in the countryside.',
  0.5714285714285714),
 ("It wasn't much just rolling hills, a couple of farms, and one main street with a diner where everyone knew your name.",
  0.4090909090909091)]

## `abstractness()`

This function calls `generate_noun_feature()` with specified kwargs. This function finds the abstractness value corresponding to each noun in the text utilizing a pre-existing dataset, and averages these values. The function outputs the average abstractness value across all nouns in the text.
The dataset values are on a five point scale, going from abstract to concrete. For the purpose of this feature, the scale is inverted. More details on the dataset please see [this article](https://link.springer.com/article/10.3758/s13428-013-0403-5#Sec10)

In [20]:
from semantic_complexity import abstractness

In [26]:
print("The following represents the average abstractness value of all words in the text:")
abstractness(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")


The following represents the average abstractness value of all words in the text:


0.2504803239958748

## `semantic_ambiguity()`

This function calls `generate_noun_feature()` with specified kwargs. This function finds the semantic ambiguity value corresponding to each noun in the text utilizing a pre-existing dataset, and averages these values. The function outputs the average semantic ambiguity value across all nouns in the text.
The dataset value is based on a measure that considers words that appear in a wide range of contexts on diverse topics more sematically diverse than those that appear in a restricted set of similar contexts. More details on the methods of calculation are present in [this article](https://link.springer.com/article/10.3758/s13428-012-0278-x#SecESM1).

In [24]:
from semantic_complexity import semantic_ambiguity

In [25]:
print("The following represents the average semantic ambiguity value for all words in the provided text:")
semantic_ambiguity(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/test.txt")

The following represents the average semantic ambiguity value for all words in the provided text:


1.822231297208903

## `word_frequency()`

This function calls `generate_noun_feature()` with specified kwargs. This function finds the word frequency value corresponding to each noun in the text utilizing a pre-existing dataset, and averages these values. The function outputs the average word frequency value across all nouns in the text. More details on the dataset values are present in [this article](https://link.springer.com/article/10.3758/BRM.41.4.977#SecESM1).

In [27]:
from semantic_complexity import word_frequency

In [29]:
print("the following represents the average word frequency value across all words in the given text:")
word_frequency(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

the following represents the average word frequency value across all words in the given text:


3.5944582706872614

## `word_prevalence()`

This function calls `generate_noun_feature()` with specified kwargs. This function finds the word prevalence value corresponding to each noun in the text utilizing a pre-existing dataset, and averages these values. The function outputs the average word prevalence value across all nouns in the text.  More information on dataset values can be found in [this article](https://link.springer.com/article/10.3758/s13428-018-1077-9#Sec9).

In [30]:
from semantic_complexity import word_prevalence

In [32]:
print("The following represents the average word prevalence value across all words in the text:")
word_prevalence(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the average word prevalence value across all words in the text:


2.379690672413036

## `word_familiarity()`

This function calls `generate_noun_feature()` with specified kwargs. This function finds the word familiarity value corresponding to each noun in the text utilizing a pre-existing dataset, and averages these values. The function outputs the average word familiarity value across all nouns in the text. More information on these values can be found in [this article](https://link.springer.com/article/10.3758/s13428-018-1077-9#Sec9).

In [33]:
from semantic_complexity import word_familiarity

In [34]:
print("The following represents the average word familiarity value across all words in the given text:")
word_familiarity(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the average word familiarity value across all words in the given text:


0.9958983699540903

## `age_of_acquisition()`

This function calls `generate_noun_feature()` with specified kwargs. This function finds the age of acquisition value corresponding to each noun in the text utilizing a pre-existing dataset, and averages these values. The function outputs the average age of acquisition value across all nouns in the text. The value is taken from a dataset, more information on this value can be found in [this article](https://link.springer.com/article/10.3758/s13428-018-1077-9).

In [35]:
from semantic_complexity import age_of_acquisition

In [37]:
print("The following represents the average age of aquisition value across all words in the text:")
age_of_acquisition(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the average age of aquisition value across all words in the text:


5.427060742016943

## `nonword_frequency()`

This function takes in a natural language processor, filepath, dataset path, and per word amount. It uses the dataset to find all occurrences of nonwords in the text. The function outputs, on average, how many nonwords are present per 100 words. Non words are defined as groupings of letters that do not form a valid English word such as "jjksj". Note: this feature can confuse uncommon proper nouns with non words. 

In [38]:
from syntactic_errors import nonword_frequency

In [39]:
print("The following represents the average number of nonwords per 100 words in the text:")
nonword_frequency(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/contains_nonwords.txt", dataset_fp="words_alpha.txt", amount=100)

The following represents the average number of nonwords per 100 words in the text:


1.0186757215619695

## `avg_num_nonwords()`

This function is an alternative way of counting non-words. It takes in a natural language processor, filepath, and word amount. It counts number of nonwords by checking if words are in spaCy's vocabulary, and returns the number of nonwords present per word amount.

In [4]:
from syntactic_errors import avg_num_nonwords

In [6]:
print("The following represents the average number of nonwords per 100 words in the text:")
avg_num_nonwords(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/contains_nonwords.txt", amount=100)

The following represents the average number of nonwords per 100 words in the text:


0.5093378607809848

## `sentence_lengths()`

This function takes in a natural language processor and a filepath. It calculates the length of each sentence and returns a list of sentence lengths.

In [43]:
from syntactic_complexity import sentence_lengths

In [45]:
print("The following represents a list of sentence lengths in the document:")
sentence_lengths(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")


The following represents a list of sentence lengths in the document:


[20, 14, 22]

## `most_frequent_word()`

This function takes in a natural language processor and a filepath. It calculates and returns the most commonly occurring word and how many times it appears in the text. 

In [47]:
from lexical_repetition import most_frequent_word

In [48]:
print("The following represents the most commonly occurring word and how many times in appears in the text:")
most_frequent_word(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/test.txt")

The following represents the most commonly occurring word and how many times in appears in the text:


('life', 5)

## `windowed_text_token_ratio()`

This function takes in a natural language processor, filepath, and window size. It averages type token ratio across moving windows and returns this value. 

In [50]:
from lexical_variation import windowed_text_token_ratio

In [52]:
print("The following represents the moving average text token ratio across 20 word windows in the document:")
windowed_text_token_ratio(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/test.txt", window_size=20)

The following represents the moving average text token ratio across 20 word windows in the document:


0.9238095238095237

## `repeating_unique_word_ratio()`

This function takes in a natural language processor and filepath. It calculates the ratio of repeating to unique words in the text and outputs this value.

In [53]:
from lexical_repetition import repeating_unique_word_ratio

In [55]:
print("The following represents the proportion of repeating to unique words in the text:")
repeating_unique_word_ratio(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the proportion of repeating to unique words in the text:


0.07692307692307693

## `incorrectly_followed_articles()`

This function takes in a natural language processor and filepath. It calculates and returns the number of articles (a, and, the) that are not followed by an adjective, noun, or, proper noun. 

In [56]:
from syntactic_errors import incorrectly_followed_articles

In [60]:
print("The following represent the total number of articles in the text that are incorrectly followed:")
incorrectly_followed_articles(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/test.txt")

The following represent the total number of articles in the text that are incorrectly followed:


1

## `dependency_tree_heights()`

This function calls `tree_heights()` in `syntactic_complexity.py` and uses this function to return a list of all dependency tree heights in the text. 

In [8]:
from syntactic_complexity import dependency_tree_heights

In [9]:
print("The following represents a list of dependency tree heights for all dependency trees in the text:")
dependency_tree_heights(nlp=(spacy.load('en_core_web_lg')), file_path="sample_text/sample.txt")

The following represents a list of dependency tree heights for all dependency trees in the text:


[6, 5, 8]

## `ratio_of_nouns()`

This functio calls `ratio_of_pos()` with specified kwargs to calculate and return the ration of nouns to total words.

In [10]:
from pos_tagging import ratio_of_nouns

In [11]:
print("The following represents the proportion of nouns to total words in the text:")
ratio_of_nouns(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the proportion of nouns to total words in the text:


0.19642857142857142

## `ratio_of_pronouns()`

This functio calls `ratio_of_pos()` with specified kwargs to calculate and return the ration of pronouns to total words.

In [12]:
from pos_tagging import ratio_of_pronouns

In [13]:
print("The following represents the proportion of pronouns to total words in the text:")
ratio_of_pronouns(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the proportion of pronouns to total words in the text:


0.17857142857142858

## `ratio_of_conjunctions()`

This function calls `ratio_of_pos()` with specified kwargs to calculate and return the ration of conjunctions to total words.

In [14]:
from pos_tagging import ratio_of_conjunctions

In [15]:
print("The following represents the proportion of conjunctions to total words in the text:")
ratio_of_conjunctions(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the proportion of conjunctions to total words in the text:


0.05357142857142857

## `number_of_unique_tokens()`

This function calculates and returns the number of unique tokens present in a text document.

In [16]:
from lexical_variation import number_of_unique_tokens

In [17]:
print(f"The following represents the number of unique tokens in the text:")
number_of_unique_tokens(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the number of unique tokens in the text:


56

## `number_of_unique_lemmas()`

This function calculates and returns the number of unique lemmas present in a text document.

In [18]:
from lexical_variation import number_of_unique_lemmas

In [19]:
print(f"The following represents the number of unique lemmas in the text:")
number_of_unique_lemmas(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the number of unique lemmas in the text:


52

## `count_num_sentences_without_verbs()`

This function calculates and returns the number of sentences in the text that do not contain any verbs

In [20]:
from syntactic_errors import count_num_sentences_without_verbs

In [21]:
print(f"The following represents the number of sentences without verbs:")
count_num_sentences_without_verbs(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the number of sentences without verbs:


0

## `total_consecutive_words()`

This function calculates and returns the number of consecutive repeating words in the text.

In [22]:
from lexical_repetition import total_consecutive_words

In [23]:
print(f"The following represents the total number of consecutive repeating words in the text:")
total_consecutive_words(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the total number of consecutive repeating words in the text:


0

## `stats_similarity_of_words()`

This function takes in a natural language processor, file path, and window size(default 3). It returns dictionary containing mean, min,  max, and standard deviation of word similarity across all windows.

In [24]:
from similarity import stats_similarity_of_words

In [25]:
print(f"The following is a dictionary containing the mean, min, max and standard deviation of word similarity across moving windows:")
stats_similarity_of_words(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt", window_size=3)

The following is a dictionary containing the mean, min, max and standard deviation of word similarity across moving windows:


{'mean': 0.4301234567901235,
 'max': 0.7233333333333333,
 'min': 0.013333333333333336,
 'std': 0.1383596064509404}

## `mean_similarity_of_sentences()`

This function takes in a natural language processor and a filepath. It calculates and returns the mean similarity of all combinations of sentences.

In [26]:
from similarity import mean_similarity_of_sentences

In [27]:
print(f"The following represents the mean similarity score of all combinations of sentences in the text:")
mean_similarity_of_sentences(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/sample.txt")

The following represents the mean similarity score of all combinations of sentences in the text:


0.89

## `tf_idf()`

Takes in a natural language processor, filepath, document list and target string(term). It first calculates term frequency, which is defined as the frequency of a target string in a document. Then, it calculates inverse-document-frequency, which is defined as log10 of the number of documents divided by the number of documents containing the term. TF-IDF is calculated by multiplying these two values.

In [28]:
from term_freq_inverse_doc_freq import tf_idf

In [32]:
print(f"tf-idf of target string: 'life:'")
tf_idf(nlp=spacy.load('en_core_web_lg'), file_path="sample_text/test.txt", document_list=["sample_text/sample.txt", "sample_text/test.txt", "sample_text/contains_nonwords.txt"], term="life")




0.0