## Word Frequency

Word frequency in the context of Natural Language Processing (NLP) refers to the number of times a specific word appears in a text or a corpus of texts. Analyzing word frequency is a fundamental technique in text analysis and can provide insights into the themes, style, and structure of the text.

Here are some key points about word frequency:

- Basic Analysis: At its simplest, word frequency analysis involves counting how many times each word appears in a text. This can reveal the most commonly used words, which can be indicative of the text's main topics or themes.
- Stop Words: Often, common words like "the", "is", and "and" (known as stop words) are among the most frequent. These are usually filtered out in NLP tasks to focus on more meaningful words.
- Normalization: Sometimes, words are normalized before counting (using techniques like stemming or lemmatization) to ensure that different forms of the same word are counted together.
- Term Frequency (TF): In text mining and information retrieval, term frequency is the number of times a term appears in a document, relative to the total number of terms in that document. It's a part of TF-IDF (Term Frequency-Inverse Document Frequency) which is used to evaluate how important a word is to a document in a collection or corpus.
- Applications: Word frequency analysis is used in various applications such as creating word clouds, conducting sentiment analysis, performing authorship attribution, and improving search engine algorithms.
- Insights and Trends: Beyond simple counts, word frequency can be analyzed over time or across different text corpora to identify trends, changes in language use, or differences in vocabulary among authors or genres.

Word frequency analysis is a basic yet powerful tool in NLP and text analytics, providing a foundation for more complex linguistic and semantic analyses.


In [1]:
# Let take a look at the example below.

import spacy
from spacy import displacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

In [7]:
complete_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech company. He is"
    " interested in learning Natural Language Processing."
    " There is a developer conference happening on 21 July"
    ' 2019 in London. It is titled "Applications of Natural'
    ' Language Processing". There is a helpline number'
    " available at +44-1234567891. Gus is helping organize it."
    " He keeps organizing local Python meetups and several"
    " internal talks at his workplace. Gus is also presenting"
    ' a talk. The talk will introduce the reader about "Use'
    ' cases of Natural Language Processing in Fintech".'
    " Apart from his work, he is very passionate about music."
    " Gus is learning to play the Piano. He has enrolled"
    " himself in the weekend batch of Great Piano Academy."
    " Great Piano Academy is situated in Mayfair or the City"
    " of London and has world-class piano instructors."
)

complete_doc = nlp(complete_text)
word = [
    token.text for token in complete_doc if not token.is_stop and not token.is_punct
]
word_freq = Counter(word)
common_words = word_freq.most_common(5)
print(common_words)

[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]


In [13]:
Counter([token.text for token in complete_doc if not token.is_punct]).most_common(5)

[('is', 10), ('a', 5), ('in', 5), ('Gus', 4), ('of', 4)]

## Visualization: Using displaCy

In [14]:
# spaCy comes with a built-in visualizer called displaCy
about_interest_text = "He is interested in learning Natural Language Processing."

about_interest_doc = nlp(about_interest_text)
displacy.render(about_interest_doc, style="dep", jupyter=True)

In [24]:
display_options = {
    "compact": True,
    "distance": 100,
    "color": "purple",
    "bg": "#09a3d5",
    "font": "Times",
}
displacy.render(about_interest_doc, style="dep", jupyter=True, options=display_options)