# Word Frequency

Word frequency analysis simply counts how many times each word appears in a given text or collection of texts (corpus).


## Common Approaches to Represent Word Frequency


- **Bag of Words (BoW)**: Counts word frequency, ignores grammar and order.
- **Term Frequency–Inverse Document Frequency (TF-IDF)**: Adjusts for common words across documents.
- **CountVectorizer (from sklearn)**: Automates BoW.
- **TF-IDF Veectorizer (from sklearn)**:
Automates TF-IDF.



### Bag of Words (BoW):

In [2]:
from collections import Counter
import re

def word_frequency(text):
    words = re.findall(r'\b\w+\b', text.lower())
    return Counter(words)

text = "Machine learning is fun. Learning machine learning is very fun!"
freq = word_frequency(text)
freq

Counter({'machine': 2, 'learning': 3, 'is': 2, 'fun': 2, 'very': 1})

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "Machine learning is fun",
    "I love learning machine learning",
    "Machine learning is powerful"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.vocabulary_)
print("Feature Matrix:\n", X.toarray())

Vocabulary: {'machine': 4, 'learning': 2, 'is': 1, 'fun': 0, 'love': 3, 'powerful': 5}
Feature Matrix:
 [[1 1 1 0 1 0]
 [0 0 2 1 1 0]
 [0 1 1 0 1 1]]


## TF-IDF

the goal of this is to put more emphasis on less common words

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Matrix:\n", X.toarray())

Vocabulary: {'machine': 4, 'learning': 2, 'is': 1, 'fun': 0, 'love': 3, 'powerful': 5}
TF-IDF Matrix:
 [[0.66283998 0.50410689 0.39148397 0.         0.39148397 0.        ]
 [0.         0.         0.71307037 0.60366655 0.35653519 0.        ]
 [0.         0.50410689 0.39148397 0.         0.39148397 0.66283998]]


Zipf's law

In [5]:
# https://www.youtube.com/watch?v=zLMEnNbdh4Q