Calculate Term Frequency and Inverse Document Frequency. Considering
sentences of documents.


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents (sentences)
documents = [
    "Natural Language Processing (NLP) is a subfield of linguistics.",
    "Computer science and artificial intelligence are closely related to NLP.",
    "NLP deals with the interactions between computers and human language.",
    "How to program computers to process and analyze large amounts of natural language data is a key aspect of NLP."
]

In [2]:
# Preprocessing the text (tokenization is done internally by TfidfVectorizer)
# Here, we're not performing stemming or lemmatization for simplicity
# We'll also remove common stop words
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Compute TF-IDF scores
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names (words) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

In [3]:
# Print TF-IDF scores for each word in each document
print("Term Frequency - Inverse Document Frequency (TF-IDF) Scores:")
for i, document in enumerate(documents):
    print(f"\nDocument {i+1}:")
    for j, word in enumerate(feature_names):
        tfidf_score = tfidf_matrix[i, j]
        if tfidf_score > 0:
            print(f"{word}: {tfidf_score:.2f}")

Term Frequency - Inverse Document Frequency (TF-IDF) Scores:

Document 1:
language: 0.31
linguistics: 0.48
natural: 0.38
nlp: 0.25
processing: 0.48
subfield: 0.48

Document 2:
artificial: 0.40
closely: 0.40
computer: 0.40
intelligence: 0.40
nlp: 0.21
related: 0.40
science: 0.40

Document 3:
computers: 0.38
deals: 0.48
human: 0.48
interactions: 0.48
language: 0.31
nlp: 0.25

Document 4:
amounts: 0.32
analyze: 0.32
aspect: 0.32
computers: 0.25
data: 0.32
key: 0.32
language: 0.20
large: 0.32
natural: 0.25
nlp: 0.17
process: 0.32
program: 0.32


We have a list of sample documents (sentences).
We use the TfidfVectorizer from scikit-learn to compute TF-IDF scores.
We fit the vectorizer on the documents and transform them into a TF-IDF matrix.
We retrieve the feature names (words) from the vectorizer.
Finally, we print the TF-IDF scores for each word in each document.