# Tf-idf/ Term frequency-Inverse document frequency

- Finding essential words in a text is one of the most common use cases in information retrieval and text mining, and a common way of doing this is using tf-idf

- This is a measure *to assess a word’s significance within a collection of documents*

- Therefore, a unique word that only appears a few times in a set of documents will be more critical and assigned a higher weight than frequently occurring words.

- TF-IDF is a simple measure of a word’s importance within a set of documents.

- Search engines frequently use variations of the tf-idf weighting schemes as their leading scoring and ranking tool when determining how relevant a document is to a user query.

- Tf-idf is also commonly used to filter out stop-words effectively, and this has various use cases in text classification and summarization.

# Term Frequency

- The frequency at which a word appears in a document is referred to as “term frequency.”

- The weight of a term that occurs in a document is simply proportional to the term’s frequency.



# Inverse Document Frequency

- This increases the weight of infrequent terms and decreases the importance of frequently occurring words in the document set.

- The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

# TF-IDF

TF-IDF = term frequency * inverse document frequency

- The most significant problem faced by natural language processing is that machine learning models tend to only deal with numerical values. This is a problem, as numbers can’t just represent natural language, or they would lose meaning.

- Therefore, we must **vectorize the text to convert it into numbers**. This is a crucial step in machine learning, and the outcomes of various vectorization algorithms will vary greatly. Hence, choosing one that produces the desired product for your problem is vital.

- The **tf-idf score converts words into numbers that can be fed to algorithms like Naive Bayes and Support Vector Machines,** significantly improving the results of more straightforward techniques like **word counts.**

- In its simplest form, a **word vector** represents a **document as a list of numbers**. A number is used to represent each possible word in the text. By taking a document’s text and turning it into one of these vectors, the text’s content is somehow represented by the vectors’ numbers. Then, with the help of **tf-idf, we can quantify the relevance of each word in a document by associating it with a number**. As a result, similar vectors will exist in documents that contain identical, pertinent words, which is what a machine learning algorithm seeks to do.



# Applications of TF-IDF

- Information retrieval

- Keyword Extraction - word cloud formations and quick summaries of large bodies of text.

# Advantages

The simplicity and ease of use of tf-idf are its most significant benefits. As a result, it is easy to **compute, inexpensive to run, and a clear starting point for similarity calculations**.



# Disadvantages


- tf-idf cannot assist in carrying semantic meaning

- Tf-idf disregards word order, so compound nouns like “New York” will not be regarded as a “single unit.”

- Because tf-idf can experience the curse of dimensionality, it can also experience memory inefficiency

- vocabulary size is equal to the length of the tf-idf vectors

# Alternatives: Word2Vec, BERT

# Source

- https://spotintelligence.com/2022/11/28/tf-idf/