## TF-IDF Intuition


*TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a statistical measure used in natural language processing and information retrieval to evaluate the relevance of a term within a document or a corpus of documents.*

__The idea behind TF-IDF is to give more weight to terms that are important or distinctive in a document, and less weight to terms that are common or unimportant across the corpus. It consists of two components:__

__A-)__ Term frequency (TF): the number of times a term appears in a document, divided by the total number of terms in that document. This measures how frequently a term appears in a document relative to other terms.

__B-)__ Inverse document frequency (IDF): the logarithm of the total number of documents in the corpus, divided by the number of documents that contain the term. This measures how important or distinctive a term is across the entire corpus.

![asd.png](attachment:asd.png)

In [1]:
#Coding

import nltk

paragraph = """Turkish nation! Your homeland is no longer a despotic state ruled by sultans. It is now a modern and sovereign republic. A republic that you have established with great sacrifice and determination.
The foundations of this republic are democracy and freedom. Its fundamental principles are based on the concept of national sovereignty, where power belongs to the people. Every citizen, regardless of their religion, language, or race, is equal before the law.
Our aim is to raise the standards of civilization of our nation to the level of contemporary civilizations. We must embrace science, knowledge, and reason. We must strive for progress in every field, from education to industry, from agriculture to technology.
Let us remember that the true power of a nation lies in the unity and solidarity of its people. We must cast aside our differences and work together as one. We have proven to the world that we are capable of achieving great things when we are united.
Our national sovereignty is our most precious possession. We will protect it at all costs and defend our independence against any threat. We will not allow any external influence to hinder our progress or undermine our sovereignty.
We must also never forget the sacrifices made by our heroes and martyrs. They fought bravely to secure the future of our nation. It is our duty to honor their memory by continuing to build a strong and prosperous Turkey.
Turkish nation! The Republic of Turkey is in your hands. It is up to you to preserve its principles and ensure its progress. Let us work together, with determination and perseverance, to create a brighter future for our country and our people.
Long live the Republic of Turkey"""



In [2]:
#Cleaning

import re
from nltk.corpus import stopwords
# from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

In [6]:
ps = PorterStemmer()
wordnet = WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []

In [10]:
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [11]:
# Creating a TF-IDF Intuition
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()

In [13]:
X.shape

(72, 110)