#### TF-IDF [Term Frequency and Inverse Document Frequency]:

TF-IDF is used to measure the importance of a word in a document relative to a collection of documents (corpus). It's commonly used in NLP and search engines.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
corpus = [
    "Artificial Intelligence is changing the world.",
    "Machine learning allows systems to learn from data.",
    "Deep learning is a part of machine learning.",
    "Neural networks are used in deep learning models.",
    "Python is widely used in data science.",
    "Data preprocessing is crucial in machine learning.",
    "Supervised learning uses labeled data.",
    "Unsupervised learning works with unlabeled data.",
    "Reinforcement learning learns from rewards and penalties.",
    "Decision trees are simple and interpretable models.",
    "Random forests are built using multiple decision trees.",
    "Logistic regression is used for binary classification.",
    "Linear regression predicts continuous values.",
    "Clustering groups data based on similarity.",
    "K-means is a popular clustering algorithm.",
    "Natural Language Processing deals with human language.",
    "Tokenization is the first step in NLP.",
    "TF-IDF helps identify important words in documents.",
    "Feature scaling improves model performance.",
    "Hyperparameter tuning boosts accuracy.",
    "Cross-validation ensures model generalization.",
    "Overfitting occurs when the model memorizes training data.",
    "Underfitting happens when the model fails to learn patterns.",
    "Model evaluation uses metrics like accuracy and F1-score.",
    "Data visualization reveals insights through graphs and plots."
]


In [28]:
# Initialize and fit the Tfidf Vectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus).toarray()

In [30]:
print("Length of the Vocabulary : ", len(tfidf.vocabulary_))
print()
print(tfidf.get_feature_names_out())

Length of the Vocabulary :  115

['accuracy' 'algorithm' 'allows' 'and' 'are' 'artificial' 'based' 'binary'
 'boosts' 'built' 'changing' 'classification' 'clustering' 'continuous'
 'cross' 'crucial' 'data' 'deals' 'decision' 'deep' 'documents' 'ensures'
 'evaluation' 'f1' 'fails' 'feature' 'first' 'for' 'forests' 'from'
 'generalization' 'graphs' 'groups' 'happens' 'helps' 'human'
 'hyperparameter' 'identify' 'idf' 'important' 'improves' 'in' 'insights'
 'intelligence' 'interpretable' 'is' 'labeled' 'language' 'learn'
 'learning' 'learns' 'like' 'linear' 'logistic' 'machine' 'means'
 'memorizes' 'metrics' 'model' 'models' 'multiple' 'natural' 'networks'
 'neural' 'nlp' 'occurs' 'of' 'on' 'overfitting' 'part' 'patterns'
 'penalties' 'performance' 'plots' 'popular' 'predicts' 'preprocessing'
 'processing' 'python' 'random' 'regression' 'reinforcement' 'reveals'
 'rewards' 'scaling' 'science' 'score' 'similarity' 'simple' 'step'
 'supervised' 'systems' 'tf' 'the' 'through' 'to' 'tokenizat

In [8]:
print(X)

[[0.         0.         0.         ... 0.         0.         0.45058348]
 [0.         0.         0.41837712 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.32144851 0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


#### The values of probabilities in the output matrix show the importance of the term:

#### Setting out the max_features:

In [27]:
tfidf1 = TfidfVectorizer(max_features=79)
X1 = tfidf1.fit_transform(corpus).toarray()
print("Length of the Vocabulary : ", len(tfidf1.vocabulary_))
print()
print(tfidf1.get_feature_names_out())

Length of the Vocabulary :  79

['accuracy' 'and' 'are' 'classification' 'clustering' 'continuous' 'cross'
 'crucial' 'data' 'deals' 'decision' 'deep' 'documents' 'ensures'
 'evaluation' 'f1' 'fails' 'feature' 'first' 'for' 'from' 'in' 'is'
 'language' 'learn' 'learning' 'machine' 'metrics' 'model' 'models'
 'neural' 'occurs' 'of' 'on' 'overfitting' 'part' 'patterns' 'penalties'
 'performance' 'plots' 'popular' 'predicts' 'preprocessing' 'processing'
 'python' 'random' 'regression' 'reinforcement' 'reveals' 'rewards'
 'scaling' 'science' 'score' 'similarity' 'simple' 'step' 'supervised'
 'systems' 'tf' 'the' 'through' 'to' 'tokenization' 'training' 'trees'
 'tuning' 'underfitting' 'unlabeled' 'unsupervised' 'used' 'uses' 'using'
 'validation' 'values' 'visualization' 'when' 'widely' 'with' 'words']


#### Using the ngram_range:

#### Unigram:

In [31]:
tfidf2 = TfidfVectorizer(max_features=75, ngram_range=(1,1))
X2 = tfidf2.fit_transform(corpus).toarray()
print("Length of the Vocabulary : ", len(tfidf2.vocabulary_))
print()
print(tfidf2.get_feature_names_out())

Length of the Vocabulary :  75

['accuracy' 'and' 'are' 'clustering' 'data' 'deals' 'decision' 'deep'
 'documents' 'ensures' 'evaluation' 'f1' 'fails' 'feature' 'first' 'for'
 'from' 'in' 'is' 'language' 'learn' 'learning' 'machine' 'metrics'
 'model' 'models' 'neural' 'occurs' 'of' 'on' 'overfitting' 'part'
 'patterns' 'penalties' 'performance' 'plots' 'popular' 'predicts'
 'preprocessing' 'processing' 'python' 'random' 'regression'
 'reinforcement' 'reveals' 'rewards' 'scaling' 'science' 'score'
 'similarity' 'simple' 'step' 'supervised' 'systems' 'tf' 'the' 'through'
 'to' 'tokenization' 'training' 'trees' 'tuning' 'underfitting'
 'unlabeled' 'unsupervised' 'used' 'uses' 'using' 'validation' 'values'
 'visualization' 'when' 'widely' 'with' 'words']


#### Bigram:

In [32]:
tfidf3 = TfidfVectorizer(ngram_range=(1,2))
X3 = tfidf3.fit_transform(corpus).toarray()
print("Length of the Vocabulary : ", len(tfidf3.vocabulary_))
print()
print(tfidf3.get_feature_names_out())

Length of the Vocabulary :  251

['accuracy' 'accuracy and' 'algorithm' 'allows' 'allows systems' 'and'
 'and f1' 'and interpretable' 'and penalties' 'and plots' 'are'
 'are built' 'are simple' 'are used' 'artificial'
 'artificial intelligence' 'based' 'based on' 'binary'
 'binary classification' 'boosts' 'boosts accuracy' 'built' 'built using'
 'changing' 'changing the' 'classification' 'clustering'
 'clustering algorithm' 'clustering groups' 'continuous'
 'continuous values' 'cross' 'cross validation' 'crucial' 'crucial in'
 'data' 'data based' 'data preprocessing' 'data science'
 'data visualization' 'deals' 'deals with' 'decision' 'decision trees'
 'deep' 'deep learning' 'documents' 'ensures' 'ensures model' 'evaluation'
 'evaluation uses' 'f1' 'f1 score' 'fails' 'fails to' 'feature'
 'feature scaling' 'first' 'first step' 'for' 'for binary' 'forests'
 'forests are' 'from' 'from data' 'from rewards' 'generalization' 'graphs'
 'graphs and' 'groups' 'groups data' 'happens' 'happens w

#### Trigram:

In [33]:
tfidf4 = TfidfVectorizer(ngram_range=(1,3))
X4 = tfidf4.fit_transform(corpus).toarray()
print("Length of the Vocabulary : ", len(tfidf4.vocabulary_))
print()
print(tfidf4.get_feature_names_out())

Length of the Vocabulary :  369

['accuracy' 'accuracy and' 'accuracy and f1' 'algorithm' 'allows'
 'allows systems' 'allows systems to' 'and' 'and f1' 'and f1 score'
 'and interpretable' 'and interpretable models' 'and penalties'
 'and plots' 'are' 'are built' 'are built using' 'are simple'
 'are simple and' 'are used' 'are used in' 'artificial'
 'artificial intelligence' 'artificial intelligence is' 'based' 'based on'
 'based on similarity' 'binary' 'binary classification' 'boosts'
 'boosts accuracy' 'built' 'built using' 'built using multiple' 'changing'
 'changing the' 'changing the world' 'classification' 'clustering'
 'clustering algorithm' 'clustering groups' 'clustering groups data'
 'continuous' 'continuous values' 'cross' 'cross validation'
 'cross validation ensures' 'crucial' 'crucial in' 'crucial in machine'
 'data' 'data based' 'data based on' 'data preprocessing'
 'data preprocessing is' 'data science' 'data visualization'
 'data visualization reveals' 'deals' 'deals wit

In [19]:
print(X1)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.34493677 0.28916737 0.         ... 0.         0.         0.        ]
 [0.         0.33611434 0.         ... 0.         0.         0.        ]]


In [20]:
print(X2)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.54947226 0.46063355 0.         ... 0.         0.         0.        ]
 [0.         0.54097705 0.         ... 0.         0.         0.        ]]


In [36]:
print(X3)

[[0.         0.         0.         ... 0.         0.         0.31741249]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.22437724 0.25317217 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [37]:
print(X4)

[[0.         0.         0.         ... 0.         0.         0.26797529]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.18642047 0.21034431 0.21034431 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


#### Summary:

TF captures word frequency.

IDF reduces the influence of common words.

TF-IDF balances both to highlight meaningful words.
