# TF-IDF and its Applications

**Term Frequency, Inverse Document Frequency** is used for for finding keywords in a text/corpus

### Why Do We Need To Find Important Words in Texts?
- If we are interested to know what type of activity, an influencer in social networks such as Instagram is doing we should find the important words (keywords) from the captions of influencer’s posts
- There are different criteria to define what is keyword, so need to score the words in our texts and sort them
- TF-IDF is one of these criteria, but there are other methods too

### What is TF-IDF?
It's a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents
    
 Intuitively:
    - If a word appears frequently in a document, it's important. Give the word a high score
    - But if a word appears in many documents, it's not a unique identifier. Give the word a low score

### How do we calculate TF-IDF programmatically?
Using sklearns TF-IDF Vectorizer

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = ['The sky is blue', 'The sun is bright', 'The sun in the sky is bright', 'we can see the shining sun, the bright sun']

vectorizer = TfidfVectorizer(stop_words='english')
TFIDF_matrix = vectorizer.fit_transform(documents)

print('TD-IDF Matrix')
print(TFIDF_matrix.toarray())
print('')
print('TD-IDF Keywords')
print(vectorizer.get_feature_names())

TD-IDF Matrix
[[0.78528828 0.         0.         0.6191303  0.        ]
 [0.         0.70710678 0.         0.         0.70710678]
 [0.         0.53256952 0.         0.65782931 0.53256952]
 [0.         0.36626037 0.57381765 0.         0.73252075]]

TD-IDF Keywords
['blue', 'bright', 'shining', 'sky', 'sun']


## Applications of TD-IDF
- Influencer marketing is a new and innovative way for brands to target their customers by influencer on social media
- We can use natural language processing (NLP) to extract keywords and top hashtags and then mapped to category (word2vec).
**Work2Vec** is when data scientist decide to asign a number to every word and helps us use TD-IDF

Why is this important:
- Doing un-automated way for influencer categorization is an impossible task
- Brands can find the right influencer quickly 
Using 50 clusters on cloud, roughly we can tag 1 million influencers every day

### How we can test our keyword extraction method through a web app interface?
There are different ways to test and deploy the Python module that has been developed, for example here the keyword extraction

Flask-RESTful API is the one of the good one for this task because:
- It is easy to develop and deploy
- It is completely modular
- In the next slide, you will watch a demo, how flask works with our keyword extraction in backend 
 

## Summary

- We learned that it is possible to extract important words (keywords) from TF-IDF criteria
- TF-IDF is computed from Bag-of-Words (BoW) matrix
- TF-IDF is not the only way to extract keywords. We will do homework for TextRank algorithm
- Sklearn, Gensim and Keras has built-in TF-IDF modules
- TF-IDF or other text vectorization method can be used also for text classification

Next lecture, we will talk about word2vec. It is useful for mapping our keyword list to pre-defined category 