<a href="https://colab.research.google.com/github/BI-DS/ELE-3909/blob/master/lecture6/tf_idf_bigrams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Bigrams and the Term Frequency - Inverse Document Frequency

⭐An easy implementation of the TF-IDF using `sklearn` ⭐

Specify the bigrams that you would like to consider in a list called `my_bigrams`. Note each bigram is in a list. Since we are considering the word `data` as part of a bigram, we shouldn't consider it again as a token. Therefore, we can create a second list called `exclude_words` with tokens that shouldn't be included. However, the word `science` isn't excluded as it has its own meaning in the first sentence.

In [2]:
corpus = ['data science is one of the most important fields of science',
          'this is one of the best data science courses',
          'data scientists analyze data'
          ]

my_bigrams = [['data','science'],
              ['data','scientists']
            ]

exclude_words = ['data',
                 'scientists'
              ]

I modify the original `TfidVectorizer` class in `sklearn` to achieve what we want...


In [5]:
class NewTfidfVectorizer(TfidfVectorizer):
    def _word_ngrams(self, tokens, stop_words=None):
        # First get tokens without stop words and with all bigrams
        tokens = super(TfidfVectorizer, self)._word_ngrams(tokens, None)

        new_tokens=[]
        for token in tokens:
            split_words = token.split(' ')
            if len(split_words) == 1:
                if split_words[0] not in exclude_words:
                    if stop_words is not None:
                        if split_words[0] not in stop_words:
                            new_tokens.append(token)
                    else:
                        new_tokens.append(token)
            else:
                for bigram in my_bigrams:
                    if split_words == bigram:
                        new_tokens.append(token)
        return new_tokens

In [8]:
vectorizer = NewTfidfVectorizer(ngram_range=(1,2))
vectors = vectorizer.fit_transform(corpus)
df = pd.DataFrame(vectors.T.todense(), index=vectorizer.get_feature_names_out())
df.head(20)

Unnamed: 0,0,1,2
analyze,0.0,0.0,0.707107
best,0.0,0.393129,0.0
courses,0.0,0.393129,0.0
data science,0.241215,0.298984,0.0
data scientists,0.0,0.0,0.707107
fields,0.317168,0.0,0.0
important,0.317168,0.0,0.0
is,0.241215,0.298984,0.0
most,0.317168,0.0,0.0
of,0.482429,0.298984,0.0


Note that both bigrams `data science` and `data scientists` are considered tokens. Likewise, `data`and `scientists` are not tokens by themselves.

Now, look that words like `is`, `of`, or `the` don't provide any meaning to any of the sentences. We can add `stop_words` directly to our version of `TfidfVectorizer` and remove them. See below.

In [9]:
vectorizer = NewTfidfVectorizer(ngram_range=(1,2),stop_words='english')
vectors = vectorizer.fit_transform(corpus)
df = pd.DataFrame(vectors.T.todense(), index=vectorizer.get_feature_names_out())
df.head(20)

Unnamed: 0,0,1,2
analyze,0.0,0.0,0.707107
best,0.0,0.562829,0.0
courses,0.0,0.562829,0.0
data science,0.343851,0.428046,0.0
data scientists,0.0,0.0,0.707107
fields,0.452123,0.0,0.0
important,0.452123,0.0,0.0
science,0.687703,0.428046,0.0
