# Text Analysis using TF-IDF, Bag of Words and N-Grams

In [4]:
# TF-IDF a statistical measure used in information retrieval and text mining.
# It helps to evaluate the importance of a word in a document relative to a collection of documents (corpus). 

# Applications of TF-IDF 
# • Text classification 
# • Keyword extraction
# • Information retrieval and search engines
# • Text clustering

In [6]:
# Bag of Words (BoW) 
# A representation of text data where each document is represented by word frequencies. 
# It converts each document into a vector based on the frequency of words.

In [64]:
# N-Grams
# N-Grams are contiguous sequences of n words from a given text. For example: 
# 1-Grams: Single words (unigrams). 
# 2-Grams: Pairs of words (bigrams). 
# 3-Grams: Triples of words (trigrams).

## STEP 1: Set up

In [10]:
# Importing the necessary libraries and module
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [11]:
# Loading the sample dataset into a list
documents = [ 
"Data science is an interdisciplinary field.", 
"Machine learning is a subset of artificial intelligence.", 
"Data science uses machine learning algorithms.", 
"Artificial intelligence and data science are growing fields." 
] 

## STEP 2: Bag of Words

In [35]:
# Initializing CountVectorizer
vectorizer_bow = CountVectorizer()

# Transforming into a corpus
X_bow = vectorizer_bow.fit_transform(documents)

# Converting the result into a DataFrame for readability
bow_df = pd.DataFrame(X_bow.toarray(), columns=vectorizer_bow.get_feature_names_out())
print("Bag of Words Model:\n", bow_df)

Bag of Words Model:
    algorithms  an  and  are  artificial  data  field  fields  growing  \
0           0   1    0    0           0     1      1       0        0   
1           0   0    0    0           1     0      0       0        0   
2           1   0    0    0           0     1      0       0        0   
3           0   0    1    1           1     1      0       1        1   

   intelligence  interdisciplinary  is  learning  machine  of  science  \
0             0                  1   1         0        0   0        1   
1             1                  0   1         1        1   1        0   
2             0                  0   0         1        1   0        1   
3             1                  0   0         0        0   0        1   

   subset  uses  
0       0     0  
1       1     0  
2       0     1  
3       0     0  


In [20]:
# Explanation: 
# Each row represents a document. 
# Each column represents a word from the corpus. 
# The values in the matrix represent the frequency of the words in the corresponding document.

#### 1. Which word appears most frequently across all documents?

In [50]:
word_frequencies = bow_df.sum()
most_frequent_words = word_frequencies[word_frequencies == word_frequencies.max()].index.tolist()
print(f"==> Most frequent words: {most_frequent_words}")

==> Most frequent words: ['data', 'science']


#### 2. Are there any words that appear only once?

In [51]:
unique_words = bow_df.columns[bow_df.sum() == 1].tolist()
print(f"==> Words that appear only once: {unique_words}")

==> Words that appear only once: ['algorithms', 'an', 'and', 'are', 'field', 'fields', 'growing', 'interdisciplinary', 'of', 'subset', 'uses']


## STEP 3: TF-IDF

In [77]:
# Initializing the vectorizer (removing common English stop words)
vectorizer_tfidf = TfidfVectorizer(stop_words = 'english')

# Transforming into a corpus
X_tfidf = vectorizer_tfidf.fit_transform(documents)

# Converting the result into a DataFrame for readability
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer_tfidf.get_feature_names_out())
print("TF-IDF Model:\n", tfidf_df)

TF-IDF Model:
    algorithms  artificial      data     field    fields   growing  \
0    0.000000    0.000000  0.380444  0.596039  0.000000  0.000000   
1    0.000000    0.422247  0.000000  0.000000  0.000000  0.000000   
2    0.496414    0.000000  0.316854  0.000000  0.000000  0.000000   
3    0.000000    0.391378  0.316854  0.000000  0.496414  0.496414   

   intelligence  interdisciplinary  learning   machine   science    subset  \
0      0.000000           0.596039  0.000000  0.000000  0.380444  0.000000   
1      0.422247           0.000000  0.422247  0.422247  0.000000  0.535566   
2      0.000000           0.000000  0.391378  0.391378  0.316854  0.000000   
3      0.391378           0.000000  0.000000  0.000000  0.316854  0.000000   

       uses  
0  0.000000  
1  0.000000  
2  0.496414  
3  0.000000  


#### 3. Which term has the highest TF-IDF score in the first document?

In [78]:
highest_tfidf_term = tfidf_df.iloc[0].idxmax()
highest_tfidf_score = tfidf_df.iloc[0].max()
print(f"==> In first document, term with highest TF-IDF score: {highest_tfidf_term} \n Score: {highest_tfidf_score}")

==> In first document, term with highest TF-IDF score: field 
 Score: 0.5960389368177127


#### 4. Why do some terms have a TF-IDF score of 0 in certain documents? 

In [None]:
# A term has a TF-IDF score of 0 in a document means that the term is not important. 
# This happens when :
# a. The term does not appear in the document (TF = 0)
# b. The term appears in all documents (IDF = 0)

## STEP 4: N-Grams

In [68]:
# To generate N-Grams, you can adjust the ngram_range parameter of CountVectorizer. 
# Initializing CountVectorizer with bigrams (ngram_range=(2, 2)) 
vectorizer_ngrams = CountVectorizer(ngram_range=(2, 2))

# Transforming into a corpus
X_bigrams = vectorizer_ngrams.fit_transform(documents)

# Convert the result into a DataFrame for readability
bigrams_df = pd.DataFrame(X_bigrams.toarray(), columns=vectorizer_ngrams.get_feature_names_out())
print("Bigrams Model:\n", bigrams_df)

Bigrams Model:
    an interdisciplinary  and data  are growing  artificial intelligence  \
0                     1         0            0                        0   
1                     0         0            0                        1   
2                     0         0            0                        0   
3                     0         1            1                        1   

   data science  growing fields  intelligence and  interdisciplinary field  \
0             1               0                 0                        1   
1             0               0                 0                        0   
2             1               0                 0                        0   
3             1               1                 1                        0   

   is an  is subset  learning algorithms  learning is  machine learning  \
0      1          0                    0            0                 0   
1      0          1                    0            1              

In [69]:
# Explanation: 
# The columns represent the 2-grams (bigrams) that occur in the text. 
# The values in the matrix indicate the frequency of each bigram in the corresponding document. 

#### 5. Which bigram is most frequent across all documents? 

In [76]:
most_frequent_bigram = bigrams_df.sum().idxmax()
print(f"==> Most frequent bigram: {most_frequent_bigram}")

==> Most frequent bigram: data science


#### 6. How might bigrams provide additional context compared to unigrams?

In [82]:
# Bigrams look at two words together whereas unigrams treat each word separately. 
# This means they capture the order and relationship between words, helping us understand phrases better. 
# For example:
# “New York” specifically refers to the well-known city.
# Treating “New” and “York” as separate unigrams might not clearly indicate that meaning.

## STEP 5: N-Grams

### Compare the results from BoW and TF-IDF: 
#### 7. How does TF-IDF handle frequent terms differently from BoW?

In [83]:
# TF-IDF assigns lower weights to words that appear very often across all documents, 
# while BoW simply counts each word's frequency without considering its commonness across the dataset. 
# This means that in TF-IDF, common words (even if frequent in a document) are down-weighted, 
# allowing rarer but potentially more important words to stand out.

#### 8. Why might TF-IDF be preferred for some applications?

In [90]:
# TF-IDF is often preferred because it doesn't just count words like the bag-of-words model does. 
# Instead, it gives less importance to very common words (like "the" or "and") and more importance to words that are unique or rare in a document. 
# This helps highlight the key words that truly define the content, which is very useful in applications like search engines and document classification.

### Compare unigrams (singe words) with bigrams:
#### 9. How do bigrams capture relationships between words?

In [85]:
# Bigrams capture relationships by pairing consecutive words to form meaningful phrases. 
# For example, in our sample dataset, while unigrams treat "data" and "science" as separate tokens, 
# a bigram like "data science" preserves the link between the two words, conveying a specific concept. 
# This ordering helps the model recognize context and associations—like 
# "machine learning" or "artificial intelligence" —which unigrams alone might miss.

#### 10. Provide an example where a bigram adds more meaning than individual words.

In [89]:
# Consider the bigram "Social Media."
# Individually, "Social" and "Media" each have broad meanings.
# But when paired together as a bigram, 
# "Social Media" refers to a specific concept: digital platforms where users create and share content or engage with each other. 
# Without the bigram, the meaning would be much less clear, as "Social" could refer to various social aspects, 
# "Media" could refer to any form of communication.