***Step 1: Set Up***

In [37]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

# corpus 
documents = [ 
    "Data science is an interdisciplinary field.", 
    "Machine learning is a subset of artificial intelligence.", 
    "Data science uses machine learning algorithms.", 
    "Artificial intelligence and data science are growing fields." 
] 

***Step 2: Bag of Words (BoW)***

In [38]:
# Initialize CountVectorizer 
vectorizer = CountVectorizer()

# Fit and transform the corpus 
X = vectorizer.fit_transform(documents)

# Convert the result into a DataFrame for readability 
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()) 

print("Bag of Words Model:\n", bow_df)

Bag of Words Model:
    algorithms  an  and  are  artificial  data  field  fields  growing  \
0           0   1    0    0           0     1      1       0        0   
1           0   0    0    0           1     0      0       0        0   
2           1   0    0    0           0     1      0       0        0   
3           0   0    1    1           1     1      0       1        1   

   intelligence  interdisciplinary  is  learning  machine  of  science  \
0             0                  1   1         0        0   0        1   
1             1                  0   1         1        1   1        0   
2             0                  0   0         1        1   0        1   
3             1                  0   0         0        0   0        1   

   subset  uses  
0       0     0  
1       1     0  
2       0     1  
3       0     0  


In [39]:
# 1. Which word appears most frequently across all documents? 
word_frequencies = bow_df.sum().sort_values(ascending=False)

# Display most frequent word
most_frequent_word = word_frequencies.idxmax()
most_frequent_word, word_frequencies

('science',
 science              3
 data                 3
 learning             2
 machine              2
 intelligence         2
 artificial           2
 is                   2
 algorithms           1
 an                   1
 are                  1
 growing              1
 fields               1
 field                1
 and                  1
 interdisciplinary    1
 of                   1
 subset               1
 uses                 1
 dtype: int64)

In [40]:
# 2. Are there any words that appear only once?
unique_words = word_frequencies[word_frequencies == 1].index.tolist()
print("Words that appear only once:", unique_words)

Words that appear only once: ['algorithms', 'an', 'are', 'growing', 'fields', 'field', 'and', 'interdisciplinary', 'of', 'subset', 'uses']


***Step 3: TF-IDF***

In [41]:
# TF-IDF Representation
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Convert to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display TF-IDF DataFrame
tfidf_df

Unnamed: 0,algorithms,an,and,are,artificial,data,field,fields,growing,intelligence,interdisciplinary,is,learning,machine,of,science,subset,uses
0,0.0,0.474771,0.0,0.0,0.0,0.30304,0.474771,0.0,0.0,0.0,0.474771,0.374315,0.0,0.0,0.0,0.30304,0.0,0.0
1,0.0,0.0,0.0,0.0,0.348842,0.0,0.0,0.0,0.0,0.348842,0.0,0.348842,0.348842,0.348842,0.442462,0.0,0.442462,0.0
2,0.496414,0.0,0.0,0.0,0.0,0.316854,0.0,0.0,0.0,0.0,0.0,0.0,0.391378,0.391378,0.0,0.316854,0.0,0.496414
3,0.0,0.0,0.406289,0.406289,0.320323,0.259329,0.0,0.406289,0.406289,0.320323,0.0,0.0,0.0,0.0,0.0,0.259329,0.0,0.0


In [42]:
# 3. Which term has the highest TF-IDF score in the first document? 
highest_tfidf_term = tfidf_df.iloc[0].idxmax()
highest_tfidf_score = tfidf_df.iloc[0].max()

print(f"The highest TF-IDF term in the first document is '{highest_tfidf_term}' with a score of {highest_tfidf_score:.4f}")

The highest TF-IDF term in the first document is 'an' with a score of 0.4748


In [43]:
# 4. Why do some terms have a TF-IDF score of 0 in certain documents? 
# Term Absence: The term does not appear in the document, so its TF (Term Frequency) is 0, leading to a TF-IDF score of 0.

***Step 3: N-Grams***

In [45]:
# Using CountVectorizer for N-Grams
# Initialize CountVectorizer with bigrams (ngram_range=(2, 2)) 
vectorizer = CountVectorizer(ngram_range=(2, 2))

# Fit and transform the corpus 
X = vectorizer.fit_transform(documents)

# Convert the result into a DataFrame for readability 
ngram_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()) 
 
print("N-Grams (Bigrams) Model:\n", ngram_df) 

N-Grams (Bigrams) Model:
    an interdisciplinary  and data  are growing  artificial intelligence  \
0                     1         0            0                        0   
1                     0         0            0                        1   
2                     0         0            0                        0   
3                     0         1            1                        1   

   data science  growing fields  intelligence and  interdisciplinary field  \
0             1               0                 0                        1   
1             0               0                 0                        0   
2             1               0                 0                        0   
3             1               1                 1                        0   

   is an  is subset  learning algorithms  learning is  machine learning  \
0      1          0                    0            0                 0   
1      0          1                    0            1    

In [52]:
# 5. Which bigram is most frequent across all documents?

bigram_counts = ngram_df.sum().sort_values(ascending=False)
most_frequent_bigram = bigram_counts.idxmax()

print(f"Most frequent bigram across all documents is {most_frequent_bigram}.")

Most frequent bigram across all documents is data science.


In [None]:
# How might bigrams provide additional context compared to unigrams?
# Bigrams help capture word relationships compared to unigrams.

In [54]:
#  Generating Trigrams
# Initialize CountVectorizer with trigrams (ngram_range=(3, 3)) 
vectorizer = CountVectorizer(ngram_range=(3, 3)) 
 
# Fit and transform the corpus 
X = vectorizer.fit_transform(documents) 
 
# Convert the result into a DataFrame for readability 
ngram_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()) 
 
print("N-Grams (Trigrams) Model:\n", ngram_df)

N-Grams (Trigrams) Model:
    an interdisciplinary field  and data science  are growing fields  \
0                           1                 0                   0   
1                           0                 0                   0   
2                           0                 0                   0   
3                           0                 1                   1   

   artificial intelligence and  data science are  data science is  \
0                            0                 0                1   
1                            0                 0                0   
2                            0                 0                0   
3                            1                 1                0   

   data science uses  intelligence and data  is an interdisciplinary  \
0                  0                      0                        1   
1                  0                      0                        0   
2                  1                      0            

***Step 4: Analyze Combined Representations***

7. **How does TF-IDF handle frequent terms differently from BoW?**
   - TF-IDF reduces the weight of common terms that appear in multiple documents.

8. **Why might TF-IDF be preferred for some applications?**
   - It helps in filtering out less important words and gives higher importance to distinguishing terms.

9. **How do bigrams capture relationships between words?**
   - Unlike unigrams, bigrams provide context by grouping adjacent words.

10. **Provide an example where a bigram adds more meaning than individual words.**
   - 'Machine learning' as a bigram has more meaning than 'Machine' and 'Learning' separately.