*  Bag-Of-Words(BoW) algorithm converts a piece of text into a numerical representation, disregarding grammar  and word order but keeping multiplicity.

*  It is a technique used in Natural Language Processing (NLP) to represent text data. It involves:
1. **Tokenization** Breaking down text into individual words or terms (tokens).
2. **Counting:** Counting the frequency of each word in the text.
3. **Vectorization:** Representing the text as a numerical vector, where each dimension corresponds to a word, and the value represents its frequency in the document.

# Import libraries


In [1]:
pip install scikit-learn     

Note: you may need to restart the kernel to use updated packages.


In [3]:
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer


# Data Collection

In [4]:
data = [
    'Fashion is an art form and expression.',
    'Style is a way to say who you are without having to speak.',
    'Fashion is what you buy, style is what you do with it.',
    'With fashion, you convey a message about yourself without uttering a single word'
]

# Text prepocessing

In [5]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '',text)
    return text

preprocessed_data = [preprocess_text(doc) for doc in data]

for i, doc in enumerate(preprocessed_data, 1):
    print(f"Data {i}: {doc}")

Data 1: fashion is an art form and expression
Data 2: style is a way to say who you are without having to speak
Data 3: fashion is what you buy style is what you do with it
Data 4: with fashion you convey a message about yourself without uttering a single word


In [6]:
# removing words like "the", "is", "are", "and" as they usually do not carry much useful information for the analysis.
vectorizer = CountVectorizer(stop_words='english')

X=vectorizer.fit_transform(preprocessed_data)

Word=vectorizer.get_feature_names_out()
print(Word)


['art' 'buy' 'convey' 'expression' 'fashion' 'form' 'having' 'message'
 'say' 'single' 'speak' 'style' 'uttering' 'way' 'word']


# BoW representation

In [7]:
bow_df = pd.DataFrame(X.toarray(),columns=Word)

bow_df.index =[f'Data {i}' for i in range(1, len(data) + 1)]
print(bow_df)

        art  buy  convey  expression  fashion  form  having  message  say  \
Data 1    1    0       0           1        1     1       0        0    0   
Data 2    0    0       0           0        0     0       1        0    1   
Data 3    0    1       0           0        1     0       0        0    0   
Data 4    0    0       1           0        1     0       0        1    0   

        single  speak  style  uttering  way  word  
Data 1       0      0      0         0    0     0  
Data 2       0      1      1         0    1     0  
Data 3       0      0      1         0    0     0  
Data 4       1      0      0         1    0     1  


# Calculate TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection.


* **Term Frequency(TF):** This measures how frequently a term (word) appears in a document. 
* **Inverse Document Frequency (IDF):** This measures how important a term is across the entire corpus (collection of documents).


In [8]:
tfidf_transformer = TfidfTransformer()
X_tfidf=tfidf_transformer.fit_transform(X)

tfidf_df=pd.DataFrame(X_tfidf.toarray(), columns=Word)

tfidf_df.index=[f'Data {i}' for i in range(1, len(data) + 1)]

print("TF-IDF value:")
print(tfidf_df)

TF-IDF value:
             art       buy    convey  expression   fashion      form  \
Data 1  0.541736  0.000000  0.000000    0.541736  0.345783  0.541736   
Data 2  0.000000  0.000000  0.000000    0.000000  0.000000  0.000000   
Data 3  0.000000  0.702035  0.000000    0.000000  0.448100  0.000000   
Data 4  0.000000  0.000000  0.430037    0.000000  0.274487  0.000000   

          having   message       say    single     speak     style  uttering  \
Data 1  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
Data 2  0.465162  0.000000  0.465162  0.000000  0.465162  0.366739  0.000000   
Data 3  0.000000  0.000000  0.000000  0.000000  0.000000  0.553492  0.000000   
Data 4  0.000000  0.430037  0.000000  0.430037  0.000000  0.000000  0.430037   

             way      word  
Data 1  0.000000  0.000000  
Data 2  0.465162  0.000000  
Data 3  0.000000  0.000000  
Data 4  0.000000  0.430037  
