## One Hot Encoding

#### One-hot encoding is a simple but effective technique in Natural Language Processing (NLP) for representing categorical data, such as words or characters, as numerical vectors. Here's how it works:

#### One-hot encoding represents each word (or character) in a vocabulary as a binary vector where:

#### <li> The vector length is equal to the size of the vocabulary. </li>
#### <li> Only one position in the vector is set to 1 (hot), while all others are 0 (cold).</li>
#### <li> The position of 1 is unique for each word.</li>

In [3]:
corups = ["i love nlp","i teach gen ai","i love data science",
          "i am working with mitsubishi"]

In [4]:
vocab_list = " ".join(corups).split()
vocab_list

['i',
 'love',
 'nlp',
 'i',
 'teach',
 'gen',
 'ai',
 'i',
 'love',
 'data',
 'science',
 'i',
 'am',
 'working',
 'with',
 'mitsubishi']

In [5]:
vocab_set = set(vocab_list)

In [22]:
word_to_index = {w:i for i,w in enumerate(vocab_set)}

In [33]:
one_hot_vector = []

for sentence in corups:
    #print(sentence)
    sentence_vector = []

    for word in sentence.split():
        vector = [0] * len(vocab_set)
        vector[word_to_index[word]] = 1
        sentence_vector.append(vector)
    
    one_hot_vector.append(sentence_vector) 
    
    #print(sentence)
    #print(sentence_vector)
one_hot_vector


[[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]],
 [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]],
 [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]],
 [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]]

## Bag of Words (BOW)

#### Bag of Words (BoW) is a simple and widely used text representation technique in Natural Language Processing (NLP). It converts text into numerical feature vectors by counting the occurrences of words, ignoring grammar and word order but keeping track of word frequency.

## Limitations
#### <li>Ignores meaning: "I love programming" and "Programming love I" are treated the same.</li>
#### <li>High dimensionality: If vocabulary is large, vectors become sparse.</li>
#### <li>Does not capture semantics: Words like "good" and "great" are treated as independent.</li>

In [14]:
from sklearn.feature_extraction.text import CountVectorizer


corups = ["i i i love nlp nlp nlp","i teach gen ai ai ai","i love data science as data is new oil",
          "i am working with mitsubishi"]

vocab_list = " ".join(corups).split()
vocab_set = set(vocab_list)

vectorizer = CountVectorizer(vocabulary=vocab_set)

x = vectorizer.fit_transform(corups)

x.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0],
       [3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 2, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1]], dtype=int64)

In [15]:
vectorizer.get_feature_names_out()

array(['ai', 'am', 'as', 'data', 'gen', 'i', 'is', 'love', 'mitsubishi',
       'new', 'nlp', 'oil', 'science', 'teach', 'with', 'working'],
      dtype=object)

## TF -IDF (Term Frequency - inverse document frequency)

#### TF-IDF is a statistical measure used in Natural Language Processing (NLP) to evaluate the importance of a word in a document relative to a collection (corpus) of documents. It helps reduce the weight of common words while highlighting unique words in a document.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfvector = TfidfVectorizer()

x = tfvector.fit_transform(corups)

x.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.25417303, 0.        , 0.        , 0.96715876,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.90453403, 0.        , 0.        , 0.        , 0.30151134,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.30151134, 0.        , 0.        ],
       [0.        , 0.        , 0.32238625, 0.64477251, 0.        ,
        0.32238625, 0.25417303, 0.        , 0.32238625, 0.        ,
        0.32238625, 0.32238625, 0.        , 0.        , 0.        ],
       [0.        , 0.5       , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.5       , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.5       , 0.5       ]])