# Text To Vectors Conversion

Text-to-vector conversion is a fundamental step in Natural Language Processing (NLP) because machines cannot interpret raw text data directly. These conversions, such as One-Hot Encoding (OHE), Bag of Words (BoW), and Term Frequency-Inverse Document Frequency (TF-IDF) etc, allow us to transform text into numerical representations that machine learning algorithms can process. Each technique captures different aspects of the text.

There are several methods that can be applied for text-to-vector conversion, each with its own advantages and disadvantages. The choice of method depends on the specific requirements and context of the task. We select the appropriate technique based on the nature of the data and the desired outcomes for our particular use case.

[1.	OHE](#One-Hot-Encoding)

[2.	Bag-Of-Word](#Bag-of-Words)

[3.	TF-IDF](#TF-IDF)

[4.	Word2vec](#Word2vec)

[5.	Avgword2vec](#Avgword2vec)

In [1]:
dummy_text_data = [
    "The quick brown fox jumps over the lazy dog.",
    "The dog is lazy but the fox is quick.",
    "Foxes are wild animals and dogs are domestic.",
    "Both dogs and foxes have a tail."
]

In [3]:
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK data files
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Convert to lowercase
    text = text.lower()
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return ' '.join(lemmatized_tokens)

# Apply the cleaning function to the dummy data
cleaned_data = [clean_text(text) for text in dummy_text_data]

print("Cleaned Data:\n", cleaned_data)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\GH\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\GH\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\GH\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned Data:
 ['quick brown fox jump lazy dog', 'dog lazy fox quick', 'fox wild animal dog domestic', 'dog fox tail']


->	Stopword Removal: Words like "the", "is", "but", "are", and "have" have been removed.

->	Lemmatization: Words like "jumps" → "jump" and "foxes" → "fox" have been lemmatized to their base forms.

# One-Hot-Encoding

In [5]:
from sklearn.preprocessing import OneHotEncoder

# Flatten the cleaned data into a list of words
all_words = ' '.join(cleaned_data).split()

# Reshape for OneHotEncoder
ohe_input = [[word] for word in all_words]

# Initialize and fit the encoder
ohe_encoder = OneHotEncoder()
ohe_data = ohe_encoder.fit_transform(ohe_input)

# Display the One-Hot Encoded vectors
print("One-Hot Encoding:\n", ohe_data)

One-Hot Encoding:
   (0, 7)	1.0
  (1, 1)	1.0
  (2, 4)	1.0
  (3, 5)	1.0
  (4, 6)	1.0
  (5, 2)	1.0
  (6, 2)	1.0
  (7, 6)	1.0
  (8, 4)	1.0
  (9, 7)	1.0
  (10, 4)	1.0
  (11, 9)	1.0
  (12, 0)	1.0
  (13, 2)	1.0
  (14, 3)	1.0
  (15, 2)	1.0
  (16, 4)	1.0
  (17, 8)	1.0


**Pros:**

- Simple and intuitive.

- Represents the presence of words.

**Cons:**

- Creates very large and sparse matrices for large vocabularies.

- Does not capture semantic relationships between words.

# Bag-of-Words

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
vectorizer_bow = CountVectorizer(max_features=200)
bow_matrix = vectorizer_bow.fit_transform(cleaned_data)

# Display the Bag of Words matrix
print("Bag of Words:\n", bow_matrix.toarray())
print("Feature Names:", vectorizer_bow.get_feature_names_out())

Bag of Words:
 [[0 1 1 0 1 1 1 1 0 0]
 [0 0 1 0 1 0 1 1 0 0]
 [1 0 1 1 1 0 0 0 0 1]
 [0 0 1 0 1 0 0 0 1 0]]
Feature Names: ['animal' 'brown' 'dog' 'domestic' 'fox' 'jump' 'lazy' 'quick' 'tail'
 'wild']


In [11]:
vectorizer_bow.vocabulary_

{'quick': 7,
 'brown': 1,
 'fox': 4,
 'jump': 5,
 'lazy': 6,
 'dog': 2,
 'wild': 9,
 'animal': 0,
 'domestic': 3,
 'tail': 8}

In [7]:
# Initialize the CountVectorizer
vectorizer_bow = CountVectorizer(max_features=200,ngram_range=(1,2))    # addition of Ngram concept.
bow_matrix = vectorizer_bow.fit_transform(cleaned_data)

# Display the Bag of Words matrix
print("Bag of Words:\n", bow_matrix.toarray())
print("Feature Names:", vectorizer_bow.get_feature_names_out())

Bag of Words:
 [[0 0 1 1 1 0 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0]
 [0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0]
 [1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1]
 [0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0]]
Feature Names: ['animal' 'animal dog' 'brown' 'brown fox' 'dog' 'dog domestic' 'dog fox'
 'dog lazy' 'domestic' 'fox' 'fox jump' 'fox quick' 'fox tail' 'fox wild'
 'jump' 'jump lazy' 'lazy' 'lazy dog' 'lazy fox' 'quick' 'quick brown'
 'tail' 'wild' 'wild animal']


In [9]:
vectorizer_bow.vocabulary_

{'quick': 19,
 'brown': 2,
 'fox': 9,
 'jump': 14,
 'lazy': 16,
 'dog': 4,
 'quick brown': 20,
 'brown fox': 3,
 'fox jump': 10,
 'jump lazy': 15,
 'lazy dog': 17,
 'dog lazy': 7,
 'lazy fox': 18,
 'fox quick': 11,
 'wild': 22,
 'animal': 0,
 'domestic': 8,
 'fox wild': 13,
 'wild animal': 23,
 'animal dog': 1,
 'dog domestic': 5,
 'tail': 21,
 'dog fox': 6,
 'fox tail': 12}

**Pros:**

- Captures word frequency.

- Simple to implement.

**Cons:**

- Ignores word order and context.
- Results in large, sparse matrices.

# TF-IDF

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer_tfidf = TfidfVectorizer()
tfidf_matrix = vectorizer_tfidf.fit_transform(cleaned_data)

# Display the TF-IDF matrix
print("TF-IDF:\n", tfidf_matrix.toarray())
print("Feature Names:", vectorizer_tfidf.get_feature_names_out())


TF-IDF:
 [[0.         0.51381313 0.268129   0.         0.268129   0.51381313
  0.40509617 0.40509617 0.         0.        ]
 [0.         0.         0.3902801  0.         0.3902801  0.
  0.58964518 0.58964518 0.         0.        ]
 [0.53114624 0.         0.27717414 0.53114624 0.27717414 0.
  0.         0.         0.         0.53114624]
 [0.         0.         0.41988018 0.         0.41988018 0.
  0.         0.         0.8046125  0.        ]]
Feature Names: ['animal' 'brown' 'dog' 'domestic' 'fox' 'jump' 'lazy' 'quick' 'tail'
 'wild']


In [13]:
# Initialize the TfidfVectorizer
vectorizer_tfidf = TfidfVectorizer(ngram_range=(2,2))
tfidf_matrix = vectorizer_tfidf.fit_transform(cleaned_data)

# Display the TF-IDF matrix
print("TF-IDF:\n", tfidf_matrix.toarray())
print("Feature Names:", vectorizer_tfidf.get_feature_names_out())

TF-IDF:
 [[0.         0.4472136  0.         0.         0.         0.4472136
  0.         0.         0.         0.4472136  0.4472136  0.
  0.4472136  0.        ]
 [0.         0.         0.         0.         0.57735027 0.
  0.57735027 0.         0.         0.         0.         0.57735027
  0.         0.        ]
 [0.5        0.         0.5        0.         0.         0.
  0.         0.         0.5        0.         0.         0.
  0.         0.5       ]
 [0.         0.         0.         0.70710678 0.         0.
  0.         0.70710678 0.         0.         0.         0.
  0.         0.        ]]
Feature Names: ['animal dog' 'brown fox' 'dog domestic' 'dog fox' 'dog lazy' 'fox jump'
 'fox quick' 'fox tail' 'fox wild' 'jump lazy' 'lazy dog' 'lazy fox'
 'quick brown' 'wild animal']


**Pros:**

- Balances word frequency with importance.

- Reduces the weight of common words.

**Cons:**

- Still lacks semantic understanding.

- Results in large, sparse matrices.