**Word Embedding**
- Create count based word embedding
- One hot embedding
- Word embedding using Gensim
    - Create a word2vec model using text from wikipedia https://en.wikipedia.org/wiki/Machine_learning
    - what are possible preprocessing
    - tokenization?
    - what is the trained vocab?
    - how to treat words not in vocab?
    - length of vector?
- interpret the output of the following

In [1]:
my_docs = ["The economic slowdown is becoming more severe",
           "The movie was simply awesome",
           "I like cooking my own food",
           "Samsung is announcing a new technology",
           "Machine Learning is an example of awesome technology",
           "All of us were excited at the movie",
           "We have to do more to reverse the economic slowdown"]

## Count-based Word Embedding

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(my_docs)

# Vocabulary
vocab = count_vectorizer.get_feature_names_out()
print(vocab)

# Count-based word embedding
count_embedding = count_matrix.toarray()

print("Count-based Word Embedding:")
print(count_embedding)

['all' 'an' 'announcing' 'at' 'awesome' 'becoming' 'cooking' 'do'
 'economic' 'example' 'excited' 'food' 'have' 'is' 'learning' 'like'
 'machine' 'more' 'movie' 'my' 'new' 'of' 'own' 'reverse' 'samsung'
 'severe' 'simply' 'slowdown' 'technology' 'the' 'to' 'us' 'was' 'we'
 'were']
Count-based Word Embedding:
[[0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0]
 [0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1]
 [0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 2 0 0 1 0]]


## One-Hot Embedding

In [5]:
# Create a CountVectorizer instance for one-hot encoding
count_vectorizer = CountVectorizer(binary=True)
one_hot_matrix = count_vectorizer.fit_transform(my_docs)

# Vocabulary
vocab = count_vectorizer.get_feature_names_out()

# One-hot embedding
one_hot_embedding = one_hot_matrix.toarray()

print("One-Hot Embedding:")
print(one_hot_embedding)

One-Hot Embedding:
[[0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0]
 [0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1]
 [0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0]]


## Word Embedding using Gensim (Word2Vec)


In [8]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Download NLTK data for tokenization
import nltk
nltk.download('punkt')

text = """
Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines 'discover' their 'own' algorithms, without needing to be explicitly told what to do by any human-developed algorithms. Recently, generative artificial neural networks have been able to surpass results of many previous approaches. Machine learning approaches have been applied to large language models, computer vision, speech recognition, email filtering, agriculture and medicine, where it is too costly to develop algorithms to perform the needed tasks.
The mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. Data mining is a related (parallel) field of study, focusing on exploratory data analysis through unsupervised learning.
ML is known in its application across business problems under the name predictive analytics. Although not all machine learning is statistically-based, computational statistics is an important source of the field's methods.
"""

# Preprocessing: Tokenization
tokens = word_tokenize(text)

# Create and train Word2Vec model
model = Word2Vec(sentences=[tokens], vector_size=100, window=5, min_count=1, sg=0)

# Vocabulary
vocab = list(model.wv.index_to_key)

# Treat words not in vocabulary as "unknown"
out_of_vocab_word = "computer"
if out_of_vocab_word in model.wv:
    embedding = model.wv[out_of_vocab_word]
else:
    embedding = None

print("Word2Vec Embedding:")
print("Embedding for the word:", out_of_vocab_word)
print(embedding)

Word2Vec Embedding:
Embedding for the word: computer
[ 3.35471542e-03 -9.13533848e-03  4.59504547e-03  4.63672820e-03
 -7.76442373e-03 -8.24694987e-03 -7.79009005e-03 -7.75163900e-03
  1.76021364e-03  2.73724040e-03 -9.59737413e-03  6.85713626e-03
 -1.05297026e-04 -4.79714992e-03 -5.18830912e-03  9.33995179e-05
  8.09609052e-03  5.34964586e-03  8.75388179e-03 -6.34399801e-03
 -1.98542210e-03 -1.77396648e-03 -7.58295925e-03  1.84197666e-03
 -7.67881237e-03  5.72665315e-03 -8.51520337e-03  1.45647756e-03
 -3.90171044e-06 -7.10296445e-03  8.20576865e-03  6.60096295e-03
  3.34269600e-03  6.19508931e-03 -7.13171787e-04  4.48410353e-03
 -2.08885851e-03  4.66461061e-03  2.40713079e-03 -9.11064795e-04
  4.54581914e-06  2.97173741e-03 -5.67242922e-03  9.98330116e-03
 -7.04643503e-03  9.54082515e-03  5.69678890e-03 -6.27642963e-03
 -5.32261282e-03  2.75894068e-03 -5.74489310e-03  7.32901599e-03
  7.07960362e-03 -8.79097078e-03  3.09709483e-03  1.52279122e-03
 -1.19015633e-03 -7.68693443e-03 -7.3

[nltk_data] Downloading package punkt to C:\Users\Suraj
[nltk_data]     Pathak\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
