<a href="https://colab.research.google.com/github/RajarajachozhanVK/RajarajachozhanVK/blob/main/Morphology_is_an_Important_Factor_for_Word_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Morphology is important factor for word Embedding**

Aim: To develop algorithms for finding morphology of Word Documents.
Procedure:
Morphology is indeed an important factor for word embedding. Word embedding is a technique in
natural language processing (NLP) that represents words as dense vectors in a continuous vector
space. These embeddings capture semantic relationships and contextual information about words,
enabling machines to better understand and process natural language.
Morphology refers to the structure of words and how they are formed from smaller meaningful
units called morphemes. Words can be broken down into morphemes, such as prefixes, suffixes,
and roots, each contributing to the overall meaning of the word. Understanding morphology is
crucial for accurate word representation in embedding models for several reasons:
1. Semantic Similarity: Words with similar meanings often share morphological features.
Embedding models that consider morphology can capture these relationships more effectively,
improving the semantic similarity between words.
2. Out-of-Vocabulary Handling: Morphological analysis helps in handling out-of-vocabulary
words by recognizing common prefixes, suffixes, or roots. This allows the model to generalize
to new or unseen words based on their morphological structure.
3. Word Sense Disambiguation: Morphological information aids in distinguishing between
different senses of a word. Words with the same spelling but different meanings (homographs)
can be disambiguated based on their morphological context.
4. Language Agglutination: In languages where words are often formed by adding prefixes,
suffixes, or infixes, understanding morphology becomes crucial. Models that capture these
morphological rules can generate more accurate embeddings.
In this example, We will use the FastText library to create a word embedding model with morphol-
ogy awareness. The FastText model inherently considers subword information, which is beneficial
for capturing morphological nuances.
1. The NLTK library is used for tokenization and stop-word removal.
2. The FastText model is trained on the preprocessed data, considering word bigrams (word-
Ngrams=2) to capture morphological information.
3. The trained model is saved for later use.
4. An example shows how to obtain the embedding for the word “morphology.”

**Implementation**

In [4]:
pip install nltk



In [5]:
pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m859.3 kB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.12.0-py3-none-any.whl (234 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.3-cp310-cp310-linux_x86_64.whl size=4239657 sha256=a73b679dc368636ffc3c136c28399cd2cda39f9d536443574ce32ad80a2b0814
  Stored in directory: /root/.cache/pip/wheels/0d/a2/00/81db54d3e6a8199b829d58e02cec2ddb20ce3e59fad8d3c92a
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.3 pybind11-2.12.0


In [10]:
#!pip install nltk
#!pip install fasttext
import nltk
nltk.download('stopwords') # Download the stopwords corpus
nltk.download('punkt')
import fasttext
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# Sample Word document data
document_data = [
    "This is a sample document.",
    "Word embeddings capture morphological information.",
    "Morphology is essential for understanding words.",
    # Add more sentences as needed
]
# Preprocessing
stop_words = set(stopwords.words('english'))
def preprocess(sentence):
    tokens = word_tokenize(sentence.lower())
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words and token not in string.punctuation]
    return " ".join(tokens)
preprocessed_data = [preprocess(sentence) for sentence in document_data]
# Save the preprocessed data to a text file
with open('morphology_document.txt', 'w', encoding='utf-8') as file:
    for sentence in preprocessed_data:
        file.write(sentence + '\n')
# Train FastText model with morphology awareness
model = fasttext.train_unsupervised(
    'morphology_document.txt',
    model='skipgram',
    wordNgrams=2,  # Consider word bigrams for morphological information
    minCount=1,  # Minimum count of a word to be considered
    epoch=10  # Number of training epochs
)
# Save the trained model
model.save_model('morphology_word_embedding.bin')
# Example: Get the embedding for a word
word_embedding = model.get_word_vector('morphology')
print(f'Embedding for "morphology": {word_embedding}')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Embedding for "morphology": [-3.67912871e-05 -4.51257511e-04  3.27752699e-04  2.42517432e-04
 -7.41376027e-07 -6.91938272e-04 -3.17198079e-04  3.48138768e-04
  1.95542147e-04 -5.99805266e-04 -2.36053820e-05 -2.24640913e-04
 -1.45704093e-04  1.25037594e-04  2.67465453e-04 -1.61467367e-04
 -3.69575457e-04 -1.26908606e-04  2.84750917e-04  7.62814525e-05
 -1.07449639e-04 -4.08452121e-04  3.95845156e-04 -1.16585805e-04
  2.35396627e-04 -1.06461055e-04 -4.26862389e-05 -1.13754002e-04
  2.43401955e-04  3.27853806e-04  9.78696335e-06  8.42085647e-05
  3.66075808e-04  2.36371983e-04 -1.82176649e-04  2.96842860e-04
  3.47433350e-04  9.30823517e-05  1.04221115e-04  5.78945655e-05
 -1.35618553e-04 -2.50680750e-04 -1.14625560e-04  2.65853800e-04
 -2.38906068e-05 -1.68257102e-04  3.32706579e-04  6.07833616e-04
 -3.30856041e-04  3.07992275e-04  4.57142996e-05  5.42950293e-04
 -4.90873994e-04 -7.48773236e-05 -4.81654279e-04  1.90033345e-04
  3.66624910e-04 -3.54755204e-04 -1.10380883e-04  2.38397930e-

In this example:
1. NLTK is used for tokenization, stop-word removal, and lemmatization.
2. The Word2Vec model from gensim is trained on the preprocessed data.
3. The trained model is saved for later use.
4. An example shows how to obtain the embedding for the word “morphology.”

In [13]:
pip install gensim



In [14]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

In [16]:
# nltk.download('stopwords')
# nltk.download('punkt')
# Sample document data
document_data = [
    "This is a sample document.",
    "Word embeddings capture morphological information.",
    "Morphology is essential for understanding words.",
    # Add more sentences as needed
]
# Preprocess the data
stop_words = set(stopwords.words('english'))

def preprocess(sentence):
    tokens = word_tokenize(sentence.lower())
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words and token not in punctuation]
    return tokens
preprocessed_data = [preprocess(sentence) for sentence in document_data]
# Train Word2Vec model
model = Word2Vec(sentences=preprocessed_data, vector_size=100, window=5, min_count=1, workers=4)
# Save the trained model
model.save("morphology_word_embedding.model")
# Example: Get the embedding for a word
word_embedding = model.wv['morphology']
print(f'Embedding for "morphology": {word_embedding}')

Embedding for "morphology": [-8.2426779e-03  9.2993546e-03 -1.9766092e-04 -1.9672764e-03
  4.6036304e-03 -4.0953159e-03  2.7431143e-03  6.9399667e-03
  6.0654259e-03 -7.5107943e-03  9.3823504e-03  4.6718083e-03
  3.9661205e-03 -6.2435055e-03  8.4599797e-03 -2.1501649e-03
  8.8251876e-03 -5.3620026e-03 -8.1294188e-03  6.8245591e-03
  1.6711927e-03 -2.1985089e-03  9.5136007e-03  9.4938548e-03
 -9.7740470e-03  2.5052286e-03  6.1566923e-03  3.8724565e-03
  2.0227872e-03  4.3050171e-04  6.7363144e-04 -3.8206363e-03
 -7.1402504e-03 -2.0888723e-03  3.9238976e-03  8.8186832e-03
  9.2591504e-03 -5.9759365e-03 -9.4026709e-03  9.7643770e-03
  3.4297847e-03  5.1661171e-03  6.2823449e-03 -2.8042626e-03
  7.3227035e-03  2.8302716e-03  2.8710044e-03 -2.3803699e-03
 -3.1282497e-03 -2.3701417e-03  4.2764368e-03  7.6057913e-05
 -9.5842788e-03 -9.6655441e-03 -6.1481940e-03 -1.2856961e-04
  1.9974159e-03  9.4319675e-03  5.5843508e-03 -4.2906962e-03
  2.7831673e-04  4.9643586e-03  7.6983096e-03 -1.1442233e