<a href="https://colab.research.google.com/github/Swap1984/swapnil/blob/main/Assignment_Fast_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ** Fast text embedding technique**

FastText is an advanced word embedding model developed by Facebook’s AI Research (FAIR) team. Unlike Word2Vec, which learns vectors for entire words, FastText breaks down words into subwords (n-grams) and learns embeddings for these subword components. This makes FastText more robust to out-of-vocabulary (OOV) words and misspellings.

**Advantages of FastText:**

1)Handles Out-of-Vocabulary (OOV) Words:
Since FastText learns from subwords, it can generate vectors for words not seen during training, unlike Word2Vec and GloVe, which fail on OOV words.

2)Captures Morphological Information:
FastText captures word morphology better by using character n-grams. This is especially useful in morphologically rich languages.

3)Works Well on Small Datasets:
Since it uses subword information, FastText often performs better on smaller datasets compared to models like Word2Vec.

**Disadvantages of FastText:**

1)Larger Memory Footprint:
Due to the subword approach, FastText models can consume more memory and storage compared to Word2Vec and GloVe.

2)Slower Training:
Breaking down words into n-grams increases the number of computations required, leading to slower training compared to Word2Vec.

3)Can Be Overly Sensitive to Misspellings:
While FastText handles OOV words, it might over-rely on the subword information, potentially making it more sensitive to minor misspellings.

**Applications of FastText:**

1)Text Classification:
FastText embeddings can improve text classification tasks by capturing subword-level information.

2)Spell Checking and Autocorrect:
FastText can identify and correct misspelled words based on subword similarities.

3)Named Entity Recognition (NER):
Helps in identifying entities even if they are slightly misspelled or in different forms.

4)Multilingual Embeddings:
FastText can be used for training multilingual embeddings, useful for translation tasks.


**Comparison of FastText, Word2Vec, and GloVe in a tabular format based on various aspects**

| **Aspect**                 | **FastText**                                 | **Word2Vec**                                | **GloVe**                                      |
|----------------------------|----------------------------------------------|---------------------------------------------|------------------------------------------------|
| **Model Type**              | Predictive (Subword information used)        | Predictive (Contextual)                     | Count-based (Global co-occurrence statistics)  |
| **Subword Information**     | Yes (Handles subwords via n-grams)           | No (Word-level)                             | No (Word-level)                                |
| **Out-of-Vocabulary (OOV)** | Handles OOV by creating vectors for subwords | Does not handle OOV words                   | Does not handle OOV words                      |
| **Morphological Awareness** | High (Captures morphology)                   | Low (Fails with minor word variations)      | Low (Fails with minor word variations)         |
| **Training Speed**          | Slower (Subword breakdown increases steps)   | Faster                                      | Requires pre-training (cannot retrain easily)  |
| **Memory Requirement**      | High (Due to subword representations)        | Moderate                                    | Low (Pre-trained vectors, no subwords)         |
| **Training Data Needed**    | Works well on small datasets                 | Needs larger datasets for good performance  | Pre-trained (on large corpus)                  |
| **Handling Misspellings**   | Good (Generates embeddings for misspelled words) | Poor (Misspellings are treated as separate words) | Poor (Misspellings are treated as separate words) |
| **Pre-trained Availability**| Available (but often requires custom training) | Available (but custom training required)    | Available (Pre-trained on large corpora)       |
| **Embedding Type**          | Dynamic (Can be trained on new data)         | Dynamic (Can be trained on new data)        | Static (Pre-trained embeddings, cannot be fine-tuned) |
| **Dimensionality**          | Adjustable during training                   | Adjustable during training                  | Fixed dimensionality (based on pre-trained vectors) |
| **Contextual Information**  | Captures context with subword representations | Captures context with full words            | Captures global word co-occurrence statistics  |
| **Common Applications**     | Text classification, spell check, NER        | Text classification, word similarity tasks  | Pre-trained for tasks like sentiment analysis  |


#Code for Fast text

In [1]:
# Import the necessary libraries
from gensim.models import FastText
from gensim.test.utils import common_texts

In [2]:
# Sample dataset (list of tokenized sentences)
common_texts


[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [5]:
# Train FastText model
model_ft = FastText(sentences=common_texts, vector_size=100, window=5, min_count=1, epochs=10)
#sentences: The tokenized sentences to train the FastText model.
#vector_size: Size of the embedding vector for each word.
#window: Maximum distance between the current and predicted word in the sentence.
#min_count: Minimum frequency of words to include in the vocabulary.
#epochs: Number of training iterations.

In [4]:
# Save the model
model_ft.save("fasttext.model")

In [6]:
# Retrieve the vector for a word in the training set
word_vector = model_ft.wv['computer']
print(f"Embedding for 'computer':\n{word_vector}\n")

Embedding for 'computer':
[ 2.9691675e-04  3.3106157e-04 -8.7776285e-04  3.3971368e-04
 -5.0200790e-04 -2.0427848e-03 -1.2411864e-03 -1.9409786e-03
  1.3460165e-03 -2.4136524e-03  9.1865542e-04 -1.0317172e-03
 -7.6370395e-04  7.3127791e-05  1.3834432e-03  5.1939470e-04
 -2.9898330e-04 -1.1951911e-03 -1.1728198e-03 -6.0902466e-04
 -6.7812472e-04  3.9287744e-04  9.9021803e-05  8.1278791e-04
  5.8207958e-04  7.0247462e-04 -7.3670415e-04 -1.0399165e-03
 -6.2516180e-04 -2.4074930e-04 -1.1935978e-03 -2.6611326e-04
  7.3650532e-04 -7.2191353e-04 -1.2753668e-03  1.2436805e-04
  3.7794228e-04 -1.3318789e-03 -2.7351857e-03 -3.0493268e-04
  9.2886400e-04 -7.2840165e-04 -1.1293935e-03 -3.2223188e-04
 -2.0584364e-04 -1.0493779e-04 -6.2300439e-04 -1.6142892e-03
  9.9138310e-04  9.2263552e-05  3.6843625e-04 -5.3785514e-04
  1.1336672e-03  8.7101565e-04 -1.6394290e-03 -8.5611606e-04
 -6.3161540e-04  6.2353048e-04  8.4044196e-04 -1.1285455e-03
  1.2917615e-03 -3.4059497e-04 -1.1788004e-03 -1.6089173e-0

In [8]:
# Handling OOV word 'computr'
if 'computr' in model_ft.wv:
    print(f"'computr' is in the vocabulary")
else:
    print(f"'computr' is not in the vocabulary but FastText can still generate a vector")


'computr' is in the vocabulary


In [7]:
# Checking a word not seen during training (Out-of-Vocabulary)
oov_vector = model_ft.wv['computr']  # Misspelled 'computer'
print(f"Embedding for OOV word 'computr':\n{oov_vector}\n")

Embedding for OOV word 'computr':
[ 1.2959763e-03  8.6262973e-04  1.0625147e-03 -1.0403705e-04
 -1.5368442e-03 -1.6962028e-03 -1.8042364e-04 -7.4284733e-04
  1.1159903e-03 -1.5890810e-03  2.0372857e-04 -4.8860029e-04
 -8.2608103e-04 -1.1068498e-03  9.5997169e-04 -4.6363252e-04
 -7.4049569e-04 -3.5551208e-04 -1.7914166e-03 -4.1690480e-05
  6.1781757e-04  2.3780437e-04  1.4869868e-03  2.3090348e-03
 -7.2294025e-04  9.5657702e-04 -9.6319732e-04 -1.1511220e-03
  2.0535337e-04 -8.2215574e-04 -5.9127103e-04 -8.6116942e-04
 -8.1335296e-05 -1.8366714e-03  8.6622278e-04 -5.4118922e-05
 -1.3347534e-03 -4.3000176e-04 -1.9358672e-03  2.1117524e-04
  1.0365512e-03 -8.5818191e-04  1.8216838e-03 -2.0865335e-03
 -9.4473729e-04  5.9314049e-04  8.9802052e-04 -2.1881014e-03
 -5.0291722e-04  8.2109409e-04 -7.3110219e-04  5.9765473e-04
  2.4574695e-04 -1.1383955e-03 -3.9746417e-04  1.0086796e-04
 -1.1171998e-03 -8.2829612e-04  8.1999600e-04 -1.4038280e-03
  3.9927711e-04 -3.2045655e-03  2.2712503e-04 -1.26

# Inference and analysis of above exercise

**Inference from the above exercise**

Here we see that the model is incorrectly returning that the misspelled word 'computr' is present in the vocabulary

Here's why this happens:

The FastText model's word vectors (model_ft.wv) include subword vectors. When you query a word like "computr", FastText looks at its subwords and checks whether it can generate a vector by combining the n-grams that exist in the vocabulary.

So, when you check for 'computr' in model_ft.wv, FastText may return True because it implicitly recognizes the word based on the subwords that exist in the model, even though the full word "computr" itself was never trained directly.

FastText Can Dynamically Create Vectors for New Words:

Even though "computr" wasn't seen in training, FastText can dynamically create a vector for it by using the subword vectors it learned during training. As a result, FastText models can generate vectors for OOV words (out-of-vocabulary words) because they rely on the subword structure.

**How FastText Handles Misspelled Words:**

Breaking Words into Subwords:

FastText represents words as a combination of n-grams (subword units). For example, the word "computer" might be broken down into subwords like com, omp, put, uter, etc. These subword embeddings are learned during training.
Misspelled Words:

When FastText encounters a misspelled word (e.g., "computr" instead of "computer"), it still tries to break the word into subwords like com, om, put, and utr. Since some subwords of "computr" are similar to subwords in "computer" (e.g., com and put), the resulting vector for "computr" will be close to "computer" but not identical.
Creating a Vector:

FastText generates a new vector for the misspelled word by summing the embeddings of its subwords. Since many subwords overlap between the misspelled and correctly spelled word, the vectors will be similar.
Similarity to Correct Word:

The vector for the misspelled word (e.g., "computr") will be similar to but distinct from the vector for the correct word ("computer"). This makes FastText resilient to minor spelling mistakes or variations.

**Important**

FastText can generate embeddings for misspelled or OOV words using subwords, but it does not modify its vocabulary or add those words to it.

The generated vectors for OOV words are temporary and not retained in the model's vocabulary for future use.

This allows FastText to handle spelling variations and OOV words flexibly, but the underlying vocabulary remains static after training.

To handle the above incorrect response we can use the following code,

**model_ft.wv.key_to_index**: This attribute contains the actual words in the vocabulary that were seen during training.

By using key_to_index, you are ensuring that only words explicitly present in the training set are counted as being in the vocabulary.

In [12]:
# Handling OOV word 'computr'
if 'computr' in model_ft.wv.key_to_index:
    print(f"'computr' is in the vocabulary")
else:
    print(f"'computr' is not in the vocabulary but FastText can still generate a vector")

'computr' is not in the vocabulary but FastText can still generate a vector


Thus now we get the correct response for the misspelled words

# Comparing Fast text with Word2Vec for OOV words

In [13]:
# Handling OOV word 'computr' in Fast Text model
if 'computr' in model_ft.wv:
    print(f"'computr' is in the vocabulary")
else:
    print(f"'computr' is not in the vocabulary but FastText can still generate a vector")

# Get the vector for the OOV word
oov_word_vector = model_ft.wv['computr']
print(f"OOV vector for 'computr':\n{oov_word_vector}")

'computr' is in the vocabulary
OOV vector for 'computr':
[ 1.2959763e-03  8.6262973e-04  1.0625147e-03 -1.0403705e-04
 -1.5368442e-03 -1.6962028e-03 -1.8042364e-04 -7.4284733e-04
  1.1159903e-03 -1.5890810e-03  2.0372857e-04 -4.8860029e-04
 -8.2608103e-04 -1.1068498e-03  9.5997169e-04 -4.6363252e-04
 -7.4049569e-04 -3.5551208e-04 -1.7914166e-03 -4.1690480e-05
  6.1781757e-04  2.3780437e-04  1.4869868e-03  2.3090348e-03
 -7.2294025e-04  9.5657702e-04 -9.6319732e-04 -1.1511220e-03
  2.0535337e-04 -8.2215574e-04 -5.9127103e-04 -8.6116942e-04
 -8.1335296e-05 -1.8366714e-03  8.6622278e-04 -5.4118922e-05
 -1.3347534e-03 -4.3000176e-04 -1.9358672e-03  2.1117524e-04
  1.0365512e-03 -8.5818191e-04  1.8216838e-03 -2.0865335e-03
 -9.4473729e-04  5.9314049e-04  8.9802052e-04 -2.1881014e-03
 -5.0291722e-04  8.2109409e-04 -7.3110219e-04  5.9765473e-04
  2.4574695e-04 -1.1383955e-03 -3.9746417e-04  1.0086796e-04
 -1.1171998e-03 -8.2829612e-04  8.1999600e-04 -1.4038280e-03
  3.9927711e-04 -3.2045655e-

In [10]:
# Handling OOV word 'computr' in Word2Vec model
from gensim.models import Word2Vec

# Train Word2Vec model
model_w2v = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, epochs=10)

# Try getting a vector for an OOV word in Word2Vec
try:
    w2v_vector = model_w2v.wv['computr']
except KeyError:
    print("'computr' not found in Word2Vec vocabulary!")

'computr' not found in Word2Vec vocabulary!


# Comparing Fast text with GloVe for OOV words

### FastText vs GloVe: Handling OOV Words and Vocabulary Management

| **Aspect**                 | **FastText**                                                                                   | **GloVe**                                                                                     |
|----------------------------|------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| **Vocabulary Handling**     | Vocabulary is based on words seen during training, but **subword n-grams** (subword pieces) are learned and used to generate embeddings for OOV words dynamically. | Vocabulary is static and contains only words seen during training. GloVe does **not handle subwords** or dynamically create vectors for OOV words.|
| **Handling OOV Words**      | Can handle OOV words by generating embeddings using subword n-grams (e.g., "computr" can be broken down into subword pieces like "com", "put", "utr", and generate a vector). | GloVe **cannot handle OOV words**. If you query an OOV word (e.g., "computr"), it will result in an error, as no vector exists for the word.|
| **Misspelled Words**        | Can generate vectors for misspelled words using subword components, resulting in a vector close to the correct word but not identical. | Treats misspelled words as completely new words, and no vector is generated unless they were part of the training data. |
| **Dynamic Vector Creation** | Vectors for OOV and misspelled words are generated dynamically at query time using subwords, but the misspelled word is **not added to the vocabulary**. | GloVe uses static pre-trained vectors, and no new vectors are created after training. No dynamic vector creation for OOV words. |
| **Vocabulary Expansion**    | Does not expand the vocabulary when encountering OOV words. It dynamically creates vectors using subword n-grams but does not add the words to the model's vocabulary. | Vocabulary is fixed after training. There is **no capability to expand the vocabulary** or create vectors for OOV words without retraining the model. |
| **Vector Consistency**      | Subword modeling helps ensure that similar words (including misspelled words) have similar vectors, reducing the impact of minor variations in word forms. | Vectors are only available for words seen during training. Misspelled or new words have no vector, leading to consistency issues with OOV words. |
| **Training Data**           | Works well even on smaller datasets because it uses subword information to generalize to unseen words. | Requires large datasets for good performance because the co-occurrence statistics rely on seeing enough examples of each word. |
| **Pre-trained Models**      | Pre-trained FastText models are available and can be used for various tasks, though they can be further fine-tuned on new data. | Pre-trained GloVe models are widely available and are typically used as-is since GloVe cannot be dynamically fine-tuned easily without retraining. |

---

### Key Differences Between FastText and GloVe:

1. **Subword Information**:
   - **FastText**: Leverages subwords (character n-grams), which allows it to generate vectors for unseen or misspelled words.
   - **GloVe**: Does not use subwords, which means it cannot handle OOV words and requires retraining to expand vocabulary.

2. **OOV Word Handling**:
   - **FastText**: Can handle OOV words dynamically using subwords. This makes it much more robust to spelling variations, new words, or morphological changes.
   - **GloVe**: Fails with OOV words. If a word wasn’t seen during training, no vector is generated, and the model will return an error.

3. **Vocabulary Expansion**:
   - **FastText**: Does not expand its vocabulary after training, but it dynamically generates vectors using subwords.
   - **GloVe**: No vocabulary expansion is possible unless the model is retrained.

4. **Efficiency in Handling Misspellings**:
   - **FastText**: Handles minor variations and misspellings well because it can still create vectors based on subwords.
   - **GloVe**: Treats misspellings as entirely new words and fails to generate a vector unless explicitly trained on those variations.

