# Basic NLP Course

## Introduction to Word Embeddings

Word embeddings are dense vector representations of words that capture their meanings, semantic relationships, and syntactic roles. Unlike traditional methods like Bag of Words or TF-IDF, word embeddings encode words in a continuous vector space, enabling models to understand the context and relationships between words.

### Concepts of Word Embeddings

- **Dense Representation**: Words are represented as dense vectors in a lower-dimensional space, unlike sparse representations in traditional methods.
- **Semantic Similarity**: Words with similar meanings are located closer together in the vector space.
- **Contextual Understanding**: Embeddings capture the context in which words appear, improving the model's ability to understand language.

### Popular Word Embedding Models

1. **Word2Vec**:
    - Developed by Google.
    - Two architectures: Continuous Bag of Words (CBOW) and Skip-Gram.
    - CBOW predicts a word given its context, while Skip-Gram predicts the context given a word.

2. **GloVe (Global Vectors for Word Representation)**:
    - Developed by Stanford.
    - Combines global word co-occurrence statistics with local context to generate embeddings.

3. **FastText**:
    - Developed by Facebook.
    - Extends Word2Vec by representing words as subword units, enabling it to handle rare and out-of-vocabulary words.

4. **BERT (Bidirectional Encoder Representations from Transformers)**:
    - Developed by Google.
    - Contextual embeddings that consider the entire sentence, providing different vectors for the same word in different contexts.

### Example

Consider the words "king", "queen", "man", and "woman". Word embeddings can capture relationships such as:
- **king - man + woman ≈ queen**

This demonstrates how embeddings encode semantic relationships.

### Pros and Cons

| Feature       | Word Embeddings                                |
|---------------|------------------------------------------------|
| **Pros**      | Captures semantic and syntactic relationships. |
|               | Dense representation reduces dimensionality.   |
|               | Improves performance on NLP tasks.             |
| **Cons**      | Requires large datasets for training.          |
|               | Pre-trained embeddings may not fit specific domains. |


In [13]:
import spacy
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.8.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [3]:
# load large model
nlp = spacy.load("en_core_web_lg")

In [6]:
# load work order samples
data = pd.read_csv('../data/work_orders_sample.csv')
data.head()

Unnamed: 0,failure_mode,description
0,Internal leakage,Compressor CP-001 is experiencing internal lea...
1,Abnormal instrument reading,Compressor CP-101 is showing abnormal pressure...
2,Abnormal instrument reading,Compressor C-101 is giving an abnormal high pr...
3,Abnormal instrument reading,Compressor C-101-A is giving abnormal instrume...
4,Abnormal instrument reading,Compressor CP-101 is giving an abnormal instru...


In [7]:
sample_description = data.sample(1)['description'].values[0]
print(sample_description)

Compressor leak on process line, suspected worn piston rings.


In [9]:
for token in nlp(sample_description):
    print(f'{token.text:{12}} {token.has_vector} {token.vector_norm:.2f} {token.is_oov} {token.is_stop} {token.pos_:{6}}')

Compressor   True 7.24 False False PROPN 
leak         True 6.62 False False NOUN  
on           True 5.22 False True ADP   
process      True 6.46 False False NOUN  
line         True 5.82 False False NOUN  
,            True 5.09 False False PUNCT 
suspected    True 6.48 False False VERB  
worn         True 6.53 False False VERB  
piston       True 7.63 False False NOUN  
rings        True 6.58 False False NOUN  
.            True 4.93 False False PUNCT 


In [10]:
# investigate vector shape
print(f"Vector shape: {nlp.vocab['cat'].vector.shape}")


Vector shape: (300,)


In [11]:
# establish similarity between words
base_word = nlp('machine')

for token in nlp(sample_description):
    print(f'{token.text:{12}} <--> {base_word.text:{12}} {token.similarity(base_word):.4f}')

Compressor   <--> machine      0.3332
leak         <--> machine      0.2174
on           <--> machine      0.2343
process      <--> machine      0.4120
line         <--> machine      0.3479
,            <--> machine      0.1672
suspected    <--> machine      0.1293
worn         <--> machine      0.1985
piston       <--> machine      0.3027
rings        <--> machine      0.1988
.            <--> machine      0.2211


In [16]:
compressor = nlp('compressor')
machine = nlp('machine')
rotating = nlp('rotating')
gas = nlp('gas')

new_compressor = machine.vector + rotating.vector + gas.vector

cosine_sim = cosine_similarity([compressor.vector], [new_compressor])
print(f"Cosine similarity between 'compressor' and the combination of 'machine', 'rotating', and 'gas': {cosine_sim[0][0]:.4f}")

Cosine similarity between 'compressor' and the combination of 'machine', 'rotating', and 'gas': 0.5097
