**Word embeddings**


Word embeddings are a type of NLP technique that maps words or phrases to vectors in a high-dimensional space, where words with similar meanings are represented by vectors that are close to each other

In [1]:
pip install -U sentence-transformers

Note: you may need to restart the kernel to use updated packages.


In [19]:
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np



2023-03-27 14:40:44.493361: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Imports the SentenceTransformer library and two pre-trained models,'all-MiniLM-L6-v2' and 'average_word_embeddings_glove.840B.300d', which are used to convert sentences into high-dimensional vectors, also known as word embeddings.



In [None]:
model1 = SentenceTransformer('all-MiniLM-L6-v2')
model2 = SentenceTransformer('average_word_embeddings_glove.840B.300d')

data = pd.read_csv('activities_symptoms_bool.csv', sep= ",")


In [None]:
# Assign the 'symptomName' column to a variable
symptoms = data['symptomName']


This code block defines the dimensionality of the embedding vectors as 768 
defines two functions that use two different pre-trained models, model1 and model2,
convert a given sentence to an embedding vector. 

In [None]:
# Define the dimensionality of the embedding vectors
EMBEDDING_DIM = 768

# Define a function to convert a sentence to an embedding vector using model1
def sentence_to_vector1(sentence):
    return model1.encode(sentence)

# Define a function to convert a sentence to an embedding vector using model2
def sentence_to_vector2(sentence):
    return model2.encode(sentence)

# Convert each variable in the 'symptomName' column to an embedding vector using model1
embedding_vectors1 = [sentence_to_vector1(symptom) for symptom in symptoms]

# Convert each variable in the 'symptomName' column to an embedding vector using model2
embedding_vectors2 = [sentence_to_vector2(symptom) for symptom in symptoms]

# Create a dataframe to store the embedding vectors for each variable
embedding_df = pd.DataFrame({'Symptom': symptoms, 'Embedding1': embedding_vectors1, 'Embedding2': embedding_vectors2})



In [29]:
embedding_df

Unnamed: 0,Symptom,Embedding1,Embedding2
0,Abcess,"[-0.009819672, 0.010166229, 0.037522994, 0.017...","[0.02169, -0.18056, -0.085585, -0.56702, -0.37..."
1,Abdomen,"[0.05988404, 0.016402284, -0.04906652, 0.04811...","[-0.73936, -0.18636, 0.59149, 0.47356, 0.59297..."
2,Abortifacient,"[0.0063083256, 0.069451496, 0.009171189, -0.00...","[0.58928, 0.24762, 0.5015, -0.31308, -0.029607..."
3,Abortive,"[-0.014113224, 0.0776526, -0.008357837, 0.0237...","[0.082946, 0.16964, -0.21112, 0.21073, -0.0094..."
4,Abrasion,"[-0.07861289, -0.02588769, 0.034610912, 0.0558...","[-0.37954, 0.44132, 0.036332, 0.2241, 0.087512..."
...,...,...,...
2399,Xeroderma,"[-0.019220931, 0.039361082, -0.008053315, -0.0...","[-0.1321, -0.03129, 0.56218, -0.37585, -0.1352..."
2400,Xerostomia,"[0.026467256, -0.004775559, -0.030214, -0.0261...","[0.44359, -1.2665, -0.32656, 0.016998, 0.29406..."
2401,Yawn,"[0.025499135, 0.016739048, 0.054517888, -0.011...","[-0.18912, 0.051953, -0.007478, -0.51766, -0.1..."
2402,Hyperhidrosis,"[-0.0497032, 0.010405881, 0.015777154, 0.07780...","[0.33859, 0.84053, -0.07664, 0.27558, 0.026969..."


In [31]:
embedding_df.to_csv('word2_embeddings.csv', index=False)