"Skip-gram Word Embeddings with Logistic Regression Classifier"

This notebook implements the Skip-gram model to train word embeddings, capturing the likelihood of words occurring nearby in a given text dataset. A logistic regression classifier is then employed to compute the probability of word pairs co-occurring. The program utilizes a custom dataset for training, showcasing the practical application of Skip-gram for creating meaningful word representations and the subsequent use of logistic regression for probabilistic analysis of word associations.

In [113]:
import gensim
import numpy as np
from sklearn.linear_model import LogisticRegression

In [135]:
with open("american_psycho.txt", "r") as f:
    text = f.read()

sentences = text.split("\n")
text_data = []
for sentence in sentences:
    text_data.append(sentence.lower().split())

In [140]:
model = gensim.models.Word2Vec(text_data, sg=1, window=5, vector_size=100, min_count=3, workers=-1)

In [151]:
def word_similarity(word1, word2):
    try:
        vec1 = model.wv[word1]
        vec2 = model.wv[word2]
        similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
        return similarity
    except:
        return np.random.rand()

In [152]:
X_train = []
y_train = []
for i in range(len(text_data) - 1):
    for j in range(i + 1, min(i + 6, len(text_data))):
        if text_data[i] != text_data[j]:
            X_train.append([word_similarity(text_data[i], text_data[j])])
            y_train.append(1 if j - i <= 3 else 0)

In [157]:
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

In [158]:
def predict_probability(word1, word2):
    similarity = word_similarity(word1, word2)
    return clf.predict_proba([[similarity]])[0][1]

In [159]:
predict_probability("american","psycho")

0.607632571881666

In [161]:
predict_probability("life","play")

0.6063473052704731