[SemAxis](https://arxiv.org/pdf/1806.05521.pdf) is a method for scoring terms along a user-defined axis (e.g., positive-negative, concrete-abstract, hot-cold), which can be used for a range of empirical questions (for one example, see [Kozlowski et al. 2019](https://journals.sagepub.com/doi/full/10.1177/0003122419877135)). In this homework, you'll implement SemAxis using word representations from Glove, and use it to explore corpus-specific conceptual associations.

Before running, install gensim with:

`conda install gensim`


In [5]:
conda install gensim

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\12062\Anaconda3\envs\anlp

  added / updated specs:
    - gensim


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    boto3-1.18.21              |     pyhd3eb1b0_0          70 KB
    botocore-1.21.21           |     pyhd3eb1b0_1         3.8 MB
    bz2file-0.98               |   py38haa95532_1         246 KB
    cython-0.29.23             |   py38hd77b12b_0         1.7 MB
    gensim-4.0.1               |   py38hd77b12b_0        18.2 MB
    jmespath-

In [6]:
import re
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import numpy as np
import numpy.linalg as LA



In this homework, we'll be working with pre-trained word embeddings using the `gensim` library, which provides a number of functions for accessing representations for individual words and comparing them.  The representations we'll use come from [Glove](https://nlp.stanford.edu/projects/glove/), which are trained on web data from the [Common Crawl](https://en.wikipedia.org/wiki/Common_Crawl) corpus.

In [7]:
# First we have to convert the Glove format into w2v format; this creates a new file
glove_file="../data/glove.6B.100d.100K.txt"
glove_in_w2v_format="../data/glove.6B.100d.100K.w2v.txt"
_ = glove2word2vec(glove_file, glove_in_w2v_format)

  _ = glove2word2vec(glove_file, glove_in_w2v_format)


In [8]:
glove = KeyedVectors.load_word2vec_format("../data/glove.6B.100d.100K.w2v.txt", binary=False)

In [9]:
good_vector=glove["good"]

Functions useful for the first question include the following:

In [16]:
# access the representation for a single word
great_vector=glove["great"]

# use numpy to average multiple vector representations together
vecs_to_average=[good_vector, great_vector]
average=np.mean(vecs_to_average, axis=0)
# calculate the cosine similariy between two vectors
cosine_similarity=glove.cosine_similarities(good_vector, [great_vector])

print(good_vector.shape, great_vector.shape, average.shape, cosine_similarity)

(100,) (100,) (100,) [0.7592798]


**Q1.** Read the SemAxis [paper](https://arxiv.org/pdf/1806.05521.pdf) and implement the SemAxis method described in sections 3.1.2 and 3.1.3.  Given a set of word embeddings for positive terms $S^+ = \{v_1^+, \ldots v_n^+\}$ and embeddings for negative terms $S^- = \{v_1^-, \ldots v_n^-\}$ that define the endpoints of the axis, your output should be a single real-value score for an input word $w$ with word representation $v_w$:

$$
score(w)_{\mathbf{V_\textrm{axis}}} = \textrm{cos}(v_w, \mathbf{V}_\textrm{axis})
$$

Where: 
$$
\mathbf{V}^+ = {1 \over n} \sum_1^n v_i^+
$$

$$
\mathbf{V}^- = {1 \over m} \sum_1^m v_i^-
$$

$$
\mathbf{V}_{\textrm{axis}} = \mathbf{V}^+ - \mathbf{V}^-
$$



In [52]:
def get_semaxis_score(vectors, positive_terms=None, negative_terms=None, target_word=None):
    V_plus = []
    V_neg = []
    
    # your code here
    for p_word in positive_terms:
        V_plus.append(vectors[p_word])
        
    for n_word in negative_terms:
        V_neg.append(vectors[n_word])

    average_p = np.mean(V_plus, axis=0)
    average_n = np.mean(V_neg, axis=0)

    V_axis = np.subtract(average_p , average_n)

    # calculate the cosine similariy between two vectors
    score = vectors.cosine_similarities(vectors[target_word], [V_axis])
    
    return score[0]

In [53]:
# should be 0.342
get_semaxis_score(glove, positive_terms=["woman", "women"], negative_terms=["man", "men"], target_word="actress")

0.3424988

Now let's score a set of target terms along that axis

In [54]:
def score_list_of_targets(vectors, positive_terms=None, negative_terms=None, target_words=None):
    scores=[]
    for target in target_words:
        scores.append((get_semaxis_score(vectors, positive_terms, negative_terms, target), target))

    for k,v in reversed(sorted(scores)):
        print("%.3f\t%s" % (k,v))

In [55]:
targets=["doctor", "nurse", "actor", "actress", "mechanic", "librarian", "architect", "magician", "cook", "chef"]

In [56]:
score_list_of_targets(glove, positive_terms=["woman", "women"], negative_terms=["man", "men"], target_words=targets)

0.342	actress
0.294	nurse
0.219	librarian
0.106	doctor
0.024	actor
0.003	chef
-0.019	cook
-0.075	architect
-0.153	magician
-0.194	mechanic


**Q2:** Define your own concept axis by selecting a set of positive and negative terms and illustrate its utility by scoring a set of 10 target terms (as we did above).

In [58]:
positive_terms=['patience', 'love', 'knowledge', 'perseverance', 'faith', 'hope']
negative_terms=['pride', 'greed', 'wrath', 'envy', 'lust', 'gluttony', 'sloth']
targets=['jealous', 'control', 'playful', 'competitive', 'humble', 'admire', 'frustration', 'anger', 'mentor', 'sharing']

score_list_of_targets(glove, positive_terms=positive_terms, negative_terms=negative_terms, target_words=targets)

0.493	sharing
0.330	control
0.303	mentor
0.280	competitive
0.264	humble
0.199	admire
0.127	frustration
0.048	playful
0.001	anger
-0.087	jealous


**Q3:** Let's assume now that you're able to score all words in a vocabulary along several conceptual dimensions (like the one you've defined) for a given set of word embeddings trained on a dataset.  What could you do with that score? Brainstorm possible applications.

Incentivize positive word use in a community sharing environment, or parental control on a children gaming platform, banning words that are scored lower at a certain threashold, and rewardc positive attitude and word choices. 