# Upskills clustering

A number of experiments on trying to make sense of the ~200 job offers compiled as part of the upskills project. 
The core idea is trying to organise the kib offers. Since we do not have any annotation, we have opted for carrying out a clustering. Both representations and clustering come from [scikit](https://scikit-learn.org/) (twiking a bit the tokenizer)

**Representations**


* [TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
* [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
* [SVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD) (LSA)

I had considered doc2vec at first, but the computation was taking way too long

**Clustering alternatives**
* [Birch](https://scikit-learn.org/stable/modules/clustering.html#birch). Requires the number of clusters. 
* [Afifinity propagation](https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation). Estimates a reasonable number of clusters.
* [Meanshift](https://scikit-learn.org/stable/modules/clustering.html#mean-shift).  Estimates a reasonable number of clusters.

**Requirements (non-python-standard)**
* spacy 3.0. Tokenization, lemmatization
* sklearn. Feature computation, clustering
* pandas. Dataframes 

In [None]:
# Downloading the model for spacy 
# RUN ONLY IF YOU DON'T HAVE THE MODELS READY
! python3 -m spacy download en_core_web_sm
# ! python -V
# ! which python
# ! pip3 install --upgrade --upgrade-strategy eager sklearn
# ! pip3 install --upgrade spacy

In [1]:
# checking the version of spacy (don't run)
import spacy as kk
  
# Check the version 
print(kk.__version__) 

3.0.3


In [55]:
# import gensim.models as g
import os
import numpy as np
import pandas as pd
import xml.etree.ElementTree as ET

import spacy 

# from spacy.tokenizer import Tokenizer
# from spacy.lang.en import English

from sklearn.decomposition import PCA, TruncatedSVD
# from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import AffinityPropagation
from sklearn.cluster import Birch
from sklearn.cluster import MeanShift, estimate_bandwidth

from sklearn.decomposition import PCA, TruncatedSVD

from sklearn.preprocessing import MinMaxScaler


#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

HERE ARE THE CONSTANTS THAT YOU MIGHT WANT TO MODIFY

In [17]:
# the path to the corpus
path = "/Users/albarron/tmp/adriano/txt_raw"

# the number of PCA components (e.g., 2, 20, 100)
PCA_COMPONENTS = 100
# From when considering to use doc2vec. Not really necessary
# model = "/Users/albarron/corpora/embeddings/doc2vec/enwiki_dbow/doc2vec.bin"
#load model
# m = g.Doc2Vec.load(model)

In [4]:
def find_files(path):
    """Loads all the txt files in the path folder and 
    returns them, with their full path"""
    my_files = []
    for root, dirs, files in os.walk(path):
        my_files.extend(os.path.join(path, file) for file in files if file.endswith(".txt"))
    return my_files

In [5]:
def extract_xml(file):
    """Extract the contents from an xml file (which in this 
    corpus actually have txt extension) and returns it as 
    a dictionary. 
    The assumed tags are: 'id', 'jobtitle', 'about', 'jobdesc', 
    'keyinfo', 'benefits'.
    If jobdesc contains internal tags, all their text is simply 
    merged.
    """
    with open(file) as f:
        elements = {}
        tree = ET.fromstring(f.read())
        elements['id'] = tree.attrib['id']

        for child in tree: #ET.fromstring(f.read()):
            # elements[child.tag] = child.text
            elements[child.tag] = ' '.join(child.itertext()).lower()
    return elements

## Load all the files into a pandas dataframe

In [6]:
files = find_files(path)
df = pd.DataFrame(
    columns=['id', 'jobtitle', 'about', 'jobdesc', 'keyinfo', 'benefits'])

for file in files:
#     print(file)
    d = extract_xml(file)
    df = df.append(d, ignore_index=True)

# replace empty jobtitles (NAN in a dataframe) with ''
df.jobtitle = df.jobtitle.fillna('')
# Add a new column combining offer title and description
df['jobtitle_desc'] = " " + df.jobtitle + '\n' + df.jobdesc
df.set_index('id')
# print(df.jobdesc)
print(df.jobtitle_desc)

0       localization editor,\njapanese\nlanguages are...
1       german; english; applied linguistics; computa...
2       english; computational linguistics; general l...
3        english; french; computational linguistics; ...
4       english; computational linguistics: senior an...
                             ...                        
193     french canadian linguist - siri tts\napple is...
194      english; german; spanish; applied linguistic...
195     french; german; syntax: analytical linguist, ...
196     swedish; computational linguistics; lexicogra...
197     senior social analyst, data\noverview\nthe id...
Name: jobtitle_desc, Length: 198, dtype: object


## Representations production

### Alternative 1: tf-idf

In [12]:
# nlp = English()
# tokenizer = Tokenizer(nlp.vocab)

# nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
nlp = spacy.load("en_core_web_sm")
# config = {"mode": "rule"}
# nlp.add_pipe("lemmatizer", config=config)
# This usually happens under the hood
# processed = nlp(doc)


doc = nlp("this is a very small tests trying to check if I am or you are actually getting lemmas and not tokens")
# lemmas = set([w.lemma for w in doc])
# tokens = set([w for w in doc])
# print (len(lemmas), len(tokens))
# print(doc)
print([token.lemma_ for token in doc])


['this', 'be', 'a', 'very', 'small', 'test', 'try', 'to', 'check', 'if', 'I', 'be', 'or', 'you', 'be', 'actually', 'get', 'lemmas', 'and', 'not', 'token']


In [14]:
# tfidf_docs = pd.DataFrame(tfidf_docs)
# # # centers the vectorized documents (BOW vectors) by subtracting the mean
# # tfidf_docs = tfidf_docs - tfidf_docs.mean()

def lemmatized_words(doc):
    """A tokenizer based on spacy to add lemmas to the vector, 
    rather than tokens. It also ignores sequences of spaces 
    and 1-character tokens"""
    doc = nlp(doc)
    return (w.lemma_ for w in doc if w.lemma_.strip() !="" and len(w.lemma_) > 1)

# Without lemmatization; the vocabulary is huge: 75.4k.
# lemm_vectorizer = TfidfVectorizer(tokenizer=tokenizer) 

# With lemmatization; the vocabulary is reasonable < 3k
lemm_vectorizer = TfidfVectorizer(
    tokenizer=lemmatized_words, # the tokenizer from the function 
    stop_words='english', 
    min_df=2)

tfidf_docs = lemm_vectorizer.fit_transform(raw_documents=df.jobtitle_desc).toarray()
print("Size of the vocabulary:", len(lemm_vectorizer.vocabulary_))

tfidf_docs = pd.DataFrame(tfidf_docs)
print("Shape of the matrix:", tfidf_docs.shape)

# Uncomment if you want to see the features (vocabulary)
# print("Feature names:\n", lemm_vectorizer.get_feature_names())

# print(tfidf_docs)

Size of the vocabulary: 2317
Shape of the matrix: (198, 2317)


## Alternative 2: PCA

In [19]:
# Scaling the tfidf vectors in [0,1] (recommended for PCA)
scaler = MinMaxScaler()
tfidf_docs_rescaled = scaler.fit_transform(tfidf_docs)

# Computing the PCS with PCA_COMPONENTS
pca = PCA(svd_solver = 'full', n_components=PCA_COMPONENTS)
pca = pca.fit(tfidf_docs_rescaled)
pca_topic_vectors = pca.transform(tfidf_docs_rescaled)

columns = ['topic{}'.format(i) for i in range(pca.n_components)]
pca_topic_vectors = pd.DataFrame(pca_topic_vectors, columns=columns, index=df.id)
pca_topic_vectors.round(3).head(6)

Unnamed: 0_level_0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15,topic16,topic17,topic18,topic19
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Facebook001,-0.739,-0.14,-0.029,-0.298,0.086,-0.097,-0.921,-0.159,0.174,0.017,-0.067,-0.546,0.119,-0.828,-0.267,-0.059,0.412,-1.293,-0.218,0.402
Linguist014,0.504,-0.772,-0.043,-0.135,-0.57,0.361,-0.03,0.464,0.551,-0.382,0.327,0.045,0.111,-0.29,0.267,0.0,-0.312,-0.156,0.162,0.3
Linguist028,0.826,-0.059,0.449,-0.1,-0.07,0.978,-1.857,-1.199,-0.666,-1.143,-0.894,0.558,0.544,-0.075,0.038,-1.106,0.412,0.508,-0.277,-0.248
Linguist029,0.971,-0.068,0.295,-0.18,-0.228,1.131,-1.231,-0.983,-1.063,-1.616,-0.769,1.005,0.614,0.739,0.42,-0.217,0.232,-0.771,-0.917,0.497
Linguist015,-0.306,-0.476,-1.245,3.051,1.819,0.225,-0.562,-0.439,-0.274,0.208,0.195,-0.427,0.28,0.318,-0.481,-0.313,-0.33,0.432,-0.659,0.146
Linguist001,1.071,-0.269,0.009,-0.011,-0.4,0.806,-0.168,1.113,-0.048,0.447,-0.069,-0.581,-0.316,0.469,-0.277,-0.262,-0.337,0.051,-0.361,0.289


## Alternative 3: TruncatedSVD (LSA)

In [25]:
svd = TruncatedSVD(n_components=100)   #, n_iter=100)
scaler = MinMaxScaler()
tfidf_docs_rescaled = scaler.fit_transform(tfidf_docs)

# Decomposes TF-IDF vectors and transforms them into topic vectors
svd_topic_vectors = svd.fit(tfidf_docs_rescaled)
svd_topic_vectors = svd.transform(tfidf_docs_rescaled)

# print(svd_topic_vectors.shape)
columns = ['topic{}'.format(i) for i in range(svd.n_components)]
svd_topic_vectors = pd.DataFrame(svd_topic_vectors, columns=columns, index=df.id)

# Display the top-6 vectors
svd_topic_vectors.round(3).head(6)

Unnamed: 0_level_0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic90,topic91,topic92,topic93,topic94,topic95,topic96,topic97,topic98,topic99
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Facebook001,1.612,-0.715,-0.102,0.091,-0.315,-0.006,-0.043,-0.893,-0.017,0.211,...,0.553,0.3,0.303,0.019,-0.16,0.015,-0.527,0.116,0.606,-0.12
Linguist014,1.695,0.583,-0.379,0.441,0.013,-0.785,0.263,-0.062,0.69,0.207,...,0.24,-0.325,0.055,0.003,0.611,0.466,-0.274,0.216,0.078,-0.342
Linguist028,2.21,0.785,0.336,0.319,-0.11,-0.283,0.985,-1.767,-1.478,0.987,...,-0.249,-0.007,0.025,0.328,0.257,-0.132,0.129,-0.343,0.039,0.228
Linguist029,2.313,0.92,0.202,0.237,-0.141,-0.537,1.048,-1.181,-1.603,0.708,...,0.2,0.163,0.346,-0.624,-0.424,0.021,0.183,-0.475,-0.255,-0.344
Linguist015,2.483,-0.438,-1.697,0.804,2.13,2.393,0.431,-0.51,-0.459,-0.134,...,-0.067,0.003,0.305,0.033,0.274,0.475,0.029,-0.498,-0.222,-0.297
Linguist001,2.057,1.055,-0.118,0.187,0.079,-0.626,0.639,-0.269,0.77,-0.868,...,-0.021,0.401,-0.185,-0.62,0.024,-0.429,-0.01,0.12,-0.121,-0.172


# Clustering

Notice that running every alternative would add the column with the clusters to the df (and would be included in the saved tsv)

## Alternative 1: Birch (which requires the number of clusters)

In [29]:
def birch(data, k):
    "Produces a clustering with k clusters for the given data"
    brc = Birch(branching_factor=50, n_clusters=k, threshold=0.1, compute_labels=True)
    brc.fit(data)

    clusters = brc.predict(data)

    labels = brc.labels_
    return clusters

    
    # print ("Clusters: ")
#     print (clusters)
# print(df.head(10))


### Birch with tfidf

In [31]:
for k in range(1, 21):
    clusters = birch(tfidf_docs, k)
    df[".".join(["birch", "tfidf", str(k)])] = clusters 

### Birch with PCA

In [32]:
for k in range(1, 21):
    clusters = birch(pca_topic_vectors, k)
    df[".".join(["birch", "pca", str(k)])] = clusters 

### Birch with SVD

In [33]:
for k in range(1, 21):
    clusters = birch(svd_topic_vectors, k)
    df[".".join(["birch", "svd", str(k)])] = clusters 

## Alternative 2: [AffinityPropagation](https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation)

In [45]:
def aff_propagation(X):
    """Compute clusters based on Affinity Propagation"""
    af = AffinityPropagation(random_state=None)
    af.fit(X)
    cluster_centers_indices = af.cluster_centers_indices_
    labels = af.labels_

    n_clusters_ = len(cluster_centers_indices)

    print('Estimated number of clusters: %d' % n_clusters_)
    # print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
    # print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
    # print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
    # print("Adjusted Rand Index: %0.3f"
    #       % metrics.adjusted_rand_score(labels_true, labels))
    # print("Adjusted Mutual Information: %0.3f"
    #       % metrics.adjusted_mutual_info_score(labels_true, labels))
    # print("Silhouette Coefficient: %0.3f"
    #       % metrics.silhouette_score(X, labels, metric='sqeuclidean'))
    # print("Silhouette Coefficient: %0.3f"
    #       % metrics.silhouette_score(X, labels, metric='sqeuclidean'))
    print(labels)
    return labels
#     print(labels)

### AffinityPropagation with tfidf

In [46]:
# centers the vectorized documents (BOW vectors) by subtracting the mean
X = tfidf_docs
X = X - X.mean()
clusters = aff_propagation(X)
df["affprop.tfidf"] = clusters 

Estimated number of clusters: 40
[ 0 12  1  5 19 24  0  3  7  6 11 25 39 12 24 24  9 31  3  2  9 31 24  3
  7 19  3 24  4 16  3 25  5  7 14 13 31  0 33  6 35  4  7  5 31  7 10  7
 10 10 23 37 29  0 39  6 35  8  9 10 11 24 10 17 12  9 35 13 13  4 14 34
 14 15 16  3 39  0 11  5 10 15  6 15 31  4 21 24 25 17 31 18 31 25 38 15
 20 19 23 22 30 15 32 32 21 36  4 20 29 20 18 31  1 26 25 13 21 37 10 31
 22 23 36 31 18 29 38 24 31  7 34 38  1 25 26  2 31 37  7 34 27 31  7 23
 16 28 12  9 31 27 39 29 29 30 31 34 33 32 21 10 33  8  3 28 24  3  7  2
  7 10 20 19 10 16 19 23  4 34 35 25 24 34 24 10 27 36 19 24  5  3 25 38
 36 37 24 22 38 39]


### AffinityPropagation with PCA

In [47]:
clusters = aff_propagation(pca_topic_vectors)
df["affprop.pca"] = clusters 

Estimated number of clusters: 28
[ 3  0  6  6  1 24  3 20  2  3  4  3  3 16  0  0  3  3  4 27  2 23  0  4
  0  1  4 18  5  0 20  3  6  3  7 16  3  3 23  2 26  3  3  6 23  2  3 23
 10 10 16 18  3  3  8  5 26  9 23 10 20  0 10  0  0  2 26 16 16  3  3 27
  7  3  0 24 26  3 20  6 24 24 23  3  5  5 16 16  3  0 23 11  5  3 20  3
  0 25 12 13 14  3 15 16 17 27  5 24 23  3 11  5  6 21  2 16 17 18  2 14
 13 12 27 19  3 23 20  0  3  3 27 20 23  3 21  0  3 18  2 27 22 14  3  2
  0 16  0 16  3 22 23  3 26 14 14  3 23 15 23  3 23  9  0 16  0  4  0 27
 23 23 24  1 24  0 25 27  5 27 26  3  0 27  0 10  3 27  1  0 27 16  2  3
 27 18 18 13 20  3]


### AffinityPropagation with SVD

In [50]:
clusters = aff_propagation(svd_topic_vectors)
df["affprop.svd"] = clusters 

Estimated number of clusters: 26
[ 1 21  8  3  0 21  8 25  1  1  8  1  1  8 21  8  1  1  2  1  1  8  8  2
  8  0  2  1  1  1 25  1  3  1  4  8  8  1  8  1 23  1  1  3  8  1  1 21
  6  6  8 24  1  1  5  1 23 20 21  6  8 21  7  8  9 21 23  8  1  8  1 21
  4  1  8 21  8  8 21  3  1  1  1  1  1  1  8 21  1  8  1 10 11  1 25  1
  1 22 12 21 13  1 14  8 15 21  1  8 21  1 10 11  8 16  1  8 15 24  1 13
 21 12  1 17  1  1 25 21  1  1 21 25  1  1 16  8  1 24  1  1 19 13  1 18
 21 21  9  8  1 19  1  1  1 13 13  8 21 14  8  8  1 20 21 21  8 21  8  1
  8  8  8 22  1  8 22  1  1 21 23  1  8 21 21  7  8  1  1  8  1 21  1 21
  1 24 24 21 25  1]


# Alternative 3: [Meanshift](https://scikit-learn.org/stable/modules/clustering.html#mean-shift)

In [56]:
def mean_shift(X):
    """Compute clusters based on Meanshift"""
    # The following bandwidth can be automatically detected using
    bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)

    ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
    ms.fit(X)
    labels = ms.labels_
    cluster_centers = ms.cluster_centers_
    
    labels_unique = np.unique(labels)
    n_clusters_ = len(labels_unique)

    print("number of estimated clusters : %d" % n_clusters_)
    return labels

### MeanShift with tfidf

In [57]:
X = tfidf_docs
X = X - X.mean()
clusters = mean_shift(X)
df["meanshift.tfidf"] = clusters

number of estimated clusters : 1


### MeanShift with PCA

In [64]:
clusters = mean_shift(pca_topic_vectors)
df["meanshift.pca"] = clusters 

number of estimated clusters : 6


### MeanShift with SVD

In [63]:
clusters = aff_propagation(svd_topic_vectors)
df["meanshift.svd"] = clusters 

Estimated number of clusters: 26
[ 1 20  7  3  0 20  7 25  1  1  7  1  1  7 20  7  1  1  2  1  1  7  7  2
  7  0  2  1  1  1 25  1  3  1  9  7  7  1  7  1 22  1  1  3  7  1  1 20
  6  6  7 24  1  1  4  1 22  5 20  6  7 20 23  7  8 20 22  7  1  7  1 20
  9  1  7 20  7  7 20  3  1  1  1  1  1  1  7 20  1  7  1 10 12  1 25  1
  1 21 15 20 11  1 19  7 14 20  1  7 20  1 10 12  7 13  1  7 14 24  1 11
 20 15  1 16  1  1 25 20  1  1 20 25  1  1 13  7  1 24  1  1 18 11  1 17
 20 20  8  7  1 18  1  1  1 11 11  7 20 19  7  7  1  5 20 20  7 20  7  1
  7  7  7 21  1  7 21  1  1 20 22  1  7 20 20 23  7  1  1  7  1 20  1 20
  1 24 24 20 25  1]


In [61]:
print(df.head(3))
# df.to_csv("/".join([path, "upskills_clusters.tsv"]), sep="\t")

            id                                           jobtitle  \
0  Facebook001                     localization editor,\njapanese   
1  Linguist014  german; english; applied linguistics; computat...   
2  Linguist028  english; computational linguistics; general li...   

                                               about  \
0  about the facebook company\nfacebook's mission...   
1                                                NaN   
2  we are an equal opportunity employer and value...   

                                             jobdesc  \
0  languages are key to our mission of bringing t...   
1  description:\n\nthis is an exciting opportunit...   
2  description:\n\nappen is the world's leading i...   

                                             keyinfo benefits  \
0                                                         NaN   
1  university or organization: nuance communicati...      NaN   
2  university or organization: appen\ndepartment:...      NaN   

            