# Upskills clustering

A number of experiments on trying to make sense of the ~200 job offers compiled as part of the upskills project. 
The core idea is trying to organise the kib offers. Since we do not have any annotation, we have opted for carrying out a clustering. Both representations and clustering come from [scikit](https://scikit-learn.org/) (twiking a bit the tokenizer)

**Representations**


* [TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
* [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
* [SVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD) (LSA)

I had considered doc2vec at first, but the computation was taking way too long

**Clustering alternatives**
* [Birch](https://scikit-learn.org/stable/modules/clustering.html#birch). Requires the number of clusters. 
* [Afifinity propagation](https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation). Estimates the right number of clusters.
* [Meanshift](https://scikit-learn.org/stable/modules/clustering.html#mean-shift).  Estimates the right number of clusters.

**Requirements (non-python-standard)**
* spacy 3.0. Tokenization, lemmatization
* sklearn. Feature computation, clustering
* pandas. Dataframes 

In [None]:
# Downloading the model for spacy 
# RUN ONLY IF YOU DON'T HAVE THE MODELS READY
! python3 -m spacy download en_core_web_sm
# ! python -V
# ! which python
# ! pip3 install --upgrade --upgrade-strategy eager sklearn
# ! pip3 install --upgrade spacy

In [1]:
# checking the version of spacy (don't run)
import spacy as kk
  
# Check the version 
print(kk.__version__) 

3.0.3


In [11]:
# import gensim.models as g
import os
import pandas as pd
import xml.etree.ElementTree as ET

import spacy 

# from spacy.tokenizer import Tokenizer
# from spacy.lang.en import English

from sklearn.decomposition import PCA, TruncatedSVD
# from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import AffinityPropagation
from sklearn.cluster import Birch
from sklearn.cluster import MeanShift

from sklearn.decomposition import PCA, TruncatedSVD

from sklearn.preprocessing import MinMaxScaler


#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

HERE ARE THE CONSTANTS THAT YOU MIGHT WANT TO MODIFY

In [17]:
# the path to the corpus
path = "/Users/albarron/tmp/adriano/txt_raw"

# the number of PCA components (e.g., 2, 20, 100)
PCA_COMPONENTS = 20
# From when considering to use doc2vec. Not really necessary
# model = "/Users/albarron/corpora/embeddings/doc2vec/enwiki_dbow/doc2vec.bin"
#load model
# m = g.Doc2Vec.load(model)

In [4]:
def find_files(path):
    """Loads all the txt files in the path folder and 
    returns them, with their full path"""
    my_files = []
    for root, dirs, files in os.walk(path):
        my_files.extend(os.path.join(path, file) for file in files if file.endswith(".txt"))
    return my_files

In [5]:
def extract_xml(file):
    """Extract the contents from an xml file (which in this 
    corpus actually have txt extension) and returns it as 
    a dictionary. 
    The assumed tags are: 'id', 'jobtitle', 'about', 'jobdesc', 
    'keyinfo', 'benefits'.
    If jobdesc contains internal tags, all their text is simply 
    merged.
    """
    with open(file) as f:
        elements = {}
        tree = ET.fromstring(f.read())
        elements['id'] = tree.attrib['id']

        for child in tree: #ET.fromstring(f.read()):
            # elements[child.tag] = child.text
            elements[child.tag] = ' '.join(child.itertext()).lower()
    return elements

## Load all the files into a pandas dataframe

In [6]:
files = find_files(path)
df = pd.DataFrame(
    columns=['id', 'jobtitle', 'about', 'jobdesc', 'keyinfo', 'benefits'])

for file in files:
#     print(file)
    d = extract_xml(file)
    df = df.append(d, ignore_index=True)

# replace empty jobtitles (NAN in a dataframe) with ''
df.jobtitle = df.jobtitle.fillna('')
# Add a new column combining offer title and description
df['jobtitle_desc'] = " " + df.jobtitle + '\n' + df.jobdesc
df.set_index('id')
# print(df.jobdesc)
print(df.jobtitle_desc)

0       localization editor,\njapanese\nlanguages are...
1       german; english; applied linguistics; computa...
2       english; computational linguistics; general l...
3        english; french; computational linguistics; ...
4       english; computational linguistics: senior an...
                             ...                        
193     french canadian linguist - siri tts\napple is...
194      english; german; spanish; applied linguistic...
195     french; german; syntax: analytical linguist, ...
196     swedish; computational linguistics; lexicogra...
197     senior social analyst, data\noverview\nthe id...
Name: jobtitle_desc, Length: 198, dtype: object


## Representations production

### Alternative 1: tf-idf

In [12]:
# nlp = English()
# tokenizer = Tokenizer(nlp.vocab)

# nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
nlp = spacy.load("en_core_web_sm")
# config = {"mode": "rule"}
# nlp.add_pipe("lemmatizer", config=config)
# This usually happens under the hood
# processed = nlp(doc)


doc = nlp("this is a very small tests trying to check if I am or you are actually getting lemmas and not tokens")
# lemmas = set([w.lemma for w in doc])
# tokens = set([w for w in doc])
# print (len(lemmas), len(tokens))
# print(doc)
print([token.lemma_ for token in doc])


['this', 'be', 'a', 'very', 'small', 'test', 'try', 'to', 'check', 'if', 'I', 'be', 'or', 'you', 'be', 'actually', 'get', 'lemmas', 'and', 'not', 'token']


In [14]:
# tfidf_docs = pd.DataFrame(tfidf_docs)
# # # centers the vectorized documents (BOW vectors) by subtracting the mean
# # tfidf_docs = tfidf_docs - tfidf_docs.mean()

def lemmatized_words(doc):
    """A tokenizer based on spacy to add lemmas to the vector, 
    rather than tokens. It also ignores sequences of spaces 
    and 1-character tokens"""
    doc = nlp(doc)
    return (w.lemma_ for w in doc if w.lemma_.strip() !="" and len(w.lemma_) > 1)

# Without lemmatization; the vocabulary is huge: 75.4k.
# lemm_vectorizer = TfidfVectorizer(tokenizer=tokenizer) 

# With lemmatization; the vocabulary is reasonable < 3k
lemm_vectorizer = TfidfVectorizer(
    tokenizer=lemmatized_words, # the tokenizer from the function 
    stop_words='english', 
    min_df=2)

tfidf_docs = lemm_vectorizer.fit_transform(raw_documents=df.jobtitle_desc).toarray()
print("Size of the vocabulary:", len(lemm_vectorizer.vocabulary_))

tfidf_docs = pd.DataFrame(tfidf_docs)
print("Shape of the matrix:", tfidf_docs.shape)

# Uncomment if you want to see the features (vocabulary)
# print("Feature names:\n", lemm_vectorizer.get_feature_names())

# print(tfidf_docs)

Size of the vocabulary: 2317
Shape of the matrix: (198, 2317)


## Alternative 2: PCA

In [19]:
# Scaling the tfidf vectors in [0,1] (recommended for PCA)
scaler = MinMaxScaler()
tfidf_docs_rescaled = scaler.fit_transform(tfidf_docs)

# Computing the PCS with PCA_COMPONENTS
pca = PCA(svd_solver = 'full', n_components=PCA_COMPONENTS)
pca = pca.fit(tfidf_docs_rescaled)
pca_topic_vectors = pca.transform(tfidf_docs_rescaled)

columns = ['topic{}'.format(i) for i in range(pca.n_components)]
pca_topic_vectors = pd.DataFrame(pca_topic_vectors, columns=columns, index=df.id)
pca_topic_vectors.round(3).head(6)

Unnamed: 0_level_0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15,topic16,topic17,topic18,topic19
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Facebook001,-0.739,-0.14,-0.029,-0.298,0.086,-0.097,-0.921,-0.159,0.174,0.017,-0.067,-0.546,0.119,-0.828,-0.267,-0.059,0.412,-1.293,-0.218,0.402
Linguist014,0.504,-0.772,-0.043,-0.135,-0.57,0.361,-0.03,0.464,0.551,-0.382,0.327,0.045,0.111,-0.29,0.267,0.0,-0.312,-0.156,0.162,0.3
Linguist028,0.826,-0.059,0.449,-0.1,-0.07,0.978,-1.857,-1.199,-0.666,-1.143,-0.894,0.558,0.544,-0.075,0.038,-1.106,0.412,0.508,-0.277,-0.248
Linguist029,0.971,-0.068,0.295,-0.18,-0.228,1.131,-1.231,-0.983,-1.063,-1.616,-0.769,1.005,0.614,0.739,0.42,-0.217,0.232,-0.771,-0.917,0.497
Linguist015,-0.306,-0.476,-1.245,3.051,1.819,0.225,-0.562,-0.439,-0.274,0.208,0.195,-0.427,0.28,0.318,-0.481,-0.313,-0.33,0.432,-0.659,0.146
Linguist001,1.071,-0.269,0.009,-0.011,-0.4,0.806,-0.168,1.113,-0.048,0.447,-0.069,-0.581,-0.316,0.469,-0.277,-0.262,-0.337,0.051,-0.361,0.289


## Alternative 3: TruncatedSVD (LSA)

In [25]:
svd = TruncatedSVD(n_components=100)   #, n_iter=100)
scaler = MinMaxScaler()
tfidf_docs_rescaled = scaler.fit_transform(tfidf_docs)

# Decomposes TF-IDF vectors and transforms them into topic vectors
svd_topic_vectors = svd.fit(tfidf_docs_rescaled)
svd_topic_vectors = svd.transform(tfidf_docs_rescaled)

# print(svd_topic_vectors.shape)
columns = ['topic{}'.format(i) for i in range(svd.n_components)]
svd_topic_vectors = pd.DataFrame(svd_topic_vectors, columns=columns, index=df.id)

# Display the top-6 vectors
svd_topic_vectors.round(3).head(6)

Unnamed: 0_level_0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic90,topic91,topic92,topic93,topic94,topic95,topic96,topic97,topic98,topic99
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Facebook001,1.612,-0.715,-0.102,0.091,-0.315,-0.006,-0.043,-0.893,-0.017,0.211,...,0.553,0.3,0.303,0.019,-0.16,0.015,-0.527,0.116,0.606,-0.12
Linguist014,1.695,0.583,-0.379,0.441,0.013,-0.785,0.263,-0.062,0.69,0.207,...,0.24,-0.325,0.055,0.003,0.611,0.466,-0.274,0.216,0.078,-0.342
Linguist028,2.21,0.785,0.336,0.319,-0.11,-0.283,0.985,-1.767,-1.478,0.987,...,-0.249,-0.007,0.025,0.328,0.257,-0.132,0.129,-0.343,0.039,0.228
Linguist029,2.313,0.92,0.202,0.237,-0.141,-0.537,1.048,-1.181,-1.603,0.708,...,0.2,0.163,0.346,-0.624,-0.424,0.021,0.183,-0.475,-0.255,-0.344
Linguist015,2.483,-0.438,-1.697,0.804,2.13,2.393,0.431,-0.51,-0.459,-0.134,...,-0.067,0.003,0.305,0.033,0.274,0.475,0.029,-0.498,-0.222,-0.297
Linguist001,2.057,1.055,-0.118,0.187,0.079,-0.626,0.639,-0.269,0.77,-0.868,...,-0.021,0.401,-0.185,-0.62,0.024,-0.429,-0.01,0.12,-0.121,-0.172


# Clustering

Notice that running every alternative would add the column with the clusters to the df (and would be included in the saved tsv)

## Alternative 1: Birch (which requires the number of clusters)

In [29]:
def birch(data, k):
    "Produces a clustering with k clusters for the given data"
    brc = Birch(branching_factor=50, n_clusters=k, threshold=0.1, compute_labels=True)
    brc.fit(data)

    clusters = brc.predict(data)

    labels = brc.labels_
    return clusters

    
    # print ("Clusters: ")
#     print (clusters)
# print(df.head(10))


### Birch with tfidf

In [31]:
for k in range(1, 21):
    clusters = birch(tfidf_docs, k)
    df[".".join(["birch", "tfidf", str(k)])] = clusters 

### Birch with PCA

In [32]:
for k in range(1, 21):
    clusters = birch(pca_topic_vectors, k)
    df[".".join(["birch", "pca", str(k)])] = clusters 

### Birch with SVD

In [33]:
for k in range(1, 21):
    clusters = birch(svd_topic_vectors, k)
    df[".".join(["birch", "svd", str(k)])] = clusters 

In [34]:
print(df.head(3))
# df.to_csv("/".join([path, "upskills_clusters.tsv"]), sep="\t")

            id                                           jobtitle  \
0  Facebook001                     localization editor,\njapanese   
1  Linguist014  german; english; applied linguistics; computat...   
2  Linguist028  english; computational linguistics; general li...   

                                               about  \
0  about the facebook company\nfacebook's mission...   
1                                                NaN   
2  we are an equal opportunity employer and value...   

                                             jobdesc  \
0  languages are key to our mission of bringing t...   
1  description:\n\nthis is an exciting opportunit...   
2  description:\n\nappen is the world's leading i...   

                                             keyinfo benefits  \
0                                                         NaN   
1  university or organization: nuance communicati...      NaN   
2  university or organization: appen\ndepartment:...      NaN   

            

## Alternative 2: [AffinityPropagation](https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation) with tf-idf vectors

In [None]:
# #############################################################################
# Compute Affinity Propagation
X = tfidf_docs
# centers the vectorized documents (BOW vectors) by subtracting the mean
X = X - X.mean()

af = AffinityPropagation(random_state=None)
af.fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

print('Estimated number of clusters: %d' % n_clusters_)
# print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
# print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
# print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
# print("Adjusted Rand Index: %0.3f"
#       % metrics.adjusted_rand_score(labels_true, labels))
# print("Adjusted Mutual Information: %0.3f"
#       % metrics.adjusted_mutual_info_score(labels_true, labels))
# print("Silhouette Coefficient: %0.3f"
#       % metrics.silhouette_score(X, labels, metric='sqeuclidean'))
# print("Silhouette Coefficient: %0.3f"
#       % metrics.silhouette_score(X, labels, metric='sqeuclidean'))
print(labels)

### AffinityPropagation with tfidf

# Trying with [AffinityPropagation](https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation) with PCA vectors

In [None]:
# #############################################################################
# Compute Affinity Propagation
X = pca_topic_vectors
af = AffinityPropagation()
af.fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

print('Estimated number of clusters: %d' % n_clusters_)
# print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
# print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
# print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
# print("Adjusted Rand Index: %0.3f"
#       % metrics.adjusted_rand_score(labels_true, labels))
# print("Adjusted Mutual Information: %0.3f"
#       % metrics.adjusted_mutual_info_score(labels_true, labels))
# print("Silhouette Coefficient: %0.3f"
#       % metrics.silhouette_score(X, labels, metric='sqeuclidean'))
# print("Silhouette Coefficient: %0.3f"
#       % metrics.silhouette_score(X, labels, metric='sqeuclidean'))
print(labels)

# Trying with [AffinityPropagation](https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation) with SVD (LSA) vectors

In [None]:
# #############################################################################
# Compute Affinity Propagation
X = svd_topic_vectors
af = AffinityPropagation()
af.fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

print('Estimated number of clusters: %d' % n_clusters_)
# print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
# print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
# print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
# print("Adjusted Rand Index: %0.3f"
#       % metrics.adjusted_rand_score(labels_true, labels))
# print("Adjusted Mutual Information: %0.3f"
#       % metrics.adjusted_mutual_info_score(labels_true, labels))
# print("Silhouette Coefficient: %0.3f"
#       % metrics.silhouette_score(X, labels, metric='sqeuclidean'))
# print("Silhouette Coefficient: %0.3f"
#       % metrics.silhouette_score(X, labels, metric='sqeuclidean'))
print(labels)

for id, cluster in zip(df.id, labels):
    print (id, cluster)
# print("\t".join([id, cluster]) for id, cluster in zip (df.id, labels))

# Clustering with Meanshift

In [None]:
cluster_centers_indices

In [None]:

 # Trying with doc2vec 
 # The computation is way to slow and we might need to use the cluster
 
# test_docs="data/test_docs.txt"
 


# load the texts into lists
test_ids = []
test_docs = []
X = []
for i, row in df.iterrows():
  print(row['id'])
  test_ids.append(row['id'])
  X.append(
      m.infer_vector(
          row['jobdesc'].strip().split(),
          # [x.strip().split() for x in row['jobdesc']], 
          alpha=start_alpha, steps=infer_epoch) )
      

print(X)


# test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]
 
# print (test_docs)
# """
# [['the', 'cardigan', 'welsh', 'corgi'........
# """
 
# X=[]
# for d in test_docs:
     
#     X.append( m.infer_vector(d, alpha=start_alpha, steps=infer_epoch) )
    


In [None]:

 
k=3
 
 
brc = Birch(branching_factor=50, n_clusters=k, threshold=0.1, compute_labels=True)
brc.fit(X)
 
clusters = brc.predict(X)
 
labels = brc.labels_
 
 
print ("Clusters: ")
print (clusters)
 
 
# silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')
 
# print ("Silhouette_score: ")
# print (silhouette_score)


The community has created multiple libraries for pre-processing, which include options for tokenisation. One of the most popular ones is [NLTK](http://www.nltk.org). 

Before using it, you should install it. If using pip, you should do: 

\$ pip install --user -U nltk

\$ pip install --user -U numpy


An now we can import and use one of its tokenisers

In [None]:
from nltk.tokenize import TreebankWordTokenizer # import one of the many tokenizers available
tokenizer = TreebankWordTokenizer()             # invoke it 
tokens = tokenizer.tokenize(txt)
print(tokens)

Now, see the difference between tokenising with split() and with NLTK's treebank tokeniser on a different sentence.

In [None]:
sentence = "Monticello wasn't designated as UNESCO World Heritage Site until 1987."
tokens_split = sentence.split()
tokens_tree = tokenizer.tokenize(sentence)

print("OUTPUT USING split()\t\t", tokens_split)
print("OUTPUT USING TreebankWordTokenizer\t", tokens_tree)

## Normalisation

### Casefolding

In [None]:
sentence  = sentence.lower()
print(sentence)

## Stemming

Once again, we can use a regular expression to do stemming

In [None]:
def stem(phrase):
    return ' '.join([re.findall('^(.*ss|.*?)(s)?$',
         word)[0][0].strip("'") for word in phrase.lower()
         .split()])

In [None]:
print("'houses' \t\t->", stem('houses'))
print("'Doctor House's calls' \t->", stem("Doctor House's calls"))
print("'stress' \t\t->", stem("stress"))

But we would need to include many more expressions to deal with all cases and exceptions.

Instead, once again we can rely on a library. Let's consider the **Porter stemmer**, available in NLTK.

In [None]:
from nltk.stem.porter import PorterStemmer # Import the stemmer
stemmer = PorterStemmer()                  # invoke the stemmer

# Notice that we are "tokenising" and stemming in one line
x = ' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])
print(x.split())

## Lemmatisation

This is a more complex process, compared to stemming. Let us go straight to use a library.
In this particular case we are going to use NLTK's WordNet lemmatiser. If it is the first time you use it (or you are in an ephemeral environment!), you should download it as follows:

In [None]:
import nltk 
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer # importing the lemmatiser
lemmatizer = WordNetLemmatizer()        # invoking it

print("'better' alone \t->",lemmatizer.lemmatize("better"))
print("'better' including it's part of speech (adj) \t->",lemmatizer.lemmatize("better", pos="a"))

## A quick overview on representations

### Bag of Words (BoW)

First, let us see a simple construction, using a dictionary

In [None]:
sentence = """Thomas Jefferson began building Monticello at the age of 26. Thomas"""

sentence_bow = {}
for token in sentence.split():
     sentence_bow[token] = 1
sorted(sentence_bow.items())


Another option would be using **pandas**

In [None]:
import pandas as pd

# Loading the corpus
sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

# Loading the tokens into a dictionary (notice that we asume that each line is a document)
corpus = {}
for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in
         sent.split())

# Loading the dictionary contents into a pandas dataframe. 
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
# SEE THE .T, which transposes the matrix for visualisation purposes.


df[df.columns[:10]]


### One-hot vectors

This is our input sentence (and its vocabulary)

In [None]:
import numpy as np
sentence = "Thomas Jefferson began building Monticello at the age of 26."
token_sequence = str.split(sentence)
vocab = sorted(set(token_sequence))
print(vocab)

And now, we produce the one-hot representation

In [None]:
num_tokens = len(token_sequence)
vocab_size = len(vocab)
onehot_vectors = np.zeros((num_tokens, vocab_size), int) # create the |tokens| x |vocabulary size| matrix of zeros 
for i, word in enumerate(token_sequence):
   onehot_vectors[i, vocab.index(word)] = 1  # set one to right dimension to 1

print("Vocabulary:\t", vocab)
print("Sentence:\t", token_sequence)
onehot_vectors

Let us bring pandas into the game

In [None]:
pd.DataFrame(onehot_vectors, columns=vocab)