According to a [Stack Overflow Post](https://stackoverflow.com/a/34350553) it is possible ot use an AutoEncoder to compress the output of Word2Vec or RNNs down to smaller dimensions. Conversely, it is not realy feasible to try and implement a generative network directly using a Variational AutoEncoder (VAE).

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve, auc
from sklearn import metrics
from sklearn.model_selection import train_test_split


from functools import reduce
from tqdm import tqdm
import gensim
import gensim.corpora as corpora

from Preprocessing import Preprocessor
preprocessor = Preprocessor(0)
tqdm.pandas()

In [2]:
df = pd.read_csv("khan_joined.csv")
df.head()

Unnamed: 0,course,unit,lesson,video_title,about,transcript,transcript_cleaned,transcript_n_entries
0,Computer programming,Intro to JS: Drawing & Animation,Intro to programming,What is Programming?,Programming is the process of creating a set o...,"Hi, welcome to programming! If you've never le...","['help', 'computer', 'doodle', 'great', 'anima...",94
1,Computer programming,Intro to JS: Drawing & Animation,Coloring,The Power of the Docs,Created by Pamela Fox.,Voiceover: Ok so you've\r\nmade a few programs...,"['computer', 'refer', 'great', 'rect', 'like',...",161
2,Computer programming,Intro to HTML/CSS: Making webpages,Further learning,HTML validation,Learn how to validate your webpages with the W...,"- [Voiceover] On Khan Academy, we pop up the o...","['computer', 'try', 'validation', 'image', 'ac...",67
3,Computer programming,Intro to SQL: Querying and managing data,SQL basics,Welcome to SQL,SQL is useful for creating and querying relati...,- [Instructor] The world is full of data. Ever...,"['money', 'help', 'america', 'like', 'location...",79
4,Computer programming,Intro to SQL: Querying and managing data,SQL basics,S-Q-L or SEQUEL?,How is it pronounced? Why? Let's discuss...,"At this point, you've probably heard me\r\npro...","['probably', 'favorite', 'version', 'stand', '...",61


In [3]:
limit = 100
targets = df['course'][0:limit]
data = df['transcript_cleaned'][0:limit]

## Translate words to numerics
### Preliminary Research
[Link to article about Word2Vec and GloVe](https://forecast.global/insight/numerical-interpretation-of-textual-data-understanding-vector-representations/). More on encodings from [medium](https://medium.com/@kashyapkathrani/all-about-embeddings-829c8ff0bf5b).  


Word Embeddings are a possible option. BOW method would give a fixed length vector. Word2Vec also works for this, though it assumes a skip-gram model. GloVe uses a Continuous Bag of Words (CBOW) to do this instead. Does not work with words not seen during training. FastText extension should be able to as it breaks words into tri-grams.

Sentence embeddings better preserve context when compared to word embeddings. ELMo, InferSent, and Sentence-BERT all work in this case.

Can test sentence embeddings using SentEval tool-kit.

### Ideas
Try both Word embeddings and one of the sentence embeddings later. Using Word2Vec over Glove as we want to train our own Corpus.

### Word Embeddings

In [4]:
def str_to_list(s: str) -> list:
    s = s.replace('[', '')
    s = s.replace(']', '')
    s = s.replace('\'', '')
    s = s.split(', ')
    return s

corpus = data.sum(axis=0)
if isinstance(corpus, str):
    # means it performed string concatenation so it needs to be cleaned up
    print(corpus[0:100])
    corpus = str_to_list(corpus)
    print(corpus[0: 25])
    print(len(corpus))
    print(type(corpus))
# convert the corpus to a set since there should be no unique values
corpus = set(corpus)
print(len(corpus))
corpus = list(corpus)
print(corpus[0:25])

temp = [d.split() for d in corpus]
print(type(temp))
words = corpora.Dictionary(temp)
words.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
corpus = [words.doc2bow(doc) for doc in temp]

['help', 'computer', 'doodle', 'great', 'animation', 'water', 'like', 'concept', 'english', 'languag
['help', 'computer', 'doodle', 'great', 'animation', 'water', 'like', 'concept', 'english', 'language', 'give', 'tell', 'relate', 'pretty', 'special', 'animate', 'maps', 'wonder', 'khan', 'effect', 'think', 'cool', 'wikipedia', 'surface', 'minecraft']
13714
<class 'list'>
3729
['procure', 'wheelbarrow', 'straight', 'nemo', 'cultural', 'uniformity', 'representation', 'accidentally', 'incredibles', 'imbalance', 'propose', 'input', 'decide', 'state', 'invert', 'well', 'magnification', 'processingjs', '16th', 'prolific', 'aspect', 'entry', 'traveler', 'moment', 'shift']


### Sentence Embeddings
Ended up stumbling across a [SpaCy package]((https://spacy.io/universe/project/spacy-universal-sentence-encoder)) that used google's Universal Sentence Encoder model. This takes text and translates into a [512 dimensional vector](https://amitness.com/2020/06/universal-sentence-encoder/).

In [6]:
# pip install spacy-universal-sentence-encoder

In [8]:
import spacy_universal_sentence_encoder
nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')


Downloaded https://tfhub.dev/google/universal-sentence-encoder-large/5, Total size: 577.10MB



In [10]:
d1 = nlp(data[0])
d2 = nlp(data[1])
d1.similarity(d2)

0.5215909665018893

## Next Step
We have now verified that the sentence embedder works. Now we need to create a list of all sentences for each target domain.

In [14]:
joined = df[['course', 'transcript']][0:limit]
joined.head()

Unnamed: 0,course,transcript
0,Computer programming,"Hi, welcome to programming! If you've never le..."
1,Computer programming,Voiceover: Ok so you've\r\nmade a few programs...
2,Computer programming,"- [Voiceover] On Khan Academy, we pop up the o..."
3,Computer programming,- [Instructor] The world is full of data. Ever...
4,Computer programming,"At this point, you've probably heard me\r\npro..."


first off is that we notice the transcript has unicode characters and lots of weird tags. This is usually cleaned in the preprocessing pipeline so lets call that with different arguments.

In [16]:
joined['full_transcript'] = joined['transcript'].progress_apply(lambda x: preprocessor.clean(x))
joined.head()

100%|██████████| 100/100 [00:05<00:00, 16.84it/s]


Unnamed: 0,course,transcript,full_transcript
0,Computer programming,"Hi, welcome to programming! If you've never le...","(hi, ,, welcome, programming, !, never, learne..."
1,Computer programming,Voiceover: Ok so you've\r\nmade a few programs...,"(voiceover, :, ok, made, programs, ,, might, w..."
2,Computer programming,"- [Voiceover] On Khan Academy, we pop up the o...","(-, [, voiceover, ], khan, academy, ,, pop, oh..."
3,Computer programming,- [Instructor] The world is full of data. Ever...,"(-, [, instructor, ], world, full, data, ., ev..."
4,Computer programming,"At this point, you've probably heard me\r\npro...","(point, ,, probably, heard, pronounce, sql, tw..."


In [17]:
type(joined['full_transcript'][0])

spacy.tokens.doc.Doc

We now have the SpaCy doc for every transcript in our dataset. Next is to use this to break things out by sentences

In [18]:
joined['sents'] = joined['full_transcript'].progress_apply(lambda x: list(x.sents))
joined.head()

100%|██████████| 100/100 [00:00<00:00, 18813.60it/s]


Unnamed: 0,course,transcript,full_transcript,sents
0,Computer programming,"Hi, welcome to programming! If you've never le...","(hi, ,, welcome, programming, !, never, learne...","[(hi, ,, welcome, programming, !), (never, lea..."
1,Computer programming,Voiceover: Ok so you've\r\nmade a few programs...,"(voiceover, :, ok, made, programs, ,, might, w...","[(voiceover, :, ok, made, programs, ,, might, ..."
2,Computer programming,"- [Voiceover] On Khan Academy, we pop up the o...","(-, [, voiceover, ], khan, academy, ,, pop, oh...","[(-, [, voiceover, ], khan, academy, ,, pop, o..."
3,Computer programming,- [Instructor] The world is full of data. Ever...,"(-, [, instructor, ], world, full, data, ., ev...","[(-, [, instructor, ], world, full, data, .), ..."
4,Computer programming,"At this point, you've probably heard me\r\npro...","(point, ,, probably, heard, pronounce, sql, tw...","[(point, ,, probably, heard, pronounce, sql, t..."


In [19]:
type(joined['sents'][0])

list

we now have the sentences for each doc. Next step is to encode each of them with the universal encoder

In [26]:
def to_str(lst: list) -> list:
    return list(map(str, lst))

def encode(lst: list) -> list:
    return list(map(nlp, to_str(lst)))

print(type(joined['sents'][0][0]))
# encode(joined['sents'][0])
joined['encoding'] = joined['sents'].progress_apply(lambda x: encode(x))
joined.head()

<class 'spacy.tokens.span.Span'>


100%|██████████| 100/100 [00:00<00:00, 534.83it/s]


Unnamed: 0,course,transcript,full_transcript,sents,encoding
0,Computer programming,"Hi, welcome to programming! If you've never le...","(hi, ,, welcome, programming, !, never, learne...","[(hi, ,, welcome, programming, !), (never, lea...","[(hi, ,, welcome, programming, !), (never, lea..."
1,Computer programming,Voiceover: Ok so you've\r\nmade a few programs...,"(voiceover, :, ok, made, programs, ,, might, w...","[(voiceover, :, ok, made, programs, ,, might, ...","[(voiceover, :, ok, made, programs, ,, might, ..."
2,Computer programming,"- [Voiceover] On Khan Academy, we pop up the o...","(-, [, voiceover, ], khan, academy, ,, pop, oh...","[(-, [, voiceover, ], khan, academy, ,, pop, o...","[(-, [, voiceover, ], khan, academy, ,, pop, o..."
3,Computer programming,- [Instructor] The world is full of data. Ever...,"(-, [, instructor, ], world, full, data, ., ev...","[(-, [, instructor, ], world, full, data, .), ...","[(-, [, instructor, ], world, full, data, .), ..."
4,Computer programming,"At this point, you've probably heard me\r\npro...","(point, ,, probably, heard, pronounce, sql, tw...","[(point, ,, probably, heard, pronounce, sql, t...","[(point, ,, probably, heard, pronounce, sql, t..."


In [27]:
type(joined['encoding'][0])

list

In [28]:
type(joined['encoding'][0][0])

spacy.tokens.doc.Doc

At this point we should now have a list of sentence embeddings for each document. next step is to group them by course