## Problem Statement

My goal is to look through the transcripts of different ted talks given by different speakers, and seperate them into distinct topics like religion, education etc. Specifically, I want to find the particular topic with the most views i.e. the most popular TED talk topic, and use topic similarity to select transcripts with the same topic.

## Getting the data

I am using the TED - Ultimate Dataset | Kaggle created by Miguel Corral Jr and it can be found on the kaggle website @ https://www.kaggle.com/miguelcorraljr/ted-ultimate-dataset. TED is devoted to spreading powerful ideas in just about any topic. These datasets contain over 4,000 TED talks including transcripts in many languages. But I will be using only the English version.

In [None]:
base_dir = '/content/drive/My Drive/Ted talk topic modelling'

In [None]:
#Initialization 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Importing the fastai library
from fastai import *
from fastai.text import *

In [None]:
import nltk; nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
!python3 -m spacy download en

In [None]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [None]:
# Setting basepath
path= Path(base_dir)

In [None]:
# reading the data file
df = pd.read_csv(path/'ted_talks_en.csv')
df.head()

Unnamed: 0,talk_id,title,speaker_1,all_speakers,occupations,about_speakers,views,recorded_date,published_date,event,native_lang,available_lang,comments,duration,topics,related_talks,url,description,transcript
0,1,Averting the climate crisis,Al Gore,{0: 'Al Gore'},{0: ['climate advocate']},{0: 'Nobel Laureate Al Gore focused the world’...,3523392,2006-02-25,2006-06-27,TED2006,en,"['ar', 'bg', 'cs', 'de', 'el', 'en', 'es', 'fa...",272.0,977,"['alternative energy', 'cars', 'climate change...","{243: 'New thinking on the climate crisis', 54...",https://www.ted.com/talks/al_gore_averting_the...,With the same humor and humanity he exuded in ...,"Thank you so much, Chris. And it's truly a gre..."
1,92,The best stats you've ever seen,Hans Rosling,{0: 'Hans Rosling'},{0: ['global health expert; data visionary']},"{0: 'In Hans Rosling’s hands, data sings. Glob...",14501685,2006-02-22,2006-06-27,TED2006,en,"['ar', 'az', 'bg', 'bn', 'bs', 'cs', 'da', 'de...",628.0,1190,"['Africa', 'Asia', 'Google', 'demo', 'economic...","{2056: ""Own your body's data"", 2296: 'A visual...",https://www.ted.com/talks/hans_rosling_the_bes...,You've never seen data presented like this. Wi...,"About 10 years ago, I took on the task to teac..."
2,7,Simplicity sells,David Pogue,{0: 'David Pogue'},{0: ['technology columnist']},{0: 'David Pogue is the personal technology co...,1920832,2006-02-24,2006-06-27,TED2006,en,"['ar', 'bg', 'de', 'el', 'en', 'es', 'fa', 'fr...",124.0,1286,"['computers', 'entertainment', 'interface desi...","{1725: '10 top time-saving tech tips', 2274: '...",https://www.ted.com/talks/david_pogue_simplici...,New York Times columnist David Pogue takes aim...,"(Music: ""The Sound of Silence,"" Simon & Garfun..."
3,53,Greening the ghetto,Majora Carter,{0: 'Majora Carter'},{0: ['activist for environmental justice']},{0: 'Majora Carter redefined the field of envi...,2664069,2006-02-26,2006-06-27,TED2006,en,"['ar', 'bg', 'bn', 'ca', 'cs', 'de', 'en', 'es...",219.0,1116,"['MacArthur grant', 'activism', 'business', 'c...",{1041: '3 stories of local eco-entrepreneurshi...,https://www.ted.com/talks/majora_carter_greeni...,"In an emotionally charged talk, MacArthur-winn...",If you're here today — and I'm very happy that...
4,66,Do schools kill creativity?,Sir Ken Robinson,{0: 'Sir Ken Robinson'},"{0: ['author', 'educator']}","{0: ""Creativity expert Sir Ken Robinson challe...",65051954,2006-02-25,2006-06-27,TED2006,en,"['af', 'ar', 'az', 'be', 'bg', 'bn', 'ca', 'cs...",4931.0,1164,"['children', 'creativity', 'culture', 'dance',...","{865: 'Bring on the learning revolution!', 173...",https://www.ted.com/talks/sir_ken_robinson_do_...,Sir Ken Robinson makes an entertaining and pro...,Good morning. How are you? (Audience) Good. It...


In [None]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [None]:
stop_words.extend(['go','many' ,'laughter', 'be','thing', 'likely','actually', 'come', 'start', 'happen', 'after', 'really', 'way', 'lot', 'start', 'would', 'also', 'lot', 'have', 'make', 'take', 's', 'get', 'much','try', 'could', 'say', 'tell'])

In [None]:
pprint(stop_words)

In [None]:
# Convert to list
data = df.transcript.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['thank', 'you', 'so', 'much', 'chris', 'and', 'its', 'truly', 'great', 'honor', 'to', 'have', 'the', 'opportunity', 'to', 'come', 'to', 'this', 'stage', 'twice', 'im', 'extremely', 'grateful', 'have', 'been', 'blown', 'away', 'by', 'this', 'conference', 'and', 'want', 'to', 'thank', 'all', 'of', 'you', 'for', 'the', 'many', 'nice', 'comments', 'about', 'what', 'had', 'to', 'say', 'the', 'other', 'night', 'and', 'say', 'that', 'sincerely', 'partly', 'because', 'mock', 'sob', 'need', 'that', 'laughter', 'put', 'yourselves', 'in', 'my', 'position', 'laughter', 'flew', 'on', 'air', 'force', 'two', 'for', 'eight', 'years', 'laughter', 'now', 'have', 'to', 'take', 'off', 'my', 'shoes', 'or', 'boots', 'to', 'get', 'on', 'an', 'airplane', 'laughter', 'applause', 'ill', 'tell', 'you', 'one', 'quick', 'story', 'to', 'illustrate', 'what', 'thats', 'been', 'like', 'for', 'me', 'laughter', 'its', 'true', 'story', 'every', 'bit', 'of', 'this', 'is', 'true', 'soon', 'after', 'tipper', 'and', 'left'

In [None]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])



['thank', 'you', 'so', 'much', 'chris', 'and', 'its', 'truly', 'great', 'honor', 'to', 'have', 'the', 'opportunity', 'to', 'come', 'to', 'this', 'stage', 'twice', 'im', 'extremely', 'grateful', 'have', 'been', 'blown_away', 'by', 'this', 'conference', 'and', 'want', 'to', 'thank', 'all', 'of', 'you', 'for', 'the', 'many', 'nice', 'comments', 'about', 'what', 'had', 'to', 'say', 'the', 'other', 'night', 'and', 'say', 'that', 'sincerely', 'partly', 'because', 'mock', 'sob', 'need', 'that', 'laughter', 'put', 'yourselves', 'in', 'my', 'position', 'laughter', 'flew', 'on', 'air', 'force', 'two', 'for', 'eight', 'years', 'laughter', 'now', 'have', 'to', 'take', 'off', 'my', 'shoes', 'or', 'boots', 'to', 'get', 'on', 'an', 'airplane', 'laughter', 'applause', 'ill', 'tell', 'you', 'one', 'quick', 'story', 'to', 'illustrate', 'what', 'thats', 'been', 'like', 'for', 'me', 'laughter', 'its', 'true', 'story', 'every', 'bit', 'of', 'this', 'is', 'true', 'soon', 'after', 'tipper', 'and', 'left', 't

In [None]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

In [None]:
stop_words.extend(['even', 'may', 'thank', 'great', 'honour', 'nice', 'opprtunity' , 'show', 'first', 'question', 'mean', 'bad', 'day', 'let', 'well', 'give', 'find', 'put', 'maybe', 'ask', 'call', 'big', 'find', 'give', 'back', 'big','need','go','laughter', 'be','thing', 'likely','actually', 'come', 'start', 'happen', 'after', 'really', 'way', 'lot', 'start', 'would','want', 'new', 'know', 'also', 'lot', 'have', 'make', 'take', 's', 'get', 'much','try', 'could', 'say', 'tell'])

In [None]:
data_lemmatized = remove_stopwords(data_lemmatized)
print(data_lemmatized[:1])

[['truly', 'honor', 'opportunity', 'stage', 'twice', 'extremely', 'grateful', 'conference', 'comment', 'night', 'sincerely', 'partly', 'mock', 'sob', 'position', 'fly', 'air', 'force', 'year', 'shoe', 'boot', 'airplane', 'applause', 'ill', 'quick', 'story', 'illustrate', 'true', 'story', 'bit', 'true', 'soon', 'tipper', 'leave', 'mock', 'drive', 'little', 'farm', 'mile', 'drive', 'sound', 'little', 'looked', 'rear', 'view', 'mirror', 'sudden', 'hit', 'motorcade', 'hear', 'pain', 'rent', 'look', 'place', 'eat', 'exit', 'exit', 'shoney', 'restaurant', 'family', 'restaurant', 'chain', 'sit', 'booth', 'commotion', 'tipper', 'order', 'couple', 'booth', 'lower', 'voice', 'strain', 'former', 'wife', 'tipper', 'man', 'long', 'applause', 'series', 'epiphanie', 'next', 'continue', 'totally', 'true', 'story', 'energy', 'begin', 'speech', 'story', 'pretty', 'share', 'tipper', 'driving', 'shoney', 'family', 'restaurant', 'chain', 'man', 'laugh', 'speech', 'airport', 'fly', 'home', 'plane', 'middle'

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

In [None]:
id2word[0]

'accomplish'

In [None]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=30, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=100,
                                           alpha='auto',
                                           per_word_topics=True)

In [None]:
# # Build LDA model for the purpos of topic similarity, I used per_word_topic = False
# lda_model1 = gensim.models.ldamodel.LdaModel(corpus=corpus,
#                                            id2word=id2word,
#                                            num_topics=30, 
#                                            random_state=100,
#                                            update_every=1,
#                                            chunksize=100,
#                                            passes=100,
#                                            alpha='auto',
#                                            per_word_topics=False)

In [None]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(11,
  '0.083*"dog" + 0.053*"smell" + 0.033*"bird" + 0.024*"intersection" + '
  '0.022*"skew" + 0.020*"millennial" + 0.017*"wavelength" + 0.017*"loneliness" '
  '+ 0.015*"nanoparticle" + 0.013*"electrify"'),
 (10,
  '0.116*"game" + 0.086*"play" + 0.067*"sleep" + 0.029*"player" + 0.025*"team" '
  '+ 0.024*"motel" + 0.019*"win" + 0.018*"gay" + 0.018*"coach" + '
  '0.017*"pornography"'),
 (29,
  '0.078*"energy" + 0.071*"water" + 0.023*"power" + 0.017*"air" + '
  '0.017*"particle" + 0.016*"solar" + 0.015*"atom" + 0.015*"temperature" + '
  '0.013*"heat" + 0.012*"electricity"'),
 (27,
  '0.046*"planet" + 0.039*"earth" + 0.029*"space" + 0.025*"star" + '
  '0.020*"universe" + 0.020*"life" + 0.012*"year" + 0.011*"orbit" + '
  '0.011*"light" + 0.010*"sky"'),
 (0,
  '0.227*"woman" + 0.070*"man" + 0.057*"sex" + 0.045*"male" + 0.044*"female" + '
  '0.021*"sexual" + 0.013*"seize" + 0.009*"marriage" + 0.009*"tea" + '
  '0.009*"perverse"'),
 (17,
  '0.042*"robot" + 0.025*"fly" + 0.021*"move" + 0.018*

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score for 30
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.9406886781826245

Coherence Score:  0.4726060946162407


In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

In [None]:
#['even', 'may', 'show', 'first', 'day', 'let', 'well', 'give', 'find','people', 'put', 'maybe', 'ask', 'call', 'big', 'find', 'give', 'back', 'big','need', ,'go',''laughter', 'be','thing', 'likely','actually', 'come', 'start', 'happen', 'after', 'really', 'way', 'lot', 'start', 'would', 'also', 'lot', 'have', 'make', 'take', 's', 'get', 'much','try', 'could', 'say', 'tell']

## Analysis of the above LDA model & identification of topics

 
1.   The words that appear in topics 1 are mostly common words that do not add special meaning to any sentence otherwise known as stopwords. This finding infers that the corpus needs more stopwords removal. This can be done by adding this common words to the stop word list and retraining the LDA.






2.   The most of the topics are properly seperated on the intertopic distance map, this suggests that the LDA has indeed learnt to a fair extent how to seperate the transcripts into distinct topics.



Most topics can be identified easilty from their word lists and a few of them are summarized below:

*   **Topic 12**: This topic is about language.
*   **Topic 24**: This topic concerns crime
*   **Topic 17**:This topic is about education
*   **Topic 20**: This topic is about music.










## Finding the dominant topic in each transcript

In [None]:
def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
  # Init output
  sent_topics_df = pd.DataFrame()

  # Get main topic in each document
  for i, row in enumerate(ldamodel[corpus]):
    row = sorted(row[0], key=lambda x: (x[1]), reverse=True)
    # row = sorted(row, key=lambda x: (x[1]), reverse=True) # old line
    # Get the Dominant topic, Perc Contribution and Keywords for each document
    for j, (topic_num, prop_topic) in enumerate(row):
      if j == 0: # => dominant topic
        wp = ldamodel.show_topic(topic_num)
        topic_keywords = ", ".join([word for word, prop in wp])
        sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
      else:
        break
  sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

  # Add original text to the end of the output
  contents = pd.Series(texts)
  sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
  return(sent_topics_df)

df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data)
#df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)

In [None]:
#df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data)
#df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,2.0,0.2052,"think, people, time, see, look, right, talk, w...","Thank you so much, Chris. And its truly a grea..."
1,1,15.0,0.2142,"people, percent, help, country, community, yea...","About 10 years ago, I took on the task to teac..."
2,2,2.0,0.2798,"think, people, time, see, look, right, talk, w...","(Music: ""The Sound of Silence,"" Simon & Garfun..."
3,3,28.0,0.1469,"company, work, car, business, create, buy, pro...",If youre here today — and Im very happy that y...
4,4,2.0,0.3858,"think, people, time, see, look, right, talk, w...",Good morning. How are you? (Audience) Good. It...
5,5,9.0,0.2944,"city, build, space, design, building, work, ar...",Im going to present three projects in rapid fi...
6,6,2.0,0.4638,"think, people, time, see, look, right, talk, w...","On September 10, the morning of my seventh bir..."
7,7,2.0,0.2548,"think, people, time, see, look, right, talk, w...",Its wonderful to be back. I love this wonderfu...
8,8,2.0,0.3865,"think, people, time, see, look, right, talk, w...","Im often asked, ""What surprised you about the ..."
9,9,2.0,0.3356,"think, people, time, see, look, right, talk, w...",I cant help but this wish: to think about when...


## Topic distribution across documents

In [None]:
# topic/ topic_count DataFrame
t = pd.DataFrame(df_dominant_topic['Dominant_Topic'].value_counts())
t_dic = t.to_dict()
t_dict = t_dic.get('Dominant_Topic')
t_df = DataFrame(list(t_dict.items()),columns = ['Topic','total_count'])


# Keyword/topic_count DataFrame
f = pd.DataFrame(df_dominant_topic['Keywords'].value_counts())
f_dic = f.to_dict()
f_dict = f_dic.get('Keywords')
f_df = DataFrame(list(f_dict.items()),columns = ['Keywords','total_count'])

# topic_popularity DatFrame
topic_popularity = t_df
topic_popularity['Keywords'] = f_df['Keywords']

topic_popularity


Unnamed: 0,Topic,total_count,Keywords
0,2.0,1405,"think, people, time, see, look, right, talk, w..."
1,19.0,472,"see, look, time, light, different, small, move..."
2,13.0,426,"life, man, feel, family, girl, young, become, ..."
3,16.0,253,"people, test, study, time, example, problem, r..."
4,23.0,176,"year, human, world, water, live, time, place, ..."
5,28.0,163,"company, work, car, business, create, buy, pro..."
6,4.0,138,"people, political, power, country, government,..."
7,1.0,136,"world, country, year, global, system, economy,..."
8,9.0,107,"city, build, space, design, building, work, ar..."
9,6.0,95,"cell, patient, disease, drug, doctor, cancer, ..."


In [None]:
[11 :'general', 4: 'world affairs', 16: 'time management', 12: 'family', 2: 'ecology/nature', 0: 'vision', 5: 'intelligence', 7: 'disease', 18: 'space', 14: 'economy', 9: 'construction', 17: 'technology', 15: 'energy', 13: 'health', 8: 'music', 10: 'crime/order', 19: 'experience', 1: 'education', 6: 'food', 3: 'storytelling']

In [None]:
# d=df_dominant_topic.set_index('Dominant_Topic')
# d.head(10)

## Most viewed topic

In [None]:
# adding views info from the original table
df_dominant_topic['views'] = df['views'] 
df_dominant_topic

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text,views
0,0,2.0,0.2052,"think, people, time, see, look, right, talk, w...","Thank you so much, Chris. And its truly a grea...",3523392
1,1,15.0,0.2142,"people, percent, help, country, community, yea...","About 10 years ago, I took on the task to teac...",14501685
2,2,2.0,0.2798,"think, people, time, see, look, right, talk, w...","(Music: ""The Sound of Silence,"" Simon & Garfun...",1920832
3,3,28.0,0.1469,"company, work, car, business, create, buy, pro...",If youre here today — and Im very happy that y...,2664069
4,4,2.0,0.3858,"think, people, time, see, look, right, talk, w...",Good morning. How are you? (Audience) Good. It...,65051954
...,...,...,...,...,...,...
4000,4000,15.0,0.4090,"people, percent, help, country, community, yea...","""Im 14, and I want to go home."" ""My name is Be...",502934
4001,4001,16.0,0.4829,"people, test, study, time, example, problem, r...","In 1905, psychologists Alfred Binet and Théodo...",307187
4002,4002,12.0,0.3337,"crime, forfeiture, drug, police, case, war, vi...",Picture yourself driving down the road tomorro...,464414
4003,4003,13.0,0.6138,"life, man, feel, family, girl, young, become, ...","In early 1828, Sojourner Truth approached the ...",56582


In [None]:
v = pd.DataFrame(df_dominant_topic.groupby(["Dominant_Topic"])["views"].sum())
v_dic = v.to_dict()
v_dict = v_dic.get('views')
v_df = DataFrame(list(v_dict.items()),columns = ['Topic','total_views'])

In [None]:
v_df

Unnamed: 0,Topic,total_views
0,0.0,3648457
1,1.0,212507298
2,2.0,3818059604
3,3.0,164634414
4,4.0,203249020
5,5.0,79417767
6,6.0,143822095
7,7.0,89877673
8,8.0,59791812
9,9.0,131770125


## Topic Similarity

In [None]:
from gensim import models
lsi = models.LsiModel(corpus, id2word=id2word, num_topics=30)

In [None]:
doc = "World economy"
vec_bow = id2word.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

In [None]:
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus])  # transform corpus to LSI space and index it

  if np.issubdtype(vec.dtype, np.int):


In [None]:
index.save("simIndex.index")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples


In [None]:
# print out the documents with highest similarity score to the keyword
sim = sorted(enumerate(sims), key=lambda item: -item[1])
print(sim)

In [38]:
df.iloc[3215]

talk_id                                                       35621
title             How today's truths shape tomorrow's possibilities
speaker_1                                           Yannick Roudaut
all_speakers                                 {0: 'Yannick Roudaut'}
occupations          {0: ['philosopher', 'author', 'entrepreneur']}
about_speakers    {0: 'Yannick Roudaut is an author, journalist,...
views                                                        358484
recorded_date                                            2013-01-22
published_date                                           2019-02-15
event                                                    TEDxNantes
native_lang                                                      fr
available_lang                                   ['en', 'es', 'fr']
comments                                                        NaN
duration                                                        808
topics            ['future', 'philosophy', 'econ

In [39]:
df.iloc[3217]

talk_id                                                       24019
title                                How to build a fictional world
speaker_1                                              Kate Messner
all_speakers                                    {0: 'Kate Messner'}
occupations                  {0: ['author', 'educator', 'speaker']}
about_speakers    {0: 'Kate Messner believes in nature, art, mag...
views                                                       5091788
recorded_date                                            2014-01-09
published_date                                           2019-02-15
event                                                        TED-Ed
native_lang                                                      en
available_lang    ['ar', 'bg', 'cs', 'de', 'el', 'en', 'es', 'fr...
comments                                                        NaN
duration                                                        303
topics            ['literature', 'TED-Ed', 'anim