# Bag Of Words Topic Modeling and Recommender for Ted Talks - JBCS
 - Summer K. Rankin 

once the text has been cleaned, we move onto the next steps
        - 1. Vectorize 
        - 2. Topic modeling
        - 3. Visualization
        - 4. Recommender (optional)

# Load the packages

In [None]:
import nltk, re, pickle, os
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation,  TruncatedSVD, NMF
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing  import  StandardScaler

import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

import pyLDAvis
from IPython.display import display

# Import the cleaned text

In [None]:
with open('./data/cleaned_talks.pkl', 'rb') as picklefile:
    cleaned_talks = pickle.load(picklefile)    

# 1 Vectorization
Vectorization is the important step of turning our words into numbers. There are 2 common methods: count vectorizer, tf-idf. This function takes each word in each document and counts the number of times the word appears. You end up with each word (and n-gram) as your columns and each row is a document, so the data is the frequency of each word in each document. As you can imagine, there will be a large number of zeros in this matrix; we call this a sparse matrix. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

tf-IDF

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

----

For this tutorial the tokenization and vectorization gets bundled together because we are using skleanrn's feature extraction functions. This means we will set the parameters of these functions to tokenize the way we want, include n-grams, and set thresholds for max or min document frequency of a term. https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af

- takes us from words to numbers 
- create the document-term matrix which is the basis for all modeling
    - row = document, column = word or n-gram, data = word's weight for that document
- we will vectorize in 2 ways 
    1. counting the frequency of each term in each document (**CountVectorizer**)
    2. counting the frequency of each term in each document and weighting by the number of times the term appears in the corpus. Term Frequency * Inverse Document Frequency (**TfidfVectorizer**)


# 1.1 Count Vectorize 
+ Using Sklearn algorithms with text data
+ CountVectorizer: Convert a collection of text documents to a matrix of token counts 
+ This implementation produces a sparse representation
+  **CountVectorizer** is a class; so **vectorizer** below represents an instance of that object
+ note that we can also **lowercase** , remove **stopwords** and search for a certain pattern **token_pattern** (i.e. letters only) by using the parameters of this function. 
+ a specific vocabulary can also be passed to this function. 
+ for more info about the many parameters, see the sklearn docs. 

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

CountVectorizer is a class; so `vectorizer` below represents an instance of that object.


In [None]:
c_vectorizer = CountVectorizer(ngram_range=(1,3), 
                             stop_words='english', 
                             max_df = 0.6, 
                             max_features=2000)

# call `fit` to build the vocabulary and calculate the weights
c_vectorizer.fit(cleaned_talks)

# finally, call `transform` to apply the weights and convert text to a bag of words
count_vect_data = c_vectorizer.transform(cleaned_talks)

# to view the document-term matrix, we can transpose back to a dense array
pd.DataFrame(data = count_vect_data.toarray(), columns=sorted(c_vectorizer.get_feature_names()))

# 1.2 Tfidf (term frequency * inverse document frequency)
+ Assign a weight to each term in each document rather than a raw count. 
+ gives more weight to less frequent terms
+ similar paremeters for this vectorizer 

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

In [None]:
t_vectorizer = TfidfVectorizer(ngram_range=(1, 3),  
                                   stop_words='english', 
                                   token_pattern="\\b[a-z][a-z]+\\b",
                                   lowercase=True,
                                   max_df = 0.6,
                                   max_features=2000)


# call `fit` to build the vocabulary and calculate weights
t_vectorizer.fit(cleaned_talks)

# finally, call `transform` to convert text to a bag of words
tfidf_data = t_vectorizer.transform(cleaned_talks)

In [None]:
# view a dense representation of the document-term matrix
pd.DataFrame(tfidf_data.toarray(), columns=t_vectorizer.get_feature_names())

# 2 Topic modeling 
Use the document term matrix created with vectorization, to create a latent space and find the words that tend to ocurr together

We will use LDA Latent Dirichlet Allocation here (there are several methods, NMF, SVD)

This will reduce the data from thousands of terms (dimensions) to 20 topics. (Dimensionality Reduction)

Creates a latent space that is X dimensions.

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

# LDA Latent Dirichlet Allocation

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

In [None]:
def topic_mod(vectorizer, vect_data, topics=20, iters=5, no_top_words=50):
    
    """ use Latent Dirichlet Allocation to get topics"""

    mod = LatentDirichletAllocation(n_components=topics,
                                    max_iter=iters,
                                    random_state=42,
                                    learning_method='online',
                                    n_jobs=-1)
    
    mod_dat = mod.fit_transform(vect_data)
    
    
    # to display a list of topic words and their scores 
    
    def display_topics(model, feature_names, no_top_words):
        for ix, topic in enumerate(model.components_):
            print("Topic ", ix)
            print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]) + '\n')
    
    display_topics(mod, vectorizer.get_feature_names() , no_top_words)

    
    return mod, mod_dat

In [None]:
lda_model, lda_data = topic_mod(c_vectorizer, 
                            count_vect_data, 
                            topics=20, 
                            iters=10, 
                            no_top_words=15)  

# 3 Visualization  

# 3.1 View distribution of topics with pyLDAvis 
+ plot the first 2 components from the topic modeling (LDA). 
+ Not really the best way to look at clusters, but a good place to start and a very nice way to present data to clients

In [None]:
# Setup to run in Jupyter notebook
pyLDAvis.enable_notebook()

 # Create the visualization
vis = pyLDAvis.sklearn.prepare(lda_model, count_vect_data, c_vectorizer, sort_topics=False, mds='mmds')

 # can export as a standalone HTML web page
pyLDAvis.save_html(vis, 'lda_visual.html')

# # Let's view it!
display(vis)

# 3.2 Assign a topic to each document

+ for each document, assign the topic (column) with the  highest score from the LDA


In [None]:
topic_index = np.argmax(lda_data, axis=1)


 # 3.3 Assign labels to topics
 Try to use the top terms from each topic to give it a label.

In [None]:
topic_names = pd.DataFrame(topic_index)

topic_names[topic_names==0] = "family"
topic_names[topic_names==1] = "agriculture"
topic_names[topic_names==2] = "space"
topic_names[topic_names==3] = "environment"
topic_names[topic_names==4] = "global economy"
topic_names[topic_names==5] = "writing"
topic_names[topic_names==6] = "sounds"
topic_names[topic_names==7] = "belief, mortality"
topic_names[topic_names==8] = "transportation"

topic_names[topic_names==9] = "gaming"
topic_names[topic_names==10] = "architecture"
topic_names[topic_names==11] = "education"

topic_names[topic_names==12] = "neuroscience"
topic_names[topic_names==13] = "climate, energy"

topic_names[topic_names==14] = "politics"
topic_names[topic_names==15] = "robotics"  
topic_names[topic_names==16] = "disease biology"
topic_names[topic_names==17] = "medicine"
topic_names[topic_names==18] = "technology, privacy"
topic_names[topic_names==19] = "war"

In [None]:
topic_names.head()

# 4. Recommender (optional)
we will use the Ted Talk metadata to add some information to our recommender. 

# import original talks and metadata
merge them on the 'url' column

In [None]:
ted_trans = pd.read_csv('/data/transcripts.csv', encoding = "UTF-8")  
ted_main = pd.read_csv('/data/ted_main.csv', encoding = "UTF-8")  

In [None]:
ted_all = pd.merge(ted_trans, right=ted_main, on='url')
ted_all.url = ted_all.url.astype('str',copy=False)

ted_all.head(50)

In [None]:
def get_recommendations(target_doc, num_of_recs, topics, data, topic_model, vectorizer, topic_model_data):
    
    new_vec = topic_model.transform(
        vectorizer.transform([target_doc]))
    
    nn = NearestNeighbors(n_neighbors=num_of_recs, metric='cosine', algorithm='brute')
    nn.fit(topic_model_data)
    
    results = nn.kneighbors(new_vec)
    
    recommend_list = results[1][0]
    scores = results[0]
                       
    ss = np.array(scores).flat
    for i, resp in enumerate(recommend_list):
        print('\n--- ID ---\n', + resp)
        print('--- distance ---\n', + ss[i])  
        print('--- topic ---')
        print(topics.iloc[resp,0])
        print(data.iloc[resp,1])
        print('--- teds tags ---')
        print(data.iloc[resp,-3])
        
    return recommend_list, ss   

In [None]:
rec_list, scores = get_recommendations(cleaned_talks[804], num_of_recs=10, topics=topic_names, data=ted_all,
                                       topic_model=lda_model, vectorizer=c_vectorizer, topic_model_data=lda_data)

# 4.1 search and recommend similar documents
We use a great library called fuzzywuzzy to find the titles that is most similar to a search term. then we use this as our target document for the recommendation. 

install fuzzywuzzy

In [None]:
from fuzzywuzzy import process
from fuzzywuzzy import fuzz

search_term = "computer science"

titles = ted_all['title']

In [None]:
tite, score, talk_ind = process.extractOne(search_term, titles, scorer=fuzz.token_set_ratio)

In [None]:
rec_list, scores = get_recommendations(cleaned_talks[talk_ind], num_of_recs=10, topics=topic_names, data=ted_all,
                                       topic_model=lda_model, vectorizer=c_vectorizer, topic_model_data=lda_data)