# Table of Content
1. [Preparation](#preparation)
2. [Dependency Graph](#dependency_graph)  
    1. [Example 1](#dependency_graph_example1)
    2. [Example 2](#dependency_graph_example2)
3. [Feature Extraction](#feature_extraction)   
4. [EDA](#eda)
    1. [Tags Matching](#eda_tags_matching)
    2. [Top Groups](#eda_top_groups)
    3. [First Answer](#eda_first_answer)
    4. [Word Count](#eda_wordcound)
5. [Recommendation](#recommendation)
    1. [Example 1](#recommender_example1)
    2. [Example 2](#recommender_example2)
6. [Topic Model (LDA)](#topic_model)
    1. [Model](#lda_model)
    2. [Topics](#lda_topics)
    3. [Document-Topic Probabilities](#lda_doc_topic_prob)
    4. [Example 1](#lda_example1)
    5. [Example 2](#lda_example2)
7. [Next steps](#next_steps)
8. [Version control](#version_control)

# 1. Preparation <a id="preparation"></a>

## Libraries

In [1]:
import os
import datetime
import math
import random
import warnings

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import networkx as nx

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import spacy
nlp = spacy.load('en')
nlp.remove_pipe('parser')
nlp.remove_pipe('ner')
#nlp.remove_pipe('tagger')

from wordcloud import WordCloud

import gensim
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import plotly.offline as py
py.init_notebook_mode(connected=True)

ModuleNotFoundError: No module named 'spacy'

## Read CSV Files

In [None]:
input_dir = '../input'
print(os.listdir(input_dir))

In [None]:
professionals = pd.read_csv(os.path.join(input_dir, 'professionals.csv'))
groups = pd.read_csv(os.path.join(input_dir, 'groups.csv'))
comments = pd.read_csv(os.path.join(input_dir, 'comments.csv'))
school_memberships = pd.read_csv(os.path.join(input_dir, 'school_memberships.csv'))
tags = pd.read_csv(os.path.join(input_dir, 'tags.csv'))
emails = pd.read_csv(os.path.join(input_dir, 'emails.csv'))
group_memberships = pd.read_csv(os.path.join(input_dir, 'group_memberships.csv'))
answers = pd.read_csv(os.path.join(input_dir, 'answers.csv'))
students = pd.read_csv(os.path.join(input_dir, 'students.csv'))
matches = pd.read_csv(os.path.join(input_dir, 'matches.csv'))
questions = pd.read_csv(os.path.join(input_dir, 'questions.csv'))
tag_users = pd.read_csv(os.path.join(input_dir, 'tag_users.csv'))
tag_questions = pd.read_csv(os.path.join(input_dir, 'tag_questions.csv'))

## Global Parameters

In [None]:
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', -1)

seed = 13
random.seed(seed)
np.random.seed(seed)

# 2. Dependency Graph <a id="dependency_graph"></a>  
![workflow_diagram](https://i.imgur.com/zzAo1JD.jpg)

## Functions

In [None]:
def plot_dependecy_graph(email_index=seed, plot_graph=True, print_report=False):
    """ Merges all relevant data for a given email together and builds a dependency graph and report.

        Actual missing: Group membership and School membership
        
        :param email_index: Index of the 'emails' dataframe (default: seed)
        :param plot_graph: Boolean to plot the graph (default: True)
        :param print_report: Boolean to print the text report (default: False)
    """  
    email_id = emails.loc[email_index, 'emails_id'].values
    # Merge the dataframes
    graph_data = matches[matches['matches_email_id'].isin(email_id)]
    graph_data = pd.merge(graph_data, questions, left_on='matches_question_id', right_on='questions_id', how='left')
    graph_data = pd.merge(graph_data, answers, left_on='questions_id', right_on='answers_question_id', how='left')
    graph_data = pd.merge(graph_data, tag_questions, left_on='questions_id', right_on='tag_questions_question_id', how='left')
    graph_data = pd.merge(graph_data, tags, left_on='tag_questions_tag_id', right_on='tags_tag_id', how='left')
    graph_data = pd.merge(graph_data, tag_users, left_on='questions_author_id', right_on='tag_users_user_id', how='left', suffixes=('', '_student'))
    graph_data = pd.merge(graph_data, tags, left_on='tag_users_tag_id', right_on='tags_tag_id', how='left', suffixes=('', '_student'))
    graph_data = pd.merge(graph_data, group_memberships, left_on='questions_author_id', right_on='group_memberships_user_id', how='left', suffixes=('', '_student'))
    graph_data = pd.merge(graph_data, school_memberships, left_on='questions_author_id', right_on='school_memberships_user_id', how='left', suffixes=('', '_student'))
    graph_data = pd.merge(graph_data, tag_users, left_on='answers_author_id', right_on='tag_users_user_id', how='left', suffixes=('', '_professional'))
    graph_data = pd.merge(graph_data, tags, left_on='tag_users_tag_id_professional', right_on='tags_tag_id', how='left', suffixes=('', '_professional'))
    graph_data = pd.merge(graph_data, group_memberships, left_on='answers_author_id', right_on='group_memberships_user_id', how='left', suffixes=('', '_professional'))
    graph_data = pd.merge(graph_data, school_memberships, left_on='answers_author_id', right_on='school_memberships_user_id', how='left', suffixes=('', '_professional'))    
    
    if plot_graph:
        plt.figure(figsize=(15, 15)) 
        G = nx.Graph()
        node_color = []
        # Nodes
        df_nodes = pd.DataFrame({'node':['matches_email_id', 'questions_id', 'questions_author_id', 'answers_id', 'answers_author_id', 'group_memberships_group_id', 
                                'group_memberships_group_id_professional', 'school_memberships_school_id', 'school_memberships_school_id_professional',
                                'tags_tag_name', 'tags_tag_name_student', 'tags_tag_name_professional'],
                         'color':['grey', 'blue', 'green', 'red', 'cyan', 'orange', 'orange', 'purple', 'purple', 'yellow', 'yellow', 'yellow']})
        for index, row in df_nodes.iterrows():
            G.add_nodes_from(graph_data[row['node']].dropna().unique())
            node_color +=([row['color']]*len(graph_data[row['node']].dropna().unique()))
        # Edges 
        df_edges = pd.DataFrame({'source':['matches_email_id', 'questions_id', 'questions_id', 'answers_id', 'questions_id',
                                          'questions_author_id', 'answers_author_id', 'questions_author_id',
                                          'answers_author_id', 'answers_author_id', 'answers_author_id'],
                                'target':['questions_id', 'questions_author_id', 'answers_id', 'answers_author_id', 'tags_tag_name',
                                         'tags_tag_name_student', 'tags_tag_name_professional', 'group_memberships_group_id',
                                         'group_memberships_group_id_professional', 'school_memberships_school_id', 'school_memberships_school_id_professional']})
        for index, row in df_edges.iterrows():
            G.add_edges_from({tuple(row) for i,row in graph_data[[row['source'], row['target']]].dropna().iterrows()})

        nx.draw_networkx(G, with_labels=True, node_color=node_color, font_size=8, node_size=900)
        plt.title('Dependency graph for email {}'.format(email_id))
        plt.axis('off')

        legend_email = mpatches.Patch(color='grey', label='Email')
        legend_question = mpatches.Patch(color='blue', label='Question')
        legend_student = mpatches.Patch(color='green', label='Student')
        legend_answer = mpatches.Patch(color='red', label='Answer')
        legend_professional = mpatches.Patch(color='cyan', label='Professional')
        legend_tag = mpatches.Patch(color='yellow', label='Tag')
        legend_group = mpatches.Patch(color='orange', label='Group')
        legend_school = mpatches.Patch(color='purple', label='School')
        plt.legend(handles=[legend_email, legend_question, legend_student, legend_answer, legend_professional, legend_tag, legend_group, legend_school])
        plt.show()
    
    if print_report:
        print('Email ID: {}'.format(email_id))
        print('Questions: {}'.format(len(graph_data['questions_id'].unique())))
        for question in graph_data['questions_id'].unique():
            date = graph_data[graph_data['questions_id'] == question]['questions_date_added'].dropna().unique()[0]
            author = graph_data[graph_data['questions_id'] == question]['questions_author_id'].dropna().unique()[0]
            question_tags = graph_data[graph_data['questions_id'] == question]['tags_tag_name'].dropna().unique()
            author_tags = graph_data[graph_data['questions_id'] == question]['tags_tag_name_student'].dropna().unique()
            author_groups = graph_data[graph_data['questions_id'] == question]['group_memberships_group_id'].dropna().unique()
            author_school = graph_data[graph_data['questions_id'] == question]['school_memberships_school_id'].dropna().apply('{0:.0f}'.format).unique()
            question_answers = graph_data[graph_data['questions_id'] == question]['answers_id'].dropna().unique()
            print('  \033[44mQuestion {}\033[0m from \033[42mstudent {}\033[0m on {}'.format(question, author, date))
            print('    Question Tags: {}'.format(', '.join(question_tags)))
            print('    Student Tags: {}'.format(', '.join(author_tags)))
            print('    Student Groups: {}'.format(', '.join(author_groups)))
            print('    Student Schools: {}'.format(', '.join(author_school)))
            print('    Answers: {}'.format(len(question_answers)))
            for question_answer in question_answers:
                date = graph_data[graph_data['answers_id'] == question_answer]['answers_date_added'].dropna().unique()[0]
                author = graph_data[graph_data['answers_id'] == question_answer]['answers_author_id'].dropna().unique()[0]
                author_tags = graph_data[graph_data['answers_id'] == question_answer]['tags_tag_name_professional'].dropna().unique()
                author_groups = graph_data[graph_data['answers_id'] == question_answer]['group_memberships_group_id_professional'].dropna().unique()
                author_school = graph_data[graph_data['answers_id'] == question_answer]['school_memberships_school_id_professional'].dropna().apply('{0:.0f}'.format).unique()
                print('      \033[41mAnswer {}\033[0m from \033[46mprofessional {}\033[0m on {}'.format(question_answer, author, date))
                print('        Professional Tags: {}'.format(', '.join(author_tags)))
                print('        Professional Groups: {}'.format(', '.join(author_groups)))
                print('        Professional Schools: {}'.format(', '.join(author_school)))

## Example 1  <a id="dependency_graph_example1"></a> 
In example 1 we have <span style="background-color:gray">one email</span> with <span style="background-color:blue">two questions</span> with a few <span style="background-color:yellow">questions tags</span>. There are <span style="background-color:red">several answers</span> for the question. The <span style="background-color:green">students</span> from the questions haven't any <span style="background-color:orange">group</span> or <span style="background-color:purple">school</span> membership. The <span style="background-color:cyan">professionals</span> have more motivation to subscribe some <span style="background-color:yellow">tags</span> and specify there <span style="background-color:purple">school</span> membership. But only <span style="background-color:cyan">two professionals</span> have joined a <span style="background-color:orange">group</span>.<span style="background-color:blue"></span>

In [None]:
plot_dependecy_graph(email_index=[seed])

## Example 2 <a id="dependency_graph_example2"></a> 

In [None]:
plot_dependecy_graph(email_index=[seed*2], print_report=True)

# 3. Features extraction <a id="feature_extraction"></a> 

## Parameters

In [None]:
# Spacy Tokenfilter for part-of-speech tagging
token_pos = ['NOUN', 'VERB', 'PROPN', 'ADJ', 'INTJ', 'X']

# The data export was from 1. February 2019. For Production use datetime.now()
actual_date = datetime.datetime(2019, 2 ,1)

## Functions

In [None]:
def nlp_preprocessing(data):
    """ Use NLP to transform the text corpus to cleaned sentences and word tokens

    """    
    def token_filter(token):
        """ Keep tokens who are alphapetic, in the pos (part-of-speech) list and not in stop list

        """    
        return not token.is_stop and token.is_alpha and token.pos_ in token_pos
    
    processed_tokens = []
    data_pipe = nlp.pipe(data)
    for doc in data_pipe:
        filtered_tokens = [token.lemma_.lower() for token in doc if token_filter(token)]
        processed_tokens.append(filtered_tokens)
    return processed_tokens

## Features

In [None]:
# Transform datatypes
questions['questions_date_added'] = pd.to_datetime(questions['questions_date_added'])
answers['answers_date_added'] = pd.to_datetime(answers['answers_date_added'])
professionals['professionals_date_joined'] = pd.to_datetime(professionals['professionals_date_joined'])
students['students_date_joined'] = pd.to_datetime(students['students_date_joined'])

### Questions
# Merge Question Title and Body
questions['questions_full_text'] = questions['questions_title'] +'\r\n\r\n'+ questions['questions_body']
# Count of answers
temp = answers.groupby('answers_question_id').size()
questions['questions_answers_count'] = pd.merge(questions, pd.DataFrame(temp.rename('count')), left_on='questions_id', right_index=True, how='left')['count'].fillna(0).astype(int)
# First answer for questions
temp = answers[['answers_question_id', 'answers_date_added']].groupby('answers_question_id').min()
questions['questions_first_answers'] = pd.merge(questions, pd.DataFrame(temp), left_on='questions_id', right_index=True, how='left')['answers_date_added']
# Last answer for questions
temp = answers[['answers_question_id', 'answers_date_added']].groupby('answers_question_id').max()
questions['questions_last_answers'] = pd.merge(questions, pd.DataFrame(temp), left_on='questions_id', right_index=True, how='left')['answers_date_added']

### Answers
# Days required to answer the question
temp = pd.merge(questions, answers, left_on='questions_id', right_on='answers_question_id')
answers['time_delta_answer'] = (temp['answers_date_added'] - temp['questions_date_added'])

### Professionals
# Time since joining
#professionals['professionals_time_delta_joined'] = actual_date - professionals['professionals_date_joined']
# Number of answers
temp = answers.groupby('answers_author_id').size()
professionals['professionals_answers_count'] = pd.merge(professionals, pd.DataFrame(temp.rename('count')), left_on='professionals_id', right_index=True, how='left')['count'].fillna(0).astype(int)
# Last activity (Answer)
temp = answers.groupby('answers_author_id')['answers_date_added'].max()
professionals['date_last_answer'] = pd.merge(professionals, pd.DataFrame(temp.rename('last_answer')), left_on='professionals_id', right_index=True, how='left')['last_answer']
# Avg answers per week
#professionals['avg_answers_week'] = (professionals['answers_count'] / ((professionals['date_last_answer'] - professionals['professionals_date_joined']).dt.days).apply(lambda x: np.ceil(x/7) if x > 0 else 1))

### Students
# Time since joining
#students['students_time_delta_joined'] = actual_date - students['students_date_joined']
# Number of answers
temp = questions.groupby('questions_author_id').size()
students['students_questions_count'] = pd.merge(students, pd.DataFrame(temp.rename('count')), left_on='students_id', right_index=True, how='left')['count'].fillna(0).astype(int)
# Last activity (Question)
temp = questions.groupby('questions_author_id')['questions_date_added'].max()
students['date_last_question'] = pd.merge(students, pd.DataFrame(temp.rename('last_question')), left_on='students_id', right_index=True, how='left')['last_question']


In [None]:
%%time
# Get NLP Tokens
questions['nlp_tokens'] = nlp_preprocessing(questions['questions_full_text'])

# 3. EDA <a id="eda"></a>

## Functions

In [None]:
def word_count(text):
    """ Count the words in the text
    
    """
    result = text.split()
    result = len(result)
    return result

In [None]:
def plot_tags_matching():
    students_tags = tag_users[tag_users['tag_users_user_id'].isin(students['students_id'])]
    students_tags = pd.merge(students_tags, tags, left_on='tag_users_tag_id', right_on='tags_tag_id')
    students_tags['user_type'] = 'student'
    professionals_tags = tag_users[tag_users['tag_users_user_id'].isin(professionals['professionals_id'])]
    professionals_tags = pd.merge(professionals_tags, tags, left_on='tag_users_tag_id', right_on='tags_tag_id')
    professionals_tags['user_type'] = 'professional'
    questions_tags = tag_questions
    questions_tags = pd.merge(questions_tags, tags, left_on='tag_questions_tag_id', right_on='tags_tag_id')
    questions_tags['user_type'] = 'question'
    plt_data = pd.concat([students_tags, professionals_tags, questions_tags])

    plt_data = plt_data[['tags_tag_name', 'user_type']].pivot_table(index='tags_tag_name', columns='user_type', aggfunc=len, fill_value=0)
    plt_data['professional'] = plt_data['professional'] / professionals.shape[0]
    plt_data['student'] = plt_data['student'] / students.shape[0]
    plt_data['question'] = plt_data['question'] / questions.shape[0]
    plt_data['sum'] = (plt_data['professional'] + plt_data['student'] + plt_data['question'])
    plt_data = plt_data.sort_values(by='sum', ascending=False).drop(['sum'], axis=1).head(100)

    # Bubble chart
    fig, ax = plt.subplots(facecolor='w',figsize=(15, 15))
    ax.set_xlabel('Professionals')
    ax.set_ylabel('Students')
    ax.set_title('Tags Matching')
    ax.set_xlim([0, max(plt_data['professional'])+0.001])
    ax.set_ylim([0, max(plt_data['student'])+0.001])
    import matplotlib.ticker as mtick
    ax.xaxis.set_major_formatter(mtick.FuncFormatter("{:.2%}".format))
    ax.yaxis.set_major_formatter(mtick.FuncFormatter("{:.2%}".format))
    ax.grid(True)
    i = 0
    for key, row in plt_data.iterrows():
        ax.scatter(row['professional'], row['student'], s=row['question']*10**4, alpha=.5)
        if i < 25:
            ax.annotate('{}: {:.2%}'.format(key, row['question']), xy=(row['professional'], row['student']))
        i += 1
    plt.show()
    
    # Wordcloud
    plt.figure(figsize=(20, 20))
    wordloud_values = ['student', 'professional', 'question']
    axisNum = 1
    for wordcloud_value in wordloud_values:
        wordcloud = WordCloud(margin=0, max_words=20, random_state=seed).generate_from_frequencies(plt_data[wordcloud_value])
        ax = plt.subplot(1, 3, axisNum)
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.title(wordcloud_value)
        plt.axis("off")
        axisNum += 1
    plt.show()    

In [None]:
def plot_top_groups():
    plt_data = group_memberships.groupby('group_memberships_group_id').size().sort_values(ascending=False)
    plt_data.plot(kind='bar', figsize=(15, 5), color='green')
    plt_data.plot(lw=3, color='green')
    plt.xticks(range(len(plt_data)), [])
    plt.xlabel('Groups')
    plt.ylabel('Members')
    plt.title('Top Groups')
    plt.show()
    print(plt_data.head(5))

In [None]:
def plot_first_answer():
    plt_data = (questions['questions_first_answers'] - questions['questions_date_added']).dt.days.rename("First Answer")
    print(plt_data.describe())
    plt_data.plot(kind='box', showfliers=False, grid=True, vert=False, figsize=(15, 5))
    plt.xlabel('Days')
    plt.title('Time for first answer')
    plt.show()

In [None]:
def plot_wordcount_questions():
    plt_data_title = questions['questions_title'].apply(word_count).rename("Title")
    plt_data_body = questions['questions_body'].apply(word_count).rename("Body")
    plt_data_fulltext = questions['questions_full_text'].apply(word_count).rename("Full Text")
    plt_data = pd.DataFrame([plt_data_title, plt_data_body, plt_data_fulltext]).T
    print(plt_data.describe())
    plt_data.plot(kind='box', showfliers=False, vert=False, figsize=(15, 5), grid=True)
    plt.xticks(range(0, 130, 10))
    plt.xlabel('Words')
    plt.title('Word count (Questions)')
    plt.show()

In [None]:
def plot_wordcount_answers():
    plt_data = answers['answers_body'].astype(str).apply(word_count).rename("Answers body")
    print(plt_data.describe())
    plt_data.plot(kind='box', showfliers=False, vert=False, figsize=(15, 5), grid=True)
    plt.xticks(range(0, 400, 50))
    plt.xlabel('Words')
    plt.title('Word count (Answers)')
    plt.show()

In [None]:
def plot_wordcount():
    plt_data_questions = questions['questions_full_text'].apply(word_count).rename("Questions")
    plt_data_answers = answers['answers_body'].astype(str).apply(word_count).rename("Answers")
    plt_data = pd.DataFrame([plt_data_questions, plt_data_answers]).T
    print(plt_data.describe())
    plt_data.plot(kind='box', showfliers=False, vert=False, figsize=(15, 5), grid=True)
    plt.xticks(range(0, 400, 10))
    plt.xlabel('Words')
    plt.title('Word count')
    plt.show()

## Tags matching <a id="eda_tags_matching"></a> 
The size of the bubbles depends on how many questions the tag is used. The x-axis is how many professionals have subscribe the tag and the y-axis is how many students have subscribe the tag.  
The top tag for professionals ist *telecommunications* on the right site with about 11% but the tag doesn't appear in many questions or students subscribtion.  
The top tags for questions is *college* with 15.6% and *carrer* with 6.5%. The other top tags are carrer specific (*medicine, engineering, business, ...*).  
The top tag for students is *college* but only 1.5% of the students have subscribe this tag.  

In [None]:
plot_tags_matching()

## Top Groups  <a id="eda_top_groups"></a>
There are only two groups with more then 100 members. Related on the size of students and professionals, groups actual are only used by a small size of members.

In [None]:
plot_top_groups()

## Time for first answer  <a id="eda_first_answer"></a>
The most questions get the answer in the first days.  
There are some outliers (e.q. the maximum with 1897 days) that are removed for the box plot.

In [None]:
plot_first_answer()

## Word count <a id="eda_wordcound"></a>
Here we can see how many words are used for the questions and answers.  
The professionals write very detailed answers for the students questions.

In [None]:
plot_wordcount()

# 5. Recommendation <a id="recommendation"></a>  
With the preprocessed data I build a tf-idf corpus and can use this, to calculate the (cosine) similarity between a new question and the given questions.  
Here are the detailed steps:  
1. Use NLP on the Questions corpus.  
    a. Use part-of-speech tagging to filter words.  
    b. Calculate the tf-idf for a better Information Retrieval.  
2. Use NLP on the Query text.  
    a. Use part-of-speech tagging to filter words.  
    b. Calculate the tf-idf for a better Information Retrieval.  
3. Use the cosine similiarty to get similiar questions for the query text.  
4. Get the answers and professionals for the similar questions.  
5. Make a recommendation to fit the best professionals to answer the new question.  

I use the similar questions and the professionals who answered the question to calculate a *recommendation score*. On the basis of this, professionals will be recommended who have already answered many similar questions which have not been answered by many others. The first draft of this formula is as follows  
$Professional_{score} = \sum\limits_{q}^{Q}(q_{sim}*\dfrac{1}{q_{answers}}*p_{answers})$  
$Q$ = Similar  questions  answered  by  professional  p  
$q_{sim}$ = Similarity of the question q  
$q_{answers}$ = Total answers for question q  
$p_{answers}$ = Total answers of professional p
    

## Functions

In [None]:
def get_similar_docs(corpus, query_text, threshold=0.0, top=5):
    """ Calculates the tfidf of the corpus and returns similiar questions, matching the query text.

    """  
    #nlp_corpus = [' '.join(x) for x in nlp_preprocessing(corpus)]
    nlp_corpus = [' '.join(x) for x in questions['nlp_tokens']]
    nlp_text = [' '.join(nlp_preprocessing([query_text])[0])]
    vectorizer = TfidfVectorizer(lowercase = True, stop_words = 'english')
    vectorizer.fit(nlp_corpus)
    corpus_tfidf = vectorizer.transform(nlp_corpus)
    
    text_tfidf = vectorizer.transform(nlp_text)
    sim = cosine_similarity(corpus_tfidf, text_tfidf)
    sim_idx = (sim >= threshold).nonzero()[0]
    result = pd.DataFrame({'similarity':sim[sim_idx].reshape(-1,),
                          'text':corpus[sim_idx]},
                          index=sim_idx)
    result = result.sort_values(by=['similarity'], ascending=False).head(top)
    return result

In [None]:
def get_questions_answers(sim_questions):
    """ Merges the questions with the corresponding answers

    """  
    sim_question_answers = pd.merge(sim_questions, questions, left_index=True, right_index=True)
    sim_question_answers = pd.merge(sim_question_answers, answers, left_on='questions_id', right_on='answers_question_id')
    sim_question_answers = sim_question_answers[['questions_id', 'similarity', 'questions_title', 'questions_body', 'answers_body']]
    return sim_question_answers

In [None]:
def get_recommendation(sim_questions, plot_graph=True, print_report=True, top_n=5):
    """ Get the top recommended professionals based on questions

    """    
    df = pd.merge(sim_questions, questions, left_index=True, right_index=True)
    df = pd.merge(df, answers, left_on='questions_id', right_on='answers_question_id')
    df = df[['questions_id', 'similarity', 'answers_author_id']]
    plot_data = pd.DataFrame(columns=['source', 'target', 'value'])
    plot_data = plot_data.append(pd.DataFrame({'source':['question'] * len(df['questions_id'].drop_duplicates()),
                                                   'target':df['questions_id'].drop_duplicates(),
                                                  'value':df['similarity'].drop_duplicates()}), ignore_index=True)
    temp_values = df['similarity']/df['questions_id'].apply(lambda x: df.groupby('questions_id').size()[x])
    temp_values = temp_values * df['answers_author_id'].apply(lambda x: df.groupby('answers_author_id').size()[x])
    plot_data = plot_data.append(pd.DataFrame({'source':df['questions_id'],
                                                   'target':df['answers_author_id'],
                                                  'value':temp_values}), ignore_index=True)

    if plot_graph:
        labels = plot_data['source'].append(plot_data['target']).unique()
        sources = plot_data['source'].apply(lambda x: labels.tolist().index(x))
        targets = plot_data['target'].apply(lambda x: labels.tolist().index(x))
        values = plot_data['value']*100
        
        data = dict(
            type='sankey',
            node = dict(
              label = labels,
            ),
            link = dict(
              source = sources,
              target = targets,
              value = values
          ))
        layout =  dict(
            title = "Recommendation (Question -> Similar Questions -> Professionals)"
        )

        fig = dict(data=[data], layout=layout)
        py.iplot(fig, validate=False)
    
    if print_report:
        top_professionals = plot_data[~plot_data['target'].isin(plot_data['source'])][['target', 'value']].groupby('target').sum()
        top_professionals = top_professionals.reset_index().sort_values('value', ascending = False).head(top_n)
        top_professionals.columns = ['professional', 'recommendation_score']
        print(top_professionals)

## Example 1  <a id="recommender_example1"></a> 
I use a already existing question to get a recommendation. It's a question about the process of becoming a lawyer and how hard it will be.

In [None]:
sim_corpus = questions['questions_full_text']
sim_text = sim_corpus[seed]
print('Example 1 Question:\n', sim_text)
sim_questions = get_similar_docs(sim_corpus, sim_text, top=10)

**Similar Questions:  **  
In this example the first recommendation is the question itself.  
But a look on the other recommendation seems to be a good match too. They are about *lawyer* and how *hard* it will be.

In [None]:
sim_questions

**Answers to similar questions**  
Now I can merge the recommended questions with the answers for these questions. This can be used to give the student who asked the question a first recommendation of his question. Maybe these answers are already an answer to his question.

In [None]:
get_questions_answers(sim_questions).head().T

**Recommended Professionals**  
The left is the new question. In the middle are the top similar questions. The width of the line shows the similarity.  
On the right are the professionals who have answered the similar questions from the middle. Here the width of the line is a recommendation score.  
Professionals with a big box have a high score and should be recommended.  

The Professional **c5c2ca95fcd3463a8852b8bc9d636313** has the highest score with 2.27. This is because he answered three questions with a high similarity.

In [None]:
get_recommendation(sim_questions)

## Example 2  <a id="recommender_example2"></a> 
Example 2 use a new defined question about the carrer as a data scientist.

In [None]:
query_text = 'I will finish my college next year and would like to start a career as a data scientist. \n'\
            +'What is the best way to become a good data scientist? #data-science'
print('Example 2 Question:\n', query_text)
sim_questions = get_similar_docs(sim_corpus, query_text, top=5)

**Similar Questions:  **  
The recommended questions are also about the carrer and preparation of become a data scientist.

In [None]:
sim_questions

**Answers to similar questions**  
Here we have several answers to the recommended similar questions and can use this to forward the question to a professional.

In [None]:
get_questions_answers(sim_questions).head(5).T

**Recommended Professionals**  

In [None]:
get_recommendation(sim_questions)

# 6. Topic Model (LDA) <a id="lda"></a>  
In this section I will implement a LDA Model to get topic probabilities for the questions. We can use this to see how topics are distributed across questions and which words characterize them.  
New questions can be allocated to topics and forwarded to professional who are familiar with these topics.

1. Use NLP on the Questions corpus.  
    a. Use part-of-speech tagging to filter words.  
    b. Filter extrem values from corpus.  
    c. Calculate the tf-idf. 
2. Train a LDA Model.  
3. Give the topics names.
3. Get the topic probability of a query text.

## Parameters

In [None]:
# Gensim Dictionary
extremes_no_below = 10
extremes_no_above = 0.6
extremes_keep_n = 8000

# LDA
num_topics = 18
passes = 20
chunksize = 1000
alpha = 1/50

## Functions

In [None]:
def get_model_results(ldamodel, corpus, dictionary):
    """ Create doc-topic probabilities table and visualization for the LDA model

    """  
    vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)
    transformed = ldamodel.get_document_topics(corpus)
    df = pd.DataFrame.from_records([{v:k for v, k in row} for row in transformed])
    return vis, df  

In [None]:
def get_model_wordcloud(ldamodel):
    """ Create a Word Cloud for each topic of the LDA model

    """  
    plot_cols = 3
    plot_rows = math.ceil(num_topics / 3)
    axisNum = 0
    plt.figure(figsize=(5*plot_cols, 3*plot_rows))
    for topicID in range(ldamodel.state.get_lambda().shape[0]):
        #gather most relevant terms for the given topic
        topics_terms = ldamodel.state.get_lambda()
        tmpDict = {}
        for i in range(1, len(topics_terms[0])):
            tmpDict[ldamodel.id2word[i]]=topics_terms[topicID,i]

        # draw the wordcloud
        wordcloud = WordCloud( margin=0,max_words=20 ).generate_from_frequencies(tmpDict)
        axisNum += 1
        ax = plt.subplot(plot_rows, plot_cols, axisNum)

        plt.imshow(wordcloud, interpolation='bilinear')
        title = topicID
        plt.title(title)
        plt.axis("off")
        plt.margins(x=0, y=0)
    plt.show()

In [None]:
def topic_query(data, query):
    """ Get Documents matching the query with the doc-topic probabilities

    """  
    result = data
    result['sort'] = 0
    for topic in query:
        result = result[result[topic] >= query[topic]]
        result['sort'] += result[topic]
    result = result.sort_values(['sort'], ascending=False)
    result = result.drop('sort', axis=1)
    result = result.head(5)
    return result

In [None]:
def get_text_topics(text, top=20):
    """ Get the topics probabilities for a text and highlight relevant words

    """    
    def token_topic(token):
        return topic_words.get(token, -1)
    
    colors = ['\033[46m', '\033[45m', '\033[44m', '\033[43m', '\033[42m', '\033[41m', '\033[47m']    
    nlp_tokens = nlp_preprocessing([text])

    bow_text = [lda_dic.doc2bow(doc) for doc in nlp_tokens]
    bow_text = lda_tfidf[bow_text]
    topic_text = lda_model.get_document_topics(bow_text)
    topic_text = pd.DataFrame.from_records([{v:k for v, k in row} for row in topic_text])
    
    topic_labeled = 0
    for topic in topic_text:
        print(colors[topic_labeled % len(colors)]+'Topic '+str(topic)+':', '{0:.2%}'.format(topic_text[topic].values[0])+'\033[0m')
        topic_labeled += 1
    print('')
    topic_words = []
    topic_labeled = 0
    for topic in topic_text.columns.values:
        topic_terms = lda_model.get_topic_terms(topic, top)
        topic_words = topic_words+[[topic_labeled, lda_dic[pair[0]], pair[1]] for pair in topic_terms]
        topic_labeled += 1
    topic_words = pd.DataFrame(topic_words, columns=['topic', 'word', 'value']).pivot(index='word', columns='topic', values='value').idxmax(axis=1)
    nlp_doc = nlp(text)
    text_highlight = ''.join([x.string if token_topic(x.lemma_.lower()) <0  else colors[token_topic(x.lemma_.lower()) % len(colors)] + x.string + '\033[0m' for x in nlp_doc])
    print(text_highlight) 

    # Plot Pie chart
    plt_data = topic_text
    plt_data.columns = ['Topic '+str(c) for c in plt_data.columns]
    plt_data['Others'] = 1-plt_data.sum(axis=1)
    plt_data = plt_data.T
    plt_data.plot(kind='pie', y=0, autopct='%.2f')
    plt.xlabel('')
    plt.ylabel('')
    plt.title('Topics Probabilities')
    plt.show()

## Model <a id="lda_model"></a> 

In [None]:
lda_tokens = questions['nlp_tokens']
# Gensim Dictionary
lda_dic = gensim.corpora.Dictionary(lda_tokens)
lda_dic.filter_extremes(no_below=extremes_no_below, no_above=extremes_no_above, keep_n=extremes_keep_n)
lda_corpus = [lda_dic.doc2bow(doc) for doc in lda_tokens]

lda_tfidf = gensim.models.TfidfModel(lda_corpus)
lda_corpus = lda_tfidf[lda_corpus]

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(lda_corpus, num_topics=num_topics, 
                                            id2word = lda_dic, passes=passes,
                                            chunksize=chunksize,update_every=0, 
                                            alpha=alpha, random_state=seed)

## Topics  <a id="lda_topics"></a> 
Each wordcloud shows a topic and the top words who define the topic. 
Here some examples:    
Topic 0 is for teacher (*teacher, teaching, education, ...*)  
Topic 1 is for designer (*design, video, graphic, art, ...*)  
Topic 4 is for veterinary (*veterainary, vet, animal, ...*) but seems to be for actors to (*film, theatre, music, singer, ...*)  
Topic 9 is for health (*medicine, doctor, dental, ...*)  
Topic 13 is for engineers (*engineering, mechanical, aerospace, electrical, ...*)  
Topic 17 is for sport (*sport, athlet, basketball, ...*)

In [None]:
get_model_wordcloud(lda_model)

## Interactive Visualization  
*lda_vis* is a interactive visualization for topic model. But it makes some problems with the sceen width on kaggle, so I commented it out.

In [None]:
lda_vis, lda_result = get_model_results(lda_model, lda_corpus, lda_dic)
#lda_vis

## Document-Topic Probabilities <a id="lda_doc_topic_prob"></a> 
Here are the topic probabilites for the first five questions.  
Topics with *NaN* values for these five question were deleted.  
If a topic probabilites is under a give threshold it gets automaticaly a *NaN* value

In [None]:
lda_questions = questions[['questions_id', 'questions_title', 'questions_body']]
lda_questions = pd.concat([lda_questions, lda_result.add_prefix('Topic_')], axis=1)
lda_questions.head(5).dropna(axis=1, how='all').T

## Example 1  <a id="lda_example1"></a> 
The example with the data science text was assigned to topic 3 with 87%.  
The highlighted text are words, who define the topic.  
A look at the previously created wordcloud shows that topic 3 is a mix of *math* and *computer science*.

In [None]:
query_text = 'I will finish my college next year and would like to start a career as a Data Scientist. \n'\
            +'What is the best way to become a good Data Scientist? #data-science'
get_text_topics(query_text, 80)

## Example 2  <a id="lda_example2"></a>
Now I would like to make a query, which gives me back documents with the topic *veterinary (**Topic 4**)* and *health (**Topic 9**)*.  
The first two questions are about descision to begin the career as a veterinarian (*Topic 4*) or in another medical field (*Topic 9*).

In [None]:
query = {'Topic_4':0.4, 'Topic_9':0.4}
topic_query(lda_questions, query).dropna(axis=1, how='all').head(2).T

In [None]:
get_text_topics(questions['questions_full_text'][20658], 50)
print()
get_text_topics(questions['questions_full_text'][3075], 50)

# 7. Next steps <a id="next_steps"></a>  
* Additional features for the recommendation.  
* Evaluation of the LDA model (parameters, randscore)  
* Professionals scoring  
* ...

# 8. Version control <a id="version_control"></a>  
10.03.2019:  
* Update dependency graph
* Code improvement

09.03.2019:  
* Recommendation: Scoring and Visualization
* LDA Pie charts for topic probabilities

08.03.2019:  
* Feature Extraction
* Some code optimization

07.03.2019:  
* Dependency graph
* Change workflow section

06.03.2019:  
* Workflow diagram  
* Bubble charts for tags  

05.03.2019:  
* Initial version  

**I'm on vacation for the next days and unfortunately I won't be able to do any updates.  
But you are welcome to leave a comment and upvote if you find this kernel useful.**