### This code was created for submission to Data Science for Good: CareerVillage.org
Here are the codes for the following two processes:

1, The first process is data reading, copying, tokenizing, etc. The execution time is about 10 minutes on the PC, and is executed when updating data.
 Processing items and flows are as followings.
- Read csv data
- Copy "answers", "questions", "professionals" for data processing and addition
- Add 'score' column from answer_scores to answers_copy
- Add 'score' column from question_scores to questions_copy
- Calculate response days from question to the answer, and add it to answers_copy
- Tokenize from questions_title and questions_body to questions_words
- Get tags from questions_body to questions_body_tags
- Merge professionals_industry and professionals_headline, and tokenize them to 'ind_head'.
- Join ind_head to ind_head_sp with space, for high-speed search
- Convert professionals_date_joined to professionals_date_joined_dt with Datetime format 

2, The second process actually enters a question and selects authors for the answer. 
  Processing items and flows are as follows, and it takes about 6 seconds by PC(i7/8GB)
- Two input methods are prepared.
        (1)The first method is to input title and body directly by str.
        (2)The second is a method of specifying the index of Framedata of "questions" by integer.
- The hyper parameters are below.
        'number_of_authors': Number of authors to select. default[10]
        'sample_num': Number of questions to compare similarity, larger number takes processing time. default[10]
        'tag_coef': Priority coefficient of tag for body words in calculating questions similarity. default[2]
        'expiration_date': Similar questions prioritize newer ones within 'expiration_date'.  default[1000]
        'reseponse_time_limit': Similar answer prioritize faster response to the answer within 'reseponse_time_limit'.  default[30] 
        'thank_coef':  Priority coefficient of the number of "thank" included in the comment of the answer. default[1]
        'rand_coef': Random coefficient of 0. to 1., 0.[default]: select from rank high, 1.: 100% random.  default[0] 'rand_coef' can add fluctuation to the selection  					 
- Calucurate questions_copy['questions_similarity'] to input question by Jaccard similarity.
        questions_similarity = (jaccard_index(word)+jaccard_index(tag)*tag_coef)*(log(score+1)+1)*(1-delta_day/expiration_date)
- Choose 'questions_selected' of quantity of 'sample_num'
- Select answers of the 'questions_selected'
- Count 'thank' from the comments of each answer.
- Listed aurhors of the selected answers and calculate the author_priority from score, Response_day, questions_similarity and number of "thank".      
        author_selected['author_priority'] = (1 + author_selected['thank_mean'] * thank_coef) * author_selected['similarity_mean'] * author_selected['answer_count'] * (1 - author_selected['respose_day_mean'] / reseponse_time_limit) * author_selected['score_mean']
- Collect words from the selected authors and aggregate the author_priority to each words.
- Calculate professionals_priority for all of professionals by words' priority.
- Set higher 'author_priority' for 'author_selected'
- Sort professionals by the priority and select recommendation of professionals due to 'rand_coef'
        select.index = int((1-random.random()**rand_coef)*professionals_sorted.shape[0])

Loading Libraries

In [1]:
import pandas as pd
import numpy as np
import math
import datetime as dt
import sys, time
from datetime import datetime
from tqdm import tqdm_notebook
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import copy
import random

Functions

In [2]:
#function of delta_days = date1 - date2
def delta_days(date1, date2):
    d1=dt.datetime.strptime(date1[:19], '%Y-%m-%d %H:%M:%S')
    d2=dt.datetime.strptime(date2[:19], '%Y-%m-%d %H:%M:%S')
    return (d1-d2).days+(d1-d2).seconds/(3600*24)

In [3]:
#function of days_to_now
def days_to_now(date):
    d1=dt.datetime.now()
    d2=dt.datetime.strptime(date[:19], '%Y-%m-%d %H:%M:%S')
    return (d1-d2).days+(d1-d2).seconds/(3600*24)

In [4]:
#Tokenizing function
stopWords = set(stopwords.words('english'))
stopWords |= set(["n't","'m","'re","'ve","'s",".", "|", '＋', "^", "="])
def get_word_list(text):
    text_tokenized = [wt[0].lower() for wt in nltk.pos_tag(nltk.word_tokenize(text)) \
                      if ('NN'  in wt[1]) or ('JJ'  in wt[1]) or ('RB'  in wt[1]) or ('VB'  in wt[1])]
    return [w for w in text_tokenized if w not in stopWords if w!='+']

In [5]:
#Function of getting tag list from text
def get_tag_list(text):
    tag_name = []
    wl=word_tokenize(text)
    for i,w in enumerate(wl):
        if ('#' in w) and (i != len(wl)-1) and (len(w)>0):
            if len(wl[i+1])>1:
                tag_name.append(wl[i+1].lower())
    return tag_name

In [6]:
#jaccard similarity
def get_jac_sim(content1,content2):
    a=set(content1)
    b=set(content2)
    if len(a.union(b))==0:
        return 0
    else:
        return len(a.intersection(b))/len(a.union(b))

Reading and Data processing

In [7]:
#Reading and data preprocessing
def processing_data():
    
    #Update files
    print('---  1/10',"\r", end="")
    answer_scores = pd.read_csv('../input/answer_scores.csv')
    answers = pd.read_csv('../input/answers.csv')
    comments = pd.read_csv('../input/comments.csv')
    professionals = pd.read_csv('../input/professionals.csv')
    question_scores = pd.read_csv('../input/question_scores.csv')
    questions = pd.read_csv('../input/questions.csv')

    #Copy three files for data processing
    print('---  2/10',"\r", end="")
    answers_copy=copy.copy(answers)
    questions_copy=copy.copy(questions)
    professionals_copy=copy.copy(professionals)

    #Add 'score' column from answer_scores to answers_copy
    print("---  3/10","\r", end="")
    answers_copy['score']=0
    answers_arr=answers_copy.values
    answer_s_arr=answer_scores.values
    for i in range(len(answers_arr)):
        search_arr=np.any(answer_s_arr==answers_arr[i,0], axis=1)
        if any(search_arr)==True:
            answers_arr[i,5]=answer_s_arr[search_arr,1][0]
    answers_copy=pd.DataFrame(answers_arr, index=answers_copy.index, columns=answers_copy.columns)

    #Add 'score' column from question_scores to questions_copy
    print("---  4/10","\r", end="")
    questions_copy['score']=0
    questions_arr=questions_copy.values
    question_s_arr=question_scores.values
    for i in range(len(questions_arr)):
        searchq_arr=np.any(question_s_arr==questions_arr[i,0], axis=1)
        if any(searchq_arr)==True:
            questions_arr[i,5]=question_s_arr[searchq_arr,1][0]
    questions_copy=pd.DataFrame(questions_arr, index=questions_copy.index, columns=questions_copy.columns)

    #Calculate response days from question to the answer, and add it to answers_copy
    print("---  5/10","\r", end="")
    for i in answers.index:
        answers_copy.loc[i,'Response_day']=delta_days(answers_copy.loc[i,'answers_date_added'],
            questions_copy[questions_copy['questions_id']==answers_copy.loc[i,'answers_question_id']].iloc[0]['questions_date_added'])

    #Tokenize from questions_title and questions_body to questions_words
    print("---  6/10","\r", end="")
    questions_copy['questions_words']=[get_word_list(x) for x in (questions_copy['questions_body'] +
                                                                  ' ' + questions_copy['questions_title'])]

    #Get tags from questions_body to questions_body_tags
    print("---  7/10","\r", end="")
    questions_copy['questions_body_tags']=[get_tag_list(x) for x in questions_copy['questions_body']]

    #Merge (professionals_location,) professionals_industry and professionals_headline to ind_head
    print("---  8/10","\r", end="")
    professionals_copy = professionals_copy.fillna('')
    professionals_copy['ind_head'] = ''
    for x in professionals_copy.index:
        professionals_copy.at[x,'ind_head'] = set(get_word_list(professionals_copy.loc[x,'professionals_industry'])) | \
                set(get_word_list(professionals_copy.loc[x,'professionals_headline'])) 

    # Join ind_head to ind_head_sp with space for high-speed search 
    print("---  9/10","\r", end="")
    professionals_copy['ind_head_sp']=[" ".join(x) for x in professionals_copy['ind_head']]

    # Convert professionals_date_joined to professionals_date_joined_dt with Datetime format 
    print("--- 10/10","\r", end="")
    professionals_copy['professionals_date_joined_dt']=[dt.datetime.strptime(professionals_copy.loc[x,'professionals_date_joined'][:19],
                                                                        '%Y-%m-%d %H:%M:%S') for x in professionals_copy.index]
    print("Finished ","\r")
    return answers_copy, questions_copy, professionals_copy, comments

In [8]:
%time answers_copy, questions_copy, professionals_copy, comments = processing_data()
# it takes about 12minuts by PC(i7/8GB)

Finished  
CPU times: user 7min 55s, sys: 1.89 s, total: 7min 57s
Wall time: 7min 57s


Select author

In [9]:
def select_author(input_question, number_of_authors, sample_num, tag_coef, expiration_date, 
                  reseponse_time_limit, thank_coef, rand_coef):
    
    # calucurate questions_similarity in the case of type(input_question)==int: existing question on the data
    print('---  1/6',"\r", end="")
    if type(input_question)==int:
        if input_question in questions_copy['questions_words']:
            questions_copy['questions_similarity'] = [(get_jac_sim(questions_copy['questions_words'][input_question],
                            questions_copy['questions_words'][i]) + get_jac_sim(questions_copy['questions_body_tags'][input_question],
                            questions_copy['questions_body_tags'][i])*tag_coef)* (math.log10(questions_copy['score'][i]+1)+1)
                            * (1 - delta_days(questions_copy['questions_date_added'][input_question],
                            questions_copy['questions_date_added'][i])/expiration_date) for i in questions_copy.index]
            if questions_copy.loc[input_question,'questions_id'] in answers_copy['answers_question_id'].values:
                old_item=1
            else:
                old_item=0
        else:
            print('Err: No applicable questions.index')
            sys.exit()

    # calucurate questions_similarity in the case of type(input_question)==str: new question
    elif type(input_question)==str:
        word_list = get_word_list(input_question)
        tag_list = get_tag_list(input_question)
        questions_copy['questions_similarity'] = [(get_jac_sim(word_list, questions_copy['questions_words'][i]) \
                            + get_jac_sim(tag_list, questions_copy['questions_body_tags'][i])*tag_coef) \
                            * (math.log10(questions_copy['score'][i]+1)+1) * (1 - days_to_now(questions_copy['questions_date_added'][i]) \
                            /expiration_date) for i in questions_copy.index]
        old_item=0
    else:
        print('Err: type error')
        sys.exit()
        
    #select questions similar to input
    print('---  2/6',"\r", end="")
    questions_selected = questions_copy.sort_values('questions_similarity', ascending=False).iloc[old_item:sample_num+old_item,:]
    
    #select answers of the selected questions
    print('---  3/6',"\r", end="")
    answers_selected=pd.DataFrame()
    for i in questions_selected.index:        
        for j in answers_copy[answers_copy['answers_question_id']==questions_selected['questions_id'].loc[i]].index:
            answers_selected= answers_selected.append(answers_copy.loc[j,['answers_id','answers_question_id',
                                                                     'answers_author_id', 'answers_date_added','Response_day','score']])
            answers_selected.loc[j,'questions_similarity']= questions_selected.loc[i,'questions_similarity']
            #Count 'thank' from the comments of each answer. 
            answers_selected.loc[j,'thank']=int(any(comments[(comments['comments_parent_content_id']==
                                                              answers_selected.loc[j, 'answers_id'])]['comments_body'].str.contains('Thank|thank|thk|thnk')))
    if len(answers_selected)==0:
        print("Err: No similar answers, use larger 'sample_num'")
        sys.exit()            
    #Listed aurhors of the selected answers and calculate the author_priority from score, Response_day, questions_similarity and thank
    print('---  4/6',"\r", end="")
    author_selected = answers_selected.groupby('answers_author_id').agg({'answers_id':'count', 'score':'mean', 
                                                                         'Response_day':'mean', 'questions_similarity': 'mean', 'thank': 'mean'})
    author_selected.columns=['answer_count', 'score_mean', 'respose_day_mean', 'similarity_mean', 'thank_mean']
    author_selected['score_mean'] = [math.log10(author_selected.loc[x,'score_mean']+1)+1 for x in author_selected.index]
    author_selected['author_priority'] = (1+author_selected['thank_mean']*thank_coef)*author_selected['similarity_mean']\
                * author_selected['answer_count']*(1-author_selected['respose_day_mean']/reseponse_time_limit)\
                * author_selected['score_mean']
    author_selected=author_selected[author_selected['author_priority']>0]
    author_selected['ind_head']=''
    for x in author_selected.index:
        if any(professionals_copy['professionals_id']==x):
            author_selected.at[x, 'ind_head']=[professionals_copy[professionals_copy['professionals_id']==x]['ind_head'].values[0]]

    #Collect words from the selected authors and aggregate the author_priority.
    print('---  5/6',"\r", end="")
    word_set=set()
    for i in author_selected.index:
        if author_selected.loc[i, 'ind_head'] != '':
            word_set|=author_selected.loc[i, 'ind_head'][0]

    word_dict = dict(zip(word_set,np.zeros(len(word_set))))
    for i in author_selected.index:
         if author_selected.loc[i, 'ind_head'] != '':
                for j in author_selected.loc[i, 'ind_head'][0]:
                    word_dict[j] = word_dict[j]+author_selected.loc[i, 'author_priority']

    word_dict = {k: v for k, v in word_dict.items() if v != 0.0}

    #Calculate professionals_priority by word_dict.
    print('---  5/6',"\r", end="")
    professionals_copy['professionals_priority']=0.0
    for k, v in word_dict.items():
        professionals_copy.loc[professionals_copy['ind_head_sp'].str.contains(k),'professionals_priority']+=v
    professionals_priority_max=professionals_copy['professionals_priority'].max()
    #Set higher 'author_priority' for 'author_selected'
    for i in author_selected.index:
        professionals_copy.loc[(professionals_copy['professionals_id']==i),'professionals_priority']+= \
                author_selected.loc[i, 'author_priority']+professionals_priority_max
    
    #Sort professionals by the priority
    print('---  6/6',"\r", end="")
    random.seed(dt.datetime.now().microsecond)
    professionals_sorted = copy.copy(professionals_copy.sort_values(['professionals_priority','professionals_date_joined_dt'],
                                                        ascending=[False, False]))
    professionals_sorted['rank']=[i+1 for i in range(professionals_sorted.shape[0])]
    selected_professionals = pd.DataFrame()    
    # and select recommendation of professionals due to 'rand_coef'
    if (rand_coef>=0) and (rand_coef<=1):
        for i in range(number_of_authors):
            ii = int((1-random.random()**(rand_coef))*professionals_sorted.shape[0])
            selected_professionals = selected_professionals.append(professionals_sorted.iloc[ii,:])
            professionals_sorted=professionals_sorted.drop(index=professionals_sorted.index[ii])
        selected_professionals['rank']=selected_professionals['rank'].astype('int64')
    else:
        print("Err: Input 0. to 1. for 'rand_coef'")
        sys.exit()
    
    print("Finished ","\r")
    return selected_professionals.drop(['ind_head', 'ind_head_sp', 'professionals_date_joined_dt','professionals_priority'], axis=1)

In [10]:
#parameters
params = {
        'number_of_authors': 10,# Number of authors to select
        'sample_num': 10, # Number of sample questions to compare similarity, the biggest factor of the processing time.
        'tag_coef': 2, # Priority coefficient of Tag in calculating questions similarity
        'expiration_date': 1000, # Similar questions prioritize newer ones within 'expiration_date'
        'reseponse_time_limit': 30, # Similar answer prioritize faster response to the answer within in 30 days 
        'thank_coef': 1, # Priority coefficient of the number of "thank" included in the comment of the answer
        'rand_coef': 0.} # Random coefficient of 0. to 1., 0.[default]: select from high order, 1.: 100% random

Two input methods are prepared.
The first method is to input title and body directly by str.

In [11]:
input_question_title="Is it hard to find a job in graphic design?"
input_question_body="I'd like to know how hard it is to find a job straight out of school  #graphic-design #graphics"

input_question=input_question_title+' '+input_question_body
%time select_authors = select_author(input_question, **params)
select_authors

Finished  
CPU times: user 4.19 s, sys: 20 ms, total: 4.21 s
Wall time: 4.19 s


Unnamed: 0,professionals_date_joined,professionals_headline,professionals_id,professionals_industry,professionals_location,rank
18204,2018-04-10 13:19:06 UTC+0000,Creative Director | Art Director | Designer,16584031624041119309e01c908a2f1f,Marketing and Advertising,"Austin, Texas",1
3070,2015-12-16 23:48:37 UTC+0000,"Principal, Illustrator, Graphic Designer at Ki...",349db306672e425f9481e6c30d84afe5,"GRAPHIC_DESIGN, ILLUSTRATION","Seattle, Washington",2
1794,2015-03-09 11:20:00 UTC+0000,"Keeping Busy!!! - Illustrator and Designer, al...",e01894b52bfb4eabb8791beaef276fa7,Civil Engineering,"Frome, England, United Kingdom",3
3045,2015-12-15 16:59:08 UTC+0000,Illustrator and Graphic Designer,4bbd6d03e36b445198780896632a01f1,Graphic Design,"Missoula, Montana",4
1121,2014-05-15 16:57:38 UTC+0000,Principal Artist at Zynga,bc46e3699d92477ba8c7aa723e54a151,Entertainment,"San Francisco, California",5
11580,2017-05-22 20:45:47 UTC+0000,Making the donuts at Okta,67c1bd0e570e447ba48f25bcdc073297,Graphic Design,"Portland, Oregon",6
23605,2018-09-25 23:10:33 UTC+0000,Solutions Manager,cdeef38b72d54b65b0f826261d33276b,Telecommunications,,7
13045,2017-09-18 22:44:13 UTC+0000,Princpal Catalog Expert at SAP Ariba,562bbe2bad304300aa551dceecf5e440,Information Services,"Santa Clarita, California",8
15436,2018-01-17 20:18:04 UTC+0000,,a209eed08b8846adaf95f352b1325815,,"Hallandale Beach, Florida",9
8782,2016-12-06 11:49:51 UTC+0000,"Advertising & Marketing, Graphic Designer, Art...",113c32498e0f4c67929b194a549655db,Marketing and Advertising,"Jeddah, Makkah Province, Saudi Arabia",10


The second method is a method of specifying the index of Framedata of "questions" by integer.

In [12]:
input_question=123
%time select_authors = select_author(input_question, **params)
select_authors

Finished  
CPU times: user 6.08 s, sys: 24 ms, total: 6.1 s
Wall time: 6.07 s


Unnamed: 0,professionals_date_joined,professionals_headline,professionals_id,professionals_industry,professionals_location,rank
22677,2018-09-01 20:47:44 UTC+0000,Adjunct Professor Engineering,e2b4c84bf1ca4aea9b108869692d8017,Information Technology and Services,Greater Chicago Area,1
27147,2019-01-12 18:57:07 UTC+0000,"Senior Product Line Manager, Servers & Systems...",6cd927ff0179440f955400924564ea78,Information Technology and Services,"Austin, Texas Area",2
3698,2016-01-27 14:51:09 UTC+0000,Retired Engineering Manager with limited consu...,81a594b683d54e6dbb4b04ea00a5e25b,Chemicals,"Greensboro, Georgia",3
3611,2016-01-22 19:47:04 UTC+0000,Retired Civil Engineer,c3b4e11154f74a858779be7ba9b6f00c,Consulting Engineering,"Kent, Washington",4
16061,2018-02-16 01:56:13 UTC+0000,Product Manager,899f9fcf22d04191a294da40f7cc0ade,Telecommunications,"Newark, New Jersey",5
25354,2018-11-15 03:24:17 UTC+0000,Product Marketing Dell,632f3cc0483642ceb798efef2b284cb0,Electrical and Electronic Manufacturing,"Austin, Texas",6
2410,2015-10-19 20:56:49 UTC+0000,Assist with Recognizing and Developing Potential,36ff3b3666df400f956f8335cf53e09e,Mental Health Care,"Cleveland, Ohio",7
27427,2019-01-22 13:33:56 UTC+0000,PMO Business Manager at AT&T,cb9c91e59ab9436d91b43c14e53be4c7,"IT/Advertising, Analytics",Dallas/Fort Worth Area,8
4588,2016-03-14 16:27:13 UTC+0000,Mechanical Engineer I Automotive,58fa5e95fe9e480a9349bbb1d7faaddb,Automotive,"Redford Charter Township, Michigan",9
864,2014-02-21 20:16:29 UTC+0000,"Software engineer, data infrastructure at Link...",19c30dfeabb64b108617c81d87f538fe,Computer Software,"Sunnyvale, California",10


In the case of 'rand_coef' = 0.0005.
    It can be used against cold start issue.

In [13]:
params['rand_coef']=0.0005
input_question=123
%time select_authors = select_author(input_question, **params)
select_authors

Finished  
CPU times: user 6.2 s, sys: 40 ms, total: 6.24 s
Wall time: 6.21 s


Unnamed: 0,professionals_date_joined,professionals_headline,professionals_id,professionals_industry,professionals_location,rank
16061,2018-02-16 01:56:13 UTC+0000,Product Manager,899f9fcf22d04191a294da40f7cc0ade,Telecommunications,"Newark, New Jersey",5
4588,2016-03-14 16:27:13 UTC+0000,Mechanical Engineer I Automotive,58fa5e95fe9e480a9349bbb1d7faaddb,Automotive,"Redford Charter Township, Michigan",9
25354,2018-11-15 03:24:17 UTC+0000,Product Marketing Dell,632f3cc0483642ceb798efef2b284cb0,Electrical and Electronic Manufacturing,"Austin, Texas",6
3611,2016-01-22 19:47:04 UTC+0000,Retired Civil Engineer,c3b4e11154f74a858779be7ba9b6f00c,Consulting Engineering,"Kent, Washington",4
8739,2016-11-18 15:16:57 UTC+0000,EMEA Online Product Manager at Dell,d4b01aa328844ed882f25be3f3e45276,Information Technology and Services,Ireland,40
22210,2018-08-27 20:15:47 UTC+0000,Cybersecurity Consultor Senior at PwC. Master ...,ae113c84e6e1469e9034ec6a7e0520b4,Information Technology and Services,"Mexico City, Mexico",23
22677,2018-09-01 20:47:44 UTC+0000,Adjunct Professor Engineering,e2b4c84bf1ca4aea9b108869692d8017,Information Technology and Services,Greater Chicago Area,1
2312,2015-10-05 16:46:15 UTC+0000,"Senior Manager, Analytics Engineering and BI -...",f16a4f6374fd4e63a1f5ed24baa8eb5e,Information Technology and Services,"San Francisco, California",15
864,2014-02-21 20:16:29 UTC+0000,"Software engineer, data infrastructure at Link...",19c30dfeabb64b108617c81d87f538fe,Computer Software,"Sunnyvale, California",10
16179,2018-02-19 20:14:08 UTC+0000,Scrum Master @ AT&T,ce9d21f7eefc42ec88f27e2cac7ff833,Telecommunications,"St. Louis, Missouri",11
