# 🔎**CareerVillage.org Recommendation System**🔍
![Imgur](https://i.imgur.com/oA5T6RX.jpg)

 **Objective:** To develop a method to recommend relevant questions on CareerVillage.com to the professionals who are most likely to answer them.

# **Approach:**



**************************************************************************************************************************************************************************************
[Solution 1:](#1) 

The simplest approach will be to find questions similar to the one asked, and track down professionals who answered those questions.

👍**Pros:**  ❗❗New Feature Alert❗❗ This can be used to add a feature in the system that suggests the student similar questions asked in the past to check if the question is similar to any existing questions and they could be given a chance to take a look at the responses of the previous questions before posting the question. 

⌛**Added bonus:**⌛
If they are satisfied with those answers, they can then skip posting the question in the forum, thereby saving valuable time of our professionals a.k.a Superheroes and also avoids duplication of questions in the forum. 

👎**Cons:** This method will not suggest professionals who haven't made any responses yet.

***************************************************************************************************************************************************************************************
[Solution 2:](#2) 

Professionals have tags to indicate topics of interest. We can check for similarity betweeen the question posted and the professionals' tags.

👍**Pros:** This method will not take into account whether the professionals have made any responses in the past.

👎**Cons:** This method will not suggest professionals who do not have any tags.

****************************************************************************************************************************************************************************************
[Solution 3:](#3) 

Find professionals who have commented on the most similar questions to the one posted.

👍**Pros:** This method will include professionls who haven's made any responses but commented.

👎**Cons:** This method will not suggest professionals who haven't commented.

****************************************************************************************************************************************************************************************


✅  A combination of the results from these methods should be a good starting point.



**IMPORT REQUIRED LIBRARIES**

In [None]:
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import gensim
from nltk.tokenize import word_tokenize
import dateutil.parser
from datetime import datetime
import matplotlib.pyplot as plt
pd.set_option('display.max_colwidth', -1)
%matplotlib inline
from matplotlib.pyplot import figure
from matplotlib_venn import venn2, venn2_circles
from matplotlib_venn import venn3, venn3_circles

# **Data Exploration**

**AVAILABLE FILES**

In [None]:
print(os.listdir("../input"))
pro= pd.read_csv('../input/professionals.csv')
stu=pd.read_csv("../input/students.csv")

qs=pd.read_csv("../input/questions.csv")
ans= pd.read_csv('../input/answers.csv')
com= pd.read_csv('../input/comments.csv')
email= pd.read_csv('../input/emails.csv')
match= pd.read_csv('../input/matches.csv')

school_mem= pd.read_csv('../input/school_memberships.csv')

tags= pd.read_csv('../input/tags.csv')
qs_tags=pd.read_csv('../input/tag_questions.csv')
user_tag=pd.read_csv('../input/tag_users.csv')

**SIZE OF THE COMMUNITY**

In [None]:
print("Number of students registered:",stu['students_id'].count())
print("Number of professional registered:",pro['professionals_id'].count())

In [None]:
ans_stu= pd.merge(ans,stu, left_on='answers_author_id',right_on='students_id')
ans_pro= pd.merge(ans,pro, left_on='answers_author_id',right_on='professionals_id')

com_stu= pd.merge(com,stu, left_on='comments_author_id',right_on='students_id')
com_pro= pd.merge(com,pro, left_on='comments_author_id',right_on='professionals_id')

active_pro= set(ans_pro['answers_author_id']).union(set(com_pro['comments_author_id']))
active_stu= set(ans_stu['answers_author_id']).union(set(com_stu['comments_author_id']))

figure(figsize=(20,20))
plt.subplot(1, 2, 1)
venn3([set(ans_pro['answers_author_id']), set(pro['professionals_id']),set(com_pro['comments_author_id'])],
      set_labels = ('Answered','PROFESSIONALS','Commented'))

plt.subplot(1, 2, 2)
venn3([set(active_stu), set(stu['students_id']), set(qs['questions_author_id'])],
      set_labels = ('Commented/Answered','STUDENTS','Question Posted' ))

plt.tight_layout()
plt.show()


~18k professionals and ~18.6k students haven't contributed in anyway- no questions/answers/comments.

Only about ~36% of the professionals and ~36% of students have engaged in the platform.

⭐  **An 'activity indicator' could be added as a feature to professionals data table.**

⭐  **Also another metric showing the time since last activity/ last login could be indicators of 'interest' to respond to new questions.**


In [None]:
a_s=pd.read_csv('../input/answer_scores.csv')
an_s= pd.merge(ans,a_s,left_on='answers_id',right_on='id')

ans_score_tab=an_s.pivot_table(values='score',index='answers_author_id',aggfunc='sum')
ans_score_tab.index.names=['user_id']

ans_score_tab.columns=['Hearts Earned']
ans_count_tab=ans.pivot_table(values='answers_id',index='answers_author_id',aggfunc='count')
ans_count_tab.index.names=['user_id']
ans_count_tab.columns=['Questions Answered']

score_tab=ans_score_tab.join(ans_count_tab,how='outer')
score_tab=score_tab.replace(np.NaN,0)
score_tab['Total_score']=score_tab['Hearts Earned']+score_tab['Questions Answered']
score_tab=score_tab.sort_values('Total_score',ascending=False)
print("Professionals' Score Chart")
score_tab.head()

First user answered 1710 questions and the third user answered almost half of it, 915. 

But the latter has earned more 💙 than the former.

So it is best to create two charts for internal use- 
* one based on activity 
* one based on popularity

In [None]:
print("Popularity Chart")
ans_score_tab.sort_values('Hearts Earned',ascending=False).head()

In [None]:
print("Activity Chart")
ans_count_tab.sort_values('Questions Answered',ascending=False).head()

⭐  Activity and popularity charts could be generated for rolling one-month period.

**QnA archive**

In [None]:
print("Number of questions asked:",qs['questions_id'].count())
print("Number of answers given:",ans['answers_id'].count())

**TAGS, COMMENTS AND EMAILS**

In [None]:
print("Number of tags:",tags['tags_tag_id'].count())
print("Number of comments made:",com['comments_id'].count())
print("Number of emails sent:",email['emails_id'].count())

In [None]:
figure(figsize=(20,20))
plt.subplot(1, 2, 1)
venn3([set(ans['answers_id']), set(com['comments_parent_content_id']),set(qs['questions_id'])],
      set_labels = ('ANSWERS','COMMENTS','QUESTIONS' ))

plt.tight_layout()
plt.show()


***The venn diagram shows that both,questions and answers can get commented on.***

***However, answers tend to get more comments than questions.***

**RESPONSE RATES**

Let's inspect the rate of questions getting answered on the platform.

In [None]:
def to_date(val):
    return datetime.strptime( str(val),'%Y-%m-%d %H:%M:%S UTC+0000')
def to_year(val):
    return val.strftime('%Y')
def to_yr_mon(val):
    return val.strftime('%Y-%m')
    
#Question Timestamp Processing    
qs['ts']=qs.apply(lambda x: to_date(x['questions_date_added']),axis=1)
qs['year']=qs.apply(lambda x: to_year(x['ts']),axis=1)
qs['yr_mon']=qs.apply(lambda x: to_yr_mon(x['ts']),axis=1)

#Answer Timestamp Processing    
ans['ts']=ans.apply(lambda x: to_date(x['answers_date_added']),axis=1)
ans['year']=ans.apply(lambda x: to_year(x['ts']),axis=1)

#Question ids that got answered at some point
answered_id= ans['answers_question_id'].unique()

#Questions which never got answered
n_ans=pd.DataFrame(qs.loc[~qs['questions_id'].isin(answered_id)])
n_ans=n_ans.reset_index(drop=True)
n_ans['ts']=n_ans.apply(lambda x: to_date(x['questions_date_added']),axis=1)
n_ans['y_m']=n_ans.apply(lambda x: to_yr_mon(x['ts']),axis=1)
n_ans['yr']=n_ans.apply(lambda x: to_year(x['ts']),axis=1)

#Questions that got answered at some point
y_ans=pd.DataFrame(qs.loc[qs['questions_id'].isin(answered_id)])

#Time taken for replies
ans_gap= pd.merge(ans,y_ans,left_on='answers_question_id',right_on='questions_id')
ans_gap['time_to_answer']= ans_gap.apply(lambda x: (x['ts_x']-x['ts_y']).days ,axis=1 )

#Reply gap tabulated
tab_gap= ans_gap.groupby('time_to_answer').count()[['answers_id']]

#Questions raised by yr-mon
tab_q= qs.groupby('yr_mon').count()[['questions_id']]
tab_q.sort_index()

#Answers given by yr-mon
tab_a= n_ans.groupby('y_m').count()[['questions_id']]

#Answers by year
tab_a_yr = n_ans.groupby('yr').count()[['questions_id']]
#Questions by year
tab_q_yr= qs.groupby('year').count()[['questions_id']]

#Response rates
tab_response_rate=tab_a_yr.join(tab_q_yr,how='outer', lsuffix='_unanswered', rsuffix='_posted')
tab_response_rate=tab_response_rate.fillna(0)
tab_response_rate['response rate']=round(100- ((tab_response_rate['questions_id_unanswered']/tab_response_rate['questions_id_posted'])*100),2)
tab_response_rate.columns=['Questions Unanswered','Questions Posted','Response Rate']
print("RESPONSE RATES OVER THE YEARS")
tab_response_rate

***The platform was able to stick to the policy of answering all questions during the years 2011-'15.***


***2018 has seen the lowest response rate of 91%; 712 questions went unanswered.***


***In 2016, there has been a massive increase in questions posted. ***
***Has our student community also grown by so much in 2016??***

***Let's inspect...***

In [None]:
#Professionals Timestamp Processing    
pro['ts']=pro.apply(lambda x: to_date(x['professionals_date_joined']),axis=1)
pro['year']=pro.apply(lambda x: to_year(x['ts']),axis=1)
tab_pro= pro.groupby('year').count()[['professionals_id']]
tab_pro.columns=['Professionals joined each year']
tab_pro['Total Professionals']= tab_pro['Professionals joined each year'].cumsum()

#Students Timestamp Processing    
stu['ts']=stu.apply(lambda x: to_date(x['students_date_joined']),axis=1)
stu['year']=stu.apply(lambda x: to_year(x['ts']),axis=1)
tab_stu= stu.groupby('year').count()[['students_id']]
tab_stu.columns=['Students joined each year']
tab_stu['Total Students']= tab_stu['Students joined each year'].cumsum()
tab_pro_stu= tab_pro.join(tab_stu,how='inner')

print("Size of the community over the years")
tab_pro_stu[:-1]

⭐  The highest influx in the student community has been in the year 2016-- 12k students joined that year!

⭐  That is the highest number of students joining the platform in any one year.

⭐  Professionals' community has seen a steady growth, which is a good sign.

In [None]:
tab_pro_stu[:-1].plot()

**TIME TO RESPOND TO QUESTIONS POSTED**

In [None]:
#Reply gap tabulated
tab_gap= ans_gap.groupby(['time_to_answer']).count()[['answers_id']]
tab_gap['per']=round((tab_gap['answers_id']/tab_gap['answers_id'].sum())*100,2)
print("TIME TO RESPOND TO QUESTIONS")
tab_gap[0:10]

⭐ ** Almost 24% of questions get answered in 24 hours' time.**

NB: The first row may correspond to some glitch in the timestamp recording, answers being posted before questions?

In [None]:
email['ts']= email.apply(lambda x: to_date(x['emails_date_sent']),axis=1)
email['year']= email.apply(lambda x: to_year(x['ts']),axis=1)

emailed_qs = pd.merge(email,match,left_on='emails_id',right_on='matches_email_id')
emailed_qs_ans=pd.merge(emailed_qs,ans,left_on=['emails_recipient_id','matches_question_id'],right_on=['answers_author_id','answers_question_id'],how='inner')

sent     = emailed_qs.groupby(['emails_recipient_id','year']).nunique()[['matches_question_id']] 
responded= emailed_qs_ans.groupby(['emails_recipient_id','year_x']).nunique()[['answers_id']]

sent=sent.reset_index(level=['year'])
responded=responded.reset_index(level=['year_x'])

sent_responded= pd.merge(sent,responded,how='left',left_on=['emails_recipient_id','year'],right_on=['emails_recipient_id','year_x'])
sent_responded=sent_responded.drop('year_x',axis=1)
sent_responded['rate']= np.array((sent_responded['answers_id'] / sent_responded['matches_question_id'])*100)

**Let's explore the tag landscape - questions and users tags **

In [None]:
venn3([set(qs_tags['tag_questions_tag_id']), set(user_tag['tag_users_tag_id']),set(tags['tags_tag_id'])],
      set_labels = ('Question_tags','User_tags','TAGS'))

There are question tags that do not match any user's tags and there are users' tags which haven't matched any questions asked so far.

In [None]:
figure(figsize=(5,5))
venn3([set(user_tag['tag_users_user_id']),set(ans_pro['answers_author_id']),set(pro['professionals_id'])],
      set_labels = ('Tagged Users','Professionals Answered','Professionals'))

Out of the ~24k questions posted, almost 10k were matched through emails or at least emails can be thought to be the motivating factor for answering those questions. 

# **Solution 1** <a id=1></a>

# **Find questions from the past most similar to the current question and target professionals who have answered those questions**

**HISTORICAL QUESTIONS ASKED**

In [None]:
qs.head(2)

Every Question has a title and body. There is a separate table that holds hashtags associated to the question as well.

Let's have a look at this table.

In [None]:
qs_tags.sort_values('tag_questions_question_id')[:10]

⭐  Well, each question can have multiple tags. We have to concatenate these tags and create a **"tag string"** per question (where tags are available)

In [None]:
qs_tagnames= pd.merge(qs_tags,tags,left_on='tag_questions_tag_id',right_on='tags_tag_id')
qs_tagnames=qs_tagnames.drop(['tags_tag_id','tag_questions_tag_id'],axis=1)
print(qs_tagnames.sort_values('tag_questions_question_id')[:10])
qs_tag_pivot=qs_tagnames.pivot_table(index='tag_questions_question_id',values='tags_tag_name',aggfunc=lambda x: " ".join(x))
qs_tag_pivot['tag_questions_question_id']=qs_tag_pivot.index
print("\nNumber of questions asked:",qs['questions_id'].count())
print("Number of questions with tags:",len(qs_tag_pivot))
qs_tag_pivot=qs_tag_pivot.reset_index(drop=True)
print("\n",qs_tag_pivot.head())

In [None]:
print("Example:\nQuestion id-", qs_tag_pivot.iloc(0)[0]['tag_questions_question_id'],
      ":\n\n",qs.loc[qs['questions_id']==qs_tag_pivot.iloc(0)[0]['tag_questions_question_id']]['questions_body'],
      "\n\n*************************************************************************\nTag string:",
      qs_tag_pivot.iloc(0)[0]['tags_tag_name'])

**COMBINE QUESTIONS TABLE WITH CORRESPONDING TAGS**

In [None]:
qs_with_tags=pd.merge(qs,qs_tag_pivot,left_on='questions_id',right_on='tag_questions_question_id')
print("Number of questions with tags:",len(qs_with_tags))
qs_with_tags.head(2)

**Combine question title, body and tags.**

In [None]:
raw_documents=qs_with_tags['questions_title']+qs_with_tags['questions_body']+qs_with_tags['tags_tag_name']
raw_documents.head()

**TEXT PROCESSING...**

In [None]:
print("Number of Questions:",len(raw_documents))
print("Tokenizing data...")
gen_docs = [[w.lower() for w in word_tokenize(text)] 
            for text in raw_documents]
print("Creating dictionary...")
dictionary = gensim.corpora.Dictionary(gen_docs)
print("Creating Document-Term Matrix...")
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print("Creating TF-IDF Model...")
tf_idf = gensim.models.TfidfModel(corpus)
print("Creating Similarity Checker...")
similar_qs = gensim.similarities.Similarity("",tf_idf[corpus],num_features=len(dictionary))
print("Processing Completed!")

**NEW QUESTION POSTED**

Let's take a fresh question from the website which is not included in the training set:



In [None]:
Query='Can I become data scientist without studying at university?#technology #data-science'
Query

In [None]:
query_doc = [w.lower() for w in word_tokenize(Query)]
query_doc_bow = dictionary.doc2bow(query_doc)
query_doc_tf_idf = tf_idf[query_doc_bow]

**PROCESSING THE QUESTION...**

**CHECKING FOR MOST SIMILAR QUESTIONS FROM THE PAST...**

In [None]:
q_sim=similar_qs[query_doc_tf_idf]

**LET'S SET,
 SIMILARITY THRESHOLD = 0.1**
 
*Any question from the past with a similarity index less than the threshold gets ignored*

In [None]:
sim_threshold=0.10

**DISPLAY THE MOST SIMILAR QUESTIONS FROM THE PAST**

In [None]:
qs_with_tags['Similarity']=q_sim
ques=qs_with_tags[qs_with_tags['Similarity']>=sim_threshold]
ques=ques.sort_values('Similarity',ascending=False)
ques.head()

**IDENTIFY PROFESSIONALS WHO ANSWERED THESE QUESTIONS** 

In [None]:
qlist=ques['questions_id']
qlist_ans=ans[ans['answers_question_id'].isin(qlist)]
prof_answered=set(qlist_ans['answers_author_id'])
#print(prof_answered)
solution1= pro[pro['professionals_id'].isin(prof_answered)]
solution1.head()


*Now that's a good start. But if we only use this method, as mentioned earlier, professionals who haven't answered any questions in the past will be ignored.*

**How big is the population that gets ignored via this method?**

In [None]:
print("Number of professionals registered:",len(pro['professionals_id']))
print("Number of users who have answered:",len(ans['answers_author_id'].unique()))
ans_pro=pro[pro['professionals_id'].isin(ans['answers_author_id'])]
print("Number of professionals who have answered:",len(ans_pro))
ans_stu=stu[stu['students_id'].isin(ans['answers_author_id'])]
print("Number of students who have answered:",len(ans_stu))
print("\n***PROFESSIONALS IGNORED VIA THIS SOLUTION***")
print("Number of professionals who haven't answered yet:",len(set(pro['professionals_id']))-len(ans_pro))

In [None]:
print("From the numbers, it is clear that users who identify themselves as neither professionals nor students have answered to questions.\nHow big is this population?\n")
u= set(ans['answers_author_id'])
s= set(stu['students_id'])
p= set(pro['professionals_id'])
st_ansrd= u.intersection(s)
pr_ansrd= u.intersection(p)
all_ansrd= st_ansrd.union(pr_ansrd)
unknwn= u.difference(all_ansrd)
print("Unknown users: ",len(unknwn))

# **Solution 2** <a id=2></a>

# **Identify professionals with tags most similar to the question asked.**

**TAGS PER USER**

Combine user_tag table with tag table.

In [None]:
user_tag_exp=pd.merge(tags,user_tag,left_on='tags_tag_id',right_on='tag_users_tag_id')
user_tag_exp=user_tag_exp.drop(['tags_tag_id','tag_users_tag_id'],axis=1)
user_tag_exp.sort_values('tag_users_user_id')[:10]

The first user has three tags - content creation, script writing and digital media. This means each user can have multiple tags.

⭐  Concatenate all tags of a user to create a **'tag string'** per user.

**'TAG STRING' TABLE OF USERS**

In [None]:
tag_pivot=user_tag_exp.pivot_table(values='tags_tag_name',index='tag_users_user_id',aggfunc=lambda x: " ".join(x))
tag_pivot['tag_users_user_id']=tag_pivot.index
print("Number of all users with tags:",len(tag_pivot))
tag_pivot=tag_pivot.reset_index(drop=True)
tag_pivot.head()

This table contains both- students and professionals.

As we are interested in sending emails only to professionals, let's filter this table.

**'TAG STRING' TABLE OF PROFESSIONALS**

In [None]:
pro_tagstring= tag_pivot[tag_pivot['tag_users_user_id'].isin(pro['professionals_id'])]
print("Number of professionals with tags:",len(pro_tagstring))

In [None]:
raw_tags=pro_tagstring['tags_tag_name']
print("Tag string table of professionals:")
raw_tags.head()


**TEXT PROCESSING BEGINS...**

In [None]:
print("Number of Tags:",len(raw_tags))
print("Tokenizing data...")
gen_docs = [[w.lower() for w in word_tokenize(text)] 
            for text in raw_tags]
print("Creating dictionary...")
dictionary = gensim.corpora.Dictionary(gen_docs)
print("Creating Document-Term Matrix...")
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print("Creating TF-IDF Model...")
tf_idf = gensim.models.TfidfModel(corpus)
print("Creating Similarity Checker...")
sims = gensim.similarities.Similarity("",tf_idf[corpus],num_features=len(dictionary))
print("Processing Completed!")

Query='Can I become data scientist without studying at university?#technology #data-science'
print("\nQuestion posted:",Query)
'''
query_doc = [w.lower() for w in word_tokenize(Query)]
query_doc_bow = dictionary.doc2bow(query_doc)
query_doc_tf_idf = tf_idf[query_doc_bow]

sim=sims[query_doc_tf_idf]

sim_threshold=0.10

pro_tagstring['sim']=sim
prof_tag=pro_tagstring[pro_tagstring['sim']>=sim_threshold]
prof_tag=prof_tag.sort_values('sim',ascending=False)
prof_tag.head()

prof_list=prof_tag['tag_users_user_id']
solution2= pro[pro['professionals_id'].isin(prof_list)]
solution2.head()
'''

**CHECKING FOR MOST SIMILAR QUESTIONS FROM THE PAST**

In [None]:
query_doc = [w.lower() for w in word_tokenize(Query)]
query_doc_bow = dictionary.doc2bow(query_doc)
query_doc_tf_idf = tf_idf[query_doc_bow]

In [None]:
sim=sims[query_doc_tf_idf]

**LET'S SET, SIMILARITY THRESHOLD = 0.1**

Any question from the past with a similarity index less than the threshold gets ignored

In [None]:
sim_threshold=0.10

**DISPLAY THE PROFESSIONALS WITH TAGS MOST SIMILAR TO THE QUESTION**

In [None]:
pro_tagstring['sim']=sim
prof_tag=pro_tagstring[pro_tagstring['sim']>=sim_threshold]
prof_tag=prof_tag.sort_values('sim',ascending=False)
prof_tag.head()

In [None]:
prof_list=prof_tag['tag_users_user_id']
solution2= pro[pro['professionals_id'].isin(prof_list)]
solution2.head()

*As mentioned earlier, professionals without any tags will be ignored via this solution.*

**How big is that population?**

In [None]:
print("Number of users with tags:",len(tag_pivot))
print("\nNumber of professionals registered:",len(pro['professionals_id']))

print("Number of professionals with tags:",len(pro_tagstring))

print("\n***PROFESSIONALS IGNORED VIA THIS SOLUTION***")
print("Number of professionals without any tags:",len(set(pro['professionals_id']))-len(pro_tagstring))

**Professionals ignored in Solution 2 is far less than in Solution 1**

*Are there professionals who are ignored in both methods?*

Let's inspect this.

In [None]:
answered_pro= set(ans_pro['professionals_id'])
tagged_pro= set(pro_tagstring['tag_users_user_id'])
A1= len(answered_pro.difference(tagged_pro))
B1= len(tagged_pro.difference(answered_pro))
AnB= len(answered_pro.intersection(tagged_pro))
print("Number of professionals ignored via both methods:",len(pro['professionals_id'])-(A1+AnB+B1))

These professionals do not have any tags and haven't responded to any questions yet.

**Further Thoughts:**

One method is to promote these professionals to add tags and that way questions can be directed to them by solution 2.
Once they start answering them, they will also be considered under solution 1.

Another alternative is to suggest tags to these professionals based on their Job Title, Headline, Location etc.

> # **Solution 3** <a id=3></a>

**Find professionals who have commented on the most similar questions.**

**Comments can be given by professionals and students.
Let's check the trend so far..**

In [None]:
com_stu= pd.merge(com,stu, left_on='comments_author_id',right_on='students_id')
com_stu_q= pd.merge(com_stu,qs, left_on='comments_parent_content_id',right_on='questions_id')
com_stu_a= pd.merge(com_stu,ans, left_on='comments_parent_content_id',right_on='answers_id')

com_pro= pd.merge(com,pro, left_on='comments_author_id',right_on='professionals_id')
com_pro_q= pd.merge(com_pro,qs, left_on='comments_parent_content_id',right_on='questions_id')
com_pro_a= pd.merge(com_pro,ans, left_on='comments_parent_content_id',right_on='answers_id')
print("Number of comments posted:",len(com),
      "\n\nNumber of comments made by professionals:",len(com_pro),
      "\nNumber of comments made by students:",len(com_stu),
      "\n\nNumber of questions commented by professionals:",len(com_pro_q['comments_parent_content_id'].unique()),
      "\nNumber of answers commented by professionals:",len(com_pro_a['comments_parent_content_id'].unique()),
      "\n\nNumber of questions commented by students:",len(com_stu_q['comments_parent_content_id'].unique()),
      "\nNumber of answers commented by students:",len(com_stu_a['comments_parent_content_id'].unique())           
     )
      
      
      

**VENN DIAGRAM- REGISTERED USERS VS THOSE WHO COMMENTED**

In [None]:
from matplotlib.pyplot import figure
figure(figsize=(15,10))
plt.subplot(1, 2, 1)
venn3([set(com_pro['comments_author_id']), set(pro['professionals_id']),set(ques['questions_author_id'])],
      set_labels = ('Commented','PROFESSIONALS', 'Questions Asked'))

plt.subplot(1, 2, 2)
#figure(figsize=(8,8))
venn3([set(com_stu['comments_author_id']), set(stu['students_id']),set(ques['questions_author_id'])],
      set_labels = ('Commented','STUDENTS', 'Questions Asked'))

plt.tight_layout()
plt.show()


![Imgur](https://i.imgur.com/gGTpnnt.jpg)

![](http://)