# Problem Statement: To develop a method to recommend relevant questions to the professionals who are most likely to answer them.

These are just random thoughts looking through the dataset and trying to get a sense of things that could be done. If there's a glaring mistake in this kernel, do point it out and correct me!

Also, I'll admit that I ran through all the `.csv` files first and then looked at the data dictionary provided by CareerVillage.org. 🙈

Don't be silly like me, read the data dictionary first. Reading the data dictionary answered a lot of questions for me 😬

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

## Professionals table

> We call our volunteers "Professionals", but we might as well call them Superheroes. They're the grown ups who volunteer their time to answer questions on the site.

In [None]:
professionals = pd.read_csv('../input/professionals.csv')

In [None]:
professionals.head()

In [None]:
professionals.describe()

In [None]:
professionals.groupby('professionals_location')['professionals_id'].count().reset_index()

### If we make an assumption that each country (except bigger countries like USA, Canada and Russia) has a single time-zone (e.g India has just a single timezone) we can make the location much more simpler by getting just the country to start working around the problem

In [None]:
professionals.groupby('professionals_headline')['professionals_id'].count().reset_index()

### Professional headlines might contain lot of noise (assuming the data is pulled from LinkedIn, there's a great chance that people put in content that's sometimes not relevant) plus there's Chinese characters as well. Will need to think about how to do process it.

### I'll probably not start adding this data to the training one (maybe if my model is super bad with the things I'm thinking of)

In [None]:
professionals.groupby('professionals_industry')['professionals_id'].count().sort_values(ascending=False).head(20).reset_index()

In [None]:
professionals.groupby('professionals_industry')['professionals_id'].count().sort_values(ascending=False).tail(20).reset_index()

### In the top bracket we have professionals from the traditionally healthy job market scenes and popular streams.

### Whereas there are sectors like: Nanomedicine, NASA Robotics, Navy Cryptology (Wow!) which are seriously dope!

### I'm assuming that `professionals_industry` would be forgiving with the noise that I talked about in the above (just fingers crossed have to do more digging)

### Considering the above column, we can reduce the number of categories by bundling similar naming things together. E.g: `tech, technology, technology` should be one category. `Accounting, Accounting & Finance, Accounting / Auditing` can be bundled together

## Students table

> Students are the most important people on CareerVillage.org. They tend to range in age from about 14 to 24. They're all over the world, and they're the reason we exist!

In [None]:
students = pd.read_csv('../input/students.csv')

In [None]:
students.head()

In [None]:
students.describe()

In [None]:
students.groupby('students_location')['students_id'].count().sort_values(ascending=False).head(20).reset_index()

### The above table is a stark reminder of the gap between metros and non-metro cities. Registrations are pretty much expected of the places, like metros, where the awareness is great. More awareness about CareerVillage.org in places where's there is less

In [None]:
students.groupby('students_location')['students_id'].count().sort_values(ascending=False).tail(30).reset_index()

In [None]:
### Above are the places where there's still less question

In [None]:
groups = pd.read_csv('../input/groups.csv')

In [None]:
groups.head()

In [None]:
groups.groupby('groups_group_type').count().reset_index()

## User Tags

> Users of any type can follow a hashtag. This shows you which hashtags each user follows.

In [None]:
tag_users = pd.read_csv('../input/tag_users.csv')

In [None]:
tag_users.head()

In [None]:
tag_users.describe()

In [None]:
tag_users.shape

### Top 20 tags followed by users

In [None]:
tag_users.groupby('tag_users_tag_id')['tag_users_tag_id'].count().sort_values(ascending=False).head(20)

## Question Tags

> Every question can be hashtagged. We track the hashtag-to-question pairings, and put them into this file.

In [None]:
tag_questions = pd.read_csv('../input/tag_questions.csv')

In [None]:
tag_questions.head()

In [None]:
tag_questions.shape

## Emails table

> Each email corresponds to one specific email to one specific recipient. The frequency_level refers to the type of email template which includes immediate emails sent right after a question is asked, daily digests, and weekly digests.

`emails_recipient_id` is professional_id

`emails_id` is the email id recieved

In [None]:
emails = pd.read_csv('../input/emails.csv')

In [None]:
emails.head()

In [None]:
emails.groupby('emails_frequency_level')['emails_id'].count().reset_index()

In [None]:
emails.groupby('emails_recipient_id')['emails_id'].count().sort_values(ascending=False).head(10)

In [None]:
emails['emails_id'].nunique()

In [None]:
emails.shape

## Questions and Answers tables

In [None]:
answers = pd.read_csv('../input/answers.csv')

In [None]:
answers.head()

In [None]:
answers.shape

In [None]:
questions = pd.read_csv('../input/questions.csv')

In [None]:
questions.shape

In [None]:
questions.head()

In [None]:
questions.shape

### To get the final dataset, we need to get the base table from the combination of these two tables: `questions` and `answers`. Working our way up by addding in the relevant details, we should be able to get the dataset that will help! (mad hope)

## Comments table

> Comments can be made on Answers or Questions. We refer to whichever the comment is posted to as the "parent" of that comment. Comments can be posted by any type of user. Our favorite comments tend to have "Thank you" in them :)

In [None]:
comments = pd.read_csv('../input/comments.csv')

In [None]:
comments.shape

In [None]:
comments.head()

### Example of a comment by the students probably?

In [None]:
comments['comments_body'][4]

### Example of extra clarification on a follow up comment?

In [None]:
comments['comments_body'][2]

### Example of comment by the professionals?

In [None]:
comments['comments_body'][121]

## Matches table

> Each row tells you which questions were included in emails. If an email contains only one question, that email's ID will show up here only once. If an email contains 10 questions, that email's ID would show up here 10 times.

In [None]:
matches = pd.read_csv('../input/matches.csv')

In [None]:
matches.head()

In [None]:
matches.describe()

In [None]:
matches['matches_email_id'].nunique()

In [None]:
matches['matches_question_id'].nunique()

## Tags table

> Each tag gets a name.

In [None]:
tags = pd.read_csv('../input/tags.csv')

In [None]:
tags.head()

In [None]:
tags['tags_tag_name'].nunique()

In [None]:
tags.shape

In [None]:
import re
[str(tag) for tag in tags['tags_tag_name'].tolist() if re.search(r'computer', str(tag))][:20]

If we consider the tag similar to `computer` we can see that there are similar tags that are repeated: `computer-science`, `computer-engineering`, `computer-engineer`, `computerscienceinformation` which are similar tags

## Membership data tables: School and Group

> Group membership : Any type of user can join any group. There are only a handful of groups so far.
> School membership: Just like group_memberships, but for schools instead.

Probably not much of use building a baseline model  👀

In [None]:
school_membership = pd.read_csv('../input/school_memberships.csv')

In [None]:
group_membership = pd.read_csv('../input/group_memberships.csv')

In [None]:
school_membership.head()

In [None]:
group_membership.head()

In [None]:
school_membership['school_memberships_school_id'].nunique()

In [None]:
school_membership.groupby('school_memberships_school_id').count().reset_index().head(10)

### This is something great that you'll find in real world datasets! This is messy and leave you bamboozled like the real Messi

![](https://thumbs.gfycat.com/RewardingPointedBarebirdbat-size_restricted.gif) 


### Next set of task will be to get a combined denormalized dataset that tracks `questions`, `answers`, `professionals`, `students` together

## Data Merges [Will keep on updating]

In [None]:
questions.head()

In [None]:
answers.head()

In [None]:
answers = answers.rename(columns={'answers_question_id': 'questions_id'})

In [None]:
questions.shape, answers.shape

In [None]:
merge_qna = pd.merge(questions, answers, on='questions_id', how='left')

In [None]:
merge_qna = merge_qna[['questions_id', 'questions_author_id', 
                       'answers_author_id', 'questions_title', 
                       'questions_body', 'answers_body', 
                       'questions_date_added', 'answers_date_added']]

In [None]:
merge_qna.shape

In [None]:
merge_qna.head()

In [None]:
merge_qna_professionals = pd.merge(merge_qna, professionals, 
                                   left_on='answers_author_id', 
                                   right_on='professionals_id', 
                                   how='left')

In [None]:
merge_qna_professionals = merge_qna_professionals[['questions_id', 'questions_author_id', 
                                                   'answers_author_id', 'questions_title', 
                                                   'questions_body', 'answers_body', 
                                                   'questions_date_added', 'answers_date_added',
                                                   'professionals_location', 'professionals_industry']]

In [None]:
merge_qna_professionals.head()

In [None]:
merge_qna_professionals_students = pd.merge(merge_qna_professionals, students, 
                                             left_on='questions_author_id', 
                                             right_on='students_id', 
                                             how='left')

In [None]:
merge_qna_professionals_students.head()

### There's something weird I sensed wrt to `row[1] and row[2]`

In [None]:
row1 = merge_qna_professionals_students.loc[1, ['questions_body', 'answers_body', 
                                                'students_id', 'students_location']]

In [None]:
row2 = merge_qna_professionals_students.loc[2, ['questions_body', 'answers_body', 
                                                'students_id', 'students_location']]

In [None]:
row1['questions_body']

In [None]:
row1['answers_body']

In [None]:
row1['students_location']

In [None]:
row2['questions_body']

In [None]:
row2['answers_body']

In [None]:
row2['students_location']

### As mentioned above Data Quality errors are sometimes a big problem. In the 2nd answer the professional (from India) made sure that he tried to answer the question properly. Localization wise this is a relevant answer and a match with the right professional. 

### That was only possible coz we got a match based on the mention of Bangalore in the question

### The answer 1 perhaps not so. Are there a few more problems in the dataset? I'll be Arsene Wenger and say, "I didn't see it". Do i know whether I'll find it? Will need a lot to figure it out.

In [None]:
merge_qna_professionals_students = merge_qna_professionals_students[['questions_id', 'questions_author_id', 
                                                   'answers_author_id', 'questions_title', 
                                                   'questions_body', 'answers_body', 
                                                   'questions_date_added', 'answers_date_added',
                                                   'professionals_location', 'professionals_industry',
                                                    'students_location']]

In [None]:
merge_qna_professionals_students.shape

In [None]:
merge_qna_professionals_students.head()

**This will be updated...hopefully soon**