# Mapping Answers into a user Profile

In [135]:
import math
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [136]:
tags = pd.read_csv("../input/Tags_Filtered.csv", encoding='latin1')
answers = pd.read_csv("../input/Answers_Filtered.csv",encoding="latin1")

In [137]:
answers = answers.set_index('Id');
tags = tags.set_index('Id');

In [138]:
tags.sample(3)

Unnamed: 0_level_0,Tag
Id,Unnamed: 1_level_1
28836810,disabled-input
22148740,css
11414840,asp.net-mvc-3


In [139]:
answers.sample(3)

Unnamed: 0_level_0,OwnerUserId,CreationDate,ParentId,Score,Body
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
12402306,1464455.0,2012-09-13T08:29:52Z,12401930,0,code quite similar retrieval objects ldap need...
12084894,682515.0,2012-08-23T04:47:58Z,6165590,15,case people problem chrome later manifest must...
35409154,821110.0,2016-02-15T12:15:56Z,35408860,1,third parameter body used post requests header...


## Evaluating a user based on their Questions

In this Section, we use the data from questions and tags datasets to create two matrecies of users. The two matrices are:
* A matrix containing total Scoring statistics
* A matrix keeping track of total Questions asked by user per category

In [140]:
rows = answers['OwnerUserId'].unique()
dimensions = tags[tags.index.isin(answers['ParentId'])]['Tag'].unique()

In [176]:
user_scores = pd.DataFrame(np.zeros((len(rows), len(dimensions))),index=rows, columns=dimensions)
user_answers = pd.DataFrame(np.zeros((len(rows), len(dimensions))),index=rows, columns=dimensions)

In [154]:
answers[answers['OwnerUserId']== 61].set_index('ParentId').join(tags)

Unnamed: 0,OwnerUserId,CreationDate,Score,Body,Tag
90,61.0,2008-08-01T14:45:37Z,13,version control subversion good resource sourc...,svn
90,61.0,2008-08-01T14:45:37Z,13,version control subversion good resource sourc...,tortoisesvn
90,61.0,2008-08-01T14:45:37Z,13,version control subversion good resource sourc...,branch
90,61.0,2008-08-01T14:45:37Z,13,version control subversion good resource sourc...,branching-and-merging
24270,61.0,2008-08-29T01:30:38Z,1,know find oop useful pretty much solely syntac...,language-agnostic
24270,61.0,2008-08-29T01:30:38Z,1,know find oop useful pretty much solely syntac...,oop
47980,61.0,2008-09-07T02:32:02Z,7,sure hell small errors explode pages pages unr...,c++
47980,61.0,2008-09-07T02:32:02Z,7,sure hell small errors explode pages pages unr...,templates
47980,61.0,2008-09-07T02:32:02Z,7,sure hell small errors explode pages pages unr...,compiler-errors
51390,61.0,2008-09-09T08:35:20Z,2,wonder widespread jvm actually case flash ie5 ...,java


Number of users we will begin dealing with

In [177]:
print(len(user_scores.index.unique()),len(user_answers.index.unique()), len(answers['OwnerUserId'].unique()))

446585 446585 446585


To make sure all tag ids trace back to at least one question

In [144]:
print(len(tags[tags.index.isin(answers['ParentId'])]), len(tags))

3005933 3464882


In [145]:
tags[tags.index.isin(answers['ParentId'])]['Tag'].unique()

array(['flex', 'actionscript-3', 'air', ..., 'grails-spring-security',
       'blacklist', 'docker-windows'], dtype=object)

Number of questions we will be dealing with

In [146]:
len(answers)

1900285

The amount of answers which have just a 0 score. For this exercise, we will be treating as having a score of one, becasue dropping these would drop more than a third of our data

In [148]:
len(answers[answers['Score'] == 0])

731495

In [160]:
answers.index

Int64Index([      92,      124,      199,      269,      307,      332,
                 344,      359,      473,      529,
            ...
            40143180, 40143200, 40143212, 40143236, 40143237, 40143247,
            40143322, 40143336, 40143349, 40143389],
           dtype='int64', name='Id', length=1900285)

## Populating the matrices

This loop iterates through our dataset and adds 
1. 1 to each tag, *per tag per question* to the asker's row
2. The maximum of (1,Score) *per tag per question* to the asker's row. The max serves to treat 0 scores as 1

If score is less than 0, do nothing

In [163]:
answers[answers.index == 92]

Unnamed: 0_level_0,OwnerUserId,CreationDate,ParentId,Score,Body
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
92,61.0,2008-08-01T14:45:37Z,90,13,version control subversion good resource sourc...


In [178]:
for answer_index, answer in answers.iterrows():
    if answer['Score'] < 0:
        continue
    answer_tags = tags[tags.index == answer[0]]
    for tag_index,tag in answer_tags.iterrows():
        user_answers.at[answer['OwnerUserId'],tag['Tag']] += 1
        user_scores.at[answer['OwnerUserId'],tag['Tag']] += max(answer['Score'],1)

40.0 %
110.0 %
200.0 %
240.0 %
320.0 %
500.0 %
1330.0 %
1860.0 %
1910.0 %
1980.0 %


In [None]:
bad_users = user_answers[user_answers.sum(axis=1) == 0].index

In [None]:
len(bad_users)

In [48]:
user_answers.head(3)

Unnamed: 0_level_0,.net,actionscript-3,angularjs,asp.net,c#,cookies,css,date,flash,flex,generics,html,javascript,session,sqlite,tsql,vb.net,web-services,xml
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
26.0,1.0,1.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
58.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
83.0,9.0,0.0,0.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,3.0,0.0,0.0


In [50]:
user_scores.to_csv("../profiles/answer_scores.csv",encoding="latin1",index='Id', )
user_questions.to_csv("../profiles/answer_counts.csv",encoding="latin1",index='Id')

## Mapping User Score data and User question count data into a profile

In [72]:
user_scores = pd.read_csv("../profiles/answer_scores.csv",encoding="latin1",index_col='Id')
user_answers = pd.read_csv("../profiles/answer_counts.csv",encoding="latin1",index_col='Id')

### Treating users with all negative scores
For this exercise, we will dispose those users' questions from the question and tag datasets

In [None]:
bad_users = user_scores.loc[(user_scores==0).all(axis=1)]

In [None]:
tags = pd.read_csv("../input/Tags_Filtered.csv", encoding='latin1',index_col='Id')
questions = pd.read_csv("../input/Questions_Filtered.csv",encoding="latin1",index_col='Id')

In [80]:
questions[questions['OwnerUserId']== 25778].set_index('Id').join(tags.set_index('Id'))

Unnamed: 0_level_0,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,Tag
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
204970,25778.0,2008-10-15T14:40:11Z,,-3,split string fixed character sequence,suppose following string string asd test ass t...,java
204970,25778.0,2008-10-15T14:40:11Z,,-3,split string fixed character sequence,suppose following string string asd test ass t...,string


In [84]:
questions = questions[~questions['OwnerUserId'].isin(bad_users)]
tags = tags[~tags['Id'].isin(questions['Id'].tolist())]

In [88]:
tags = tags.to_csv("../input/Tags_Filtered.csv", encoding='latin1',index=False)
questions = questions.to_csv("../input/Questions_Filtered.csv",encoding="latin1",index=False)

In [89]:
user_scores = user_scores.loc[~(user_scores==0).all(axis=1)]
user_questions = user_questions.loc[~(user_questions==0).all(axis=1)]

In [95]:
print(len(user_scores), len(user_questions))

553866 553866


## Now that we've gotten rid of bad users, we are ready to generate profiles for them

### We will normalize a user's question count here

In [20]:
user_scores = pd.read_csv("../profiles/question_scores.csv",encoding="latin1",index_col='Id' )
user_questions = pd.read_csv("../profiles/question_counts.csv",encoding="latin1",index_col='Id')

In [42]:
bad_users = user_scores[user_scores.sum(axis=1) == 0].index
user_scores = user_scores[~user_scores.index.isin(bad_users)]
user_questions = user_questions[~user_questions.index.isin(bad_users)]

In [46]:
user_questions_index.shape == user_scores.shape

True

In [44]:
user_questions_index = user_questions.div(user_questions.sum(axis=1), axis=0)
user_questions_index.head(1)

Unnamed: 0_level_0,flex,actionscript-3,svn,sql,asp.net,algorithm,colors,c#,.net,c++,...,meteor,laravel,firebase,parse.com,typescript,docker,apache-spark,reactjs,spring-boot,ionic-framework
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
26.0,0.115385,0.038462,0.0,0.0,0.153846,0.0,0.0,0.115385,0.038462,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
user_scores.head(1)

Unnamed: 0_level_0,flex,actionscript-3,svn,sql,asp.net,algorithm,colors,c#,.net,c++,...,meteor,laravel,firebase,parse.com,typescript,docker,apache-spark,reactjs,spring-boot,ionic-framework
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
26.0,30.0,26.0,0.0,0.0,4.0,0.0,0.0,18.0,6.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Since we normalized the values, we expect the sum of al the rows to sum up to zero

In [45]:
user_questions_index.sample(1).sum(axis=1)

Id
5600344.0    1.0
dtype: float64

In [47]:
user_information = user_questions_index.multiply(user_scores)

In [48]:
user_information.head(1).loc[:,['.net','actionscript-3', 'angularjs', 'asp.net', 'c#', 'cookies','css', 'date', 'flash', 'flex', 'generics','html', 'javascript', 'session', 'sqlite', 'tsql',  'vb.net', 'web-services', 'xml']]

Unnamed: 0_level_0,.net,actionscript-3,angularjs,asp.net,c#,cookies,css,date,flash,flex,generics,html,javascript,session,sqlite,tsql,vb.net,web-services,xml
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
26.0,0.230769,1.0,0.076923,0.615385,2.076923,0.230769,0.384615,0.038462,0.038462,3.461538,0.076923,0.384615,0.153846,0.230769,0.038462,0.038462,0.038462,0.153846,0.153846


In this weighted average, the minimum "Knowledge" a person can have is if they asked one question which had a score of "1" to it (we count 0 as a 1 score wise)

In [49]:
min(user_information.sum(axis=1))

0.9999999999999998

In [51]:
user_information.to_csv("../profiles/question_profiles.csv",encoding="latin1",index='Id')