# Code for Creating a Reputation-Answer Dataset
*Author*: K. Elizabeth Bui 
*Version*: 3, 12/13/2022

## Purpose
This code aims to create and analyze a dataset made from StackOverflow user and answer post data derived from StackExchange.

## Creating the Dataset
The dataset will combine two queried datasets from [StackExchange](https://data.stackexchange.com/stackoverflow/query/new). The first .csv, futher referred to as users, contains columns id and reputation. The second .csv, answers, contains 5 columns: id, score, commentcount, owneruserid, and body. The resulting data set will have columns reputation and answer, where answer entries are lists of lists containing single answer statistics. These individual answer statistics have the format \[score, comment count, answer length\].

### Import all data and libraries

In [2]:
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format

# Queried .csv files 
user_file = 'User_Dataset_4.csv'
answer_file = 'answer_dataset_2.csv'

# Convert the .csv files to dataframes
users = pd.read_csv(user_file)
answers = pd.read_csv(answer_file)

In [3]:
dataset = pd.DataFrame(columns = ['reputation', 'answers'])

### Combine user and answer information

In [4]:
# Parse through each user
for i in range(users.shape[0]):
    
    # Filter for answers written by user 
    idfilter = answers['owneruserid'].isin([users.loc[i,'id']])
    users_answers = answers[idfilter]
    users_answers=users_answers.reset_index(drop=True)
    
    answers_list = []
    # Take each answer and input it's data into a list
    for j in range(users_answers.shape[0]):
        single_answer = [users_answers.loc[j, 'score'], users_answers.loc[j, 'commentcount'], len(users_answers.loc[j,'body'])]
        # Add the single answer list into the complete answers list
        answers_list.append(single_answer)
    
    # if the user has written any answers, add the user and their answers into the main dataset
    if len(answers_list) > 0:
        dataset=dataset.append({'reputation': users.iloc[i,1], 'answers': answers_list}, ignore_index = True)

In [5]:
dataset=dataset.sort_values(by=['reputation'], ascending=False)
dataset=dataset.reset_index(drop=True)
dataset.to_csv('combined_dataset.csv', index=False);
display(dataset)

Unnamed: 0,reputation,answers
0,1371915,"[[1, 0, 1966], [5, 0, 1160], [6, 0, 1334], [5,..."
1,1172922,"[[1, 1, 578], [0, 0, 1114], [0, 0, 612], [0, 0..."
2,999554,"[[2, 0, 2913], [4, 2, 1550], [3, 2, 7587], [2,..."
3,998406,"[[1, 9, 1270], [1, 0, 1196], [1, 1, 798]]"
4,994252,"[[4, 0, 1685], [2, 0, 1429], [1, 0, 6850], [1,..."
...,...,...
4890,1004,"[[0, 0, 468]]"
4891,1003,"[[0, 0, 504], [3, 2, 611], [0, 0, 815], [1, 2,..."
4892,1002,"[[1, 0, 689]]"
4893,1001,"[[0, 0, 126]]"


### Notes about queried datasets
The data from [StackExchange](https://data.stackexchange.com/stackoverflow/query/new) was queried using the following commands: 

- [Answer Query](https://data.stackexchange.com/stackoverflow/query/1683041/answer-posts-version-2)
- [User Query](https://data.stackexchange.com/stackoverflow/query/1683058/user-data-set-4)

The scope of these queries have been adjusted to ensure greater chances of the 50,000 users queried (max allowed) match the 50,000 queried answers.

## Key statistics

### Create table to store statistics 
This table with have rows denoting the top 100 reputed users and all other users (not including the top 100) stats. It will contain columns
- average reputation (avg rep): the average reputation of a user
- median reputation (median rep): the median reputation of the users
- reputation mode (mode rep): the mode of the users' reputations
- average number of answers (avg # answers): the average number of answers a user has written
- average score (avg score): the average score of a user's answer
- average number of comments (avg # comment): the average number of comments a user's answer gets
- average length (avg length): the average length of a user's answer

In [6]:
stats = pd.DataFrame(columns = ['avg rep', 'median rep', 'mode rep', 'avg # answers','avg score', 'avg # comment', 'avg length'], index = ['top 100', 'everyone else'])

#### Reputation Stats

In [7]:
# Computate and store each of the reputation stat entries
stats.loc['top 100','avg rep'] = sum(dataset.iloc[:100,0])//100
stats.loc['top 100','median rep'] = dataset.iloc[49,0]
stats.loc['top 100','mode rep'] = dataset.iloc[0,0] - dataset.iloc[99,0]

stats.loc['everyone else','avg rep'] = sum(dataset.iloc[100:,0])//dataset.shape[0]
stats.loc['everyone else','median rep'] = dataset.iloc[(dataset.shape[0]-100)//2+100,0]
stats.loc['everyone else','mode rep'] = dataset.iloc[100,0] - dataset.iloc[dataset.shape[0]-1,0]

#### Comment Stats

In [8]:
# Helper method used to find the average stat value within a group of users
def stat_sum_avg(size, stat_index, offset):
    total = 0
    counter = 0
    for i in range(size):
        user_answer_stats = dataset.iloc[i+offset, 1]
        for j in range(len(user_answer_stats)):
            counter+=1
            total += user_answer_stats[j][stat_index]
    return total/counter

In [9]:
stats.loc['top 100','avg score'] = stat_sum_avg(100, 0, 0)
stats.loc['top 100','avg # comment'] = stat_sum_avg(100, 1, 0)
stats.loc['top 100','avg length'] = stat_sum_avg(100, 2, 0)

stats.loc['everyone else','avg score'] = stat_sum_avg(dataset.shape[0]-100, 0, 100)
stats.loc['everyone else','avg # comment'] = stat_sum_avg(dataset.shape[0]-100, 1, 100)
stats.loc['everyone else','avg length'] = stat_sum_avg(dataset.shape[0]-100, 2, 100)

In [10]:
# Helper method used to find the total number of comments written by a group of users
def num_comments(size, offset):
    total = 0
    for i in range(size):
        user_answer_stats = dataset.iloc[i + offset, 1]
        total += len(user_answer_stats)
    return total

In [11]:
stats.loc['top 100','avg # answers'] = num_comments(100, 0)//100
stats.loc['everyone else','avg # answers'] = num_comments(dataset.shape[0]-100, 100)//dataset.shape[0]

#### Stats and interpretation

In [12]:
display(stats)

Unnamed: 0,avg rep,median rep,mode rep,avg # answers,avg score,avg # comment,avg length
top 100,426661,354782,1149657,13,1.13,1.29,1442.98
everyone else,17632,5771,220296,3,0.69,1.02,1287.7


##### Noteworthy Observations
- The median is lower than the average reputation for both the top 100 users and everyone else. This means that top users of each group are pulling up the average. Alternatively, lower ranked users of each group are pulling down the average. Notably, the discrepency in average and median is larger for the top 100. 

- The range of reputation level (mode of reputation) is larger for the top 100 than the everyone else.

- Top 100 users write, on average, 10 more answers than everyone else. This is to be expected as one needs to do more to get more reputation.

- Crucially, the average score of the top 100 is almost double that of everyone else.

- Top users, on average, write slightly more than everyone else.

##### What does this mean?
Generally speaking, top users write more higher scored answers than everyone else.

Associated Project Write-up: [*Stack Overflow*: Does Reputation Indicate a Better Answer?](https://docs.google.com/document/d/1MeUrkDXiqXJQ4aoUsPcXUYh1XqN4l1qP/edit?usp=sharing&ouid=100247400743504583876&rtpof=true&sd=true)

Github Repo: [Stack_Overflow_User_Dataset](https://github.com/GreyHeartedKait/Stack_Overflow_User_Dataset)