# Fundamentals of Social Data Science 2025. Week 1. Day 2. Exercises

This is a group assignment. 

You will be expected to submit an individual assignment on Tuesday at 12pm (not Monday) on Canvas. The sheet that you will be expected to submit will be released on Friday at 12pm. It is submitted on Tuesday because you will want to integrate your materials post-presentation. 

- That sheet will have a small number of individual questions related to Friday's assignment
- It will include one question about your presentation. That question is reproduced below so there are no surprises. 

The assignment submission details will be posted on Canvas under assignments.

To itemise: 
- Week 1. Day 2. Wednesday at 12pm: This "getting started" sheet is released. 
- Wednesday afternoon tutorial: We will want to ensure that you can get started on loading data. 
- Week 1. Day 3. Friday at 12pm: The individual assignment is released. 
- Week 1. Day 3. Friday afternoon tutorial: You will want to play with the Claude artifact as well as continue working with your group. 
- Week 2. Day 1. Monday at 12pm: An exercise will be released related to Network Canvas. It will require you to download Network Canvas interviewer from networkcanvas.com. 
- Week 2. Day 1. Monday afternoon tutorial: Bernie will explain the Network Canvas exercise as a part of the class. The tutorial period will be group presentations. 
- Tuesday at 12pm: Your individual assignment is due. 
- Tuesday at 12pm: Your group assignment should be posted. 

> **NOTE:** This assignment will use data from the web. This assignment has NOT been cleared for research via the CUREC process. It is an in-class assignment. Therefore, if you wish to publish anything from this analysis, you must first apply for a CUREC before publishing anything publicly with your Oxford affiliation. 

# Group exercise: Getting started

The group assignment will make use of the StackDownloader from the FSSTDS repository. This downloader (recently tested) will download, extract and process a StackExchange archive. It is pretty close to 'one click'. It creates a 'feather' archive, which is a very nice format for compressing DataFrames. You can open this in your own code. 

To begin, you will need to have everything installed for the StackDownloader. How do we do that? We install the requirements.

- **Step 1.** Clone the FSSTDS repository. 
- **Step 2.** Open the Ch.00.Stack_downloader and 'select kernel', select "Python Environments...", "Create Python Environment", "Venv -> Creates a `.venv` virtual environment in the current workspace", select **Python 3.12**. Note 3.14 is untested. Select dependencies to install -> requirements.txt. 
- **Step 3.** Run the big code cell in Stack_downloader. Select a specific archive. 
- **Step 4.** Locate and load the DataFrame. You can now use the Stack Exchange in your work. 

Note if you get errors with PyArrow below, try restarting the kernel. 

In [5]:
# In case this Jupyter Notebook is in a different repo than FSSTDS, you may need to install
# pandas and pyarrow to parse the file. 
import sys
import subprocess

subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "pandas", "pyarrow"])




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


0

## Q2. Defining helpfulness 

If you can describe the data simply then you are on your way to the big question for the group. Recall two of the trade-offs from the last lecture: "operationalisation" and "coding". The group project this week is very simple in some senses and very complex in other senses: 

Two questions: 
> - "How can we identify the most helpful users in this space" 
> - "When were the helpful users the most helpful or most active?"

So this means that your group will have to discuss:
- What defines helpfulness? Are there multiple possible metrics? 
- Do we think that a helpful person should _always_ be helpful? 
- Is helpfulness topic-specific? 
- You may want to explore wrangling the data by time. 

We do not expect you to merge in data from the users.xml / users.feather for this. However, you may want to explore how to create a datatime column. This is not covered in this lecture, but you may want to read either Chapter 10 of FSSTDS on cleaning data and Chapter 12 of FSSTDS on wrangling time data. 

You will want to divide some tasks among your group. Some might be delegated to surf the space online to come up with abductive hypotheses. Some might want to focus on rendering some charts. Some might be excellent at presentation design or at presenting to the group. Lean into your expertise and collaborate.

Presentations for this will be on Monday afternoon. The presentations will be no more than 12 minutes + 3 minutes of questions & transition. 

Each group will have a 'space' on Canvas to submit 3 things: 
- the presentation 
- the code
- the 'credits'. A single sheet (in docx/md) that details which group members participated in which ways. Treat this not merely as accountability but an opportunity to signal your own strengths. We do not expect everyone to do 1/5 of the work for every task. We do expect everyone to contribute in some way.

This code will not be graded but it will be made available to other students. 
The presentations will be given short written feedback by the instructor post-presentation.

In [None]:
# Q0. Check that you can load your own DataFrame
 
import pandas as pd 

stack_df = df = pd.read_feather("/Users/calebagoha/Desktop/fundamentals_of_sds/group4_fsstds/data/vegetarianism.stackexchange.com/Posts.feather")

print(stack_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2409 entries, 0 to 2408
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Id                     2409 non-null   object        
 1   PostTypeId             2409 non-null   object        
 2   AcceptedAnswerId       381 non-null    object        
 3   CreationDate           2409 non-null   datetime64[ns]
 4   Score                  2409 non-null   int64         
 5   ViewCount              760 non-null    float64       
 6   Body                   2409 non-null   object        
 7   OwnerUserId            2345 non-null   object        
 8   LastEditorUserId       1418 non-null   object        
 9   LastEditDate           1438 non-null   datetime64[ns]
 10  LastActivityDate       2409 non-null   datetime64[ns]
 11  Title                  760 non-null    object        
 12  Tags                   760 non-null    object        
 13  Ans

### Load all data files

In [12]:
users = pd.read_xml("/Users/calebagoha/Desktop/fundamentals_of_sds/group4_fsstds/data/vegetarianism.stackexchange.com/Users.xml", xpath=".//row")
votes = pd.read_xml("/Users/calebagoha/Desktop/fundamentals_of_sds/group4_fsstds/data/vegetarianism.stackexchange.com/Votes.xml", xpath=".//row")
comments = pd.read_xml("/Users/calebagoha/Desktop/fundamentals_of_sds/group4_fsstds/data/vegetarianism.stackexchange.com/Comments.xml", xpath=".//row")
badges = pd.read_xml("/Users/calebagoha/Desktop/fundamentals_of_sds/group4_fsstds/data/vegetarianism.stackexchange.com/Badges.xml", xpath=".//row")
post_links = pd.read_xml("/Users/calebagoha/Desktop/fundamentals_of_sds/group4_fsstds/data/vegetarianism.stackexchange.com/PostLinks.xml", xpath=".//row")
post_history = pd.read_xml("/Users/calebagoha/Desktop/fundamentals_of_sds/group4_fsstds/data/vegetarianism.stackexchange.com/PostHistory.xml", xpath=".//row")
posts = pd.read_xml("/Users/calebagoha/Desktop/fundamentals_of_sds/group4_fsstds/data/vegetarianism.stackexchange.com/Posts.xml", xpath=".//row")

### Data Cleaning & Preprocessing

In [None]:
answers = posts[posts["PostTypeId"] == 2].copy()
answers["OwnerUserId"] = answers["OwnerUserId"].fillna(-999) # replace missings with invalid id (might be deleted accounts) there are 38 of them

users['CreationDate'] = pd.to_datetime(users['CreationDate'], errors='coerce')
users['LastAccessDate'] = pd.to_datetime(users['LastAccessDate'], errors='coerce')

badges = badges[["UserId", "Name"]]
badges_by_user = badges.groupby('UserId', dropna=False)['Name'].agg(list).reset_index().rename(columns={'Name': 'badges'})

In [58]:
# compute per question total answer score
totals = answers.groupby('ParentId', dropna=False)['Score'].sum().rename('TotalAnswerScore').reset_index()

# merge question totals back onto answers
answers = answers.merge(totals, on="ParentId", how="left")

- Posts was filtered to only contain answers.
- New column created to keep total answer score per post
- Next Steps: Kiki creates new column for normalized score
- Next Steps: Howard groups by userId to get aggregate normalized score and total post volume
- Next Steps: Caleb finishes creating centralized dataset by including badges, and other descriptive data on the user level for Michi and Çelikhan

### Creating Centralized Dataset