# UC Subreddit Upvote Prediction
### Analyzing Post Features and Engagement Across the UC Sub-Reddit System


## 1. Introduction

Reddit is one of the largest discussion and bonding platforms for college communities, especially UC's. Each UC campus' subreddit has specific cultures, trends, and content patterns. These communities are especially useful for students to voice their opinions or share a funny story!

Take this as an example:

<img src="images\Screenshot 2025-11-12 225912.png" width="800">

This project aims to analyze Reddit posts from various UC subreddits to predict how many upvotes a post will earn based on its content, metadata, and engagement features. We thought this was interesting because you can analyze individual UC subreddits to see the differences in culture communities.

However, a practical use being if you were a club trying to have people go to your event:
What factors influence your post being brought up to the front page?
Or, how you could get the most engagement out of your post?

### **Practical Terminology**

Terms and Descriptions for Common Reddit Terms:
| Term | Meaning | Relevance to the Project |
|------|---------|---------------------------|
| **Upvote** | A + positive vote indicating support | Our main target variable |
| **Downvote** | A - negative vote indicating disagreement | Affects score, (negatively) but not always included |
| **Score** | Upvotes − Downvotes | Sometimes differs from "upvotes" field |
| **Upvote Ratio** | % of total votes that are upvotes | Proxy for sentiment/approval |
| **Karma** | A user’s total upvote score on Reddit | Used as a predictor of credibility |
| **OP** | "Original Poster": the person who created the post | We track OP karma & account age |
| **Flair** | A label for the post ("Funny", "School", etc.) | Helps to categorize content |
| **Mods** | Subreddit moderators | They influence which posts stay or get removed |
| **NSFW** | “Not Safe For Work” content flag | Included as a binary feature |
| **Hot** | A listing sorted by engagement + time decay | Affects which posts we scraped |
| **Top** | A listing sorted purely by score | May bias dataset toward high-engagement posts |
| **New** | A listing sorted by recent posts | Influences visibility + upvotes |
| **Shitpost** | Low-effort or joke post (Usually funny) | Often gets lots of votes in college subreddits |
| **Copypasta** | A repeated block of text/meme | Signals humor, may influence upvote behavior |


### **Research Questions**

**RQ1 (Prediction):**  
Can we predict the number of upvotes a post receives based on title features, metadata, and subreddit characteristics?

**RQ2 (Explanation):**  
Which factors such as title length, sentiment, posting time, media presence, and subreddit size have the strongest influence on upvote counts?

The goal is to combine exploratory analysis and machine learning to uncover meaningful patterns in campus-level Reddit engagement.

## 2. Study Design & Data Description

We scraped posts from the *top*, *hot*, and *new* listings for nine UC subreddits using Reddit’s public JSON API.
However, since these listings contain high-visibility posts, this dataset may be biased toward successful or recent posts.

### Subreddits included:
- r/UCSD  
- r/UCLA  
- r/Berkeley  
- r/UCSantaBarbara  
- r/UCI  
- r/UCDavis  
- r/UCSC  
- r/ucr  
- r/ucmerced  

### Dataset Structure
| Category | Variable | Description |
|----------|----------|-------------|
| **Post Content** | `title` | Text of the post title |
| | `title_length` | Length of the title (engineered feature) |
| | `sentiment` | Sentiment score of the title |
| **Engagement Metrics** | `upvotes` | Number of upvotes |
| | `upvote_ratio` | Proportion of upvotes / total votes |
| | `num_comments` | Number of comments|
| **Post Metadata** | `listing` | Where the post was scraped from: (top/hot/new) |
| | `created_utc` | Timestamp of post |
| | `hour` | Hour of day post was made (engineered feature) |
| **Content Type** | `has_media` | Whether the post contains media (0/1) |
| | `is_video` | Whether the post is a video (0/1) |
| | `over_18` | NSFW flag (0/1) |
| | `link_flair_text` | Category assigned by the subreddit |
| **Author Information** | `author` | Username of the creator |
| | `author_premium` | Whether the author is a Reddit Premium user |
| | `author_karma`| Total karma of the author |
| **Subreddit Information** | `subreddit` | Name of the UC subreddit |
| | `subreddit_subscribers` | Number of subscribers to that subreddit |


### Potential Biases:
Our dataset is subject to several sources of bias. First, scraping from the top listing leads to an overrepresentation of already high-performing posts, which may distort our understanding of typical engagement. Additionally, subreddit sizes vary widely—for example, UC Berkeley has far more traffic than UC Merced—so posts naturally receive different levels of exposure across campuses. The timing of data collection also affects upvote counts, since recently posted content has had less time to accumulate engagement. Finally, we lack historical data for older posts, which limits our ability to analyze long-term trends or normalize upvote counts over time. (NEED TO EDIT OARAGRAPH)

These variables allow both behavioral and structural analysis of community engagement.


In [7]:
import pandas as pd
import os

folder_path = "reddit data"
csv_files = [f for f in os.listdir(folder_path) if f.endswith(".csv")]

df_list = []

for file in csv_files:
    path = os.path.join(folder_path, file)
    try:
        temp = pd.read_csv(path, on_bad_lines='skip')
        temp["campus"] = file.replace(".csv", "")
        df_list.append(temp)
        print(f"Loaded {file} successfully: {temp.shape}")
    except Exception as e:
        print(f"ERROR loading {file}: {e}")

df_all = pd.concat(df_list, ignore_index=True)
df_all.head()

Loaded Berkeley.csv successfully: (106, 24)
Loaded UCDavis.csv successfully: (103, 24)
Loaded UCI.csv successfully: (104, 24)
Loaded ucla.csv successfully: (111, 24)
Loaded ucmerced.csv successfully: (101, 24)
Loaded ucr.csv successfully: (105, 24)
Loaded UCSantaBarbara.csv successfully: (102, 24)
Loaded UCSC.csv successfully: (101, 24)
Loaded UCSD.csv successfully: (108, 24)


Unnamed: 0,subreddit,listing,title,author,upvotes,post_text,upvote_ratio,total_awards_received,score,edited,...,domain,link_flair_text,created_utc,subreddit_subscribers,author_premium,stickied,has_media,permalink,url,campus
0,Berkeley,top,Can I table on Sproul just for fun?,Junior_Liberator,52,"Hey, quick question, does anyone know if indiv...",0.93,0,52,False,...,self.berkeley,Events/Organizations,1763005000.0,168666,False,False,False,https://www.reddit.com/r/berkeley/comments/1ov...,https://www.reddit.com/r/berkeley/comments/1ov...,Berkeley
1,Berkeley,top,Sirens sounds in Berkeley,Entire-Vehicle-4559,29,What is happening?! Why I can hear the police ...,0.91,0,29,False,...,self.berkeley,News,1763008000.0,168666,False,False,False,https://www.reddit.com/r/berkeley/comments/1ov...,https://www.reddit.com/r/berkeley/comments/1ov...,Berkeley
2,Berkeley,top,All this band merch... for no one on campus to...,AdSlight4264,22,,0.79,0,22,False,...,i.redd.it,Other,1762982000.0,168666,False,False,True,https://www.reddit.com/r/berkeley/comments/1ov...,https://i.redd.it/vgupzxmw6w0g1.jpeg,Berkeley
3,Berkeley,top,uc berkeley meme sticker rally at Anime Destin...,NyamenRamen,21,A group of student artists tabling at Anime De...,1.0,0,21,False,...,reddit.com,University,1763004000.0,168666,False,False,False,https://www.reddit.com/r/berkeley/comments/1ov...,https://www.reddit.com/gallery/1ovqjpr,Berkeley
4,Berkeley,top,Police: Man with sword arrested after cutting ...,BerkeleyScanner,18,,0.96,0,18,False,...,berkeleyscanner.com,News,1763013000.0,168666,True,False,True,https://www.reddit.com/r/berkeley/comments/1ov...,https://www.berkeleyscanner.com/2025/11/13/uc-...,Berkeley


## 3. Exploratory Data Analysis (PT.1)

Before building any models, we first need to see the structure of the dataset and find any patterns in upvotes across UC subreddits. This helps us to see any early trends relevant to our research questions.

Some basic questions we had in mind are:

- How many posts did we collect across all UC campuses?
- Which subreddits have the most activity?
- How are upvotes distributed?
- Do different UC campuses show different upvote behaviors?

In [None]:
# We have about 941 rows of data scraped from the combined subreddits
df_all.shape

(941, 24)

In [17]:
# And about 100~ rows of data per campus!
df_all['campus'].value_counts()

campus
ucla              111
UCSD              108
Berkeley          106
ucr               105
UCI               104
UCDavis           103
UCSantaBarbara    102
ucmerced          101
UCSC              101
Name: count, dtype: int64