# UC Subreddit Upvote Prediction
### Analyzing Post Features and Engagement Across the UC Sub-Reddit System


## 1. Introduction

Reddit is one of the largest discussion and bonding platforms for college communities, especially UC's. Each UC campus' subreddit has specific cultures, trends, and content patterns. These communities are especially useful for students to voice their opinions or share a funny story!

Take this as an example:

<img src="images\Screenshot 2025-11-12 225912.png" width="800">

This project aims to analyze Reddit posts from various UC subreddits to predict how many upvotes a post will earn based on its content, metadata, and engagement features. We thought this was interesting because you can analyze individual UC subreddits to see the differences in culture communities.

However, a practical use being if you were a club trying to have people go to your event:
What factors influence your post being brought up to the front page?
Or, how you could get the most engagement out of your post?

### **Practical Terminology**

Terms and Descriptions for Common Reddit Terms:
| Term | Meaning | Relevance to the Project |
|------|---------|---------------------------|
| **Upvote** | A + positive vote indicating support | Our main target variable |
| **Downvote** | A - negative vote indicating disagreement | Affects score, (negatively) but not always included |
| **Score** | Upvotes − Downvotes | Sometimes differs from "upvotes" field |
| **Upvote Ratio** | % of total votes that are upvotes | Proxy for sentiment/approval |
| **Karma** | A user’s total upvote score on Reddit | Used as a predictor of credibility |
| **OP** | "Original Poster": the person who created the post | We track OP karma & account age |
| **Flair** | A label for the post ("Funny", "School", etc.) | Helps to categorize content |
| **Mods** | Subreddit moderators | They influence which posts stay or get removed |
| **NSFW** | “Not Safe For Work” content flag | Included as a binary feature |
| **Hot** | A listing sorted by engagement + time decay | Affects which posts we scraped |
| **Top** | A listing sorted purely by score | May bias dataset toward high-engagement posts |
| **New** | A listing sorted by recent posts | Influences visibility + upvotes |
| **Shitpost** | Low-effort or joke post (Usually funny) | Often gets lots of votes in college subreddits |
| **Copypasta** | A repeated block of text/meme | Signals humor, may influence upvote behavior |


### **Research Questions**

**RQ1 (Prediction):**  
Can we predict the number of upvotes a post receives based on title features, metadata, and subreddit characteristics?

**RQ2 (Explanation):**  
Which factors such as title length, sentiment, posting time, media presence, and subreddit size have the strongest influence on upvote counts?

The goal is to combine exploratory analysis and machine learning to uncover meaningful patterns in campus-level Reddit engagement.

## 2. Study Design & Data Description

We scraped posts from the *top*, *hot*, and *new* listings for nine UC subreddits using Reddit’s public JSON API.
However, since these listings contain high-visibility posts, this dataset may be biased toward successful or recent posts.

### Subreddits included:
- r/UCSD  
- r/UCLA  
- r/Berkeley  
- r/UCSantaBarbara  
- r/UCI  
- r/UCDavis  
- r/UCSC  
- r/ucr  
- r/ucmerced  

### Dataset Structure
| Category | Variable | Description |
|----------|----------|-------------|
| **Post Content** | `title` | Text of the post title |
| | `title_length` | Length of the title (engineered feature) |
| | `sentiment` | Sentiment score of the title |
| **Engagement Metrics** | `upvotes` | Number of upvotes |
| | `upvote_ratio` | Proportion of upvotes / total votes |
| | `num_comments` | Number of comments|
| **Post Metadata** | `listing` | Where the post was scraped from: (top/hot/new) |
| | `created_utc` | Timestamp of post |
| | `hour` | Hour of day post was made (engineered feature) |
| **Content Type** | `has_media` | Whether the post contains media (0/1) |
| | `is_video` | Whether the post is a video (0/1) |
| | `over_18` | NSFW flag (0/1) |
| | `link_flair_text` | Category assigned by the subreddit |
| **Author Information** | `author` | Username of the creator |
| | `author_premium` | Whether the author is a Reddit Premium user |
| | `author_karma`| Total karma of the author |
| **Subreddit Information** | `subreddit` | Name of the UC subreddit |
| | `subreddit_subscribers` | Number of subscribers to that subreddit |


### Potential Biases:
- Overrepresentation of high-performing posts (from "top" listing)
- Different subreddit sizes (UC Berkeley has more traffic than UC Merced)
- Time of scraping affects upvote count for new posts
- Missing historical data for older posts

These variables allow both behavioral and structural analysis of community engagement.


In [1]:
import pandas as pd
import numpy as np

# Load your combined master dataset
df = pd.read_csv("UC_subreddits_posts.csv")

# Display the first few rows
df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'UC_subreddits_posts.csv'