# 01 — Data Collection: The Sims 4
**Goal:** Pull player discussions from Reddit and EA Answers HQ for sentiment & topic analysis.  
**Outputs:** 
- `data/raw/reddit_sims4_posts.csv` (and optionally `..._comments.csv`)
- `data/raw/ea_forum_threads.csv`
**Provenance:** Collected with PRAW (Reddit API) and requests/BeautifulSoup (forums).

In [7]:
import os
from pathlib import Path
import pandas as pd
import importlib


# Project paths
DATA_RAW = Path("../data/raw")
DATA_RAW.mkdir(parents=True, exist_ok=True)

# Display options (handy in notebooks)
pd.set_option("display.max_colwidth", 200)
pd.set_option("display.max_rows", 50)

In [8]:
from dotenv import load_dotenv
load_dotenv()

RID  = os.getenv("REDDIT_ID")
RSEC = os.getenv("REDDIT_SECRET")
RUA  = os.getenv("REDDIT_USER_AGENT")

assert all([RID, RSEC, RUA]), "Missing one or more Reddit creds. Check your .env!"

In [9]:
import sys
from pathlib import Path

# This notebook lives in: <project_root>/notebooks/
ROOT = Path("..").resolve()   # <-- parent of notebooks = project root
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))

print("CWD:", Path.cwd())
print("On sys.path?", str(ROOT) in sys.path)

CWD: /Users/baderrezek/Desktop/Projects/Personal/sims4-sentiment-analysis/notebooks
On sys.path? True


In [10]:
import sys
sys.path.append("../src")

from src.collect_data import collect_reddit_posts, collect_comments_for_posts

In [11]:
df_sims4_posts = collect_reddit_posts(RID, RSEC, RUA, subreddit_name="Sims4", limit=750, time_filter="year")
df_thesims_posts = collect_reddit_posts(RID, RSEC, RUA, subreddit_name="thesims", limit=750, time_filter="year")

In [12]:
sample_ids = df_thesims_posts["id"].head(250).tolist()
df_thesims_comments = collect_comments_for_posts(RID, RSEC, RUA, sample_ids)

In [13]:
sample_ids = df_sims4_posts["id"].head(250).tolist()
df_sims4_comments = collect_comments_for_posts(RID, RSEC, RUA, sample_ids)

In [14]:
print("Sims4 subreddit:")
len(df_sims4_comments), display(df_sims4_comments.head(5))
len(df_sims4_posts), display(df_sims4_posts.head(5))

print("\n\nTheSims subreddit:")
len(df_thesims_posts), display(df_thesims_posts.head(5))
len(df_thesims_comments), display(df_thesims_comments.head(5))

Sims4 subreddit:


Unnamed: 0,post_id,comment_id,created_utc,author,body,score,parent_permalink
0,1mupzk3,n9kjebi,1755627000.0,creeativerex,# How your posted questions should start:\n\n Platform: PC\n Mods: Yes\n Game version: 1.116.202.1030\n When I opened my game all my icons were messed up.....,1,https://reddit.com/r/Sims4/comments/1mupzk3/troubleshooting_thread_bugs_errors_mod_issues_ea/
1,1mupzk3,n9km6of,1755627000.0,Better-Gas-2295,"Playform: PC\n\nMods: Yes\n\nVersion: 1.117.227.1030.\n\nWhen I open a save it doesn’t let me play with the characters, it just kicks me back to the world selection screen",46,https://reddit.com/r/Sims4/comments/1mupzk3/troubleshooting_thread_bugs_errors_mod_issues_ea/
2,1mupzk3,n9l9sqo,1755634000.0,Muted-Mongoose-5043,"Mac, yes mods and cc,\nVersion 1.117.221.1220 \nEverytime I go to select a sim to play it loads and then brings me back to the world select screen. I can move my sims via the visit option but can’...",29,https://reddit.com/r/Sims4/comments/1mupzk3/troubleshooting_thread_bugs_errors_mod_issues_ea/
3,1mupzk3,n9l0dc5,1755632000.0,Sacrefice342,Platform: Pc\n\nMods: Yes\n\nGame version: 1.117.227.1030 DX 11\n\nWhen i load my household it returns me back to the World Selection Screen also the BGM is super delayed after loading in\n\nResul...,23,https://reddit.com/r/Sims4/comments/1mupzk3/troubleshooting_thread_bugs_errors_mod_issues_ea/
4,1mupzk3,n9ldt8m,1755635000.0,Less_Ad_1194,`Which platform? PC`\n\n`Any mods or cc? Yes`\n\n`Game version 1.117.221.1020`\n\n`Description: Yeah whenever I try to select my family I just get thrown back to the world select menu :/`,22,https://reddit.com/r/Sims4/comments/1mupzk3/troubleshooting_thread_bugs_errors_mod_issues_ea/


Unnamed: 0,id,created_utc,created_date,author,title,body,score,num_comments,permalink,subreddit,mode
0,1mupzk3,1755627000.0,2025-08-19 18:02:28,creeativerex,Troubleshooting Thread — Bugs? Errors? Mod issues? EA app issues? Post about them here! Update 8/19/2025 [PC: 1.117.221.1020 / Mac: 1.117.221.1220 / Console: 2.18]\nTroubleshooting thread,"**Please read** the entire thread if you’re not familiar with this help section. If you have game issues, mod issues, App or console issues, please post them here via comment to get help instead o...",66,1344,https://reddit.com/r/Sims4/comments/1mupzk3/troubleshooting_thread_bugs_errors_mod_issues_ea/,Sims4,hot
1,1mhln76,1754332000.0,2025-08-04 18:32:47,spyder-baby,The Which Pack Thread. Ask for pack recommendations here!,[Current sale on EA app](https://preview.redd.it/3kdpicavp1hf1.png?width=1272&format=png&auto=webp&s=3fb09c5ce4cee6414bfbb9cd5117150b71912659)\n\nThis is a dedicated thread for asking which pack(s...,12,23,https://reddit.com/r/Sims4/comments/1mhln76/the_which_pack_thread_ask_for_pack/,Sims4,hot
2,1n62l6z,1756765000.0,2025-09-01 22:21:12,Thiredistia,Why does Candy Behr look significantly worse in the actual gameplay compared to her promo art?,,2973,86,https://reddit.com/r/Sims4/comments/1n62l6z/why_does_candy_behr_look_significantly_worse_in/,Sims4,hot
3,1n5sjmm,1756742000.0,2025-09-01 15:58:31,Sassysubtext,"How it started, and how it turned out","Edit: Okay so this blew up. I posted and then went back to playing, and when I came back I was shook! SO, heres some answers to some questions I saw: \n1. Yes you can now find it on the gallery. ...",3289,72,https://reddit.com/r/Sims4/comments/1n5sjmm/how_it_started_and_how_it_turned_out/,Sims4,hot
4,1n5zyo0,1756759000.0,2025-09-01 20:35:16,LPhamster,"Okay y’all, I took a break from the sweets to make a savory: I present to you: DIM SUM PLACE!",It’s a 6 tiny unit apartment complex. I wanted to take a break from all the sweets and make a savory. I hope you enjoy! It’s all functional. Including the soy sauce pool and a plate of greens gard...,1267,39,https://reddit.com/r/Sims4/comments/1n5zyo0/okay_yall_i_took_a_break_from_the_sweets_to_make/,Sims4,hot




TheSims subreddit:


Unnamed: 0,id,created_utc,created_date,author,title,body,score,num_comments,permalink,subreddit,mode
0,1n0sqf5,1756231000.0,2025-08-26 17:49:07,TheSimsOfficial,Addressing Community Concerns with For Rent,"Sul Sul Simmers, We’ve received questions from players with concerns about the For Rent Expansion Pack and save file issues in The Sims 4. Today we’re sharing a detailed update on this, but here’s...",291,106,https://reddit.com/r/thesims/comments/1n0sqf5/addressing_community_concerns_with_for_rent/,thesims,hot
1,1mmp0ki,1754848000.0,2025-08-10 17:47:54,baar-ur,Bi-Weekly Build Challenge | Week 173: CAS - Kids,"The Bi-Weekly Build Challenge is led by community moderator [u/NoButterOnMyBread](https://www.reddit.com/user/NoButterOnMyBread/) and co-host, community volunteer [u/baar-ur](https://www.reddit.co...",3,2,https://reddit.com/r/thesims/comments/1mmp0ki/biweekly_build_challenge_week_173_cas_kids/,thesims,hot
2,1n62iev,1756765000.0,2025-09-01 22:17:54,AlyxStarlix,Uhm.. What the hell happened?,"So, I don't know if these tags are right. Just tried to age up (child to teen) my sim, he froze, wouldn't respond to commands. Tried to get him to blow out candle, everyone is celebrating over and...",158,30,https://reddit.com/r/thesims/comments/1n62iev/uhm_what_the_hell_happened/,thesims,hot
3,1n5z13i,1756757000.0,2025-09-01 19:58:57,barefootintheopenair,"Bless Plumbella forever, but what content do I watch now?","She's literally the only simmer I enjoy so far. I can't always play so I like to watch and live vicariously. I love her humor and vulnerabilities and her play styles, especially with builds or t...",220,96,https://reddit.com/r/thesims/comments/1n5z13i/bless_plumbella_forever_but_what_content_do_i/,thesims,hot
4,1n5d4k9,1756694000.0,2025-09-01 02:37:03,Relative-Handle-7677,Stop the ads please?!,Ok they have got to fix this asap. This is getting so annoying. It’s getting to be like those mobile games that have an ad like every 2-3mins whilst you’re playing. Some of them are unplayable bec...,5373,120,https://reddit.com/r/thesims/comments/1n5d4k9/stop_the_ads_please/,thesims,hot


Unnamed: 0,post_id,comment_id,created_utc,author,body,score,parent_permalink
0,1n0sqf5,nasx4k2,1756231000.0,AutoModerator,"**Archive Record:** Posted by u/TheSimsOfficial\n\n*This comment is for moderation archival purposes and will remain even if the post is deleted.*\n\n\n*I am a bot, and this action was performed a...",1,https://reddit.com/r/thesims/comments/1n0sqf5/addressing_community_concerns_with_for_rent/
1,1n0sqf5,nat80rw,1756234000.0,shoalhavenheads,"This is the big thing: ""We’re unable to recreate every potential Sims scenario across the different combinations of packs, playstyles, and devices before launch.""\n\nThere are 100 packs. So that's...",1148,https://reddit.com/r/thesims/comments/1n0sqf5/addressing_community_concerns_with_for_rent/
2,1n0sqf5,natxbf8,1756241000.0,Drorbitaldeathray,"The Sims 4 isn't a video game, it's 130 Flash plugins in a trench coat.",457,https://reddit.com/r/thesims/comments/1n0sqf5/addressing_community_concerns_with_for_rent/
3,1n0sqf5,nav4fg1,1756255000.0,Jessiebanana,"It’s insane to me that they are still selling a pack they can’t guarantee will function or not actively harm your game. It’s wild. I would love For Rent, but I will not get it as is.",116,https://reddit.com/r/thesims/comments/1n0sqf5/addressing_community_concerns_with_for_rent/
4,1n0sqf5,nauepy9,1756246000.0,Scott43206,"After 6 days of troubleshooting mods (I only use 3) and CC, and updating the only one impacted, my game still didn't allow me to save builds to my library or share them to the gallery.\n\nI notice...",61,https://reddit.com/r/thesims/comments/1n0sqf5/addressing_community_concerns_with_for_rent/


(3967, None)

In [15]:
from pathlib import Path
import sqlite3

DB_DIR = Path("../data/raw")
DB_DIR.mkdir(parents=True, exist_ok=True)
DB_PATH = DB_DIR / "sims4.db"

def get_conn(db_path=DB_PATH):
    conn = sqlite3.connect(db_path)
    # enforce FK constraints
    conn.execute("PRAGMA foreign_keys = ON;")
    return conn

In [16]:
from src.collect_data import (
    init_db, get_conn,
    harvest_posts, insert_posts_df,
    get_comment_candidates, collect_comments_for_candidates
)

init_db()

In [28]:
SUBS = ["Sims4", "thesims"]
posts_big = harvest_posts(
    subs=SUBS,
    modes=["top", "hot", "new"],
    time_filters=["day", "week", "month", "year", "all"],
    limit_per=1000,
    sleep_between=0.2,
)
print("New harvested posts:", len(posts_big))

New harvested posts: 194


In [18]:
inserted = insert_posts_df(posts_big, batch_size=2000)
print("Inserted posts:", inserted)

Inserted posts: 0


In [29]:
import importlib
import src.collect_data as cd

importlib.reload(cd)

from src.collect_data import (
    get_db_counts, print_delta
)

In [50]:
# --- Parameters for scaling ---
SUBS = ["Sims4", "thesims"]            # subreddits to pull from
MODES = ["top", "hot", "new"]          # retrieval modes
TIME_FILTERS = ["day", "week", "month", "year", "all"]  # only applies to "top"
LIMIT_PER = 1000                       # max per (sub, mode, time_filter) combo
SLEEP_BETWEEN = 0.4                    # polite delay between API calls
BATCH_INSERT = 2000                    # how many rows to insert at once into SQLite

# Counts before
p1, c1 = get_db_counts()

# Choose candidates that currently have zero comments stored
candidates = get_comment_candidates(limit=5000)
print("Comment candidates:", len(candidates))

# Collect/insert in batches (this returns attempted count; we will still compute actual via DB)
_ = collect_comments_for_candidates(
    post_ids=candidates,
    batch_posts=200,
    sleep_between=0.4,
    insert_batch_size=2000
)

# Counts after
p2, c2 = get_db_counts()
print_delta(p1, c1, p2, c2, label="COMMENTS UPSERT")

DB file: /Users/baderrezek/Desktop/Projects/Personal/sims4-sentiment-analysis/data/raw/sims4.db
Posts: 6,574 | Comments: 591,584
Comment candidates: 36
DB file: /Users/baderrezek/Desktop/Projects/Personal/sims4-sentiment-analysis/data/raw/sims4.db
Posts: 6,574 | Comments: 591,584
[2025-09-02 19:01:03Z] COMMENTS UPSERT → added posts: 0 (total: 6,574) | added comments: 0 (total: 591,584)


In [31]:
TARGET_POSTS = 5000

# Check current
posts_now, _ = get_db_counts()
if posts_now < TARGET_POSTS:
    print(f"Currently {posts_now:,} posts. Harvesting until we reach {TARGET_POSTS:,}...")

    # You can run this cell a couple of times if the API doesn’t yield enough in one sweep
    for _ in range(3):  # bump the range if needed
        posts_more = harvest_posts(
            subs=SUBS,
            modes=modes,
            time_filters=time_filters,
            limit_per=1000,
            sleep_between=0.4
        )
        _ = insert_posts_df(posts_more, batch_size=2000)
        posts_now, comments_now = get_db_counts()
        print_delta(0, 0, posts_now, comments_now, label="TOP-UP CHECK")  # simple progress
        if posts_now >= TARGET_POSTS:
            break

print(f"Final posts total: {posts_now:,}")

Final posts total: 6,574


In [45]:
importlib.reload(cd)
from src.collect_data import get_db_counts, DB_PATH
p, c = get_db_counts()   # prints path + totals
print("Absolute path:", DB_PATH.resolve())

DB file: /Users/baderrezek/Desktop/Projects/Personal/sims4-sentiment-analysis/data/raw/sims4.db
Posts: 6,574 | Comments: 591,584
Absolute path: /Users/baderrezek/Desktop/Projects/Personal/sims4-sentiment-analysis/data/raw/sims4.db


In [46]:
from src.collect_data import DB_PATH, get_conn
import pandas as pd

print("Notebook DB:", DB_PATH.resolve())
with get_conn() as conn:
    print(pd.read_sql_query("SELECT COUNT(*) AS posts FROM posts;", conn))
    print(pd.read_sql_query("SELECT COUNT(*) AS comments FROM comments;", conn))

Notebook DB: /Users/baderrezek/Desktop/Projects/Personal/sims4-sentiment-analysis/data/raw/sims4.db
   posts
0   6574
   comments
0    591584


In [47]:
from src.collect_data import DB_PATH, get_conn
import pandas as pd

print("DB path:", DB_PATH.resolve())
with get_conn() as conn:
    print(pd.read_sql_query("SELECT COUNT(*) AS posts FROM posts;", conn))
    print(pd.read_sql_query("SELECT COUNT(*) AS comments FROM comments;", conn))

DB path: /Users/baderrezek/Desktop/Projects/Personal/sims4-sentiment-analysis/data/raw/sims4.db
   posts
0   6574
   comments
0    591584


In [48]:
# Which tables exist?
with get_conn() as conn:
    print(pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table' ORDER BY 1;", conn))

# Count DISTINCT primary keys (defensive check)
with get_conn() as conn:
    print(pd.read_sql_query("SELECT COUNT(DISTINCT post_id) AS posts_distinct FROM posts;", conn))
    print(pd.read_sql_query("SELECT COUNT(DISTINCT comment_id) AS comments_distinct FROM comments;", conn))

# A tiny join sanity check: top 5 posts by stored comment count
with get_conn() as conn:
    q = """
    SELECT p.post_id, p.subreddit, p.mode, p.score, p.num_comments,
           COUNT(c.comment_id) AS stored_comments
    FROM posts p
    LEFT JOIN comments c ON c.post_id = p.post_id
    GROUP BY p.post_id
    ORDER BY stored_comments DESC, p.score DESC
    LIMIT 5;
    """
    display(pd.read_sql_query(q, conn))

       name
0  comments
1     posts
   posts_distinct
0            6574
   comments_distinct
0             591584


Unnamed: 0,post_id,subreddit,mode,score,num_comments,stored_comments
0,1e0ghpj,Sims4,top,b'\xd5\x07\x00\x00\x00\x00\x00\x00',b'\xa3\x04\x00\x00\x00\x00\x00\x00',500
1,1kch8mb,Sims4,top,b'r\x16\x00\x00\x00\x00\x00\x00',b')\x04\x00\x00\x00\x00\x00\x00',499
2,1l9ottz,Sims4,top,b'=\x1c\x00\x00\x00\x00\x00\x00',b'O\x03\x00\x00\x00\x00\x00\x00',499
3,1mupzk3,Sims4,top,b'@\x00\x00\x00\x00\x00\x00\x00',b'A\x05\x00\x00\x00\x00\x00\x00',498
4,1fvaeur,Sims4,top,b'\xad\x18\x00\x00\x00\x00\x00\x00',b'\x8d\x03\x00\x00\x00\x00\x00\x00',497
