# Fundamentals of Social Data Science 
## Week 3 Day 2. Lab. Text Processing 

In this lab I have generated some methods that will allow you to download posts from Reddit. It accepts a list of subreddits of arbitrary length, which are each processed independently and stored in a single `results` dictionary. The keys of the dictionary are the subreddits. Underneath each subreddit is a dictionary of sub-specific result objects, like "vectorizer" and "top terms".

Please read through the code. You will need to add your username. The code is intentionally broken so you will need to add that before running this. Other than that you should not need to make any modifications to the cell with the `RedditScraper` class. 

In the cell below is some code to run these methods. At the top are some parameters that you should set. These are typically written in ALL CAPS. You should read the code to understand what they do. 

# Exercises: 
    
0. **Explore subreddits**. The below code uses 'ukpolitics', 'unitedkingdom', and 'uknews'. These were loosely motivated by an interest in whether uknews has become a reactionary subreddit with generally conservative opinions. By comparing it to the other two (which other work has suggested are generally quite similar), we might get a sense of this from the top keywords. Try some other subs related to a topic where you suspect there will be some interesting distinctions. Motivate the distinctions. If you aren't sure about the subs, query an LLM (they are typically trained on a _lot_ of Reddit data and will know good subs). So instead of trying for /r/men and /r/women, if you ask about subs for gender-based interests, it might suggest /r/TwoXChromosomes and /r/MensRights as interesting distinctions. 
1. **Understand the results data structure**. The `results` object returns the top 5
   terms. How would you access more than 5 terms? Expand the results to see 10. Consider
   what way is more general and flexible. How might you change the code so that there is
   a `TOP_N = 10` which is then passed through the code so that the results dictionary
   contains ten terms in the "top_terms" DataFrame rather than hard coding it in the
   method below?

> I added a `TOP_N` global variable and pass it into `analyze_subreddit` as an optional
> keyword argument.

2. **Store results**. Every time we run the code we query Reddit again. How can we store our data so that it is cached for another round? There are many approaches to this and among your group you may discover everything from 'just save the json' to 'DataFrame and then export to feather' to some who would ambitiously use MongoDB. Given this is a simple exercise for now, keep this step simple as you need it to be while still usable enough if you want to add more data.
3. **Plot keywords over time**. Expand your results to anywhere from 250 upwards (I would here cap at 500 max and think that the api might only return last 1000 but untested). Determine the top keywords using TFIDF. Then plot the frequency of these keywords over this time period for these results.   
4. **Table the most common URLs for stories**. Triangulate these plots with a table summarising the top news outlets for this sub in this time period. Notice the starter code to process this from the posts data that has been stored in a large `submissions` dictionary. Note, this code does not turn all the `json` into a DataFrame, but extracts only the URL column and processes that. It also uses a _regular expression_ to separate out the top level domain, which may or may not be the most robust.  
5. **Write a summary**. Solely for reflection at this point, write some intuitions that you discover with this exploration. 

## Caveats for the exercise: 
- Reddit might severely limit the number of posts you download using this scraper even with your name appropriately in the username, so be judicious with your exploration (hence exercise 2 _first_). 
- While you might not have extensive experience with Reddit, I can be confident that there are subreddits on most imaginable topics that can be found with little challenge. However, these subs will have vastly different numbers of subscribers and activity, so bear that in mind with any interpretation when tempted to generalise what is found _beyond_ Reddit (i.e. generalising from /r/republicans to Republicans in the US). 
- You may be tempted out of curiosity to expand your data collection. You will find that this will lead to a trade off if you do not further process your data. If you have 1000 rows for headlines and 3000 for words, that's a big matrix that has to be multiplied by vectors. At some point the size of the matrix will be unnecessary as well as slow. You may need to consider different parameters for `MIN_DOC_FREQ` to get a balance between a big matrix and a meaningful one. 
- These results have not been cleared for publication with CUREC, but only for use within classroom and for illustrative purposes. Please do not upload raw reddit data to your own GitHub archive nor seek to publish these results.  (Notice that I have pre-emtively edited the .gitignore to include a `data/` folder where you can store results without uploading them). Seek advice from research.fac@oii.ox.ac.uk for use for a comparable project should you wish to publish this work. If you wish to produce a blog post or other informal analysis, this should be presented in such a way that it is not misconstrued that the University has endorsed this work for publication. 

# Where we are headed with this exercise 

### Today: 
Collect reddit data, make it robust and explore TF-IDF results. 

### Week 3 Day 3. Friday: 
We use contine the use of the TF-IDF matrix and introduce cosine distance. We show how to plot it using t-SNE. This might sound abstract but the results will be fascinating as we see words plotted in coherent clusters that seem to reveal inductive patterns. 

Worksheets will be uploaded to this repo. 

### Week 4 Day 1. Monday: 
We will use two simple forms of classification, k-means and Naive Bayes Clustering. You might also be familiar with LDA or 'topic modelling'. We will not cover this as the technique deserves some care to understand its internals even if it is easy to run out of the box. But it is not far as an extension from where we end up. 

In the lab we will then compare classification results to results from the t-SNE and exploration of distance from Friday. 

### Week 4 Day 2. Wednesday: 
We will introduce the `networkx` and `community` package and show how to both construct a network from threaded comments and users of these comments. This will involve two types of graphs: DAGs and Bipartite graphs. 

In the lab you will have code that shows how to do this with the Reddit data in general. You will have to apply this to your specific case. 

### Week 4 Day 3. Friday: 
In the walkthrough we will see how to create 'embeddings' as abstractions even further than t-SNE but as a next-step up from cosine distance. In fact we will see how you can use cosine distance on embeddings which allows you to do these same steps not with words, but with entire sentences or whole paragraphs. We feature this on Friday and assume that your presentations will not need to use embeddings. 

In the afternoon you we will have the second set of group presentations: 
- Take a current event or coherent topic that could be collected from reddit data using the requests API (or more abstract packages such as `praw`, but not entire archive dumps like PushShift, only a limited subset). 
- Look at three or more subreddits who might speak to that topic. Determine which two subs are the most similar and why? Be sure to consider not only common word use. You may define similarity in creative ways so long as they can result in calculable differences without use of ML models, external APIs, or mass labelling of data. If you can download a lexicon, you are welcome to use scoring.
- Motivate this topic deductively. Where possible try to draw upon any existing literature on the topic and not simply abductively from current events. Consider DIKW: Find ways to produce transferable _knowledge_ rather than merely _information_ from _data_. 

In [9]:
import requests
import time
import pandas as pd
import numpy as np
import os
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

In [34]:
class RedditScraper:
    def __init__(self, user_agent):
        """
        Initialize the scraper with a user agent string.
        Example user agent: "SDS_textanalysis/1.0 (by /u/your_username)"
        """
        self.headers = {'User-Agent': user_agent}
        self.base_url = "https://api.reddit.com"

    def get_subreddit_posts(self, subreddit, limit=100):
        """
        Collect posts from a subreddit with proper pagination and rate limiting.
        """
        posts = []
        after = None
        
        while len(posts) < limit:
            url = f"{self.base_url}/r/{subreddit}/new"
            params = {
                'limit': min(100, limit - len(posts)),
                'after': after
            }
            
            response = requests.get(url, headers=self.headers, params=params)
            
            if response.status_code != 200:
                print(f"Error accessing r/{subreddit}: {response.status_code}")
                break
                
            data = response.json()
            new_posts = data['data']['children']
            if not new_posts:
                break
                
            posts.extend([post['data'] for post in new_posts])
            after = data['data']['after']
            
            if not after:
                break
                
            time.sleep(2)
            
        return posts[:limit]

def preprocess_text(text):
    """
    Clean and normalize text.
    """
    if pd.isna(text):
        return ""
    
    # Convert to lowercase and remove special characters
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def analyze_vocabulary(texts, min_freq=2):
    """
    Analyze vocabulary distribution in a corpus.
    Returns word frequencies and vocabulary statistics.
    """
    # Tokenize all texts
    words = ' '.join(texts).split()
    
    # Count word frequencies
    word_freq = Counter(words)
    
    # Calculate vocabulary statistics
    total_words = len(words)
    unique_words = len(word_freq)
    
    # Create frequency distribution DataFrame
    freq_df = pd.DataFrame(list(word_freq.items()), columns=['word', 'frequency'])
    freq_df['percentage'] = freq_df['frequency'] / total_words * 100
    freq_df = freq_df.sort_values('frequency', ascending=False)
    
    # Calculate cumulative coverage
    freq_df['cumulative_percentage'] = freq_df['percentage'].cumsum()
    
    stats = {
        'total_words': total_words,
        'unique_words': unique_words,
        'words_min_freq': sum(1 for freq in word_freq.values() if freq >= min_freq),
        'coverage_top_1000': freq_df.iloc[:1000]['frequency'].sum() / total_words * 100 if len(freq_df) >= 1000 else 100
    }
    
    return freq_df, stats

def analyze_subreddits(
    subreddits: dict[str, list],
    max_terms=1000,
    min_doc_freq=2,
    top_n: int = 5,
):
    """
    Analyze a collection of subreddits using TF-IDF.
    """
    print(f"Analyzing collection of {len(subreddits)} subreddits.")

    subreddit_docs: dict[str, str] = {}
    for subreddit, posts in subreddits.items():
        # Each subreddit document is all posts joined together with newline delimiting
        subreddit_docs[subreddit] = "\n".join([
            preprocess_text(post.get('title', '')) + ' ' + 
            preprocess_text(post.get('selftext', ''))
            for post in posts
        ])
    
    # Analyze vocabulary first
    freq_df, vocab_stats = analyze_vocabulary(subreddit_docs, min_freq=min_doc_freq)
    
    # Initialize TF-IDF vectorizer for this subreddit
    stop_words = list(set(stopwords.words('english')))
    vectorizer = TfidfVectorizer(
        stop_words=stop_words,
        max_features=max_terms,
        min_df=min_doc_freq
    )
    
    # Compute TF-IDF
    tfidf_matrix = vectorizer.fit_transform(subreddit_docs.values())

    print(f"Fit TF-IDF. Matrix shape: {tfidf_matrix.shape}")
    
    # # Get top terms
    # feature_names = vectorizer.get_feature_names_out()
    # top_terms = pd.DataFrame({
    #     'term': feature_names,
    #     'score': mean_tfidf
    # }).sort_values('score', ascending=False)
    
    # return {
    #     'vocab_stats': vocab_stats,
    #     'freq_distribution': freq_df,
    #     'top_terms': top_terms.head(n=top_n),
    #     'vectorizer': vectorizer,
    #     'matrix_shape': tfidf_matrix.shape,
    #     'matrix_sparsity': 100 * (1 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1]))
    # }

    return vectorizer, tfidf_matrix


In [21]:

vectorizer = TfidfVectorizer(
    stop_words=stop_words,
    max_features=max_terms,
    min_df=min_doc_freq
)

# Compute TF-IDF
tfidf_matrix = vectorizer.fit_transform(texts)

NameError: name 'stop_words' is not defined

In [35]:
# Example subreddits
subreddits = [
    "Futurology",
    "artificial",
    "LateStageCapitalism",
    "Antiwork",
    "BigTech",
    "DeGoogle",
]

# Analysis parameters
MAX_TERMS = 1000
MIN_DOC_FREQ = 1
LIMIT = 50
USERNAME = os.getenv("REDDIT_USERNAME")  # Replace with your Reddit username
TOP_N = 5

# Initialize scraper
scraper = RedditScraper(user_agent=f"SDS_textanalysis/1.0 (by /u/{USERNAME})")

# Analyze collection of subreddits together
# The whole collection constitutes the "corpus", each subreddit is a "document"

results = {}
submissions = {}

for subreddit in subreddits:
    print(f"Collecting posts for r/{subreddit}...")
    submissions[subreddit] = scraper.get_subreddit_posts(subreddit, limit=LIMIT)


Collecting posts for r/Futurology...
Collecting posts for r/artificial...
Collecting posts for r/LateStageCapitalism...
Collecting posts for r/Antiwork...
Collecting posts for r/BigTech...
Collecting posts for r/DeGoogle...


In [36]:
vectorizer, matrix = analyze_subreddits(submissions, max_terms=MAX_TERMS, min_doc_freq=MIN_DOC_FREQ, top_n=TOP_N)

Analyzing collection of 6 subreddits.
Fit TF-IDF. Matrix shape: (6, 1000)


In [37]:
vectorizer.get_feature_names_out()

array(['10', '100', '13', '20', '2024', '25', '2fa', '30', '40',
       'ability', 'able', 'absolutely', 'access', 'according', 'account',
       'accounts', 'across', 'act', 'activities', 'actual', 'actually',
       'add', 'address', 'advance', 'advanced', 'advice', 'affect', 'age',
       'agi', 'ago', 'ai', 'aigenerated', 'aipowered', 'alien', 'allow',
       'allowed', 'allows', 'almost', 'alone', 'along', 'already', 'also',
       'alternative', 'alternatives', 'always', 'american', 'amid',
       'analytics', 'android', 'another', 'answer', 'answers',
       'anthropics', 'anymore', 'anyone', 'anything', 'anyway', 'app',
       'apple', 'applying', 'appreciate', 'apps', 'area', 'arent',
       'around', 'art', 'article', 'artificial', 'ask', 'asked', 'asking',
       'asks', 'assistants', 'attendance', 'aurora', 'available', 'aware',
       'away', 'babel', 'back', 'background', 'backup', 'bad', 'bank',
       'banking', 'based', 'basic', 'basically', 'battle', 'become',
       

In [47]:
terms = vectorizer.get_feature_names_out()

# Iterate over each document
for i, row in enumerate(matrix):
    # Convert the row to a dense array, then get the top 5 indices by score
    row_data = row.toarray().flatten()
    top_indices = row_data.argsort()[-5:][::-1]  # Top 5 indices, sorted by score
    
    # Print the top terms with their scores
    print(f"Document ({subreddits[i]}) top terms:")
    for idx in top_indices:
        print(f"  {terms[idx]}: {row_data[idx]}")

Document (Futurology) top terms:
  energy: 0.30356765090454885
  ai: 0.26208801911633506
  like: 0.18720572794023935
  robots: 0.17978141678762047
  solar: 0.17052140425891768
Document (artificial) top terms:
  ai: 0.5054984609144059
  comment: 0.19073274471777193
  like: 0.14767370768286014
  images: 0.14545351780174437
  selection: 0.1441206360239499
Document (LateStageCapitalism) top terms:
  capitalism: 0.26156372116911414
  class: 0.23235823826372412
  vote: 0.23235823826372412
  nations: 0.18890584126865304
  qr: 0.18890584126865304
Document (Antiwork) top terms:
  work: 0.2991445432240473
  job: 0.2718586231010764
  im: 0.2657519751625898
  dont: 0.18554534959466226
  get: 0.18372976060623492
Document (BigTech) top terms:
  google: 0.370449676127209
  elon: 0.35672688043902334
  data: 0.32928860100196355
  musks: 0.2972724003658528
  meta: 0.20580537562622722
Document (DeGoogle) top terms:
  google: 0.46909135954721554
  account: 0.21647476028182885
  privacy: 0.2020431095963736

In [None]:

results = analyze_subreddits(
    submissions,
    max_terms=MAX_TERMS,  # Maximum number of terms to keep
    min_doc_freq=MIN_DOC_FREQ,  # Term must appear in at least min_doc_freq documents
    top_n=TOP_N,  # Number of top terms to show
)

    # Print results for this subreddit
    print(f"\nVocabulary Statistics for r/{subreddit}:")
    print(f"Total words: {results[subreddit]['vocab_stats']['total_words']}")
    print(f"Unique words: {results[subreddit]['vocab_stats']['unique_words']}")
    print(
        f"Words appearing ≥{MIN_DOC_FREQ} times: {results[subreddit]['vocab_stats']['words_min_freq']}"
    )
    print(
        f"Coverage by top {MAX_TERMS} words: {results[subreddit]['vocab_stats']['coverage_top_1000']:.2f}%"
    )
    print(f"Matrix shape: {results[subreddit]['matrix_shape']}")
    print(f"Matrix sparsity: {results[subreddit]['matrix_sparsity']:.2f}%")

    print(f"\nTop {TOP_N} terms by TF-IDF score:")
    print(results[subreddit]["top_terms"][["term", "score"]].to_string())

In [6]:
# Data Exploration:
example_subreddit = "Antiwork"

submissions[example_subreddit][0] # Example post from the unitedkingdom subreddit

url_list = [post['url'] for post in submissions[example_subreddit]]
url_df = pd.DataFrame(url_list, columns=['url'])
url_df['domain'] = url_df['url'].str.extract(r'(https?://[^/]+)')

url_df['domain'].value_counts().head(10)

domain
https://www.reddit.com     40
https://i.redd.it           7
https://youtu.be            1
https://www.nbcnews.com     1
https://www.kwqc.com        1
Name: count, dtype: int64

In [8]:
results[example_subreddit].keys()

dict_keys(['vocab_stats', 'freq_distribution', 'top_terms', 'vectorizer', 'matrix_shape', 'matrix_sparsity'])

# AI Declaration: 

Claude Sonnet 3.5 New produced much of the reddit code. I was surprised at how similar it was to my past code (knowing it was trained on GitHub I have to wonder). Several tweaks had to be made such as removing a main() function, altering the results object, altering some NLTK packages, adding the submissions dictionary and the submissions dictionary code. I kept in the `preprocess_text()` function as is, but you are encouraged to consider alternative forms of pre-processing from the walkthrough including the use of standard tokenizers, lemmatisation, and stop-words. It also used anodyne programming subreddits which I changed and I reduced the limit to 50 which is just enough to get two queries illustrating that you can get N queries through this approach. 

The URL code was written in VS code with co-pilot. Notably the autocomplete did an excellent job of anticipating steps with minimal prompting. 