> ### Note on Labs and Assigments:
>
> Throughout this lab, you will see **🔧 Try It Yourself** sections and a final 🔧 **Reflection** section
>
> ✅ You are expected to:
> - Complete each **"🔧 Try It Yourself”** section by writing and running your own code or answering the prompted questions in a markdown or python cell below the section.
> - Answer the **Reflection** section at the end of the lab in your own words. This is your opportunity to summarize what you learned and connect the concepts.
>
> 🔧 Look for the **wrench emoji** 🔧 — it highlights where you're expected to take action!
>
> These sections are **graded** and are **not optional**. 
>
> ---

# IS 4487 Lab 14: Reddit API

## Outline

- Retrieve Reddit post titles using the Pushshift API  
- Clean and prepare text data for analysis  
- Apply TF-IDF and K-Means clustering to discover major discussion topics  
- Visualize topic clusters using word clouds  
- Reflect on business use cases for topic modeling  

In this lab, you will collect Reddit data from a business-focused subreddit and apply **unsupervised learning** to identify emerging themes and public sentiment. This type of analysis is valuable in marketing, finance, and product strategy for uncovering real-time customer interests and concerns.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Labs/lab_14_API.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Part 1: Reddit API Setup (via PRAW)

Reddit's official API requires authentication. Follow these steps:

---

### 🔑 Get Your API Credentials

1. Go to: https://www.reddit.com/prefs/apps
2. Login or create an account
3. Click the "are you a developer? create an app" button on the page
4. Fill out:
   - Name: `lab14`
   - Type: **script**
   - Redirect URI: `http://localhost`
   - Leave the other fields blank
5. After submitting:
   - **client_id**: the 14-character string below your app name and "personal use script"
   - **client_secret**: the chain of numbers next to secret

---

Paste your credentials in the Python cell below to connect to Reddit.


In [None]:
!pip install praw
import praw

# Replace these with your own credentials
reddit = praw.Reddit(
    client_id="ENTER CLIENT ID",
    client_secret="ENTER CLIENT SECRET",
    user_agent="lab14-reddit-topic-model"
)


## Part 2: Collect Reddit Post Titles

Let’s pull the latest 100 posts from the `r/stocks` subreddit.

We'll store the `title` and `created_utc` fields in a DataFrame.


In [None]:
import pandas as pd

# Choose a subreddit and fetch posts
subreddit_name = "stocks"
posts = []

for submission in reddit.subreddit(subreddit_name).hot(limit=100):
    posts.append([submission.title, submission.created_utc])

# Convert to DataFrame
df = pd.DataFrame(posts, columns=["title", "created_utc"])
df.head(20)



### 🔧 Try It Yourself - Part 2

1. Copy and paste the code above and **change the subreddit name** to one that matches your **future career goal** or your **current major/minor**. Here are some popular business-related subreddits you can choose from:

   - `r/Entrepreneur`
   - `r/business`
   - `r/startups`
   - `r/Finance`
   - `r/smallbusiness`
   - `r/marketing`
   - `r/investing`
   - `r/consulting`

   👉 [Explore more subreddits here](https://www.reddit.com/subreddits/)

2. **Update the code to increase the number of posts pulled by changing the `limit` parameter to `200`.**  

   Then, run the following lines of code to see how many total posts you collected and how many duplicate titles there are:

```
   print(f"Total posts: {len(df)}")
   print(df['title'].duplicated().sum(), "duplicates")
```


In [None]:
# 🔧 Add code here

## Part 3: Clean and Prepare the Text

Next, we’ll:
- Lowercase the text
- Remove links, punctuation, and extra spaces

Cleaned text will be passed to the vectorizer in the next step.

In [None]:
import re  # Import the regular expressions module for text pattern matching

# Define a function to clean text data
def clean_text(text):
    text = text.lower()  # Convert all text to lowercase
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)  # Remove URLs
    text = re.sub(r"[^a-z\s]", "", text)  # Remove punctuation, numbers, and special characters (keep only letters and spaces)
    text = re.sub(r"\s+", " ", text).strip()  # Replace multiple spaces with a single space and strip leading/trailing spaces
    return text

# Apply the cleaning function to each post title in the DataFrame
df['clean_title'] = df['title'].apply(clean_text)

# Display the first few rows of the updated DataFrame
df.head()



### 🔧 Try It Yourself - Part 3

1. **Create a new column to count the number of words in each post title.**  
   This uses the cleaned title column (`clean_title`) to calculate word count:

```
   df['word_count'] = df['clean_title'].apply(lambda x: len(x.split()))
```
2. **Filter the DataFrame** to remove any rows where the post title has fewer than 3 words.

3. **Sort the DataFrame** by the `word_count` column in descending order and display the top 5 longest post titles.





In [None]:
# 🔧 Add code here

## Part 4: TF-IDF and KMeans Clustering

Now, we’ll:
- Vectorize the cleaned text using TF-IDF
- Apply KMeans to group posts by topic

Clustering lets us see which themes dominate Reddit discussions.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer  # For converting text data into numerical features
from sklearn.cluster import KMeans  # For performing K-Means clustering

# Create a TF-IDF vectorizer that removes English stop words and limits features to the top 1,000 terms
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

# Transform the cleaned titles into a TF-IDF matrix (rows = posts, columns = terms)
X = vectorizer.fit_transform(df['clean_title'])

# Initialize and fit a K-Means model with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=42)

# Predict cluster assignments and add them as a new column to the DataFrame
df['cluster'] = kmeans.fit_predict(X)

# Display the first few cleaned titles along with their assigned cluster
df[['clean_title', 'cluster']].head()


### 🔧 Try It Yourself - Part 4

1. **Change the number of clusters from 4 to 5** in your clustering code (e.g., if you're using KMeans).

2. **Check how many posts were assigned to each cluster** by counting the values in the `cluster` column:


3. **Observe and describe one change** that occurred when you increased the number of clusters from 4 to 5.  
   This could include shifts in cluster size, new topic groupings, or changes in how the posts are categorized.



In [None]:
# 🔧 Add code here

🔧 Add comment here:

## Part 5: Top Words per Cluster

This step shows the 10 most important words in each cluster using TF-IDF scores.

In [None]:
# Get the list of feature names (i.e., the terms used in the TF-IDF matrix)
terms = vectorizer.get_feature_names_out()

# Get the indices of terms with the highest weights in each cluster center
# argsort() sorts each cluster center's weights in ascending order; [::-1] reverses to descending
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]

# Loop through each cluster to print the top 10 keywords
for i in range(kmeans.n_clusters):
    print(f"\nCluster {i} keywords:")  # Label for the current cluster
    # Print the top 10 terms with the highest TF-IDF scores for this cluster
    print(", ".join([terms[ind] for ind in order_centroids[i, :10]]))


### 🔧 Try It On Your Own – Part 6Word Clouds by Cluster

Let’s visualize the dominant words in each cluster using **word clouds**. This will help you interpret what each cluster represents based on the most frequent terms.

### Instructions:
1. Copy and paste the code below in a python cell. Fill in the blanks to generate a word cloud for each cluster of Reddit post titles.


```
from wordcloud import WordCloud
import matplotlib.pyplot as plt

for i in range(____):  # Loop over each cluster number
    texts = df[df['cluster'] == ____]['clean_title']  # Filter titles for the current cluster
    combined = " ".join(____)  # Combine all titles into one string
    
    wc = WordCloud(background_color="white", max_words=100).generate(____)
    
    plt.figure(figsize=(6, 6))
    plt.imshow(wc, interpolation='bilinear')
    plt.title(f"Cluster {i}")
    plt.axis('off')
    plt.show()
```

2. In 1–2 sentences, write an observation for 2 wordclouds.

> What common theme or topic does this cluster seem to represent?

> Is there anything surprising or unclear about the cluster?


In [None]:
# 🔧 Add code here


🔧 Add comment here:

# 🔧 Part 7: Reflection (In 100 words or less per question)


### Questions:
1. What business-relevant topics were users discussing the most?
2. How useful would this clustering approach be for a company’s marketing or strategy team?
3. How would the topics shift if you pulled data from a different subreddit like `r/technology` or `r/antiwork`?


🔧 Add comments here:

## Export Your Notebook to Submit in Canvas
- Use the instructions from Lab 1

In [None]:
!jupyter nbconvert --to html "lab_14_LastnameFirstname.ipynb"