# Reddit Suicide Risk Detector: End-to-End Documentation

## 1. Project Overview

This project builds a system that tries to automatically detect whether a Reddit post is likely to be about suicide risk or not. The steps in the notebook are designed to take you from raw data (posts from Reddit) to a working tool that can make predictions on new posts.

## 2. Environment Setup

Install dependencies:

```bash
pip install pandas scikit-learn nltk praw flask joblib
```

Download NLTK resources:

```python
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
```


## 3. Data Collection

Gather posts from Reddit using a tool (PRAW) that connects to Reddit’s servers. This serves as real data of how people write about their feelings or emotions.

```python
import praw
import pandas as pd

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="YOUR_USER_AGENT"
)

subreddits = ["SuicideWatch", "depression", "anxiety", "mentalhealth", "mentalillness", "BPD", "bipolar",
        "autism", "MentalHealthUK", "socialanxiety", "talktherapy", "askatherapist", "offmychest",
        "traumatoolbox", "dbtselfhelp", "bodyacceptance", "MMFB", "mentalhealthmemes", "anxietymemes",
        "MentalHealthSupport", "malementalhealth", "mentalhealthph", "mentalhealthsupport",
        "mentalhealthbabies", "emotionalsupport", "helpme", "Advice", "KindVoice", "Vent", "venting",
        "Feels", "sad", "CasualConversation", "MomForAMinute", "DadForAMinute", "BenignExistence",
        "findafriend", "relationship_advice", "internetparents", "freecompliments", "Confessions",
        "Offmychest", "AskReddit", "TodayILearned", "pics", "funny", "worldnews", "science", "movies",
        "books", "technology", "gaming", "sports", "Music", "Art", "food", "DIY", "space", "History",
        "television", "Documentaries", "InternetIsBeautiful", "travel", "photography", "cooking",
        "gardening", "Fitness", "cars", "Bicycling", "boardgames", "CampingandHiking", "coffee", "tea",
        "knitting", "woodworking"]

posts = []
for subreddit_name in subreddits:
    subreddit = reddit.subreddit(subreddit_name)
    for submission in subreddit.new(limit=limit):
        posts.append({
            "post_id": submission.id,
            "subreddit": subreddit_name,
            "title": submission.title,
            "post_text": submission.selftext
        })

df = pd.DataFrame(posts)
df.to_csv("reddit_posts.csv", index=False)
```


## 4. Data Cleaning

Remove repeated posts, empty posts, or posts missing important parts. Clean data ensures the model learns patterns that are real and not just noise.

```python
df = pd.read_csv("reddit_posts.csv")
df = df.drop_duplicates(subset=["post_id"])
df = df.dropna(subset=["title", "post_text"])
df = df[(df["title"].str.strip() != "") | (df["post_text"].str.strip() != "")]
df.to_csv("reddit_posts_cleaned.csv", index=False)
```


## 5. Data Labeling

Assign a label to each post: “high-risk” (from mental health subreddits) or “neutral” (from general subreddits).

```python
mental_health_subs = [
    "SuicideWatch", "depression", "anxiety", "mentalhealth", "mentalillness",
    "BPD", "bipolar", "autism", "MentalHealthUK", "socialanxiety", "talktherapy",
    "askatherapist", "offmychest", "traumatoolbox", "dbtselfhelp", "bodyacceptance",
    "MMFB", "mentalhealthmemes", "anxietymemes", "MentalHealthSupport",
    "malementalhealth", "mentalhealthph", "mentalhealthsupport", "mentalhealthbabies",
    "emotionalsupport", "helpme", "Advice", "KindVoice", "Vent", "venting", "Feels",
    "sad", "CasualConversation", "MomForAMinute", "DadForAMinute", "BenignExistence",
    "findafriend", "relationship_advice", "internetparents", "freecompliments",
    "Confessions", "Offmychest"
]

df = pd.read_csv("reddit_posts_cleaned.csv")
df["label"] = df["subreddit"].apply(lambda x: 1 if x in mental_health_subs else 0)
df.to_csv("reddit_posts_labeled.csv", index=False)
```


## 6. Text Preprocessing

Lowercase the text, remove punctuation, remove common words (like “the”, “and”), and reduce words to their root form (e.g., “helping” becomes “help”). Preprocessing makes the text easier for the model to analyze and reduces meaningless differences.

```python
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r"[^\w\s]", "", text)
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)

df = pd.read_csv("reddit_posts_labeled.csv")
df["clean_text"] = df["title"].fillna("") + " " + df["post_text"].fillna("")
df["clean_text"] = df["clean_text"].apply(clean_text)
df.to_csv("reddit_posts_preprocessed.csv", index=False)
```


## 7. Feature Extraction

Convert words in each post into numbers that represent how important each word is in that post, compared to all other posts. TF-IDF turns text into a format (numbers) the model can use.

```python
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv("reddit_posts_preprocessed.csv")
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df["clean_text"])
y = df["label"]
```


## 8. Model Training

Teach a mathematical model to distinguish between high-risk and neutral posts using the numbers generated above. The model needs to learn the difference between risky and neutral language, so it can make predictions on new, unseen posts.


```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
```


## 9. Model Evaluation

Test the model on posts it has never seen before and measure how often it gets the answer right. Evaluate with classification metrics and confusion matrix.

```python
from sklearn.metrics import classification_report, confusion_matrix

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["neutral", "high-risk"]))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
```


## 10. Model Saving

Save the trained model and vectorize (the way text was converted to numbers).

```python
import joblib

joblib.dump(clf, "suicide_risk_logreg_model.pkl")
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")
```


## 11. Inference Function

Predict risk for new text.

```python
def predict_post_risk(title, post_text):
    import joblib
    clf = joblib.load("suicide_risk_logreg_model.pkl")
    vectorizer = joblib.load("tfidf_vectorizer.pkl")
    text = clean_text(f"{title} {post_text}")
    X_new = vectorizer.transform([text])
    prediction = clf.predict(X_new)
    return "high-risk" if prediction[0] == 1 else "neutral"
```


## 12. Flask Web Demo

**File:** `app.py`

- Accepts Reddit post URLs via web form.
- Fetches post content using PRAW.
- Cleans and classifies the post.
- Displays result with risk color coding.

**Key Flask route logic:**

```python
from flask import Flask, request, render_template
import praw
import joblib

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="YOUR_USER_AGENT"
)

app = Flask(__name__)
clf = joblib.load("suicide_risk_logreg_model.pkl")
vectorizer = joblib.load("tfidf_vectorizer.pkl")

def extract_post_from_url(url):
    submission = reddit.submission(url=url)
    return submission.title, submission.selftext

@app.route("/", methods=["GET", "POST"])
def predict():
    if request.method == "POST":
        url = request.form.get("url", "")
        title, post_text = extract_post_from_url(url)
        text = clean_text(f"{title} {post_text}")
        X_new = vectorizer.transform([text])
        prediction = clf.predict(X_new)[0]
        label = "high-risk" if prediction == 1 else "neutral"
        return render_template("result.html", label=label)
    return render_template("form.html")
```


## 13. How to Use the Demo

1. Run the Flask app:

```
python app.py
```

2. Open browser at `http://127.0.0.1:5000/`
3. Paste a Reddit post URL into the form.
4. View the risk prediction and color-coded result.

## 14. File Structure

| File/Folder | Purpose |
| :-- | :-- |
| reddit_posts_preprocessed.csv | Preprocessed dataset |
| suicide_risk_logreg_model.pkl | Trained logistic regression model |
| tfidf_vectorizer.pkl | Fitted TF-IDF vectorizer |
| app.py | Flask web application |
| templates/ | HTML templates for Flask |
| static/style.css | CSS for web interface |

## 15. Notes

- Model performance is limited by data quality and representativeness.
- For production, secure API credentials and sanitize user input.
- The classifier uses only text features; no user metadata is processed.

This notebook documents the entire pipeline from data collection to live risk prediction.

