## Reddit API
- In the following document we will try to extract reddit post:
    - As a first approach we will collect data marked by keyboards relatedt to adhd found in related subreddits
    - To improve our model we can modify our model to identify by it's own (using old data for example) the subreddits that could be intereesting 
    to scrap.


    General composition of a post:
    -  id: The post’s ID
    -  title: The post’s title
    -  text: The post’s text
    -  author: The post’s author
    -  created_utc: The post’s creation time in UTC
    -  score: The post’s score
    -  num_comments: The number of comments on the post
    -  permalink: The post’s permalink

## Connections


In [1]:
import os
from dotenv import load_dotenv
import praw

#### Connection to the Reddit APi

In [9]:

# Load environment variables from .env file
load_dotenv()

try:
    reddit = praw.Reddit(
        client_id=os.getenv("REDDIT_CLIENT_ID"),
        client_secret=os.getenv("REDDIT_CLIENT_SECRET"),
        user_agent=os.getenv("REDDIT_USER_AGENT"),
        username=os.getenv("REDDIT_USERNAME"),
        password=os.getenv("REDDIT_PASSWORD")
    )
    print(f"Connected! Logged in as: {reddit.user.me()}")
except Exception as e:
    print("An error occurred:", e)

Connected! Logged in as: ProfessorMiddle1326


#### Connect to Mongdb

In [2]:
from pymongo import MongoClient

# Load environment variables from .env file
mongo='127.0.0.1'

try:
    # Connect to MongoDB
    myclient = MongoClient(
                        "mongodb://"+mongo+":27017/") #Mongo URI format
    db=myclient['reddit']
    
    print("Connected to MongoDB successfully!")
except Exception as e:
    print("An error occurred while connecting to MongoDB:", e)

Connected to MongoDB successfully!


 #### Connect to reddis
 

In [11]:
import redis

r = redis.Redis(host='127.0.0.1', port=6379, db=0)

In [12]:
r.keys()

[]

## Scrap Data Using Research on reddit

#### Scrapp usging the search bar in reddit (search using the keywords):
- Define the keywords for research 
["adhd", "diagnose","energy", "brain", "test", "distracted", "forgetful", "doctor","work","task","disord","struggl","focu","dysfunct","forgot","lazi","prescrib","medic","medicin","pill","self diagnosis","self medication"]
- Different sorting technics ["relevance", "hot", "top", "new", "comments"]


In [20]:
Querykeywords=["adhd", "diagnose","energy", "brain", "test", "distracted", "forgetful", "doctor"
                  ,"work","task","disord","struggl","focu","dysfunct","forgot","lazi","prescrib","medic","medicin","pill","self diagnosis","self medication"]
sortingTechniques=["relevance", "hot", "top", "new", "comments"]


In [18]:
import datetime

def get_month(datetime):
    return datetime.month

def get_year(datetime):
    return datetime.year

def get_utc_time(timestamp):
    # Convert timestamp to UTC datetime
    utc_time = datetime.datetime.utcfromtimestamp(timestamp)
    return utc_time

In [21]:
import pandas as pd
# Subreddit to target
subreddit_name = "ADHD"
subreddit = reddit.subreddit(subreddit_name)
posts = []

#Querykeywords=["self diagnosis","self medication"]

for keyword in Querykeywords:
    for sorting in sortingTechniques:
        print("Searching for keyword:", keyword, "using sorting technique:", sorting)
        for post in subreddit.search(query=keyword,sort=sorting,syntax='cloudsearch',time_filter='all',limit=10000):# 'hot', 'new', or 'top' post    
            datecreated=get_utc_time(post.created_utc)
            year=datecreated.year
            if(year>2019) and (r.sadd('reddit_posts', post.id)):
                posts.append({
                "id":post.id,
                "title": post.title,
                "author": str(post.author),
                "score": post.score,
                "num_comments": post.num_comments,
                "upvote_ratio": post.upvote_ratio,
                "url": post.url,
                "subreddit": post.subreddit.display_name,
                "created_at": post.created_utc,
                "self_text": post.selftext, 
                "searchQuery":keyword,  
                "augmented":0
            })
        
db.reddit_posts.insert_many(posts)
          
            
    
    



Searching for keyword: adhd using sorting technique: relevance


  utc_time = datetime.datetime.utcfromtimestamp(timestamp)


Searching for keyword: adhd using sorting technique: hot
Searching for keyword: adhd using sorting technique: top
Searching for keyword: adhd using sorting technique: new
Searching for keyword: adhd using sorting technique: comments
Searching for keyword: diagnose using sorting technique: relevance
Searching for keyword: diagnose using sorting technique: hot
Searching for keyword: diagnose using sorting technique: top
Searching for keyword: diagnose using sorting technique: new
Searching for keyword: diagnose using sorting technique: comments
Searching for keyword: energy using sorting technique: relevance
Searching for keyword: energy using sorting technique: hot
Searching for keyword: energy using sorting technique: top
Searching for keyword: energy using sorting technique: new
Searching for keyword: energy using sorting technique: comments
Searching for keyword: brain using sorting technique: relevance
Searching for keyword: brain using sorting technique: hot
Searching for keyword: 

InsertManyResult([ObjectId('6759b2886ff094aec1f11277'), ObjectId('6759b2886ff094aec1f11278'), ObjectId('6759b2886ff094aec1f11279'), ObjectId('6759b2886ff094aec1f1127a'), ObjectId('6759b2886ff094aec1f1127b'), ObjectId('6759b2886ff094aec1f1127c'), ObjectId('6759b2886ff094aec1f1127d'), ObjectId('6759b2886ff094aec1f1127e'), ObjectId('6759b2886ff094aec1f1127f'), ObjectId('6759b2886ff094aec1f11280'), ObjectId('6759b2886ff094aec1f11281'), ObjectId('6759b2886ff094aec1f11282'), ObjectId('6759b2886ff094aec1f11283'), ObjectId('6759b2886ff094aec1f11284'), ObjectId('6759b2886ff094aec1f11285'), ObjectId('6759b2886ff094aec1f11286'), ObjectId('6759b2886ff094aec1f11287'), ObjectId('6759b2886ff094aec1f11288'), ObjectId('6759b2886ff094aec1f11289'), ObjectId('6759b2886ff094aec1f1128a'), ObjectId('6759b2886ff094aec1f1128b'), ObjectId('6759b2886ff094aec1f1128c'), ObjectId('6759b2886ff094aec1f1128d'), ObjectId('6759b2886ff094aec1f1128e'), ObjectId('6759b2886ff094aec1f1128f'), ObjectId('6759b2886ff094aec1f112

In [41]:
# Fetch all documents from the 'reddit_posts' collection
all_documents = db.reddit_staging.find({},{'_id':0,'Gender':0,'Mention of Solutions':0,
                                           'Personal_Experience':0,'Self-Diagnosis':0,
                                           'Self-Medication':0,'Sentiment':0,'Topic':0,'augmented':0})

# Convert the documents to a list and print them
documents_list = list(all_documents)
print(documents_list[0])

{'id': '1et3kj0', 'title': 'Diagnosed with Inattentive ADHD at 31. Explains so many things from my childhood.', 'author': 'amadnomad', 'score': 392, 'num_comments': 107, 'upvote_ratio': 0.98, 'url': 'https://www.reddit.com/r/ADHD/comments/1et3kj0/diagnosed_with_inattentive_adhd_at_31_explains_so/', 'subreddit': 'ADHD', 'created_at': 1723748809.0, 'self_text': "Please go out and get tested if you are still on the fence. I always assumed ADHD was only hyperactive. A lot of concerns about day dreaming, zoning out and inattentiveness came into play during my consult. I didn't even consider my lack of sleep being tied to ADHD. But now that I have a diagnoses, it explains quite a bit from my past. I wasn't just lazy and disorganized. \n\n  \nAgain, please go get tested if you suspect anything.", 'searchQuery': 'adhd'}


In [42]:
# Remove the 'augmented' column from each JSON object in the list
for document in documents_list:
    document['staged']=1

# Verify the change
print(documents_list[0])

{'id': '1et3kj0', 'title': 'Diagnosed with Inattentive ADHD at 31. Explains so many things from my childhood.', 'author': 'amadnomad', 'score': 392, 'num_comments': 107, 'upvote_ratio': 0.98, 'url': 'https://www.reddit.com/r/ADHD/comments/1et3kj0/diagnosed_with_inattentive_adhd_at_31_explains_so/', 'subreddit': 'ADHD', 'created_at': 1723748809.0, 'self_text': "Please go out and get tested if you are still on the fence. I always assumed ADHD was only hyperactive. A lot of concerns about day dreaming, zoning out and inattentiveness came into play during my consult. I didn't even consider my lack of sleep being tied to ADHD. But now that I have a diagnoses, it explains quite a bit from my past. I wasn't just lazy and disorganized. \n\n  \nAgain, please go get tested if you suspect anything.", 'searchQuery': 'adhd', 'staged': 1}


In [26]:
db_ingest=myclient['Ingestion_db']

In [43]:
try:
    db_ingest.reddit_ingestion.insert_many(documents_list)
except Exception as e:
    print("An error occurred while inserting documents:", e)

In [29]:
db_stage=myclient['Staging_db']

In [30]:
db=myclient['reddit']

In [36]:
reddits_post_augmented=db.reddit_posts.find({},{'_id':0})

In [37]:
reddits_post_augmented_list = list(reddits_post_augmented)

for document in reddits_post_augmented_list:
    document['augmented']=0
    
    



In [39]:
len(reddits_post_augmented_list)

9588

In [40]:
db_stage.reddit_llm.insert_many(reddits_post_augmented_list)

InsertManyResult([ObjectId('6780f8344eb12a9ee92cab7e'), ObjectId('6780f8344eb12a9ee92cab7f'), ObjectId('6780f8344eb12a9ee92cab80'), ObjectId('6780f8344eb12a9ee92cab81'), ObjectId('6780f8344eb12a9ee92cab82'), ObjectId('6780f8344eb12a9ee92cab83'), ObjectId('6780f8344eb12a9ee92cab84'), ObjectId('6780f8344eb12a9ee92cab85'), ObjectId('6780f8344eb12a9ee92cab86'), ObjectId('6780f8344eb12a9ee92cab87'), ObjectId('6780f8344eb12a9ee92cab88'), ObjectId('6780f8344eb12a9ee92cab89'), ObjectId('6780f8344eb12a9ee92cab8a'), ObjectId('6780f8344eb12a9ee92cab8b'), ObjectId('6780f8344eb12a9ee92cab8c'), ObjectId('6780f8344eb12a9ee92cab8d'), ObjectId('6780f8344eb12a9ee92cab8e'), ObjectId('6780f8344eb12a9ee92cab8f'), ObjectId('6780f8344eb12a9ee92cab90'), ObjectId('6780f8344eb12a9ee92cab91'), ObjectId('6780f8344eb12a9ee92cab92'), ObjectId('6780f8344eb12a9ee92cab93'), ObjectId('6780f8344eb12a9ee92cab94'), ObjectId('6780f8344eb12a9ee92cab95'), ObjectId('6780f8344eb12a9ee92cab96'), ObjectId('6780f8344eb12a9ee92cab