## Reddit API
- In the following document we will try to extract reddit post:
    - As a first approach we will collect data marked by keyboards relatedt to adhd found in related subreddits
    - To improve our model we can modify our model to identify by it's own (using old data for example) the subreddits that could be intereesting 
    to scrap.


    General composition of a post:
    -  id: The post’s ID
    -  title: The post’s title
    -  text: The post’s text
    -  author: The post’s author
    -  created_utc: The post’s creation time in UTC
    -  score: The post’s score
    -  num_comments: The number of comments on the post
    -  permalink: The post’s permalink

## Connection to the Reddit APi

In [3]:
import os

from dotenv import load_dotenv
import praw

# Load environment variables from .env file
load_dotenv()

try:
    reddit = praw.Reddit(
        client_id=os.getenv("REDDIT_CLIENT_ID"),
        client_secret=os.getenv("REDDIT_CLIENT_SECRET"),
        user_agent=os.getenv("REDDIT_USER_AGENT"),
        username=os.getenv("REDDIT_USERNAME"),
        password=os.getenv("REDDIT_PASSWORD")
    )
    print(f"Connected! Logged in as: {reddit.user.me()}")
except Exception as e:
    print("An error occurred:", e)

Connected! Logged in as: ProfessorMiddle1326


### Text cleanning and Tokenization functions

In [23]:
#let's first try to check for common misspelling of the words in the text and then proceed to clean it
from spellchecker import SpellChecker

def correct_spelling(text):
    """
    Correct common misspellings in a text using pyspellchecker.

    Args:
        text (str): The text to correct.

    Returns:
        str: The text with corrected spelling.
        
    """
    spell = SpellChecker()
    words = text.split()
    corrected_words = [spell.correction(word) if word in spell else word for word in words]
    return " ".join(corrected_words)


def clean_text(text):
        #First let's define the cleanning function
    import re
    import string
    import nltk

    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    
    text=correct_spelling(text)

    text = text.lower()
    text = re.sub(r'[^\w\s]', ' ', text)
    #text = re.sub(r'[.]', ' ', text)
    text = re.sub(r'\d+', '', text)
    
    
    stop_words = set(stopwords.words('english'))
    text = " ".join([word for word in text.split() if word not in stop_words])
    


    stemmer = PorterStemmer()
    text = " ".join([stemmer.stem(word) for word in text.split()])     
    return text

def contains_keywords(text, keywords):
    """
    Check if a given text contains at least one keyword from a list after correcting spelling.

    Args:
        text (str): The text to check.
        keywords (list): A list of keywords to look for.

    Returns:
        bool: True if at least one keyword is found, False otherwise.
    """
    # Correct spelling in the text
    corrected_text = correct_spelling(text)
    
    # Convert corrected text to lowercase for case-insensitive matching
    corrected_text = corrected_text.lower()
    
    # Check if any keyword is in the corrected text
    for word in keywords:
        if word.lower() in corrected_text:
            return True
    return False

In [38]:
import datetime

def get_utc_time(timestamp):
    """
    Convert a timestamp in scientific notation to a UTC datetime object.

    Args:
        timestamp (float): The timestamp in scientific notation.

    Returns:
        datetime: The UTC datetime object.
    """
    # Convert timestamp to UTC datetime
    utc_time = datetime.datetime.utcfromtimestamp(timestamp)
    return utc_time

##  Scrap to gather keywords 
- In this appraoch we will try to tokenize the 100 top posts in the subreddit ADHD
- Once we get these keyborads this would help us filter new post into either posts about ADHD or not

In [26]:
final_keywords = ["adhd", "diagnose","energy", "brain", "test", "distracted", "forgetful", "doctor"
                  ,"work","task","disord","struggl","focu","dysfunct","forgot","lazi","prescrib","medic","medicin","pill"]

ps=PorterStemmer()

final_stemmed_keywords = [ps.stem(word) for word in final_keywords]
print("Final Keywords:", final_keywords)
print("Final Stemmed Keywords:", final_stemmed_keywords)

Final Keywords: ['adhd', 'diagnose', 'energy', 'brain', 'test', 'distracted', 'forgetful', 'doctor', 'work', 'task', 'disord', 'struggl', 'focu', 'dysfunct', 'forgot', 'lazi', 'prescrib', 'medic', 'medicin', 'pill']
Final Stemmed Keywords: ['adhd', 'diagnos', 'energi', 'brain', 'test', 'distract', 'forget', 'doctor', 'work', 'task', 'disord', 'struggl', 'focu', 'dysfunct', 'forgot', 'lazi', 'prescrib', 'medic', 'medicin', 'pill']


In [None]:
import pandas as pd
# Subreddit to target
subreddit_name = "ADHD"
subreddit = reddit.subreddit(subreddit_name)

# Fetch posts
posts = []
for post in subreddit.top(limit=1):# 'hot', 'new', or 'top' post    
    if(contains_keywords(clean_text(post.title),final_stemmed_keywords)):
        posts.append({
            "title": post.title,
            "text": post.selftext,
            
        })get the list of attributes you can get from fetch of a post with reddit api
    
    
    
    
# extract from the posts the keywords
df=pd.DataFrame(posts)
df['text_cleaned'] = df['text'].apply(clean_text)
df['title_cleaned'] = df['title'].apply(clean_text)
df_tokenize = pd.DataFrame(index=df.index)
df_tokenize['text'] = df['text_cleaned']+df['title_cleaned']

KeyboardInterrupt: 

In [79]:
import pandas as pd
# Subreddit to target
subreddit_name = "ADHD"
subreddit = reddit.subreddit(subreddit_name)

# Fetch posts
posts = []
notcleanedpsots=[]
for post in subreddit.new(limit=1000):# 'hot', 'new', or 'top' post    
    titre=clean_text(post.title)   
    
    if(contains_keywords(titre,final_stemmed_keywords)):
        posts.append({
            "id":post.id,
            "title": post.title,
            "author": str(post.author),
            "score": post.score,
            "num_comments": post.num_comments,
            "upvote_ratio": post.upvote_ratio,
            "url": post.url,
            "subreddit": post.subreddit.display_name,
            "created_at": post.created_utc,
            "self_text": post.selftext,    
        })
        continue
    
    text=clean_text(post.selftext)
    
    if(contains_keywords(text,final_stemmed_keywords)):
        posts.append({
            "id":post.id,
            "title": post.title,
            "author": str(post.author),
            "score": post.score,
            "num_comments": post.num_comments,
            "upvote_ratio": post.upvote_ratio,
            "url": post.url,
            "subreddit": post.subreddit.display_name,
            "created_at": post.created_utc,
            "self_text": post.selftext,    
        })
        

    
    



#### Connect to Mongdb

In [69]:
from pymongo import MongoClient

# Load environment variables from .env file
load_dotenv()
mongo='127.0.0.1'

try:
    # Connect to MongoDB
    myclient = MongoClient(
                        "mongodb://"+mongo+":27017/",  
                        username='admin',
                        password='admin') #Mongo URI format
    db=myclient['reddit_posts']
    print("Connected to MongoDB successfully!")
except Exception as e:
    print("An error occurred while connecting to MongoDB:", e)

Connected to MongoDB successfully!


In [70]:
db.reddit_posts.insert_one({"x": 1})

InsertOneResult(ObjectId('67474641202638947092e3d5'), acknowledged=True)

 #### Connect to reddis
 

In [11]:
import redis

r = redis.Redis(host='127.0.0.1', port=6379, db=0)

#### now let's try to store the posts 

In [77]:
for p in posts:
    print(p)
    print('\n')

{'id': '1h183id', 'title': 'Today i had 2 seperate job interviews and they both forgot to call me on the agreed time somehow.', 'author': 'Doctor-lasanga', 'score': 1, 'num_comments': 1, 'upvote_ratio': 1.0, 'url': 'https://www.reddit.com/r/ADHD/comments/1h183id/today_i_had_2_seperate_job_interviews_and_they/', 'subreddit': 'ADHD', 'created_at': 1732724389.0, 'self_text': "What's up my beloved focus-deprived friends. Today was the day i was supposed to have 2 seperate job interviews that would finally give me a reason to go outside and be a standard member of society. What i failed to take in consideration was the unforseen situation of both forgetting that we had an appointment and not call at all.\n\ni was supposed to get one at 10 am today so i woke up at 9 to make myself look presentable for the teams-call but as time went on i realised that the appontment was not happening and so i send them an email detailing my displeasure. I spend the rest of the day in waiting mode for the nex

In [80]:
r.delete('reddit_posts')

1

In [81]:
for p in posts:
    if r.sadd('reddit_posts', p['id']):
        db.reddits_posts.insert_one(p)
        print('inserted',p['id'])
        

inserted 1h183id
inserted 1h17utj
inserted 1h17ubs
inserted 1h17qzi
inserted 1h17puf
inserted 1h17jh1
inserted 1h17byl
inserted 1h17arp
inserted 1h17700
inserted 1h16dv8
inserted 1h16df2
inserted 1h16a19
inserted 1h15zyj
inserted 1h15y4b
inserted 1h15vcl
inserted 1h15tlj
inserted 1h15ptn
inserted 1h15o0d
inserted 1h15m56
inserted 1h15kaa
inserted 1h15av7
inserted 1h156se
inserted 1h14yrm
inserted 1h14rrz
inserted 1h14ohe
inserted 1h13wls
inserted 1h13j0p
inserted 1h13ar3
inserted 1h1376u
inserted 1h12u1a
inserted 1h12fhq
inserted 1h11o72
inserted 1h11hiy
inserted 1h11hec
inserted 1h11fho
inserted 1h11djh
inserted 1h11b6h
inserted 1h118su
inserted 1h10zv4
inserted 1h10tmt
inserted 1h10szc
inserted 1h10oj3
inserted 1h106ng
inserted 1h102sz
inserted 1h0zqtd
inserted 1h0zqmr
inserted 1h0zoye
inserted 1h0zj04
inserted 1h0ze92
inserted 1h0z7dm
inserted 1h0z1vj
inserted 1h0yvz4
inserted 1h0yvlz
inserted 1h0yugm
inserted 1h0yscq
inserted 1h0yoj5
inserted 1h0y73l
inserted 1h0y3eu
inserted 1h0xz