## Reddit API
- In the following document we will try to extract reddit post:
    - As a first approach we will collect data marked by keyboards relatedt to adhd found in related subreddits
    - To improve our model we can modify our model to identify by it's own (using old data for example) the subreddits that could be intereesting 
    to scrap.


    General composition of a post:
    -  id: The post’s ID
    -  title: The post’s title
    -  text: The post’s text
    -  author: The post’s author
    -  created_utc: The post’s creation time in UTC
    -  score: The post’s score
    -  num_comments: The number of comments on the post
    -  permalink: The post’s permalink

## Connection to the Reddit APi

In [4]:
import os

from dotenv import load_dotenv
import praw

# Load environment variables from .env file
load_dotenv()

try:
    reddit = praw.Reddit(
        client_id=os.getenv("REDDIT_CLIENT_ID"),
        client_secret=os.getenv("REDDIT_CLIENT_SECRET"),
        user_agent=os.getenv("REDDIT_USER_AGENT"),
        username=os.getenv("REDDIT_USERNAME"),
        password=os.getenv("REDDIT_PASSWORD")
    )
    print(f"Connected! Logged in as: {reddit.user.me()}")
except Exception as e:
    print("An error occurred:", e)

Connected! Logged in as: ProfessorMiddle1326


### Text cleanning and Tokenization functions

In [91]:
#let's first try to check for common misspelling of the words in the text and then proceed to clean it
from spellchecker import SpellChecker

def correct_spelling(text):
    """
    Correct common misspellings in a text using pyspellchecker.

    Args:
        text (str): The text to correct.

    Returns:
        str: The text with corrected spelling.
        
    """
    spell = SpellChecker()
    words = text.split()
    corrected_words = [spell.correction(word) if word in spell else word for word in words]
    return " ".join(corrected_words)


def clean_text(text):
        #First let's define the cleanning function
    import re
    import string
    import nltk

    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    
    text=correct_spelling(text)

    text = text.lower()
    text = re.sub(r'[^\w\s]', ' ', text)
    #text = re.sub(r'[.]', ' ', text)
    text = re.sub(r'\d+', '', text)
    
    
    stop_words = set(stopwords.words('english'))
    text = " ".join([word for word in text.split() if word not in stop_words])
    


    stemmer = PorterStemmer()
    text = " ".join([stemmer.stem(word) for word in text.split()])     
    return text

def contains_keywords(text, keywords):
    """
    Check if a given text contains at least one keyword from a list after correcting spelling.

    Args:
        text (str): The text to check.
        keywords (list): A list of keywords to look for.

    Returns:
        bool: True if at least one keyword is found, False otherwise.
    """
    # Correct spelling in the text
    corrected_text = correct_spelling(text)
    
    # Convert corrected text to lowercase for case-insensitive matching
    corrected_text = corrected_text.lower()
    
    # Check if any keyword is in the corrected text
    for word in keywords:
        if word.lower() in corrected_text:
            return True
    return False

##  Scrap to gather keywords 
- In this appraoch we will try to tokenize the 100 top posts in the subreddit ADHD
- Once we get these keyborads this would help us filter new post into either posts about ADHD or not

In [None]:
import pandas as pd
# Subreddit to target
subreddit_name = "ADHD"
subreddit = reddit.subreddit(subreddit_name)

# Fetch posts
posts = []
for post in subreddit.top(limit=100):# 'hot', 'new', or 'top' post    
    posts.append({
        "title": post.title,
        "text": post.selftext,
    })
    
# extract from the posts the keywords
df=pd.DataFrame(posts)
df['text_cleaned'] = df['text'].apply(clean_text)
df['title_cleaned'] = df['title'].apply(clean_text)
df_tokenize = pd.DataFrame(index=df.index)
df_tokenize['text'] = df['text_cleaned']+df['title_cleaned']

                                                title  \
0                     How I cured my adhd permanently   
1   I went through 700 reddit comments and collect...   
2   ADHD for me is laying down on my couch using m...   
3   It feels like there aren’t enough hours in the...   
4   It's so damn irritating to be intelligent with...   
..                                                ...   
95  Universities move online amid COVID19, create ...   
96  I have fake conversations in my head all day long   
97  I'm gonna do it. 2023 is the year I start coll...   
98  Every evening I feel like I wasted my entire d...   
99  I’ve brushed my teeth for twenty-seven days st...   

                                                 text  
0   I've been suffering from adhd my whole life, f...  
1   So there was that awesome [Reddit thread](http...  
2   It’s not like I don’t care. I’m stressed out o...  
3                                         IM OVER IT.  
4   So I've always been told I'm sm

In [92]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the cleaned text data
tfidf_matrix = tfidf_vectorizer.fit_transform(df_tokenize['text'])

# Get the feature names (tokens)
tokens = tfidf_vectorizer.get_feature_names_out()




# Sum up the TF-IDF scores of each token
tfidf_scores = np.asarray(tfidf_matrix.sum(axis=0)).flatten()

# Create a DataFrame with tokens and their TF-IDF scores
tfidf_df = pd.DataFrame({'token': tokens, 'score': tfidf_scores})

# Sort the DataFrame by TF-IDF score in descending order and get the top 10 tokens
top_100_tfidf = tfidf_df.sort_values(by='score', ascending=False).head(100)

keywords = top_50_tfidf['token'].tolist()
print("Top 100 interesting keywords based on TF-IDF scores:")   
for i in range(100):
    if i%10==0:
        print('\n')
    print(f"| {keywords[i]} ", end=" ")



Top 100 interesting keywords based on TF-IDF scores:


| adhd  | like  | feel  | time  | peopl  | know  | fuck  | work  | day  | life  

| thing  | someth  | need  | start  | make  | help  | hour  | edit  | use  | want  

| tri  | everi  | think  | realli  | task  | way  | actual  | someon  | thank  | read  

| person  | post  | anyth  | minut  | brain  | say  | got  | said  | year  | told  

| mani  | good  | word  | right  | mean  | disord  | went  | save  | check  | everyth  

| struggl  | attent  | alway  | watch  | thought  | guy  | pm  | focu  | kid  | better  

| ask  | execut  | els  | night  | end  | bad  | parent  | stuff  | lot  | love  

| eat  | self  | instead  | stop  | anyon  | dysfunct  | bed  | spend  | phone  | relat  

| write  | child  | care  | learn  | pay  | sleep  | tell  | remind  | talk  | late  

| shit  | forgot  | dish  | noth  | today  | week  | lazi  | abil  | hard  | point  

-Now let's manually select the common words that better adress our needs

In [90]:
final_keywords = ["adhd", "diagnose","energy", "brain", "test", "distracted", "forgetful", "doctor"
                  ,"work","task","disord","struggl","focu","dysfunct","forgot","lazi","prescrib","medic","medicin","pill"]

final_stemmed_keywords = [ps.stem(word) for word in final_keywords]
print("Final Keywords:", final_keywords)
print("Final Stemmed Keywords:", final_stemmed_keywords)

Final Keywords: ['adhd', 'diagnose', 'energy', 'brain', 'test', 'distracted', 'forgetful', 'doctor', 'work', 'task', 'disord', 'struggl', 'focu', 'dysfunct', 'forgot', 'lazi', 'prescrib', 'medic', 'medicin', 'pill']
Final Stemmed Keywords: ['adhd', 'diagnos', 'energi', 'brain', 'test', 'distract', 'forget', 'doctor', 'work', 'task', 'disord', 'struggl', 'focu', 'dysfunct', 'forgot', 'lazi', 'prescrib', 'medic', 'medicin', 'pill']


### Now let's use these words as a first filter to posts (talking about ADHD or not)