# Reddit Classification 

## Notebook 1 of 5 - Scraping Data
- There are a total of 5 notebooks
- Due to the processing power required, the modelling is seperated into 3 code notebooks (Notebook 3,4,5)

## Introduction
- Mental health is critically important to everyone, everywhere. All over the world, mental health needs are high but responses are insufficient and inadequate. This issue is prevalent in all countries including first world countries. However, one of the pervasive issues the report covers is stigma. Stigma wears many faces. We most commonly equate it with how we treat one other. However, that represents only part of the issue; personal shame, internalized through an individual’s mental health suffering, is a silent problem. We must normalize talking about mental health and its multitude of conditions because stigma is the chain onto which all mental health conditions link. [WHO Article](https://www.who.int/news-room/commentaries/detail/world-mental-health-day-is-an-opportunity-for-us-to-embrace-our-sense-of-community-and-normalize-mental-health)
- Thus, there are cases where people do not know if they need help or they do know but do not know where to look for help.
- Due to stigmatism, there is a possibility that people try to self-medicate or they go on to social media platforms to write their feelings but did not seek help.
- Thus, this project would bring about the first step in correctly classifying the topics based on NLP.

## Problem Statement
1. Through natural language processing and classification models, How can we classify posts based on the texts used by people who may be depressed or anxious? 

2. How can sentiment analysis be utilized to detect primary and secondary emotions from the posts?

- To correctly classify on anxiety and depression based the subreddit posts in these two topics.
- The vectorizers that will be deployed are CountVectorizer and TfidfVectorizer
- The models used are Logistic Regression, KNearestNeighbors, MultinomialNB, RandomForest and Pycaret.
- The main metrics used for evaluating the performance of the models is accuracy.
- Hugging Face models will also be explored in these notebooks.

## Target Audience
As a Data Science enthusiast and a fellow reddit user who is promoting the usage of NLP for moderators to detect emotions based on redditors's posts by joining as a speaker of World Health Day Convention.

## Datasets
- There are two datasets from two reddit topics, 'Anxiety' and 'depression' which were scrapped using Pushshift API.
- Each dataset has more than 15000 post from each topic
- The duration of the scrap will start with the latest post from 03 Oct 2022 to earlier post.

# Scrapping Reddit posts using Pushshift API

In [3]:
import requests
import pandas as pd
import datetime

In [17]:
# Function to grab posts from the subreddit.
 
def getposts(subreddit): 
 
    timing = 1664755200 # the timing is set at GMT+0 Monday 12.00am 03 Oct 2022
    posts_number = 0 # set posts_number to zero. the loop will make use of this.
    total_post = [] # create an empty list for the post

    while True:
        url = 'https://api.pushshift.io/reddit/search/submission' # pushshift API site
        params = {
            'subreddit': subreddit, # the subreddit 
            'size': 250, # max limit for Pushshift API
            'before': timing # the first time will be the one defined above.
        }
        print(f'Scraping through subreddit: {subreddit}. Total post so far: {posts_number}') # show count
        res = requests.get(url, params) # requesting url with the params
        data = res.json() # save from the website into json and then store it as data.
        posts = data['data'] # inside column 'data' in data is where the information we want.
                
        for i in range(len(posts)):
            post = {}
            epoch_time = posts[i]['created_utc'] # find the [i] post epoch time
            date_time = datetime.datetime.fromtimestamp(epoch_time) # convert the epoch time to standard time (UTC)
                 
            try:
                post['date_time'] = date_time.strftime('%Y-%m-%d %H:%M:%S') # save the time as a datetime format
            except:
                post['date_time'] = 'post deleted' # if posts gets deleted and throws an error, input text as post deleted           
            try:
                post['subreddit'] = posts[i]['subreddit'] # save the subreddit name
            except:
                post['subreddit'] = 'post deleted' #incase posts gets deleted and throws an error, input text as post deleted
            try:
                post['selftext'] = posts[i]['selftext'] # save the selftext data
            except:
                post['selftext'] = 'post deleted' # if posts gets deleted and throws an error, input text as post deleted
            try: 
                post['title'] = posts[i]['title'] # save the post title
            except:
                post['title'] = 'post deleted' # if posts gets deleted and throws an error, input text as post deleted
                       
            total_post.append(post) # append this current post dictionary, and put it ito total_post list.
        
        posts_number += len(posts) # add to post_number for counter and for the while true loop.
        max_post = len(post)
        timing = posts[-1]['created_utc'] # set the new timing from the last post in posts list 
        

        if posts_number >= 15000: # if this is true, the while loop breaks
            print(f" Completed. Total post : {posts_number}") # once its done, it will print completed
            break
    
    posts_df = pd.DataFrame(total_post) # saves the total_post from the subreddit and place it in a posts_df(dataframe)
    posts_df.to_csv('./datasets/'+subreddit+'.csv') # saves the file name into dataset folder, with the subreddit name.
    

In [18]:
# Topic 1: Anxiety
getposts('Anxiety')

Scraping through subreddit: Anxiety. Total post so far: 0
Scraping through subreddit: Anxiety. Total post so far: 250
Scraping through subreddit: Anxiety. Total post so far: 500
Scraping through subreddit: Anxiety. Total post so far: 750
Scraping through subreddit: Anxiety. Total post so far: 1000
Scraping through subreddit: Anxiety. Total post so far: 1250
Scraping through subreddit: Anxiety. Total post so far: 1500
Scraping through subreddit: Anxiety. Total post so far: 1750
Scraping through subreddit: Anxiety. Total post so far: 1999
Scraping through subreddit: Anxiety. Total post so far: 2249
Scraping through subreddit: Anxiety. Total post so far: 2499
Scraping through subreddit: Anxiety. Total post so far: 2749
Scraping through subreddit: Anxiety. Total post so far: 2999
Scraping through subreddit: Anxiety. Total post so far: 3249
Scraping through subreddit: Anxiety. Total post so far: 3499
Scraping through subreddit: Anxiety. Total post so far: 3748
Scraping through subreddit: An

In [19]:
# Topic 2: depression
getposts('depression')

Scraping through subreddit: depression. Total post so far: 0
Scraping through subreddit: depression. Total post so far: 250
Scraping through subreddit: depression. Total post so far: 500
Scraping through subreddit: depression. Total post so far: 750
Scraping through subreddit: depression. Total post so far: 1000
Scraping through subreddit: depression. Total post so far: 1250
Scraping through subreddit: depression. Total post so far: 1499
Scraping through subreddit: depression. Total post so far: 1749
Scraping through subreddit: depression. Total post so far: 1999
Scraping through subreddit: depression. Total post so far: 2248
Scraping through subreddit: depression. Total post so far: 2497
Scraping through subreddit: depression. Total post so far: 2745
Scraping through subreddit: depression. Total post so far: 2995
Scraping through subreddit: depression. Total post so far: 3245
Scraping through subreddit: depression. Total post so far: 3495
Scraping through subreddit: depression. Total 

# Data Scraping Summary

Total 15,242 posts of 'Anxiety' and 15,238 posts of 'depression' were scrapped from subreddit.  
This is a total of 30,480 posts.  
The next notebook (Notebook 2) will be on Base Model, Data Cleaning and EDA.