---
## Background
---

This project is done as part of the Data Science Immersive course in General Assembly, and was completed in a short time frame of only 2 weeks. We used Natural Language Processing (NLP) to build a model that classifies if a reddit post belongs to r/schizophrenia or r/bipolar.

### Context
To diagnose and assess mental health conditions, clinicians rely on:

**F2F interactions**
- Non-verbal cues like body and affect
- Other possible associated symptoms
- Physical state of the patient
- Family members to ask

**Direct diagnostic questioning**
- Questions that get straight to the point
- "How is your mood lately?"
- "Do you ever feel like your thoughts can be heard?"

### Challenges from the Emergence and Rise of Telemedicine
Increased accessibility and convenience, but at the cost of:
- Lack of non-verbal cues
- Limited rapport building
- Potential for misdiagnosis

Current diagnostic methods are also not perfect. Majority of mental health patients will remain undiagnosed, at about 50% of all schizophrenia patients, and about 70% of all bipolar patients
The primary healthcare system would hence greatly benefit from a preliminary alert filtering tool that is anonymous and community based.

**There is a need for better linguistic understanding of mental health conditions.**

### Objectives and Scope of this Project
- Evaluate effectiveness of Natural-Language Processing (NLP) technologies as a way of converting and understanding mental health conditions.
- Build a model that will assist clinicians in understanding how mental health can be diagnosed via non-contextual linguistic information
- Test the model on text data taken from online communities r/bipolar and r/schizophrenia

---
## Scraping Schizophrenia and Bipolar Subreddits using PRAW
---

In this next section we will use praw to scrape r/schizophrenia and r/bipolar subreddits.

For each subreddit, we will scrape the posts via each subreddit post display option, namely:
1. Hot posts
2. Controversial posts
3. New posts
4. Rising posts
5. Top posts

For each posts, we will scrape the following attributes:
1. id
2. title
3. text
4. post score
5. number of comments
6. author
7. created time in utc
8. gilded number (number of awards)

Finally, the posts will be put into a dataframe and then exported to csv for further processing

In [2]:
import praw
import pandas as pd

In [3]:
reddit = praw.Reddit(client_id ='zWndV5hUJ8XZYshw1V9axw',
                     client_secret ='4ovj54M6J9s4yS6alG7c58jKOa3TQA',
                     user_agent ='my user agent')

In [4]:
subs = ['bipolar', 'schizophrenia']
submissions_types = ['hot', 'controversial', 'new', 'rising', 'top']

In [5]:
for x in subs:
    posts = []
    subreddit = reddit.subreddit(x)

    print(f'Starting to scrape for {x} subreddit')
    print()

    for submission_type in submissions_types:
        submission_generator = getattr(subreddit, submission_type)(limit=1000)
        posts.extend([[submission.id, submission.title, submission.selftext, submission.score, submission.num_comments, submission.author, submission.created_utc, submission.gilded] for submission in submission_generator])
        print(f'{submission_type.capitalize()} posts from {x} subreddit scraped')

    print(f'Scraping for {x} subreddit completed successfully')

    df = pd.DataFrame(posts, columns=['id', 'title', 'text', 'score', 'comments_count', 'author', 'created_utc', 'gilding'])
    df.to_csv(f'{x}.csv', index = False)

    print(f'Exporting to csv for {x} subreddit completed successfully')
    print('----------------------------------------')

Starting to scrape for bipolar subreddit

Hot posts from bipolar subreddit scraped
Controversial posts from bipolar subreddit scraped
New posts from bipolar subreddit scraped
Rising posts from bipolar subreddit scraped
Top posts from bipolar subreddit scraped
Scraping for bipolar subreddit completed successfully
Exporting to csv for bipolar subreddit completed successfully
----------------------------------------
Starting to scrape for schizophrenia subreddit

Hot posts from schizophrenia subreddit scraped
Controversial posts from schizophrenia subreddit scraped
New posts from schizophrenia subreddit scraped
Rising posts from schizophrenia subreddit scraped
Top posts from schizophrenia subreddit scraped
Scraping for schizophrenia subreddit completed successfully
Exporting to csv for schizophrenia subreddit completed successfully
----------------------------------------


---
## Using .json method to scrape r/schizophrenia and r/bipolar
---

In [31]:
import time
import random
import requests

In [32]:
url_list = ['https://www.reddit.com/r/bipolar.json', 'https://www.reddit.com/r/schizophrenia.json']

In [33]:
sub_count = 0

In [34]:
for url in url_list:
    headers = {'User-agent': 'Pony Inc 1.0'}
    posts = []
    after = None
    
    print(f'Starting to scrape for {subs[sub_count]} subreddit')

    for a in range(4):
        if after:
            current_url = f"{url}?after={after}"
        else:
            current_url = url

        res = requests.get(current_url, headers=headers)
        
        if res.status_code != 200:
            print('Status error', res.status_code)
            break
        
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']
        
        if a > 0:
            prev_posts = pd.read_csv(f'ryan_code_{subs[sub_count]}.csv')
            current_df = pd.DataFrame(posts)
            combined_df = pd.concat([prev_posts, current_df])
            combined_df.to_csv(f'ryan_code_{subs[sub_count]}.csv', index = False)
        else:
            pd.DataFrame(posts).to_csv(f'ryan_code_{subs[sub_count]}.csv', index = False)

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        print('Sleeping for', sleep_duration, 'seconds.')
        time.sleep(sleep_duration)
    
    print(f'Scraping and exporting to csv for {subs[sub_count]} subreddit completed successfully')
    print('------------------------------')
    sub_count +=1

Starting to scrape for bipolar subreddit
Sleeping for 6 seconds.
Sleeping for 3 seconds.
Sleeping for 6 seconds.
Sleeping for 2 seconds.
Scraping and exporting to csv for bipolar subreddit completed successfully
------------------------------
Starting to scrape for schizophrenia subreddit
Sleeping for 4 seconds.
Sleeping for 5 seconds.
Sleeping for 4 seconds.
Sleeping for 5 seconds.
Scraping and exporting to csv for schizophrenia subreddit completed successfully
------------------------------
