# Project 3: Web Scraping and NLP: Depression vs Bipolar

## Problem description
​
Provided with numerous posts on Reddit, I had a binary classification problem on hand to see if a difference could be infered between depression and bipolar posts. After scraping two subreddits, I compared Naive Bayes, Logistic Regression, and KNN models to finetune one that would perform the best. My main concern was measuring the accuracy of the model. After, choosing my model, I went ahead and train my model to make real time predictions. In the 'real_time_predictions' subfolder you will find a code that if ran will tell you with some accuracy whether the person who wrote a paragraph about how they feel should be treated for bipolar or depression. 

### Project Structure:
- Notebook 1. Web APIs and Data Collection
- Notebook 2. EDA, Data Cleaning
- Notebook 3. Pre-Processing
- Notebook 4a. Modeling: Naive-Bayes
- Notebook 4b. Modeling: Logistic Regressoin
- Notebook 4c. Modeling: KNN
- Notebook 5. Model Evaluation

## Web Scraping

In [1]:
import requests
import pandas as pd
import numpy as np
import time

In [2]:
#function loops through 500 posts at a time from a subreddit. using the date filter
def get_data(subreddit):
    after_day = 1
    before_day = 0
    length = 0
    final_df = pd.DataFrame()
    while length < 2000:
        url = 'https://api.pushshift.io/reddit/search/submission?subreddit={}&size=500&after={}d&before={}d&sort=asc'.format(subreddit, after_day, before_day)
        response = requests.get(url)
        results = response.json()
        results_df = pd.DataFrame(results['data'])
        
        final_df = pd.concat([final_df, results_df], sort = True)
        
        length = len(final_df['title'].unique())
        after_day += 1
        before_day += 1
        
        time.sleep(5)
    return final_df

In [3]:
bipolar_df = get_data('bipolar') #pulling the bipolar posts

In [4]:
depression_df = get_data('depression') #gathering the depression posts

In [5]:
len(depression_df['title'].unique())

2450

In [6]:
len(bipolar_df['title'].unique())

2091

In [7]:
bipolar_df[['title', 'selftext', 'subreddit']].head()

Unnamed: 0,title,selftext,subreddit
0,I think I like being depressed more than hypom...,"I know this sounds stupid, but I feel like ‘my...",bipolar
1,Advice,I’ve been super close to my best friend for 6+...,bipolar
2,I just can’t sleep.,I’m going through a radical cycle right now. I...,bipolar
3,Breakup into not breakup?,[removed],bipolar
4,Deleting rants on FB,A few years ago I went on an epic rant for mon...,bipolar


In [8]:
depression_df[['title', 'selftext', 'subreddit']].head()

Unnamed: 0,title,selftext,subreddit
0,i power through,its like shit never stops coming. I just get f...,depression
1,I feel sick to my stomach,"First and foremost, I am not diagnosed with de...",depression
2,Why are people so cruel?,It really sucks to tell someone you are sad an...,depression
3,Why bother?,I do not have any motivation to learn grow or ...,depression
4,Today is my Birthday - shall I kill myself?,"In a nutshell, my parents have abandoned me wh...",depression


In [9]:
bipolar_df.columns

Index(['all_awardings', 'allow_live_comments', 'author', 'author_cakeday',
       'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext', 'author_flair_template_id',
       'author_flair_text', 'author_flair_text_color', 'author_flair_type',
       'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders',
       'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link',
       'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_template_id', 'link_flair_text', 'link_flair_text_color',
       'link_flair_type', 'locked', 'media', 'media_embed', 'media_metadata',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'post_hint',
       'preview', 'pwls', 'removed_by_catego

In [10]:
bipolar_df.drop_duplicates(subset = ['title'], inplace = True)

In [11]:
depression_df.drop_duplicates(subset = ['title'], inplace = True)

In [12]:
bipolar_df.shape

(2091, 72)

In [13]:
depression_df.shape

(2450, 61)

In [14]:
bipolar_df.to_csv('../data/bipolar_df.csv', index = False)
depression_df.to_csv('../data/depression_df.csv', index = False)
