# Project 3: Web Scraping and NLP: Depression vs Bipolar

## Problem description

Provided with numerous posts on Reddit, I had a binary classification problem on hand to see if a difference could be infered between depression and bipolar posts. After scraping two subreddits, I compared Naive Bayes, Logistic Regression, and KNN models to finetune one that would perform the best. My main concern was measuring the accuracy of the model. After, choosing my model, I went ahead and train my model to make real time predictions. In the 'real_time_predictions' subfolder you will find a code that if ran will tell you with some accuracy whether the person who wrote a paragraph about how they feel should be treated for bipolar or depression. 

### Project Structure:
- Notebook 1. Web APIs and Data Collection
- Notebook 2. EDA, Data Cleaning
- Notebook 3. Pre-Processing
- Notebook 4a. Modeling: Naive-Bayes
- Notebook 4b. Modeling: Logistic Regressoin
- Notebook 4c. Modeling: KNN
- Notebook 5. Model Evaluation

## EDA and Data Cleaning

In [1]:
import pandas as pd
import numpy as np

In [2]:
depression = pd.read_csv('../data/depression_df.csv')
bipolar = pd.read_csv('../data/bipolar_df.csv')

In [3]:
depression.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_cakeday,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,...,subreddit_id,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,url,whitelist_status,wls
0,[],False,PENTANgon,,,[],,text,t2_142i0h,False,...,t5_2qqqf,593389,public,confidence,self,i power through,0,https://www.reddit.com/r/depression/comments/e...,no_ads,0.0
1,[],False,kixback,,,[],,text,t2_xrdn5,False,...,t5_2qqqf,593391,public,confidence,self,I feel sick to my stomach,0,https://www.reddit.com/r/depression/comments/e...,no_ads,0.0
2,[],False,SparkyHollow,,,[],,text,t2_30mnfiop,False,...,t5_2qqqf,593391,public,confidence,self,Why are people so cruel?,0,https://www.reddit.com/r/depression/comments/e...,no_ads,0.0
3,[],False,ronstermonster34,,,[],,text,t2_15mz1j,False,...,t5_2qqqf,593392,public,confidence,self,Why bother?,0,https://www.reddit.com/r/depression/comments/e...,no_ads,0.0
4,[],False,TruDreams,,,[],,text,t2_339337d5,False,...,t5_2qqqf,593393,public,confidence,self,Today is my Birthday - shall I kill myself?,0,https://www.reddit.com/r/depression/comments/e...,no_ads,0.0


In [4]:
bipolar.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,subreddit_subscribers,subreddit_type,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,url,whitelist_status,wls
0,[],False,cof-E,,,,[],,,,...,95479,public,self,,,I think I like being depressed more than hypom...,0,https://www.reddit.com/r/bipolar/comments/et0y...,house_only,1
1,[],False,psychnotmyname,,,,[],,,,...,95480,public,self,,,Advice,0,https://www.reddit.com/r/bipolar/comments/et14...,house_only,1
2,[],False,1tapd,,,,[],,,,...,95482,public,self,,,I just can’t sleep.,0,https://www.reddit.com/r/bipolar/comments/et1d...,house_only,1
3,[],False,panquin3535,,,,[],,,,...,95482,public,self,,,Breakup into not breakup?,0,https://www.reddit.com/r/bipolar/comments/et1e...,house_only,1
4,[],False,vr_dream,,,,[],,,,...,95481,public,self,,,Deleting rants on FB,0,https://www.reddit.com/r/bipolar/comments/et1g...,house_only,1


In [5]:
#For our modeling and machine learning we will only use information provided under these columns. 
dep_df = depression[['created_utc', 'title', 'selftext', 'subreddit', 'permalink']]
bip_df = bipolar[['created_utc', 'title', 'selftext', 'subreddit', 'permalink']]

In [6]:
dep_df.head()

Unnamed: 0,created_utc,title,selftext,subreddit,permalink
0,1579819637,i power through,its like shit never stops coming. I just get f...,depression,/r/depression/comments/et0wnm/i_power_through/
1,1579819771,I feel sick to my stomach,"First and foremost, I am not diagnosed with de...",depression,/r/depression/comments/et0xrl/i_feel_sick_to_m...
2,1579819775,Why are people so cruel?,It really sucks to tell someone you are sad an...,depression,/r/depression/comments/et0xtj/why_are_people_s...
3,1579819832,Why bother?,I do not have any motivation to learn grow or ...,depression,/r/depression/comments/et0ybn/why_bother/
4,1579819877,Today is my Birthday - shall I kill myself?,"In a nutshell, my parents have abandoned me wh...",depression,/r/depression/comments/et0ypi/today_is_my_birt...


In [7]:
bip_df.head()

Unnamed: 0,created_utc,title,selftext,subreddit,permalink
0,1579819819,I think I like being depressed more than hypom...,"I know this sounds stupid, but I feel like ‘my...",bipolar,/r/bipolar/comments/et0y78/i_think_i_like_bein...
1,1579820566,Advice,I’ve been super close to my best friend for 6+...,bipolar,/r/bipolar/comments/et14y4/advice/
2,1579821577,I just can’t sleep.,I’m going through a radical cycle right now. I...,bipolar,/r/bipolar/comments/et1dqu/i_just_cant_sleep/
3,1579821641,Breakup into not breakup?,[removed],bipolar,/r/bipolar/comments/et1e90/breakup_into_not_br...
4,1579821932,Deleting rants on FB,A few years ago I went on an epic rant for mon...,bipolar,/r/bipolar/comments/et1gmj/deleting_rants_on_fb/


In [8]:
#Here is am concatination the two subreddits together
together_df = pd.concat([dep_df, bip_df], ignore_index=True)
together_df.head()

Unnamed: 0,created_utc,title,selftext,subreddit,permalink
0,1579819637,i power through,its like shit never stops coming. I just get f...,depression,/r/depression/comments/et0wnm/i_power_through/
1,1579819771,I feel sick to my stomach,"First and foremost, I am not diagnosed with de...",depression,/r/depression/comments/et0xrl/i_feel_sick_to_m...
2,1579819775,Why are people so cruel?,It really sucks to tell someone you are sad an...,depression,/r/depression/comments/et0xtj/why_are_people_s...
3,1579819832,Why bother?,I do not have any motivation to learn grow or ...,depression,/r/depression/comments/et0ybn/why_bother/
4,1579819877,Today is my Birthday - shall I kill myself?,"In a nutshell, my parents have abandoned me wh...",depression,/r/depression/comments/et0ypi/today_is_my_birt...


In [9]:
together_df.tail()

Unnamed: 0,created_utc,title,selftext,subreddit,permalink
4536,1578693802,Anyone just unsure on what to do next or wheth...,"I'm unemployed not sure what direction to go, ...",bipolar,/r/bipolar/comments/emxzsz/anyone_just_unsure_...
4537,1578693883,Abilify experiences,Abilify for treatment resistant depression\n\n...,bipolar,/r/bipolar/comments/emy0i9/abilify_experiences/
4538,1578693987,What books/sites/videos have helped you learn ...,,bipolar,/r/bipolar/comments/emy1au/what_bookssitesvide...
4539,1578694545,A poem I wrote about my experience with Bipola...,"Chasing the feeling of pure bliss,\nGet a hold...",bipolar,/r/bipolar/comments/emy5y4/a_poem_i_wrote_abou...
4540,1578695596,Uh I wrote a 33 page comic script while I was ...,,bipolar,/r/bipolar/comments/emyekb/uh_i_wrote_a_33_pag...


In [10]:
together_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4541 entries, 0 to 4540
Data columns (total 5 columns):
created_utc    4541 non-null int64
title          4541 non-null object
selftext       3987 non-null object
subreddit      4541 non-null object
permalink      4541 non-null object
dtypes: int64(1), object(4)
memory usage: 177.5+ KB


In [11]:
'''#There are 554 missing text fields. To avoid problems when putting title column and selftext column together
I fill the NaN and the [removed] observations with '**¯\\_(ツ)_/¯**'. Then when I add them togther
I remove the '**¯\\_(ツ)_/¯**'.'''

together_df.isna().sum()

created_utc      0
title            0
selftext       554
subreddit        0
permalink        0
dtype: int64

In [12]:
together_df[together_df.selftext.isnull()]

Unnamed: 0,created_utc,title,selftext,subreddit,permalink
158,1579834461,I promised myself never to yell at my students...,,depression,/r/depression/comments/et45yx/i_promised_mysel...
169,1579835734,An empty painful comfortable numbness,,depression,/r/depression/comments/et4faz/an_empty_painful...
177,1579836459,I just text my dad. I haven't talked to him si...,,depression,/r/depression/comments/et4kgq/i_just_text_my_d...
239,1579841887,I literally can’t do life anymore and what’s s...,,depression,/r/depression/comments/et5l69/i_literally_cant...
253,1579842963,The only time I feel some relief and hope is a...,,depression,/r/depression/comments/et5saq/the_only_time_i_...
...,...,...,...,...,...
4518,1578683354,Here we go my babies!!! Guess who's being hell...,,bipolar,/r/bipolar/comments/emvlan/here_we_go_my_babie...
4521,1578685013,Am I right?,,bipolar,/r/bipolar/comments/emvyvk/am_i_right/
4529,1578690170,Meme Friday yaaaay,,bipolar,/r/bipolar/comments/emx516/meme_friday_yaaaay/
4538,1578693987,What books/sites/videos have helped you learn ...,,bipolar,/r/bipolar/comments/emy1au/what_bookssitesvide...


In [13]:
together_df.selftext.fillna('**¯\\_(ツ)_/¯**', inplace = True)
together_df.isna().sum()

created_utc    0
title          0
selftext       0
subreddit      0
permalink      0
dtype: int64

In [14]:
together_df.selftext.replace('[removed]', '**¯\\_(ツ)_/¯**', inplace = True)

In [15]:
together_df['title_selftext'] = together_df['title'] + " " + together_df['selftext']
together_df.head()

Unnamed: 0,created_utc,title,selftext,subreddit,permalink,title_selftext
0,1579819637,i power through,its like shit never stops coming. I just get f...,depression,/r/depression/comments/et0wnm/i_power_through/,i power through its like shit never stops comi...
1,1579819771,I feel sick to my stomach,"First and foremost, I am not diagnosed with de...",depression,/r/depression/comments/et0xrl/i_feel_sick_to_m...,"I feel sick to my stomach First and foremost, ..."
2,1579819775,Why are people so cruel?,It really sucks to tell someone you are sad an...,depression,/r/depression/comments/et0xtj/why_are_people_s...,Why are people so cruel? It really sucks to te...
3,1579819832,Why bother?,I do not have any motivation to learn grow or ...,depression,/r/depression/comments/et0ybn/why_bother/,Why bother? I do not have any motivation to le...
4,1579819877,Today is my Birthday - shall I kill myself?,"In a nutshell, my parents have abandoned me wh...",depression,/r/depression/comments/et0ypi/today_is_my_birt...,Today is my Birthday - shall I kill myself? In...


In [16]:
together_df.loc[1050,'title_selftext']

"Happy Birthday! Anyone else fucking sick of hearing this? Year after year. There's nothing happy about it, another year gone. Fuck all achieved. Just another day to suffer."

In [17]:
'''There are page breaks in the data collected marked as \n
            I remove them with .replace'''


together_df['title_selftext'] = [string.replace('\n', '') for string in together_df['title_selftext']]

In [18]:
together_df.loc[1050, 'title_selftext']

"Happy Birthday! Anyone else fucking sick of hearing this? Year after year. There's nothing happy about it, another year gone. Fuck all achieved. Just another day to suffer."

In [19]:
together_df.loc[2, 'title_selftext']

'Why are people so cruel? It really sucks to tell someone you are sad and then for them to make you feel bad for being upset'

In [20]:
together_df['title_selftext'] = [string.replace('**¯\\_(ツ)_/¯**', '') for string in together_df['title_selftext']]

In [21]:
together_df.loc[2, 'title_selftext']

'Why are people so cruel? It really sucks to tell someone you are sad and then for them to make you feel bad for being upset'

In [22]:
together_df.to_csv('../data/data_cleaned.csv', index = False)
