# Project 3: Web Scraping and NLP: Depression vs Bipolar

## Problem description

Provided with numerous posts on Reddit, I had a binary classification problem on hand to see if a difference could be infered between depression and bipolar posts. After scraping two subreddits, I compared Naive Bayes, Logistic Regression, and KNN models to finetune one that would perform the best. My main concern was measuring the accuracy of the model. After, choosing my model, I went ahead and train my model to make real time predictions. In the 'real_time_predictions' subfolder you will find a code that if ran will tell you with some accuracy whether the person who wrote a paragraph about how they feel should be treated for bipolar or depression. 

### Project Structure:
- Notebook 1. Web APIs and Data Collection
- Notebook 2. EDA, Data Cleaning
- Notebook 3. Pre-Processing
- Notebook 4a. Modeling: Naive-Bayes
- Notebook 4b. Modeling: Logistic Regressoin
- Notebook 4c. Modeling: KNN
- Notebook 5. Model Evaluation

## Pre-Processing

In [1]:
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import re
from nltk.corpus import stopwords

In [2]:
df = pd.read_csv('../data/data_cleaned.csv')

In [3]:
df.head()

Unnamed: 0,created_utc,title,selftext,subreddit,permalink,title_selftext
0,1579819637,i power through,its like shit never stops coming. I just get f...,depression,/r/depression/comments/et0wnm/i_power_through/,i power through its like shit never stops comi...
1,1579819771,I feel sick to my stomach,"First and foremost, I am not diagnosed with de...",depression,/r/depression/comments/et0xrl/i_feel_sick_to_m...,"I feel sick to my stomach First and foremost, ..."
2,1579819775,Why are people so cruel?,It really sucks to tell someone you are sad an...,depression,/r/depression/comments/et0xtj/why_are_people_s...,Why are people so cruel? It really sucks to te...
3,1579819832,Why bother?,I do not have any motivation to learn grow or ...,depression,/r/depression/comments/et0ybn/why_bother/,Why bother? I do not have any motivation to le...
4,1579819877,Today is my Birthday - shall I kill myself?,"In a nutshell, my parents have abandoned me wh...",depression,/r/depression/comments/et0ypi/today_is_my_birt...,Today is my Birthday - shall I kill myself? In...


In [4]:
data_used = df[['subreddit', 'title_selftext']]
data_used.is_copy = None

  object.__getattribute__(self, name)
  return object.__setattr__(self, name, value)


In [5]:
data_used.head()

Unnamed: 0,subreddit,title_selftext
0,depression,i power through its like shit never stops comi...
1,depression,"I feel sick to my stomach First and foremost, ..."
2,depression,Why are people so cruel? It really sucks to te...
3,depression,Why bother? I do not have any motivation to le...
4,depression,Today is my Birthday - shall I kill myself? In...


In [6]:
#credit to lesson 5.03:
def pre_processing_data(raw_text, words_to_remove):
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text) #pull only words
    tokenizer = RegexpTokenizer(r'\w+')#tokenize
    text_tokens = tokenizer.tokenize(letters_only.lower()) #make everything lower case
    remove_stopwords = [w for w in text_tokens if w not in stopwords.words('english')] #remove english stopwords 
    lemmatizer = WordNetLemmatizer() #instantiate lemmatize
    text_lem = [lemmatizer.lemmatize(i) for i in remove_stopwords] #lemmatize
    words_to_remove = [lemmatizer.lemmatize(i) for i in words_to_remove] #lemmatize the custom words to remove
    words_to_remove = set(words_to_remove) #make the words to remove in to a set
    remaining_words = [w for w in text_lem if w not in words_to_remove] #remove the custom stop words
    return(" ".join(remaining_words))

In [7]:
#these are custom stop words to remove. If we leave them it would be too easy to spot the difference
words_to_remove = ['depression', 'bipolar', 'antidepressant', 'manic', 'mania', 'hypomanic', 'hypomania']

In [8]:
#run our function on our data
data_used['title_selftext'] = [pre_processing_data(string, words_to_remove) for string in data_used['title_selftext']]

In [9]:
data_used.head()

Unnamed: 0,subreddit,title_selftext
0,depression,power like shit never stop coming get frustrat...
1,depression,feel sick stomach first foremost diagnosed fee...
2,depression,people cruel really suck tell someone sad make...
3,depression,bother motivation learn grow part kind relatio...
4,depression,today birthday shall kill nutshell parent aban...


In [10]:
data_used.to_csv('../data/data_pre_processed.csv')