# Project 3: Web Scraping and NLP: Depression vs Bipolar

## Problem description

Provided with numerous posts on Reddit, I had a binary classification problem on hand to see if a difference could be infered between depression and bipolar posts. After scraping two subreddits, I compared Naive Bayes, Logistic Regression, and KNN models to finetune one that would perform the best. My main concern was measuring the accuracy of the model. After, choosing my model, I went ahead and train my model to make real time predictions. In the 'real_time_predictions' subfolder you will find a code that if ran will tell you with some accuracy whether the person who wrote a paragraph about how they feel should be treated for bipolar or depression.

### Project Structure:
- Notebook 1. Web APIs and Data Collection
- Notebook 2. EDA, Data Cleaning
- Notebook 3. Pre-Processing
- Notebook 4a. Modeling: Naive-Bayes
- Notebook 4b. Modeling: Logistic Regressoin
- Notebook 4c. Modeling: KNN
- Notebook 5. Model Evaluation

## Naive Bayes

In this notebook my goal was to use GridSearch and find the best parameters of the model. I compared the results of Multinomial Naive Bayes model and the Gaussian Naive Bayes model. Based on the accuracy score, the Multinomial model performed the best of the two. In fact it performed the best overall. I chose to CountVectorize for Multinomial model based on it doing the best with integer imputed. For Gaussian then I went with TfidfVectorizer. 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import FunctionTransformer

In [2]:
df = pd.read_csv('../data/data_pre_processed.csv')

In [3]:
df.shape

(4541, 3)

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,subreddit,title_selftext
0,0,depression,power like shit never stop coming get frustrat...
1,1,depression,feel sick stomach first foremost diagnosed fee...
2,2,depression,people cruel really suck tell someone sad make...
3,3,depression,bother motivation learn grow part kind relatio...
4,4,depression,today birthday shall kill nutshell parent aban...


In [5]:
df.isna().sum().sort_values(ascending = False)

title_selftext    6
subreddit         0
Unnamed: 0        0
dtype: int64

In [6]:
df.dropna(inplace = True)

In [7]:
y = df['subreddit'].map({'depression':0, 'bipolar':1})
y.value_counts(normalize = True)

0    0.540022
1    0.459978
Name: subreddit, dtype: float64

In [8]:
X = df['title_selftext']
X.head()

0    power like shit never stop coming get frustrat...
1    feel sick stomach first foremost diagnosed fee...
2    people cruel really suck tell someone sad make...
3    bother motivation learn grow part kind relatio...
4    today birthday shall kill nutshell parent aban...
Name: title_selftext, dtype: object

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42, test_size = .33)

### Multinomial Naive Bayse

In [10]:
pipe_mnb = Pipeline([
    ('cvec', CountVectorizer()),
    ('mnb', MultinomialNB())
])

In [11]:
pipe_params_mnb = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [.8, .9],
    'cvec__ngram_range': [(1,1), (1,2)],
}

In [12]:
gs_mnb = GridSearchCV(pipe_mnb, pipe_params_mnb, cv = 5)

In [13]:
gs_mnb.fit(X_train, y_train);

In [14]:
gs_mnb.score(X_train, y_train)

0.8485845951283739

In [15]:
gs_mnb.score(X_test, y_test)

0.7942551770207081

In [16]:
gs_mnb.best_params_

{'cvec__max_df': 0.8,
 'cvec__max_features': 4000,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1)}

### Gaussian Naive Bayse

In [17]:
pipe_gnb = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('transform', FunctionTransformer(lambda x: x.todense(), accept_sparse=True, validate = True)), 
    ('gnb', GaussianNB())
])

In [18]:
pipe_params_gnb = {
    'tfidf__max_features': [2000, 3000],
    'tfidf__min_df': [1, 2],
    'tfidf__max_df': [.8, .9],
    'tfidf__ngram_range': [(1,1), (1,2)],
}

In [19]:
gs_gnb = GridSearchCV(pipe_gnb, pipe_params_gnb, cv = 5)

In [20]:
gs_gnb.fit(X_train, y_train);

In [21]:
gs_gnb.score(X_train, y_train)

0.8775510204081632

In [22]:
gs_gnb.score(X_test, y_test)

0.7682030728122913

In [23]:
gs_gnb.best_params_

{'tfidf__max_df': 0.8,
 'tfidf__max_features': 2000,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 2)}