# Project 3: Web Scraping and NLP: Depression vs Bipolar

## Problem description

Provided with numerous posts on Reddit, I had a binary classification problem on hand to see if a difference could be infered between depression and bipolar posts. After scraping two subreddits, I compared Naive Bayes, Logistic Regression, and KNN models to finetune one that would perform the best. My main concern was measuring the accuracy of the model. After, choosing my model, I went ahead and train my model to make real time predictions. In the 'real_time_predictions' subfolder you will find a code that if ran will tell you with some accuracy whether the person who wrote a paragraph about how they feel should be treated for bipolar or depression.

### Project Structure:
- Notebook 1. Web APIs and Data Collection
- Notebook 2. EDA, Data Cleaning
- Notebook 3. Pre-Processing
- Notebook 4a. Modeling: Naive-Bayes
- Notebook 4b. Modeling: Logistic Regressoin
- Notebook 4c. Modeling: KNN
- Notebook 5. Model Evaluation

## KNN: K Nearest Neighbors

Here I am running KNN both with CountVectorizer and TfidfVectorizer. The model performed relatively well. Not as well as the Naive Bayes, but not as terrible as Logistic Regression. I ran a number of GridSearches while searching for the best parameters. 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [2]:
df = pd.read_csv('../data/data_pre_processed.csv')

In [6]:
df.dropna(inplace = True)

In [7]:
y = df['subreddit'].map({'depression':0, 'bipolar':1})
y.value_counts(normalize = True)

0    0.540022
1    0.459978
Name: subreddit, dtype: float64

In [8]:
X = df['title_selftext']
X.head()

0    power like shit never stop coming get frustrat...
1    feel sick stomach first foremost diagnosed fee...
2    people cruel really suck tell someone sad make...
3    bother motivation learn grow part kind relatio...
4    today birthday shall kill nutshell parent aban...
Name: title_selftext, dtype: object

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42, test_size = .33)

### KNN: CountVectorizer

In [14]:
pipe_knn = Pipeline([
    ('cvec', CountVectorizer()),
    ('knn', KNeighborsClassifier())
])

In [15]:
pipe_params_knn = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [.8, .9],
    'cvec__ngram_range': [(1,1), (1,2)],
    'knn__n_neighbors': [3, 5, 7, 9]
}

In [16]:
gs_knn = GridSearchCV(pipe_knn, pipe_params_knn, cv = 5)

In [17]:
gs_knn.fit(X_train, y_train);

In [18]:
gs_knn.score(X_train, y_train)

0.7995391705069125

In [19]:
gs_knn.score(X_test, y_test)

0.6432865731462926

In [20]:
gs_knn.best_params_

{'cvec__max_df': 0.8,
 'cvec__max_features': 5000,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1),
 'knn__n_neighbors': 3}

### KNN: TfidfVectorizer

In [25]:
pipe_knn_tfi = Pipeline([
    ('tfidf', TfidfVectorizer()),
    
    ('knn', KNeighborsClassifier())
])

In [26]:
pipe_params_knn_tfi = {
    'tfidf__max_features': [2000, 3000],
    'tfidf__min_df': [1, 2],
    'tfidf__max_df': [.8, .9],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'knn__n_neighbors': [3,5,7,9]
}

In [27]:
gs_knn_tfi = GridSearchCV(pipe_knn_tfi, pipe_params_knn_tfi, cv = 5)

In [28]:
gs_knn_tfi.fit(X_train, y_train);

In [29]:
gs_knn_tfi.score(X_train, y_train)

0.4901250822909809

In [30]:
gs_knn_tfi.score(X_test, y_test)

0.4682698730794923

In [31]:
gs_knn_tfi.best_params_

{'knn__n_neighbors': 3,
 'tfidf__max_df': 0.8,
 'tfidf__max_features': 2000,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 1)}

### CountVectorizer: Search for ultimate K

In [33]:
cvec = CountVectorizer(max_features = 5000, min_df = 2, max_df = .8, ngram_range = (1,2))

X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)


knn_params = {
    'n_neighbors': range(1, 51, 10),
    'p'        : [1, 2],
    'weights'  : ['uniform', 'distance'],
}

knn_gridsearch = GridSearchCV(KNeighborsClassifier(), 
                              knn_params, 
                              cv = 5,
                              verbose = 1)

knn_gridsearch.fit(X_train_cvec, y_train);
print(knn_gridsearch.score(X_train_cvec, y_train))
print(knn_gridsearch.score(X_test_cvec, y_test))

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  2.6min finished


0.999012508229098
0.6132264529058116


In [34]:
knn_gridsearch.best_params_

{'n_neighbors': 11, 'p': 2, 'weights': 'distance'}

### k = 11 vs k = 3

In [46]:
knn_model = KNeighborsClassifier(n_neighbors=11)

knn_model.fit(X_train_cvec, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=11, p=2,
                     weights='uniform')

In [47]:
knn_model.score(X_train_cvec, y_train)

0.6711652402896643

In [48]:
knn_model.score(X_test_cvec, y_test)

0.6138944555778223

In [49]:
knn_model = KNeighborsClassifier(n_neighbors=3)

knn_model.fit(X_train_cvec, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [50]:
knn_model.score(X_train_cvec, y_train)

0.7880184331797235

In [51]:
knn_model.score(X_test_cvec, y_test)

0.6232464929859719