# Project 3: Web APIs & Classification

# Problem Statement

To scape 2 subreddits(Anxiety & Depression) and using the scape data to build classification model which is based on Logistic Regression and MultinomialNB. The purpose of the model is to accurately predict the category of the words used in both subreddits. We hope that the model could be implemented in therapy session to distingush between an anxiety or depression patient.  

And at the end of the evaluation, the model with the best accuracy and least type 2 error would be selected as the final model. This would be based on score accuracy, ROC AUC score and confusion matrix.

# Executive Summary

Singapore has one of the highest depressive disorder cases among high-income nation around the world. And there is an increasing trend among Singapore youth to be diagnosed with depression or anxiety disorder. 

We hope that the model would provide insights on the common words used by depression or anxiety disorder patients during therapy. Therefore, mapping out a relationship between words used and sentiments of patients. Eventually the model could be utilize to help psychiatrist to make better deduction of depression or anxiety disorder patients. Thus, ensuring that the proper resources and treatments are allocated to individual patient. 

# Contents:
- [Acquiring URL of Anxiety Disorder & Depression](#Acquiring-URL-of-Anxiety-Disorder-&-Depression)
- [Loops to obtain more posts](#Loops-to-obtain-more-posts)
- [Define Functions](#Define-Functions)
- [Importing Scrap Data](#Importing-Scrap-Data)
- [Inspect & Cleaning](#Inspect-&-Cleaning)
- [EDA](#EDA)
- [Target and feature (Post)](#Target-and-feature-(Post))
- [Logistic Regression](#Logistic-Regression)
- [MultinomialNB](#MultinomialNB)
- [Gridsearch & Pipeline](#Gridsearch-&-Pipeline)
- [Evaluation Summary](#Evaluation-Summary)
- [Conclusion & Recommendation](#Conclusion-&-Recommendation)

In [1]:
import requests
import pandas as pd
import time
import random

## Acquiring URL of Anxiety Disorder & Depression

In [2]:
#Anxiety Disorder URL
url_anxiety = 'https://www.reddit.com/r/Anxiety.json'

In [4]:
#Depression Dark Place
url_depression_dark_place = 'https://www.reddit.com/r/depression.json'

In [None]:
#Requests Anxiety Disorder URL 
res_anxiety = requests.get(url_anxiety)

In [None]:
#Requests Depression URL 
res_depression = requests.get(url_depression)

In [None]:
res_anxiety.status_code, res_depression.status_code

In [None]:
#Change of user-agent
res_anxiety = requests.get(url_anxiety, headers={'User-agent': 'Wil 1.0'})
res_depression = requests.get(url_depression, headers={'User-agent': 'Wil 1.0'})

In [None]:
res_anxiety.status_code, res_depression.status_code

In [None]:
#Anxiety Disorder dictionary
anxiety_dict = res_anxiety.json()

In [None]:
#Depression dictionary
depression_dict = res_depression.json()

In [None]:
anxiety_dict['data'].keys()

In [None]:
depression_dict['data'].keys()

In [None]:
anxiety_dict['data']['children']

In [None]:
len(anxiety_dict['data']['children'])

## Loops to obtain more posts

In [7]:
#Anxiety dataframe
posts_anxiety = []
after = None

for a in range(50):
    if after == None:
        current_url = url_anxiety
    else:
        current_url = url_anxiety + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Wil 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts_anxiety.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/Anxiety.json
4
https://www.reddit.com/r/Anxiety.json?after=t3_fep9on
5
https://www.reddit.com/r/Anxiety.json?after=t3_fek711
2
https://www.reddit.com/r/Anxiety.json?after=t3_felsyc
3
https://www.reddit.com/r/Anxiety.json?after=t3_feji5i
3
https://www.reddit.com/r/Anxiety.json?after=t3_fegdtg
3
https://www.reddit.com/r/Anxiety.json?after=t3_fe7fuk
3
https://www.reddit.com/r/Anxiety.json?after=t3_feb982
3
https://www.reddit.com/r/Anxiety.json?after=t3_fe6jwu
6
https://www.reddit.com/r/Anxiety.json?after=t3_fe7ib7
6
https://www.reddit.com/r/Anxiety.json?after=t3_fdsrvx
3
https://www.reddit.com/r/Anxiety.json?after=t3_fe3lrf
5
https://www.reddit.com/r/Anxiety.json?after=t3_fdw26v
2
https://www.reddit.com/r/Anxiety.json?after=t3_fdyvjx
6
https://www.reddit.com/r/Anxiety.json?after=t3_fdl9t1
4
https://www.reddit.com/r/Anxiety.json?after=t3_fdsyz8
2
https://www.reddit.com/r/Anxiety.json?after=t3_fdn70d
3
https://www.reddit.com/r/Anxiety.json?after=t3_fdpn9v
5
https://

In [10]:
pd.DataFrame(posts_anxiety).to_csv('anxiety.csv', index = False)

In [8]:
len(posts_anxiety)

1253

In [None]:
#Unique Anxiety comments
len(set(p['selftext'] for p in posts_anxiety))

In [11]:
#Depression_dark dataframe
posts_depression_dark = []
after = None

for a in range(50):
    if after == None:
        current_url = url_depression_dark_place
    else:
        current_url = url_depression_dark_place + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Wil 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts_depression_dark.extend(current_posts)
    after = current_dict['data']['after']

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/depression.json
6
https://www.reddit.com/r/depression.json?after=t3_felz2c
6
https://www.reddit.com/r/depression.json?after=t3_fenddo
2
https://www.reddit.com/r/depression.json?after=t3_feoojv
2
https://www.reddit.com/r/depression.json?after=t3_fem903
4
https://www.reddit.com/r/depression.json?after=t3_feppmn
2
https://www.reddit.com/r/depression.json?after=t3_fep5r1
5
https://www.reddit.com/r/depression.json?after=t3_feok3a
6
https://www.reddit.com/r/depression.json?after=t3_fea8or
4
https://www.reddit.com/r/depression.json?after=t3_feng6j
2
https://www.reddit.com/r/depression.json?after=t3_fehgth
2
https://www.reddit.com/r/depression.json?after=t3_fega6j
2
https://www.reddit.com/r/depression.json?after=t3_fek4za
4
https://www.reddit.com/r/depression.json?after=t3_fef52j
6
https://www.reddit.com/r/depression.json?after=t3_fdnzub
2
https://www.reddit.com/r/depression.json?after=t3_feauao
3
https://www.reddit.com/r/depression.json?after=t3_fe9e27
3
https://www.r

In [12]:
pd.DataFrame(posts_depression_dark).to_csv('depression_dark.csv', index = False)

In [13]:
len([p['selftext'] for p in posts_depression_dark])

1246

In [14]:
len(set(p['selftext'] for p in posts_depression_dark))

917