<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Webscraping from Reddit : Depression vs Anxiety

# Project Description

The project was done as part of General Assembly's requirement to pass the course. The aim of this project is to identify and classify two different subreddit post using Natural Language Processing (NLP). To achieve it, I have to do webscraping, clean data and preprocess data, Exploratory Data Analysis (EDA) and training various model to predict and identify the subreddit post. A model accuracy of 96% was achieved for this project.

# Table of content

- [Background](#Background)
- [Problem Statement](#Problem-Statement)
- [Assumptions](#Assumptions)
- [Import Libraries](#Import-Libraries)
- [Functions](#Functions)
- [Web Scraping](#Web-Scraping)
- [Data Cleaning and EDA](#Data-Cleaning-and-EDA)
- [Modeling](#Modeling)
- [Conclusions & Recommendations](#Conclusions-&-Recommendations)

# Background

According to Bachmann S. Epidemiology of Suicide and the Psychiatric Perspective, most suicides are related to psychiatric disease, with depression, substance use disorders and psychosis being the most relevant risk factors. In view of this statistic, a newly developed social media application, Chipper, has implemented a new feature where users are able to report other users' posts for suspected mental health issue so that they will be able to provide help to these users before it is too late.
As a data scientist working in this company, I am tasked to train a classifier that will categorise posts that were reported for mental health issues into either Anxiety or Depression so that we are able to route these users to its appropriate helpline. To train the classifier, I will be using posts from Reddit's r/Anxiety and r/Depression subreddits as proxy data.

# Problem Statement

To correctly classify post from the correct subreddit for Chipper to route the users to its appropriate helpline.

# Assumptions

The following are the assumptions made:
<br>
- all data analysed are based on the time where the data was scrapped,
<br>
- the target audiences consist of a mix of technical and non-technical background,
<br>
- due to the time and hardware limitations, we are unable to perform a live data monitoring and analysis,
<br>
- photos, vidoes and emojis from the post are ignored and removed from our analysis,
<br>
- missing data are removed from our analysis.

# Import Libraries

In [1]:
# Import libraries
import requests
import pandas as pd
import datetime as dt 
import time
import random
import string
import re
import nltk
from datetime import datetime
from wordcloud import WordCloud, STOPWORDS
from IPython.display import Image
from nltk.stem import WordNetLemmatizer
import spacy as sp

#Import EDA
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import sklearn

#Import Modeling libraries
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import pickle
from sklearn.pipeline import Pipeline

from sklearn import metrics
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report, plot_roc_curve,plot_confusion_matrix

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

# Functions

## Webscraping function

In [2]:
def get_data_from_reddit(subreddit, no_of_post, days = 30):
    
    # Create a api link and store it in a api variable
    api = 'https://api.pushshift.io/reddit/search/submission'
    
    # Create a url link and store it in a url variable
    url = f'{api}?subreddit={subreddit}&size=100'
    
    # Create an empty list to store posts in the post variable
    posts = []
    
    # Change url after each iteration
    for i in range(1, no_of_post+1):
        urlmod = '{}&after={}d'.format(url, days*i)
        res_1 = requests.get(urlmod)
        
        # Prevent errors from stopping the code
        try:
            results = requests.get(urlmod)
            assert results.status_code == 200
        except:
            continue
        
        # Converting to json
        extracted = results.json()['data']
        
        # Change dataframe from dict
        df = pd.DataFrame.from_dict(extracted)
        
        # Adding the df to post list(created on top)
        posts.append(df)
        
        # Total scrapped posts
        total_scraped = sum(len(x) for x in posts)
        
        # If there are more than n values/data, stop. 
        if total_scraped > no_of_post:
            break
        else:
            pass
        # Generate a random sleep duration to prevent being blocked
        sleep_duration = random.randint(1,9)
        time.sleep(sleep_duration)
            
    
    # Creating a list of features that we will be using
    features_required = ['title','subreddit','selftext']
    
    # Merge all iterations into 1 dataframe
    df_merged = pd.concat(posts, sort=False)
    
    # Select the columns that we want from the datasets
    df_merged = df_merged[features_required]
    
    # Dropping any duplicates
    df_merged.drop_duplicates(inplace=True)
    
    return df_merged.reset_index(drop=True)

## Cleaning Function : Check for unique values

In [3]:
#Purpose : Check for inconsistant data and unique values

def inconsist_uni_val(dataframe):
    
    #Create a list to store updated column names
    column_name=dataframe.columns[2:4]

    #Create a loop to check each column data and unique values
    for name in column_name:
        display(name)
        display(dataframe[name].value_counts(dropna=False))

## Cleaning Function : Lemmatization of text

In [4]:
def lemmatizing(text):
    #Remove punctuation
    #text = "".join([word.lower() for word in text if word not in string.punctuation])
    # \W matches any non-word character (equivalent to [^a-zA-Z0-9_]). This does not include spaces i.e. \s
    # Add a + just in case there are 2 or more spaces between certain words
    tokens = re.split('\W+', text)
    
    # Requires a full sentence to be passed in as opposed to a tokenized list
    text = " ".join([wn.lemmatize(word) for word in tokens if word not in stopwords])
    
    return text

## Cleaning Function : Removal of punctuation

In [5]:
def remove_punct(text):
    # store character only if it is not a punctuation
    text_nopunct = "".join([char.lower() for char in text if char not in string.punctuation])
    return text_nopunct

## Modeling : Scores, Metrics, ROC Curve and Confusion Matrix

In [6]:
def model(vect_name_for_display,model_name_for_display,model,X_train,y_train,X_test,y_test):
    
    #Fit model
    model.fit(X_train,y_train)
    
    # save the model to disk
    filename = f'{vect_name_for_display}-{model_name_for_display} model.sav'
    pickle.dump(model, open(filename, 'wb'))
    
    # Predicting the results using the X_train and X_test
    y_train_pred=model.predict(X_train)
    y_test_pred=model.predict(X_test)
    
    # Train and test accuracy scores
    display(f'{vect_name_for_display} - {model_name_for_display} Accuracy Train score : {round(accuracy_score(y_train,y_train_pred),4)}')
    display(f'{vect_name_for_display} - {model_name_for_display} Accuracy Test score : {round(accuracy_score(y_test,y_test_pred),4)}')
    
    # Train and test Roc scores
    display(f'{vect_name_for_display} - {model_name_for_display} Roc-Auc Train score : {round(roc_auc_score(y_train,y_train_pred),4)}')
    display(f'{vect_name_for_display} - {model_name_for_display} Roc-Auc Test score : {round(roc_auc_score(y_test,y_test_pred),4)}')
        
    # Display Classification Metrics
    display(f'{vect_name_for_display} - {model_name_for_display} Classification Report : ')
    print(classification_report(y_test,y_test_pred))
    
    # Plotting of the ROC curve
    plot_roc_curve(model,X_test, y_test)
    plt.plot([0, 1], [0, 1], label='baseline', linestyle='--')
    plt.title(f"{vect_name_for_display}-{model_name_for_display} ROC Curve")
    plt.legend();
    plt.show()
    
    #Confusion Matrix
    labels=["Depression","Anxiety"]
    plot_confusion_matrix(model,X_test, y_test,cmap=plt.cm.Blues)
    color = 'white'
    plt.title(f'{vect_name_for_display} - {model_name_for_display} Confusion Matrix')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show()

# Webscraping

In [7]:
#To add date and time in the webscraping to know when was the last scrapped date

# datetime object containing current date and time
now = datetime.now()

# dd/mm/YY H:M:S
dt_string = now.strftime("%d/%m/%Y %H:%M:%S")

## Depression and Anxiety 

In [8]:
%%script false --no-raise-error
#The above code is used to prevent new data from being scrapped.

#Datetime object containing current date and time
now = datetime.now()
# dd/mm/YY H:M:S
dt_string = now.strftime("%d/%m/%Y %H:%M:%S")

#Start timer to time the process
start=time.perf_counter()

#Use the function above to scrap the data
submissions_depression_df = get_data_from_reddit('Depression', 3000)
submissions_anxiety_df = get_data_from_reddit('Anxiety', 3000)

#Show the results in words
display(f'From Pushshift : Scrapped \'depression\' {len(submissions_depression_df)} posts on {dt_string}.')
display(f'From Pushshift : Scrapped \'anxiety\' {len(submissions_anxiety_df)} posts on {dt_string}.')

#End timer
end=time.perf_counter() 

#Show the time taken to run the code
display(f'It took {abs(round(start-end,2))} seconds to run the code.')

The webscraping code summary:
<br>
- it is last run and scrapped on 24/11/22, 1.06 pm,
<br>
- took 592.09. secs to run and
<br>
- a total of around 6000 subreddit posts were scrapped for both depression and anxiety.

## Save the scrapped files to csv

In [9]:
%%script false --no-raise-error
#The above code is used to prevent new data from being saved.

submissions_depression_df.to_csv('../00-datasets/depression_data.csv')
submissions_anxiety_df.to_csv('../00-datasets/anxiety_data.csv')

#Show the results in words
display(f'The depression and anxiety files was last saved on {dt_string}.')

The depression and anxiety files was last saved on 24/11/22, 1.06pm . 
<br>
Even though the code is linked from the next file, the data scrapped are unable to be brought over as there was a code to prevent it from running and everytime the file closes or restart, the data will be lost.
<br>
Therefore, we need to save it as csv so that we can import the datasets for analysis later.

In [10]:
print("Part 1 of the notebook has been synchronized. \nPlease proceed to part 2 of the notebook. \nThank you for your patience.")

Part 1 of the notebook has been synchronized. 
Please proceed to part 2 of the notebook. 
Thank you for your patience.


To continue, please proceed to [part 2](./02-Code-Part-02.ipynb) of the notebook.