<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP (Part 1 Data Collection)
#### Binary CLassification of Subreddit Posts

 - [Problem Statement](#Problem-Statement)
 - [Executive Summary](#Executive-Summary)
 - [Methodology](#Methodology)
 - [Datasets](#Datasets)
 - [Data Import & Setup Conditions](#Data-Import-&-Setup-Conditions)

## Problem Statement

We've been tasked by a matchmaker company (Swindler Inc) to construct a binary classification model that can distinguish genuine relationship advice questions (subreddit r/relationship advise) from potential spam and non-related enquiries (subreddit r/stupidquestions). Seven classification models will be created for this project: Naive Bayes, Logistic Regression, Random Forest, Decision Tree, K-Nearest Neighbors, Support Vector Machines and Gradient Boosting with accuracy, precision and F1 score on unseen test data used to determine success. Swindler Inc. relationship experts will be key stakeholders in the endeavor. The intention for this study is for relationship specialists to spend as little time as possible reviewing emails in order to determine which are spam. The time saved can then be put to better use, such as increasing production or providing clients with more specialized relationship guidance.

## Executive Summary

Working professionals who don't have enough time to meet people in person owing to their tight work schedule are increasingly using Swindler dating app. It not only makes dating easy and comfortable for working professionals, but it also provides customized relationship advice. However, Swindler dating app receives a flood of inquiries for advises on a daily basis, and they fear that some of these inquiries are the result of foul play by competitors.

As such, they've enlisted the expertise of well-known NLP experts to develop a binary classification model that can tell the difference between genuine relationship guidance enquiries and non-related inquiries. NLP experts advocate collecting data from Reddit, a social news network that allows users to discuss and vote on articles published by other users, in order to train a suitable model. The model will be developed using the following two Reddit threads: genuine relationship inquiries from r/relationship advise and unrelated questions from r/stupidquestion.

Seven classification models were developed for this binary classification project: Naive Bayes, Logistic Regression, Random Forest, Decision Tree, K-Nearest Neighbors, Support Vector Machines and Gradient Boosting with accuracy, precision and F1 score on unseen test data used to determine success.

When it comes to training and testing gridsearch metric scores, both the Logistic and GradientBoostingClassifier models perform equally well after hyperparameter adjustment. However, because it is the simplest/easiest model for the audience to understand, we used the logistic regression model with countvectorizer as our final model in this study. On our testing dataset, it has the greatest accuracy of 0.96161, precision of 0.9555, and F1 score of 0.9621. However, there is still some overfitting between the training and test datasets.

As such, moving forward, some recommendations to further improve the model as follows:
- Use the latest classification modeling technique such as catboost to check for improvement of performance
- Conduct lemmatization for both topics 
- Perform Sentiment analysis of the 2 topics and observed for obvious differences
- Expand STOPWORDS library to include most frequently misclassified words
- Perform gridsearch for all 7 classification models

## Methodology 

Following Blitzstein & Pfister’s workflow ([*source3*](https://github.com/cs109/2015/blob/master/Lectures/01-Introduction.pdf)), a 5 steps framework was implemented to conduct this analysis. These 5 steps are:

**Step 1: Ask an interesting question**
- Defining a clear and concise problem statement.

**Step 2: Get the data**
- Import and clean raw data to ensure that all datatypes were accurate and any other errors were fixed.

**Step 3: Explore the data**
- Check for duplicated posts for each topic
- Plot visualization for distribution of posts of each topic
- Feature engineering
- Remove URLS / punctuations / NON-ASCII / Stopwords for each posts

**Step 4: Model the data**
- Creating a base model with MultinomialNB model
- Compare success metrics between the different classification models after hyperparameters tuning
- Selecting the best Machine learning algorithm/model selection for submission
- Data Visualization
  - barplots
  - histograms
  - SHAP summary plot

**Step 5: Communicate and visualize the results**
- Present findings to a non-technical audience and provide recommendations

## Datasets

* [`genuine_question.csv`](../datasets/genuine_qns.csv): Data set contains genuine relationship questions. This dataset will be split for  training and testing purposes.
* [`unrelated_question.csv`](../datasets/unrelated_qns.csv): Data set contains unrelated questions. This dataset will be split for  training and testing purposes.

## Data Import & Setup Conditions

#### Importing Libraries

In [11]:
import requests
import datetime as dt
from pmaw import PushshiftAPI
import pandas as pd

#### Using Pushshift API to extract 8000 posts from subreddit r/relationship_advise

In [17]:
api = PushshiftAPI()

subreddit = 'relationship_advice'
limit = 8000
before = int(dt.datetime(2022,1,1,0,0).timestamp())
after = int(dt.datetime(2010,1,1,0,0).timestamp())
rate_limit = 100 #To prevent overloading the server with requests, we set parameters such as rate limit and max sleep of 10 sec after every 100 posts
max_sleep = 10 

real_qns_data = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after, rate_limit=rate_limit, max_sleep=max_sleep)
print(f'Retrieved {len(real_qns_data)} submissions from Pushshift')

Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.


Retrieved 8000 submissions from Pushshift


In [18]:
#Parsing the data into a dataframe
real_qns_df = pd.DataFrame(real_qns_data)

In [19]:
#Filtering the columns to contain only the subreddit name, title and content of the post
real_qns_df = real_qns_df[['subreddit', 'selftext', 'title']]

In [20]:
real_qns_df.shape

(8000, 3)

In [21]:
#Some of the post's content is devoid of words and instead consists of images, GIFs, or videos. However, since there is only 11/8000 posts without content, this is acceptable
real_qns_df.isnull().sum()

subreddit     0
selftext     11
title         0
dtype: int64

In [22]:
#savings the dataframe to csv format
real_qns_df.to_csv('../datasets/genuine_qns.csv')

#### Using Pushshift API to extract 8000 posts from subreddit r/stupidquestions

In [9]:
subreddit = 'stupidquestions'
limit = 8000
before = int(dt.datetime(2022,1,1,0,0).timestamp())
after = int(dt.datetime(2010,1,1,0,0).timestamp())
#rate_limit = 60 #To prevent overloading the server with requests, we set parameters such as rate limit and max sleep of 10 sec after every 100 posts
#max_sleep = 90

unrelated_qns_data = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after) #, rate_limit=rate_limit, max_sleep=max_sleep)
print(f'Retrieved {len(unrelated_qns_data)} submissions from Pushshift')

Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.


Retrieved 8000 submissions from Pushshift


In [12]:
#Parsing the data into a dataframe
unrelated_qns_df = pd.DataFrame(unrelated_qns_data)

In [13]:
#Filtering the columns to contain only the subreddit name, title and content of the post
unrelated_qns_df = unrelated_qns_df[['subreddit', 'selftext', 'title']]

In [14]:
unrelated_qns_df.shape

(8000, 3)

In [15]:
#Some of the post's content is devoid of words and instead consists of images, GIFs, or videos. However, since there is only 6/8000 posts without content, this is acceptable
unrelated_qns_df.isnull().sum()

subreddit    0
selftext     6
title        0
dtype: int64

In [16]:
#savings the dataframe to csv format
unrelated_qns_df.to_csv('../datasets/unrelated_qns.csv')

#### Continue in Project3-Part2