<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP
## Truth or Truth

# Introduction
---

## Table of Contents:
- [Problem Statement](#problem_statement)
- [Data Used](#data_used)
- [Data Dictionary](#data_dictionary)
- [Next Step](#next_step_1)

## Problem Statement<a id='problem_statement'></a>
---

The moderation teams of Reddit are overwhelmed with the vast number of posts and comments spanning its various subreddit communities. The teams in particular have trouble ensuring all posts of a given subreddit meet relevancy standards. The aim of this project is to build a model that can take text content from two different subreddits and accurately classify the origin of each piece of content. The main metrics to maximize are both the accuracy and F1 scores of the models. If successful, this model can be further improved upon to be used to classify content from any number of different subreddits across the website. This will greatly alleviate the burden of the Reddit moderation staff and open new avenues of effective moderation. The two subreddits used in this study are **r/news** and **r/conspiracy**. The four classification models constructed to solve this problem are Logistic Regression, Multinomial Naive Bayes, XGBClassifier, and Voting Classifier. 

### Background

Reddit is a social news aggregation and discussion website. Registered users submit content to the site which can be voted up or down by other members. Posts are organized by subject into user created boards called subreddits. Reddit administrators and subreddit moderators moderate the communities.<sup>1</sup> According to March 2023 statistics, Reddit ranks as the ninth most visited website globally and the sixth most visited website in the United States.<sup>2</sup>

Natural Language Processing (NLP) refers to the branch of computer science and artificial intelligence related to computers developing the ability to understand text and spoken words.<sup>3</sup> Logistic regression estimates the probability of an event occurring based on a dataset of independent variables.<sup>4</sup> The Multinomial Naive Bayes algorithm is a learning approach that uses Bayes theorem.<sup>5</sup> XGBoost is a gradient-boosted decision tree machine learning model. It iteratively trains an array of decision trees. The final prediction is a weighted sum of all of the tree predictions.<sup>6</sup> A Voting Classifier is a machine learning model that trains on an ensemble of numerous models. It aggregates the results of each classifier and predicts the output class based on the highest majority of voting.<sup>7</sup>

Classification accuracy is a metric that summarizes the performance of a classification model as the number of correct predictions divided by the total number of predictions.<sup>8</sup> F1 score is also a metric that measures a model’s accuracy. It combines both the precision and recall scores of a model.<sup>9</sup>

1. [Reddit - Overview](https://en.wikipedia.org/wiki/Reddit)
2. [Reddit - Traffic Statistics](https://www.semrush.com/website/reddit.com/overview/)
3. [Natural Language Processing](https://www.ibm.com/topics/natural-language-processing)
4. [Logistic Regression](https://www.ibm.com/topics/logistic-regression#:~:text=Resources-,What%20is%20logistic%20regression%3F,given%20dataset%20of%20independent%20variables.)
5. [Multinomial Naive Bayes](https://www.upgrad.com/blog/multinomial-naive-bayes-explained/#:~:text=The%20Multinomial%20Naive%20Bayes%20algorithm%20is%20a%20Bayesian%20learning%20approach,tag%20with%20the%20greatest%20chance.)
6. [XGBoost](https://www.nvidia.com/en-us/glossary/data-science/xgboost/)
7. [Voting Classifier](https://www.geeksforgeeks.org/ml-voting-classifier-using-sklearn/#)
8. [Classification Accuracy](https://machinelearningmastery.com/failure-of-accuracy-for-imbalanced-class-distributions/#:~:text=Classification%20accuracy%20is%20a%20metric,used%20for%20evaluating%20classifier%20models.)
9. [F1 Score](https://www.v7labs.com/blog/f1-score-guide)

### Research

#### r/news
>"The place for news articles about current events in the United States and the rest of the world." It boasts over 21 million members and is the ninth largest subreddit on the site.<sup>12</sup> Some of the content removal rules are listed below.

Post will be removed if it:
>* is not news.
>* is not in English.
>* is an opinion/analysis or advocacy piece.
>* primarily concerns politics.
>* has a title that does not match the actual title or the lede.
>* has a pay wall or steals content.
>* covers an already-submitted story.
>* violates reddit's site-wide rules, especially regarding personal info.<sup>12</sup>

Comment will be removed if it:
>* is racist, sexist, vitriolic, or overly crude.
>* is unnecessarily rude or provocative.
>* is a cheap and distracting joke or meme.
>* is responding to spam. 
>* violates reddit's site-wide rules.
>* advocates or celebrates the death of another person. 
>* incites violence.<sup>12</sup>
---
#### r/conspiracy
>"The conspiracy subreddit is a thinking ground. Above all else, we respect everyone's opinions and ALL religious beliefs and creeds. We hope to challenge issues which have captured the public’s imagination, from JFK and UFOs to 9/11. This is a forum for free thinking, not hate speech. Respect other views and opinions, and keep an open mind. Our intentions are aimed towards a fairer, more transparent world and a better future for everyone." It boasts nearly 2 million members.<sup>13</sup> Some of the main rules are listed below.

Rules:
>* Bigoted slurs are not tolerated.
>* Address the argument; not the user, the mods, or the sub.
>* No blog spam/malicious web sites.
>* No stalking or trolling. No threatening or abusive language.
>* No caps lock in titles other than acronyms/initialisms. Comments with a large percentage of all caps, all bold, all large fonts or text colors are considered 'shouting' and will be removed.
>* No memes
>* Posting links in other subs pointing to specific submissions or comments here is subject to a ban, depending on context.
>* Misleading, fabricated or sensationalist headlines are subject to removal.
>* Self posts that lack context or content may be removed.<sup>13</sup>

12. [r/news - Overview](https://www.reddit.com/r/news/)<br>
13. [r/conspiracy - Overview](https://www.reddit.com/r/conspiracy/)

## Data Used<a id='data_used'></a>
---

Data was scraped from the afromentioned subreddits with the utilization of [Pushshift Reddit API](https://github.com/pushshift/api). Methodology can be examined in the [Data Collection](./02_Data_Collection.ipynb) notebook. **Origin date is April 26, 2023.**

#### r/news

>* [`120d_0d.csv`](../data/reddit_news/posts/120d_0d.csv): 1000 posts up to 120 days before origin date.
>* [`240d_120d.csv`](../data/reddit_news/posts/240d_120d.csv): 1000 posts between 120-240 days before origin date.
>* [`360d_240d.csv`](../data/reddit_news/posts/360d_240d.csv): 1000 posts between 240-360 days before origin date.
>* [`480d_360d.csv`](../data/reddit_news/posts/480d_360d.csv): 1000 posts between 360-480 days before origin date.
>* [`600d_480d.csv`](../data/reddit_news/posts/600d_480d.csv): 1000 posts between 480-600 days before origin date.
>* [`720d_600d.csv`](../data/reddit_news/posts/720d_600d.csv): 1000 posts between 600-720 days before origin date.
>* [`840d_720d.csv`](../data/reddit_news/posts/840d_720d.csv): 1000 posts between 720-840 days before origin date.
>* [`960d_840d.csv`](../data/reddit_news/posts/960d_840d.csv): 1000 posts between 840-960 days before origin date.
>* [`1080d_960d.csv`](../data/reddit_news/posts/1080d_960d.csv): 1000 posts between 960-1080 days before origin date.
>* [`1200d_1080d.csv`](../data/reddit_news/posts/1200d_1080d.csv): 1000 posts between 1080-1200 days before origin date.

#### r/conspiracy

>* [`120d_0d.csv`](../data/reddit_conspiracy/posts/120d_0d.csv): 1000 posts up to 120 days before origin date.
>* [`240d_120d.csv`](../data/reddit_conspiracy/posts/240d_120d.csv): 1000 posts between 120-240 days before origin date.
>* [`360d_240d.csv`](../data/reddit_conspiracy/posts/360d_240d.csv): 1000 posts between 240-360 days before origin date.
>* [`480d_360d.csv`](../data/reddit_conspiracy/posts/480d_360d.csv): 1000 posts between 360-480 days before origin date.
>* [`600d_480d.csv`](../data/reddit_conspiracy/posts/600d_480d.csv): 1000 posts between 480-600 days before origin date.
>* [`720d_600d.csv`](../data/reddit_conspiracy/posts/720d_600d.csv): 1000 posts between 600-720 days before origin date.
>* [`840d_720d.csv`](../data/reddit_conspiracy/posts/840d_720d.csv): 1000 posts between 720-840 days before origin date.
>* [`960d_840d.csv`](../data/reddit_conspiracy/posts/960d_840d.csv): 1000 posts between 840-960 days before origin date.
>* [`1080d_960d.csv`](../data/reddit_conspiracy/posts/1080d_960d.csv): 1000 posts between 960-1080 days before origin date.
>* [`1200d_1080d.csv`](../data/reddit_conspiracy/posts/1200d_1080d.csv): 1000 posts between 1080-1200 days before origin date.



## Data Dictionary<a id='data_dictionary'></a>
---

The [final preprocessed dataset](../data/cleaned_data/conspiracy_news_preprocessed.csv) contains the following features:

|       **Feature**      | **Type** |            **Dataset**           |                                **Description**                                |
|:----------------------:|:--------:|:--------------------------------:|:-----------------------------------------------------------------------------:|
| **subreddit**          | _object_ | conspiracy_news_preprocessed.csv | origin of post text from Reddit.com                                           |
| **title**              | _object_ | conspiracy_news_preprocessed.csv | content of subreddit post                                                     |
| **utc_datatime_str**   | _datetime64_ | conspiracy_news_preprocessed.csv | date and time of post creation                                                |
| **language**           | _object_ | conspiracy_news_preprocessed.csv | language of post text                                                         |
| **title_length**       |   _int_  | conspiracy_news_preprocessed.csv | number of characters in post text                                             |
| **title_word_count**   |   _int_  | conspiracy_news_preprocessed.csv | number of words in post text                                                  |
| **sentiment**          |  _float_ | conspiracy_news_preprocessed.csv | sentiment polarity score of post text                                         |
| **sentiment_category** | _object_ | conspiracy_news_preprocessed.csv | categorized polarity score (negative, neutral, postive)                       |
| **hour**               | _object_ | conspiracy_news_preprocessed.csv | categorized time of day of post creation (morning, afternoon, evening, night) |

## Next Step:<a id='next_step_1'></a>
---

### [Data Collection](./02_Data_Collection.ipynb)