# Reddit NLP Classification

### Define Problem
Posts are sourced from two separate **Reddit subreddits**. Predict which subreddit/class a post belongs to using NLP methods.

### Modeling
The two subreddits that are identified as the target classes are **r/atheism** and **r/catholicism**. It is anticipated that these two classes will have overlap in features as the two topics are in direct relation to each other (but opposing positions). The target variable is of a binary/discrete nature, therefore Classification Models will be used to make predictions.

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import scipy as stats

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import r2_score, confusion_matrix, roc_auc_score

import requests
import time
from bs4 import BeautifulSoup

import nltk
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem.porter import PorterStemmer
import regex as re



### Gather Data: Reddit API Request
Data previously retrieved via Python Reddit API Wrapper

### Load r/atheism and r/catholicism subreddits

In [2]:
atheism_df = pd.read_pickle('./data/atheism_df.pkl')
catholicism_df = pd.read_pickle('./data/catholicism_df.pkl')

### r/atheism EDA

In [3]:
atheism_df.head()

Unnamed: 0,id,title,author,created,selftext,url,subreddit
0,b98dv9,"Bibleman has been rebooted, and the villains o...",0,1554350000.0,,https://pureflix.com/series/267433510476/bible...,atheism
1,b9b45i,Roughly half of Americans think Christian nati...,0,1554370000.0,,https://www.lgbtqnation.com/2019/04/roughly-ha...,atheism
2,b9enrm,Anti-vaxxer ‘warrior mom’: If vaccines are so ...,0,1554390000.0,,http://deadstate.org/anti-vaxxer-warrior-mom-i...,atheism
3,b9dmqn,Megachurch preachers and their expensive sneak...,0,1554390000.0,,https://boingboing.net/2019/04/03/megachurch-p...,atheism
4,b95ydy,"Mormons say “Priesthood ban”, to describe thei...",0,1554340000.0,,https://www.dialoguejournal.com/wp-content/upl...,atheism


In [4]:
atheism_df.shape

(957, 7)

In [5]:
atheism_df.dtypes

id           object
title        object
author       object
created      object
selftext     object
url          object
subreddit    object
dtype: object

In [6]:
# Check if `author` is always listed as 0
atheism_df['author'].unique()

array([0], dtype=object)

In [7]:
# Check for rows where `selftext` has text (also checking for how many rows
# have no text)
atheism_df[atheism_df['selftext'] != '']

Unnamed: 0,id,title,author,created,selftext,url,subreddit
5,b9dasu,It blows my mind that churches are such an ind...,0,1.55439e+09,I moved from southern Kentucky to central Ohio...,https://www.reddit.com/r/atheism/comments/b9da...,atheism
6,b9filg,HALF of the confirmed Measles cases in 2018 st...,0,1.5544e+09,[Article](https://khn.org/news/why-measles-hit...,https://www.reddit.com/r/atheism/comments/b9fi...,atheism
7,b9cq6o,"Ex muslims, I did a presentation on why religi...",0,1.55438e+09,I got the quotes Straight from an Islamic Qura...,https://www.reddit.com/r/atheism/comments/b9cq...,atheism
10,b98xyw,All religions started out as a cult,0,1.55436e+09,They just got big enough to be called a religion,https://www.reddit.com/r/atheism/comments/b98x...,atheism
16,b9d3y5,Why did God put the fruit of the knowledge of ...,0,1.55439e+09,Like I get the whole free will thing and how G...,https://www.reddit.com/r/atheism/comments/b9d3...,atheism
19,b9fkzy,Abortion in Alabama.,0,1.5544e+09,It just blows my mind beyond belief sometimes ...,https://www.reddit.com/r/atheism/comments/b9fk...,atheism
20,b9e7bh,The idea of Evolution with Jesus,0,1.55439e+09,"Hey, I've been here for a while, and commented...",https://www.reddit.com/r/atheism/comments/b9e7...,atheism
28,b9b9lv,Just a shower-thought I had on religion vs sci...,0,1.55438e+09,Let's imagine that human knowledge is a nevere...,https://www.reddit.com/r/atheism/comments/b9b9...,atheism
31,b9fopr,Irony of the Tree of Knowledge,0,1.5544e+09,Was listening to some talks about faith and ho...,https://www.reddit.com/r/atheism/comments/b9fo...,atheism
32,b99ari,Male and female circumcision,0,1.55436e+09,Imagine a society in which it is custom to rem...,https://www.reddit.com/r/atheism/comments/b99a...,atheism


### r/atheism EDA Findings

- id for each post is unique
- `title` & `selftext` must be cleaned of non-alphanumeric characters/url
- `author` is always listed as `0`
- selftext has blank cells in some rows

### r/catholicism EDA

In [10]:
catholicism_df.head()

Unnamed: 0,id,title,author,created,selftext,url,subreddit
0,b813h8,/r/Catholicism Prayer Requests — Week of April...,0,1554120000.0,\nPlease post your prayer requests in this wee...,https://www.reddit.com/r/Catholicism/comments/...,Catholicism
1,b9eqan,Standing next to my painting of the Resurrecti...,0,1554390000.0,,https://i.redd.it/k2i1sh1wu9q21.jpg,Catholicism
2,b9dcdi,Today is the feast day of Saint Benedict the M...,0,1554390000.0,,https://i.redd.it/4xi5i01db9q21.jpg,Catholicism
3,b9ewzu,Found this anime version of Our Lady of Perpet...,0,1554400000.0,,https://i.redd.it/mko7inagx9q21.jpg,Catholicism
4,b9fcxx,Traditional Latin Mass saves U.S. Parish from ...,0,1554400000.0,,https://www.lifesitenews.com/news/traditional-...,Catholicism


In [11]:
catholicism_df.shape

(969, 7)

In [12]:
catholicism_df.dtypes

id           object
title        object
author       object
created      object
selftext     object
url          object
subreddit    object
dtype: object

In [13]:
# Check if `author` is always listed as 0
catholicism_df['author'].unique()

array([0], dtype=object)

In [14]:
# Check for rows where `selftext` has text (also checking for how many rows
# have no text)
catholicism_df[catholicism_df['selftext'] != '']

Unnamed: 0,id,title,author,created,selftext,url,subreddit
0,b813h8,/r/Catholicism Prayer Requests — Week of April...,0,1.55412e+09,\nPlease post your prayer requests in this wee...,https://www.reddit.com/r/Catholicism/comments/...,Catholicism
6,b9ejvj,Are People in Your Parish Openly Pro-Abortion?,0,1.55439e+09,They sure are in my parish. During this curre...,https://www.reddit.com/r/Catholicism/comments/...,Catholicism
8,b9bei4,I have a question about walking to a church as...,0,1.55438e+09,"I really hope this isn't a stupid question, bu...",https://www.reddit.com/r/Catholicism/comments/...,Catholicism
11,b96c4i,Received the sacrament of reconciliation for t...,0,1.55434e+09,I asked here about a week ago about a first ti...,https://www.reddit.com/r/Catholicism/comments/...,Catholicism
13,b9c32m,"Long time lurker, I have a question in my mind",0,1.55438e+09,Newly Catholic here.\n\nCan we pray our own wr...,https://www.reddit.com/r/Catholicism/comments/...,Catholicism
14,b9dkph,Catholics and tattoos,0,1.55439e+09,"Hey all, I am wanting to get a tattoo for my 2...",https://www.reddit.com/r/Catholicism/comments/...,Catholicism
15,b9c0ms,THE FIVE HOLY WOUNDS OF JESUS,0,1.55438e+09,&#x200B;\n\n[Precious Blood of Jesus](https://...,https://www.reddit.com/r/Catholicism/comments/...,Catholicism
16,b9g2qn,Honest question: how loyal are you to the pope?,0,1.5544e+09,What does he mean to you personally? How would...,https://www.reddit.com/r/Catholicism/comments/...,Catholicism
17,b94vos,I'd like to say hi,0,1.55433e+09,"Hi everyone,\n\nI'm sure it's not mandatory bu...",https://www.reddit.com/r/Catholicism/comments/...,Catholicism
18,b9bfp5,Falling short on a lenten promise,0,1.55438e+09,Hi everyone! I have been thinking about this f...,https://www.reddit.com/r/Catholicism/comments/...,Catholicism


### r/catholicism EDA Findings

- id for each post is unique
- `title` & `selftext` must be cleaned of non-alphanumeric characters/url
- `author` is always listed as `0`
- selftext has blank cells in some rows

### In order to balance classes, random sample from r/catholicism to match same amount of rows as r/atheism.
- Number of posts retrieved from each subreddit do not match. Random sample from the subreddit that has the most posts to match the same amount of posts from the subreddit with least posts.

In [15]:
catholicism_sample = catholicism_df.sample(n=atheism_df.shape[0], random_state=42)
catholicism_sample.shape

(957, 7)

In [16]:
atheism_df.shape

(957, 7)

### Merge dataframes from both subreddits

In [18]:
df = atheism_df.append(catholicism_sample)
df.shape

(1914, 7)

In [20]:
df.head(2)

Unnamed: 0,id,title,author,created,selftext,url,subreddit
0,b98dv9,"Bibleman has been rebooted, and the villains o...",0,1554350000.0,,https://pureflix.com/series/267433510476/bible...,atheism
1,b9b45i,Roughly half of Americans think Christian nati...,0,1554370000.0,,https://www.lgbtqnation.com/2019/04/roughly-ha...,atheism


In [21]:
df.tail(2)

Unnamed: 0,id,title,author,created,selftext,url,subreddit
458,b72frr,"Alyssa Milano, 49 celebrities threaten Georgia...",0,1553890000.0,,https://www.foxnews.com/entertainment/alyssa-m...,Catholicism
330,b7qkcc,Why is it ive never heard Eucharistic prayer 1...,0,1554060000.0,When I go to daily Mass i either hear EP 2 or ...,https://www.reddit.com/r/Catholicism/comments/...,Catholicism


### Change target column to binary (atheism: 0, catholicism: 1)

In [23]:
df['subreddit'] = df['subreddit'].map(lambda x: 0 if x == 'atheism' else 1)
df.head(2)

Unnamed: 0,id,title,author,created,selftext,url,subreddit
0,b98dv9,"Bibleman has been rebooted, and the villains o...",0,1554350000.0,,https://pureflix.com/series/267433510476/bible...,1
1,b9b45i,Roughly half of Americans think Christian nati...,0,1554370000.0,,https://www.lgbtqnation.com/2019/04/roughly-ha...,1


In [24]:
df.tail(2)

Unnamed: 0,id,title,author,created,selftext,url,subreddit
458,b72frr,"Alyssa Milano, 49 celebrities threaten Georgia...",0,1553890000.0,,https://www.foxnews.com/entertainment/alyssa-m...,1
330,b7qkcc,Why is it ive never heard Eucharistic prayer 1...,0,1554060000.0,When I go to daily Mass i either hear EP 2 or ...,https://www.reddit.com/r/Catholicism/comments/...,1


### Clean data
- Remove non-alphanumeric characters
- 

### Evaluation

In [25]:
def get_most_important_features(vectorizer, model, n=5):
    index_to_word = {v:k for k,v in vectorizer.vocabulary_.items()}
    
    # loop for each class
    classes ={}
    for class_index in range(model.coef_.shape[0]):
        word_importances = [(el, index_to_word[i]) for i,el in enumerate(model.coef_[class_index])]
        sorted_coeff = sorted(word_importances, key = lambda x : x[0], reverse=True)
        tops = sorted(sorted_coeff[:n], key = lambda x : x[0])
        bottom = sorted_coeff[-n:]
        classes[class_index] = {
            'tops':tops,
            'bottom':bottom
        }
    return classes

importance = get_most_important_features(count_vectorizer, clf, 10)

NameError: name 'count_vectorizer' is not defined

In [26]:
def plot_important_words(top_scores, top_words, bottom_scores, bottom_words, name):
    y_pos = np.arange(len(top_words))
    top_pairs = [(a,b) for a,b in zip(top_words, top_scores)]
    top_pairs = sorted(top_pairs, key=lambda x: x[1])
    
    bottom_pairs = [(a,b) for a,b in zip(bottom_words, bottom_scores)]
    bottom_pairs = sorted(bottom_pairs, key=lambda x: x[1], reverse=True)
    
    top_words = [a[0] for a in top_pairs]
    top_scores = [a[1] for a in top_pairs]
    
    bottom_words = [a[0] for a in bottom_pairs]
    bottom_scores = [a[1] for a in bottom_pairs]
    
    fig = plt.figure(figsize=(10, 10))  

    plt.subplot(121)
    plt.barh(y_pos,bottom_scores, align='center', alpha=0.5)
    plt.title('Irrelevant', fontsize=20)
    plt.yticks(y_pos, bottom_words, fontsize=14)
    plt.suptitle('Key words', fontsize=16)
    plt.xlabel('Importance', fontsize=20)
    
    plt.subplot(122)
    plt.barh(y_pos,top_scores, align='center', alpha=0.5)
    plt.title('Disaster', fontsize=20)
    plt.yticks(y_pos, top_words, fontsize=14)
    plt.suptitle(name, fontsize=16)
    plt.xlabel('Importance', fontsize=20)
    
    plt.subplots_adjust(wspace=0.8)
    plt.show()

top_scores = [a[0] for a in importance[1]['tops']]
top_words = [a[1] for a in importance[1]['tops']]
bottom_scores = [a[0] for a in importance[1]['bottom']]
bottom_words = [a[1] for a in importance[1]['bottom']]

plot_important_words(top_scores, top_words, bottom_scores, bottom_words, "Most important words for relevance")

NameError: name 'importance' is not defined