# Subreddit Classifier  
  
The social media platform Reddit is divided up into different subreddits. Each subreddit is a messageboard devoted to a certain topic. Each post contains a title and some combination of body text, images, videos, or links to external websites. People viewing the content can comment on it. Reddit uses a vote system where everybody can vote on each post or comment, either increasing the score (an "upvote") or decreasing the score (a "downvote") where comments or posts with higher scores generally being made more visible to anybody viewing the website. Some of the popular subreddit topics can include politics, news, humourous content, or AskReddit, a subreddit where people post questions and other people answer the questions in the comment section.

In order to scrape Reddit content, PRAW (the Python Reddit API Wrapper) was employed. It requires a Reddit developer account which gives credentials needed to access the API. Since I am only pulling content and not posting anything, I am accessing the API in read-only mode which requires fewer credentials than accessing it in read-write mode.  
  
The cell below handles imports and initializes PRAW with my credentials that are loaded via an external file.

In [8]:
#Imports and setting up the reddit API
import pandas as pd
import os
import praw
import numpy as np
import fasttext

#Client ID, secret, user agent in that order
credentials = list(pd.read_csv(os.getcwd()+'/credentials.csv').columns)

reddit = praw.Reddit(client_id=credentials[0],
                     client_secret=credentials[1],user_agent=credentials[2])

#I want:
#Title, text, link (if applicable), top comments, author


As a preliminary test, the cell below gathers the top 1000 posts (highest score) from six subreddits. These subreddits are:  
politics, a subreddit focused on US politics,  
AskReddit, a subreddit where people post questions and others answer them in the comment section,  
Worldnews, a subreddit focused on international, non-US news,  
Funny, a subreddit mostly for memes and joke content,  
Gaming, a subreddit dedicated to news or stories about videogames,  
Aww, a subreddit dedicated to photos and videos of animals, mostly pets.

In [9]:
#Gathers top posts from a few subreddits
politics = list(reddit.subreddit('politics').top(limit=1000))
ask = list(reddit.subreddit('askreddit').top(limit=1000))
worldnews = list(reddit.subreddit('worldnews').top(limit=1000))
funny = list(reddit.subreddit('funny').top(limit=1000))
gaming = list(reddit.subreddit('gaming').top(limit=1000))
aww = list(reddit.subreddit('aww').top(limit=1000))

The code above will gather lists of PRAW submission objects which then need to have the relevant data extracted from them.  
My general idea is to create a Pandas dataframe containing all of the relevant data. Since PRAW requires each submission be handled individually (I cannot just grab the list, type list.title and get a list of all of the titles), I will be creating empty lists and appending the relevant values to them while looping over my lists of submissions.  
  
For now, Reddit comments will be ignored.

In [10]:
#Initializing empty lists
titles = []
is_self = []
is_video = []
selftext = []
author = []
created = []
num_comments = []
score = []
subreddit = []
for i in range(len(politics)):
    titles.append(politics[i].title)
    is_self.append(politics[i].is_self)
    is_video.append(politics[i].is_video)
    selftext.append(politics[i].selftext)
    author.append(politics[i].author)
    created.append(politics[i].created)
    num_comments.append(politics[i].num_comments)
    score.append(politics[i].score)
    subreddit.append(0)
for i in range(len(ask)):
    titles.append(ask[i].title)
    is_self.append(ask[i].is_self)
    is_video.append(ask[i].is_video)
    selftext.append(ask[i].selftext)
    author.append(ask[i].author)
    created.append(ask[i].created)
    num_comments.append(ask[i].num_comments)
    score.append(ask[i].score)
    subreddit.append(1)
for i in range(len(worldnews)):
    titles.append(worldnews[i].title)
    is_self.append(worldnews[i].is_self)
    is_video.append(worldnews[i].is_video)
    selftext.append(worldnews[i].selftext)
    author.append(worldnews[i].author)
    created.append(worldnews[i].created)
    num_comments.append(worldnews[i].num_comments)
    score.append(worldnews[i].score)
    subreddit.append(2)
for i in range(len(funny)):
    titles.append(funny[i].title)
    is_self.append(funny[i].is_self)
    is_video.append(funny[i].is_video)
    selftext.append(funny[i].selftext)
    author.append(funny[i].author)
    created.append(funny[i].created)
    num_comments.append(funny[i].num_comments)
    score.append(funny[i].score)
    subreddit.append(3)
for i in range(len(gaming)):
    titles.append(gaming[i].title)
    is_self.append(gaming[i].is_self)
    is_video.append(gaming[i].is_video)
    selftext.append(gaming[i].selftext)
    author.append(gaming[i].author)
    created.append(gaming[i].created)
    num_comments.append(gaming[i].num_comments)
    score.append(gaming[i].score)
    subreddit.append(4)
for i in range(len(aww)):
    titles.append(aww[i].title)
    is_self.append(aww[i].is_self)
    is_video.append(aww[i].is_video)
    selftext.append(aww[i].selftext)
    author.append(aww[i].author)
    created.append(aww[i].created)
    num_comments.append(aww[i].num_comments)
    score.append(aww[i].score)
    subreddit.append(5)



In [11]:
#This creates a Pandas dataframe to hold the data
dat = pd.DataFrame()
dat['title'] = titles
dat['text'] = selftext
dat['author'] = author
dat['created'] = created
dat['score'] = score
dat['num_comments'] = num_comments
dat['is_self'] = is_self
dat['is_video'] = is_video
dat['subreddit'] = subreddit
dat = dat.sample(frac=1).reset_index(drop=True)

In [6]:
from nltk.tokenize import word_tokenize
for i in range(len(dat)):
    if ('interesting' in dat['title'][i]):
        pass
        #print(i)
        #print(dat['title'][i])
    elif (len(word_tokenize(dat['title'][i]))<2):
        print(dat['title'][i])
        


Progressive
Yummy
jogging
pain
True
Hmmm
F
neighbor
Printers
😍
Subwoofer
Shampoolympics
Dreams
Hi
onexboxonexbox
METAphor
Obu
Overdosed
Priorities
hmmm
Steam
✨wish✨
10/10
Link-ception
Geico
RIP
Sure
Hmmmmmmm
Why
Ironic
Uhh..
RiLed
Pathetic
Weaknesses
Pls
return
TSA
Wood
Weakness
hmmm
Fuck
M'laze
Confidence
Herbie


Now that the data is wrangled into a format that is easier to use, I am just going to shove it into the fasttext library Facebook keeps on their research Github. It might not work that well, but it'll at least be a quick prototype.

First, the fasttext package direct from Facebook's Github will be employed, mostly because using it is pretty simple and does not really require the normal NLP preprocessing.  
  
It can be trained in supervised mode which supplies a classifier, though it requires the data be loaded from an external text file where each entry looks like this:  
  
\__label__*thelabel* The text  
  
Where "*thelabel*" is the label of that data point and "The text" is the actual text. The part "\__label__" needs to be  placed as is to mark what the label is.

In [7]:
#Splits into training and test data
train = dat[0:int(len(dat)*.9)]
test = dat[int(len(dat)*.9):len(dat)].reset_index(drop=True)
datafile = []
#Generates data file for output based on training data
for i in range(len(train)):
    datafile.append('__label__'+str(train['subreddit'][i])+' '+train['title'][i])

In [8]:
#This saves the data as a text file as per fasttext's requirements
#this opens the data file
outfile = open('data.txt','w',errors='ignore')
#This runs through the data and adds each line
for line in datafile:
    outfile.write(line)
    outfile.write('\n')
outfile.close()

In [9]:
#This trains the fasttext model
model = fasttext.train_supervised('data.txt')

In [10]:
#This runs all of the training and testing examples through the fasttext classifier
trainout = []
for i in range(len(train)):
    trainout.append(int(model.predict(train['title'][i])[0][0][9]))
testout = []
for i in range(len(test)):
    testout.append(int(model.predict(test['title'][i])[0][0][9]))

In [11]:
#This counts up the training and test error
trainacc = 0
#loops over training data
for i in range(len(train)):
    #If the model's output is different from the 
    if (train['subreddit'][i]!=trainout[i]):
        trainacc = trainacc + 1
trainacc = 1- trainacc/len(trainout)

testacc = 0
for i in range(len(test)):
    if (test['subreddit'][i] != testout[i]):
        testacc = testacc + 1
testacc = 1-testacc/len(testout)
print('The training set accuracy is: '+str(trainacc))
print('The test set accuracy is: '+str(testacc))

The training set accuracy is: 0.604586129753915
The test set accuracy is: 0.5587248322147651


In [12]:
#This uses sklearn to plot the confusion matrix for the test set, just to see how it looks
from sklearn.metrics import confusion_matrix

print('politics','askreddit','worldnews','funny','gaming','aww')
confusion = pd.DataFrame(confusion_matrix(test['subreddit'],testout,normalize='true'))
print(confusion)
#predicted along x-axis, true along y-axis

politics askreddit worldnews funny gaming aww
          0         1         2         3         4         5
0  0.898876  0.000000  0.044944  0.022472  0.022472  0.011236
1  0.000000  0.971698  0.018868  0.000000  0.009434  0.000000
2  0.561404  0.000000  0.350877  0.008772  0.070175  0.008772
3  0.045455  0.011364  0.011364  0.227273  0.386364  0.318182
4  0.020202  0.060606  0.020202  0.222222  0.353535  0.323232
5  0.020000  0.010000  0.010000  0.220000  0.190000  0.550000


The confusion matrix actually shows some pretty interesting results. The highest accuracy rating goes to AskReddit where it correctly predicts it 95% of the time while also rarely falsely predicting askreddit.

That didn't work very well. Maybe it's the classifier, instead of using the supervised fasttext model, I will try the unsupervised fasttext embeddings and then use boosting

In [1]:
import gensim.downloader as api
dataset = api.load('text8')
c = list(dataset)
from gensim.models import FastText
unsupftt = FastText(size=40)
unsupftt.build_vocab(c)

#model = fasttext.load_model(os.getcwd()+'/fasttextmodel.vec')

In [2]:
unsupftt.train(c,total_examples=unsupftt.corpus_count,epochs=3)

In [67]:
from nltk.tokenize import word_tokenize
embeddings = np.zeros((40,len(dat)))
for i in range(len(dat)):
    tokenized = word_tokenize(dat['title'][i])
    avg = np.zeros((1,40))
    for j in range(len(tokenized)):
        avg = avg + unsupftt.wv.get_vector(tokenized[j])
    avg = avg/len(tokenized)
    embeddings[:,i] = avg
    
embeddings = embeddings.T

In [77]:
embeddings = embeddings.T
np.shape(embeddings)

(5960, 40)

In [87]:
import xgboost as xg
trainx = embeddings[0:5500]
testx = embeddings[5501:]
trainy = dat['subreddit'][0:5500]
testy = dat['subreddit'][5501:]
model = xg.XGBClassifier(gamma=3,subsample=0.8)
model.fit(trainx,trainy)


XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=3, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=0.8,
              tree_method=None, validate_parameters=False, verbosity=None)

In [88]:
#With the trained model, show accuracy and the confusion matrix
from sklearn.metrics import confusion_matrix
tfidftrain = model.predict(trainx)
tfidfout = model.predict(testx)
confusion2 = pd.DataFrame(confusion_matrix(testy,tfidfout,normalize='true'))
print('Training set accuracy:')
print(sum(trainy==tfidftrain)/len(tfidftrain))
print('Test set accuracy:')
print(sum(testy==tfidfout)/len(tfidfout))
print(confusion2)

Training set accuracy:
0.934
Test set accuracy:
0.5054466230936819
          0         1         2         3         4         5
0  0.652174  0.014493  0.217391  0.057971  0.014493  0.043478
1  0.000000  0.716418  0.029851  0.059701  0.104478  0.089552
2  0.283951  0.049383  0.518519  0.049383  0.037037  0.061728
3  0.029412  0.102941  0.073529  0.308824  0.191176  0.294118
4  0.096386  0.108434  0.012048  0.265060  0.409639  0.108434
5  0.021978  0.109890  0.032967  0.230769  0.142857  0.461538


That didn't work that well, and part of it could be the classifier, though the data up until now has barely been cleaned or processed, which is typically important in natural language applications. I will start with the following: 
  
- Word Tokenization (splitting the strings into a list of one-word strings)
- Removal of stopwords
- Lemmatization or stemming
- Moving to a vector representation (bag of words, word2vec, or fasttext)

In [10]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stop = set(stopwords.words('english'))
titles = []
#This tokenizes, removes stopwords and stems
for i in range(len(dat)):
    #tokenizes by word
    tmp_title = word_tokenize(dat['title'][i].lower())
    #removes stop words
    tmp_title2 = [i for i in tmp_title if i not in stop]
    #stems
    tmp_title3 = [ps.stem(i) for i in tmp_title2]
    titles.append(tmp_title3)

Sklearn's TF-IDF vectorizer actually seems to do a lot of the work for you, so I'll give that a try

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(dat['title'])

In [14]:
import xgboost as xg
trainx = x[0:5500]
testx = x[5501:]
trainy = dat['subreddit'][0:5500]
testy = dat['subreddit'][5501:]
model = xg.XGBClassifier(gamma=0.3,subsample=0.8)
model.fit(trainx,trainy)


XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0.3, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=0.8,
              tree_method=None, validate_parameters=False, verbosity=None)

In [15]:
#With the trained model, show accuracy and the confusion matrix
tfidftrain = model.predict(trainx)
tfidfout = model.predict(testx)
confusion2 = pd.DataFrame(confusion_matrix(testy,tfidfout,normalize='true'))
print('Training set accuracy:')
print(sum(trainy==tfidftrain)/len(tfidftrain))
print('Test set accuracy:')
print(sum(testy==tfidfout)/len(tfidfout))
print(confusion2)

Training set accuracy:
0.9096363636363637
Test set accuracy:
0.6397379912663755
          0         1         2         3         4         5
0  0.710843  0.000000  0.216867  0.012048  0.036145  0.024096
1  0.000000  1.000000  0.000000  0.000000  0.000000  0.000000
2  0.177215  0.012658  0.658228  0.088608  0.025316  0.037975
3  0.000000  0.033708  0.044944  0.516854  0.168539  0.235955
4  0.000000  0.029851  0.044776  0.373134  0.507463  0.044776
5  0.012821  0.051282  0.025641  0.307692  0.089744  0.512821
