# Subreddit Classifier  
  
The social media platform Reddit is divided up into different subreddits. Each subreddit is a messageboard devoted to a certain topic. Each post contains a title and some combination of body text, images, videos, or links to external websites. People viewing the content can comment on it. Reddit uses a vote system where everybody can vote on each post or comment, either increasing the score (an "upvote") or decreasing the score (a "downvote") where comments or posts with higher scores generally being made more visible to anybody viewing the website. Some of the popular subreddit topics can include politics, news, humourous content, or AskReddit, a subreddit where people post questions and other people answer the questions in the comment section.

In order to scrape Reddit content, PRAW (the Python Reddit API Wrapper) was employed. It requires a Reddit developer account which gives credentials needed to access the API. Since I am only pulling content and not posting anything, I am accessing the API in read-only mode which requires fewer credentials than accessing it in read-write mode.  
  
The cell below handles imports and initializes PRAW with my credentials that are loaded via an external file.

In [2]:
#Imports and setting up the reddit API
import pandas as pd
import os
import praw
import numpy as np
import fasttext

#Client ID, secret, user agent in that order
credentials = list(pd.read_csv(os.getcwd()+'/credentials.csv').columns)

reddit = praw.Reddit(client_id=credentials[0],
                     client_secret=credentials[1],user_agent=credentials[2])

#I want:
#Title, text, link (if applicable), top comments, author


As a preliminary test, the cell below gathers the top 1000 posts (highest score) from six subreddits. These subreddits are:  
politics, a subreddit focused on US politics,  
AskReddit, a subreddit where people post questions and others answer them in the comment section,  
Worldnews, a subreddit focused on international, non-US news,  
Funny, a subreddit mostly for memes and joke content,  
Gaming, a subreddit dedicated to news or stories about videogames,  
Aww, a subreddit dedicated to photos and videos of animals, mostly pets.

In [3]:
#Gathers top posts from a few subreddits
politics = list(reddit.subreddit('politics').top(limit=1000))
ask = list(reddit.subreddit('askreddit').top(limit=1000))
worldnews = list(reddit.subreddit('worldnews').top(limit=1000))
funny = list(reddit.subreddit('funny').top(limit=1000))
gaming = list(reddit.subreddit('gaming').top(limit=1000))
aww = list(reddit.subreddit('aww').top(limit=1000))

The code above will gather lists of PRAW submission objects which then need to have the relevant data extracted from them.  
My general idea is to create a Pandas dataframe containing all of the relevant data. Since PRAW requires each submission be handled individually (I cannot just grab the list, type list.title and get a list of all of the titles), I will be creating empty lists and appending the relevant values to them while looping over my lists of submissions.  
  
For now, Reddit comments will be ignored.

In [4]:
#Initializing empty lists
titles = []
is_self = []
is_video = []
selftext = []
author = []
created = []
num_comments = []
score = []
subreddit = []
for i in range(len(politics)):
    titles.append(politics[i].title)
    is_self.append(politics[i].is_self)
    is_video.append(politics[i].is_video)
    selftext.append(politics[i].selftext)
    author.append(politics[i].author)
    created.append(politics[i].created)
    num_comments.append(politics[i].num_comments)
    score.append(politics[i].score)
    subreddit.append(0)
for i in range(len(ask)):
    titles.append(ask[i].title)
    is_self.append(ask[i].is_self)
    is_video.append(ask[i].is_video)
    selftext.append(ask[i].selftext)
    author.append(ask[i].author)
    created.append(ask[i].created)
    num_comments.append(ask[i].num_comments)
    score.append(ask[i].score)
    subreddit.append(1)
for i in range(len(worldnews)):
    titles.append(worldnews[i].title)
    is_self.append(worldnews[i].is_self)
    is_video.append(worldnews[i].is_video)
    selftext.append(worldnews[i].selftext)
    author.append(worldnews[i].author)
    created.append(worldnews[i].created)
    num_comments.append(worldnews[i].num_comments)
    score.append(worldnews[i].score)
    subreddit.append(2)
for i in range(len(funny)):
    titles.append(funny[i].title)
    is_self.append(funny[i].is_self)
    is_video.append(funny[i].is_video)
    selftext.append(funny[i].selftext)
    author.append(funny[i].author)
    created.append(funny[i].created)
    num_comments.append(funny[i].num_comments)
    score.append(funny[i].score)
    subreddit.append(3)
for i in range(len(gaming)):
    titles.append(gaming[i].title)
    is_self.append(gaming[i].is_self)
    is_video.append(gaming[i].is_video)
    selftext.append(gaming[i].selftext)
    author.append(gaming[i].author)
    created.append(gaming[i].created)
    num_comments.append(gaming[i].num_comments)
    score.append(gaming[i].score)
    subreddit.append(4)
for i in range(len(aww)):
    titles.append(aww[i].title)
    is_self.append(aww[i].is_self)
    is_video.append(aww[i].is_video)
    selftext.append(aww[i].selftext)
    author.append(aww[i].author)
    created.append(aww[i].created)
    num_comments.append(aww[i].num_comments)
    score.append(aww[i].score)
    subreddit.append(5)



In [5]:
#This creates a Pandas dataframe to hold the data
dat = pd.DataFrame()
dat['title'] = titles
dat['text'] = selftext
dat['author'] = author
dat['created'] = created
dat['score'] = score
dat['num_comments'] = num_comments
dat['is_self'] = is_self
dat['is_video'] = is_video
dat['subreddit'] = subreddit
dat = dat.sample(frac=1).reset_index(drop=True)

In [20]:
for i in range(len(dat)):
    if ('\\' in dat['title'][i]):
        print(i)
        print(dat['title'][i])
        
'\\' in 'can it find \n'

False

Now that the data is wrangled into a format that is easier to use, it needs to be processed. It will be processed as follows:  
  
- Removal of stopwords
- Tokenization
- Lemmatization or stemming
- Moving to a vector representation (bag of words, word2vec, or fasttext)

First, the fasttext package direct from Facebook's Github will be employed, mostly because using it is pretty simple and does not really require the normal NLP preprocessing.  
  
It can be trained in supervised mode which supplies a classifier, though it requires the data be loaded from an external text file where each entry looks like this:  
  
\__label__*thelabel* The text  
  
Where "*thelabel*" is the label of that data point and "The text" is the actual text. The part "\__label__" needs to be  placed as is to mark what the label is.

In [50]:
#Splits into training and test data
train = dat[0:int(len(dat)*.9)]
test = dat[int(len(dat)*.9):len(dat)].reset_index(drop=True)
datafile = []
#Generates data file for output based on training data
for i in range(len(train)):
    datafile.append('__label__'+str(train['subreddit'][i])+' '+train['title'][i])

In [51]:
#This saves the data as a text file as per fasttext's requirements
#this opens the data file
outfile = open('data.txt','w',errors='ignore')
#This runs through the data and adds each line
for line in datafile:
    outfile.write(line)
    outfile.write('\n')
outfile.close()

In [55]:
#This trains the fasttext model
model = fasttext.train_supervised('data.txt')

In [56]:
#This runs all of the training and testing examples through the fasttext classifier
trainout = []
for i in range(len(train)):
    trainout.append(int(model.predict(train['title'][i])[0][0][9]))
testout = []
for i in range(len(test)):
    testout.append(int(model.predict(test['title'][i])[0][0][9]))

In [57]:
#This counts up the training and test error
trainacc = 0
#loops over training data
for i in range(len(train)):
    #If the model's output is different from the 
    if (train['subreddit'][i]!=trainout[i]):
        trainacc = trainacc + 1
trainacc = 1- trainacc/len(trainout)

testacc = 0
for i in range(len(test)):
    if (test['subreddit'][i] != testout[i]):
        testacc = testacc + 1
testacc = 1-testacc/len(testout)
print('The training set accuracy is: '+str(trainacc))
print('The test set accuracy is: '+str(testacc))

The training set accuracy is: 0.6255826962520977
The test set accuracy is: 0.5654362416107382


In [70]:
#This uses sklearn to plot the confusion matrix for the test set, just to see how it looks
from sklearn.metrics import confusion_matrix

print('politics','askreddit','worldnews','funny','gaming','aww')
confusion = pd.DataFrame(confusion_matrix(test['subreddit'],testout,normalize='true'))
print(confusion)
#predicted along x-axis, true along y-axis

politics askreddit worldnews funny gaming aww
          0         1         2         3         4         5
0  0.802198  0.010989  0.142857  0.021978  0.010989  0.010989
1  0.000000  0.950980  0.009804  0.009804  0.000000  0.029412
2  0.457143  0.019048  0.457143  0.038095  0.009524  0.019048
3  0.029412  0.029412  0.019608  0.441176  0.098039  0.382353
4  0.030303  0.090909  0.040404  0.414141  0.161616  0.262626
5  0.030928  0.010309  0.010309  0.268041  0.082474  0.597938


The confusion matrix actually shows some pretty interesting results. The highest accuracy rating goes to AskReddit where it correctly predicts it 95% of the time while also rarely falsely predicting askreddit.