# Executive Summary
How many times have you opened up a browser for a random subreddit only to find that it wasn't the random subreddit you were looking for?  We've all been there.  Furthermore, what about when you wonder "golly, just how similar are different subreddits that are focused one concept but from entirely different points of view?"  Well, we hear you.  We've scrapped data from two active subreddits which focus around sexuality and using them build a model that's able to detect if it's one subreddit or the other with over an 80% certainty.  Furthermore, if future exploritory data analysis, we hope to one day be able to talk about the defining features of each subculter that's being represented by these subreddits.

# Imports

In [3]:
import requests
import json
import time
import pandas as pd
import numpy as np
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from bs4 import BeautifulSoup
import regex as re

This is a function that scrapes a subreddit and turns it into a pandas dataframe.
Followed by it being used for the actuallesbians, Braincels and Trufemcels subreddits

In [20]:
def scrape_reddit(the_subreddit, pages = 40):
    all_posts = []
    first_url = 'http://www.reddit.com/r/' + the_subreddit + '.json'
    url = first_url
    list_of_df = []
    
    #Putting in a get check, for happy sanity reasons:
    quick_check = requests.get(first_url, headers = {'User-agent':'Electronic Goddess'})
    if int(str(quick_check)[11:14]) == 200:
        print("Get request successful.")
        time.sleep(3)
        print("Initiating Scrape...")
    else:
        print("Get request not 200, instead recieved:" + str(quick_check))
        return
    
    #Now for the actual Scraping:
    for round in range(pages):
        try:
            res = requests.get(url, headers = {'User-agent':'Electronic Goddess'})
            data = res.json()
            list_of_posts = data['data']['children']
            all_posts = all_posts + list_of_posts
            after = data['data']['after']
            url = first_url +'?after=' + after
            print('Current After:' + after,'Round: '+ str(round + 1))
            time.sleep(3)
        except:
            print('Limit likely hit.  Returning available posts.')
            break
#        return all_posts # This can be un-commented out incase I want the straight forward raw scrape

    #Formats the parts we care about into a list of dictionaries that'll become the dataframe
    for i in range(len(all_posts)):
        index_dictionary = {
                'title' : all_posts[i]['data']['title'],
                'selftext': all_posts[i]['data']['selftext'],
                'subreddit' : all_posts[i]['data']['subreddit']
            }
        list_of_df.append(index_dictionary)
    return pd.DataFrame(list_of_df, columns = ['title','selftext','subreddit'])


These are the scrappings that we'll be actually using

In [22]:
df_lesbians = scrape_reddit('actuallesbians')
df_incels = scrape_reddit('braincels')

Get request successful.


KeyboardInterrupt: 

Extra Subreddits to check out if there is the opportunity

In [None]:
#df_femcels = scrape_reddit('Trufemcels')
#df_gaybros = scrape_reddit('gaybros')

### Saved and available to be loaded from csv

In [None]:
# Export to csv (Commented out to avoid re-saving errors)
#df_lesbians.to_csv('actuallesbians_9_9_400', index=False)
#df_incels.to_csv('braincels_9_9_400', index=False)
#df_femcels.to_csv('trufemcels_9_9_1000', index=False)
#df_gaybros.to_csv('gaybros_9_10_540', index=False)

In [23]:
# Import from CSV
df_lesbians = pd.read_csv('./actuallesbians_9_9_400')
df_incels = pd.read_csv('./braincels_9_9_400')
#df_femcels = pd.read_csv('./trufemcels_9_9_1000')
#df_gaybros = pd.read_csv('./gaybros_9_10_540')

# Exploritory Data Analysis
    What are the most used words for each subreddit?
    Are the most used words jargon?
    How much text on average do the subredditors post?
    How many of the posts are pictures?

Almost No EDA has be done at this time in order to expidite the process of getting this project finished.  That which was mentioned during the presentation was from memory prior to the loss of my previous work.

# Natural Language Processing

Using CountVectorizer &/or TF-IDF to generate features from the post text and title of posts.



In [125]:
#df_lesbians['selftext'].apply(text_prep)

In [5]:
# Instantiations of the tokenizer, lemmatizer and Count Vectorizer
tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()
cvec = CountVectorizer(analyzer = "word",
                             tokenizer = tokenizer.tokenize,
                             preprocessor = None,
                             stop_words = 'english') 

Combining and altering the dataframes to be modeled.

In [6]:
# Identifying the y Values
df_lesbians['is_lesbians'] = 1
df_incels['is_lesbians'] = 0

# Concatination
les_or_inc = pd.concat([df_lesbians.drop('subreddit',axis=1),df_incels.drop('subreddit', axis=1)])

# Filling Nulls
les_or_inc.fillna('', inplace=True)

# Combining the title and selftext columns for easier Count Vectorization
les_or_inc['all_text'] = les_or_inc['title'] + ' ' + les_or_inc['selftext']

# Resetting the Index
les_or_inc.reset_index(inplace=True)

Setting up the X,y, as well as the tests and trains

In [24]:
# Defining X and y
X = les_or_inc['all_text']
y = les_or_inc['is_lesbians']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    random_state=76)
# Count Vectorizing the train and test X's while fitting the Training X
X_train = pd.DataFrame(cvec.fit_transform(X_train).todense(), columns=cvec.get_feature_names())
X_test = pd.DataFrame(cvec.transform(X_test).todense(), columns=cvec.get_feature_names())

The baseline accuracy for this model is about 50% because one could simply guess 1 or 0 for all of the rows and get 50% correct.

## Modeling and testing MultinomialNB 

In [15]:
multi_model = MultinomialNB().fit(X_train,y_train)

print("Train:", multi_model.score(X_train,y_train))

print("Test:", multi_model.score(X_test, y_test))

Train: 0.9513513513513514
Test: 0.8704453441295547


## Modeling and testing RandomForestClassifier

In [16]:
rando_forest = RandomForestClassifier().fit(X_train, y_train)

print("Train:", rando_forest.score(X_train,y_train))

print("Test:", rando_forest.score(X_test,y_test))

Train: 0.9932432432432432
Test: 0.8218623481781376
