# Submission Crowd Wisdom
### Train A Model To Classify Submissions Using The Judgements Passed By Commenters.

In [28]:
import praw
from submission_corpus import SubmissionCorpus
from subreddit_corpus import SubredditCorpus
from ngram_classifier import NgramJudgementClassifier
import os

#### Step 1: Find A Judgemental Subreddit

In [2]:
subreddit_name = 'AmITheAsshole'
judgement_categories = ['YTA', 'NTA', 'ESH', 'NAH', 'INFO']

#### Step 2: Use Reddit API To Load Subreddit

In [3]:
client_id = '71ZX5Cupn2Ohpg'
client_secret = 'nzCz5_WlQM4LbJxX-t_3m-tPgZw'

reddit = praw.Reddit(user_agent='Comment Extraction',client_id=client_id, client_secret=client_secret)
subreddit = reddit.subreddit(subreddit_name)

#### Step 3: Generate Corpus Of Submissions To Selected Subreddit.

In [4]:
# Check if submission corpora were already created 
# (to save the time of creating them again!)
filenames = [category+'.txt' for category in judgement_categories]
source_already_exists = any([True if filename not in os.listdir() else False for filename in filenames])

# If the submission corpora are not in the current directory, build corpora.
if not source_already_exists:
    sc = SubredditCorpus(subreddit, retrieval_limit=5000, bfs_depth=1, judgement_categories=judgement_categories)
    labeled_submissions = sc.get_subredditCorpus()
    submission_text = [tup[1] for tup in labeled_submissions]

#### Step 4: Pick Hyperparameters.
  
As the number of judgement categories and the vocabulary increases, the smoothing constant should be reduced, otherwise the probability of a 0 frequency word/N-gram will become closer to the probability of a high frequency word/N-gram.
  
As N increases, the number of unique N-grams usually (though not always) increases. This increase will result in a larger vocabulary, and a larger number of 0frequency N-grams that must be smoothed.  
  
If you would like to evaluate model performance, you can set a train/test split using train_prop. 

In [29]:
N=1,
smoothing_constant = 0.001
train_prop = 0.8

#### Step 4: Build The Model.

In [20]:
clf = NgramJudgementClassifier(filenames, N=1,
                               smoothing_constant=0.001,
                               train_prop=0.8)

#### Step 4a: Evaluate The Model

In [34]:
#clf.run_splitTest()

#### Step 5: Pick/Write A Submission To Classify.

In [35]:

submission_1 = "I'm secretly obsessed with cheese. I'm considering becoming overtly obsessed with cheese. I want to rid the earth of those who do not like cheese; excusing those with dietary preferences/restrictions against dairy. WIBTA if I were to start my dairy fueled conquest of the food pyramid?"
submission_2 = "I want to help my brother with his anxiety, but I'm not sure what to do. I don't want to make things worse, but I also feel bad if I don't do anything. WIBTA if I try my best to help him (even though I'm not a professional)?" 

#### Step 6: Classify.

In [36]:
judgement_categories[clf.predict(submission1)]

'NTA'