# Comment Score Prediction
## Stefan Keselj

The aim of this file is to extract comment features from the Kaggle May 2015 Data and then use non-score features to predict score. Note that the features described in the previous sentence are intermediate features, like a specific comment represented as a string, which will then be further processed into finer features like a bag-of-words vector of that comment.

In [95]:
import pandas as pd
import numpy as np
import nltk
import math
import enchant
english_dict = enchant.Dict("en_US")
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from HTMLParser import HTMLParser
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

import matplotlib.pyplot as plt
from matplotlib import pylab
%matplotlib inline
%pylab inline
pylab.rcParams['figure.figsize'] = (20, 5)

Populating the interactive namespace from numpy and matplotlib


### 1 Feature extraction

In [64]:
# load my data (andrew's data preprocessed for this task)
dftrain = pd.read_csv('/data/train_data_stef.csv',header=0)

In [65]:
dftrain.head(n=10)

Unnamed: 0.1,Unnamed: 0,body,score,subreddit
0,0,There are a lot of small tournaments in CS:GO ...,21,GlobalOffensive
1,1,I actually managed the Chilkoot trail with a 4...,1,pics
2,2,Bruh,1,nba
3,3,[Here you go](http://ftve3100-i.akamaihd.net/h...,1,nba
4,4,Retailers will just jack up the prices across ...,0,worldnews
5,5,Kobe 392 attempts this year and Jason Williams...,1,nba
6,6,China also invented the e-cig and many other i...,6,worldnews
7,7,Vampire hunter,1,pics
8,8,[**Foreigners who want to Understand**](http:/...,926,worldnews
9,9,"&gt; OW is flawed,\n no its not, if it has les...",0,GlobalOffensive


#### 1.1 Bag of words

In [144]:
# remove non-english words, stop-words, punctuation
def clean_comment(sentence):
    sentence = sentence.decode('utf-8')
    parser = HTMLParser()
    sentence = parser.unescape(sentence)
    # check if a string is a number 
    def is_number(s):
        try:
            float(s)
            return True
        except ValueError:
            return False
    # check if a string is only ascii characters
    def is_ascii(s):
        try:
            s.decode('ascii')
            return True
        except UnicodeDecodeError:
            print s
            return False
    # main logic
    try:
        tokenizer = RegexpTokenizer(r'\w+')
        tokens_no_punct = tokenizer.tokenize(sentence)
        meaningful_words = [word.lower() for word in tokens_no_punct 
                            if not is_number(word)
                            and word.lower() not in stopwords.words('english') 
                            and english_dict.check(word.lower())
                            and is_ascii(word)]
        return (" ".join( meaningful_words ))
    except:
        return ""

In [145]:
size = 1000
clean_train_comments = []
for i in xrange(0, size):
    clean_train_comments.append(clean_comment(dftrain['body'][i]))
train_data_features = vectorizer.fit_transform(clean_train_comments)

In [100]:
train_data_features_array = train_data_features.toarray()

In [102]:
print dftrain['body'][8][0:230]
print train_data_features_array[8]

[**Foreigners who want to Understand**](http://imgur.com/gallery/6NfmQ)

I wrote a rather melodramatic post earlier explaining the situation surrounding Lula’s nomination as Rousseff's Chief of Staff and now its developments. Re
[ 0  1  1  1  1  0  0  0  0  1  2  0  0  1  0  0  0  0  1  0  0  0  2  0  0
  0  0  1  0  0  1  5  0  1  1  0  1  1  0  0  0  0  0  2  1  0  0  1  2  1
  0  0  2  0  0  0  3  2  1  2  0  0  0  2  0  0  1  0  0  0  1  1  1  0  0
  1  0  1  0  1  2  0  0  0  0  0  0  1  1  0  0  0  0  1  0  0  1  0  0  1
  1  1  0  0  1  1  0  1  0  0  0  0  1  1  1  0  7  0  0  0  2  2  1  2  2
  1  0  1  0  2  1  0  0  0  4  1  1  0  0  0  0  0  1  0  0  5  1  2  0  0
  0  0  1  0  0  2  3  0  1  0  1  3  1  1  0  2  0  2  0  4  0  0  2  1  0
  0  0  0  3  2  1  0  0  0  0  0  1  0  1  0  0  0  2  1  1  1  0  1  0  0
  0  1  1  0  1  1  1  0  0  0  1  0  2  0  1  0  0  1  0  0  0  1  0  0  1
  0  1  0  0  2  1  0  0  1  0  0  0  0  1  1  0  2  0  2  0  0  0  0  0  1
  1  0  1  

#### 1.2 Word to vector

#### 1.3 Skip-thought vector

#### 1.4 2grams

###  Sketchpad

In [40]:
#clean = []
#for i in xrange(0,100):
#    print i
#    print clean_comment(dftrain.iloc[i]['body'])
#    print ""

In [41]:
#for i in range(0,10):
#    print i
#    print df_pic.iloc[i]['score']
#    print df_pic.iloc[i]['body']
#    print str(clean_comment(df_pic.iloc[i]['body']))
#    print ""

In [7]:
df_pic = dftrain[dftrain.subreddit=="pics"]
df_wne = dftrain[dftrain.subreddit=="worldnews"]
df_fun = dftrain[dftrain.subreddit=="funny"]
df_aww = dftrain[dftrain.subreddit=="aww"]
df_gof = dftrain[dftrain.subreddit=="aww"]
df_nba = dftrain[dftrain.subreddit=="nba"]
df_cje = dftrain[dftrain.subreddit=="circlejerk"]
df_sublist = [df_pic, df_nba, df_wne, df_fun, df_aww, df_gof, df_cje]