# Comment Score Prediction
## Stefan Keselj

The aim of this file is to extract comment features from the Kaggle May 2015 Data and then use non-score features to predict score. Note that the features described in the previous sentence are intermediate features, like a specific comment represented as a string, which will then be further processed into finer features like a bag-of-words vector of that comment.

In [67]:
import pandas as pd; import numpy as np; 
from scipy.sparse import csr_matrix
import nltk
from unidecode import unidecode
import math; import time
import enchant; english_dict = enchant.Dict("en_US")
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from HTMLParser import HTMLParser
from sklearn import linear_model
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, \
                             stop_words = None, max_features = 5000) 
from stemming.porter2 import stem

import matplotlib.pyplot as plt
#from matplotlib import pylab
%matplotlib inline
#%pylab inline
#pylab.rcParams['figure.figsize'] = (20, 5)

### 1 Feature extraction

In [2]:
# load my data (andrew's data preprocessed for this task)
dftrain = pd.read_csv('data/train_data_stef.csv',header=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
dftrain.head(n=10)

Unnamed: 0.1,Unnamed: 0,body,score,subreddit
0,0,There are a lot of small tournaments in CS:GO ...,21,GlobalOffensive
1,1,I actually managed the Chilkoot trail with a 4...,1,pics
2,2,Bruh,1,nba
3,3,[Here you go](http://ftve3100-i.akamaihd.net/h...,1,nba
4,4,Retailers will just jack up the prices across ...,0,worldnews
5,5,Kobe 392 attempts this year and Jason Williams...,1,nba
6,6,China also invented the e-cig and many other i...,6,worldnews
7,7,Vampire hunter,1,pics
8,8,[**Foreigners who want to Understand**](http:/...,926,worldnews
9,9,"&gt; OW is flawed,\n no its not, if it has les...",0,GlobalOffensive


In [4]:
dft_pic = dftrain[dftrain.subreddit=="pics"]
dft_wne = dftrain[dftrain.subreddit=="worldnews"]
dft_fun = dftrain[dftrain.subreddit=="funny"]
dft_aww = dftrain[dftrain.subreddit=="aww"]
dft_gof = dftrain[dftrain.subreddit=="aww"]
dft_nba = dftrain[dftrain.subreddit=="nba"]
dft_cje = dftrain[dftrain.subreddit=="circlejerk"]
dft_list = [dft_pic, dft_nba, dft_wne, dft_fun, dft_aww, dft_gof, dft_cje]

In [5]:
dft_pic.head(n=10)

Unnamed: 0.1,Unnamed: 0,body,score,subreddit
1,1,I actually managed the Chilkoot trail with a 4...,1,pics
7,7,Vampire hunter,1,pics
11,11,Only Animals with 4 legs,1,pics
20,20,That quote could not be more ironic.,1,pics
24,24,And with one piece of propaganda Trump instill...,1,pics
25,25,I thought the final picture was gonna be them ...,0,pics
32,32,You left handed?,1,pics
36,36,You just said its ok for people to spill acros...,1,pics
40,40,Until we inevitably come up with the fake stor...,0,pics
48,48,Oh. True.,3,pics


In [75]:
# check if a string is a number 
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False
# remove non-english words, stop-words, punctuation
def clean_comment(sentence, isRemoveStop):
    try:
        sentence = unicode(unidecode(sentence.decode('utf-8')))
        # remove escape sequences, e.g. &gt becomes >
        parser = HTMLParser(); sentence = parser.unescape(sentence)
        # tokenize
        tokenizer = RegexpTokenizer(r'\w+')
        tokens_no_punct = tokenizer.tokenize(sentence)
        # remove punctuation, numbers, stopwords, and non-English; return stemmed and lowercase
        meaningful_words = [stem(word.lower()) for word in tokens_no_punct 
                            if not is_number(word)
                            and ((not isRemoveStop) or (word.lower() not in stopwords.words('english'))) 
                            and english_dict.check(word.lower())]
        return (" ".join( meaningful_words ))
    # catch nans
    except:
        print sentence
        return ""




#### 1.1 Bag of words

In [None]:
# clean all of the comments in the subreddit 'pics'
c = 0; start_time = time.time()
dft_wne_clean_nstop = []
for row_index, row in dft_wne.iterrows():
    dft_wne_clean_nstop.append(clean_comment(row['body'],True)) ### Important: TRUE
    c += 1
    if c==10 or c==100 or c==1000 or c==10000 or c%100000==0:
        print (str(c) + " "),
print time.time() - start_time

10  100  1000  10000  100000 

In [None]:
# save the cleaned text to file (later)
se_wne_clean_nstop = pd.Series(dft_wne_clean_nstop)
se_wne_clean_nstop.to_csv('data/wne_clean_nstop.csv')

In [12]:
# build a bag of words representation 
dtf_BOW = vectorizer.fit_transform(dft_clean)

In [13]:
# save it to a file
save_sparse_csr("data/wne_BOW.npz",dtf_BOW)

In [8]:
# test that we saved everything correctly, these are the ones that are done so far
dft_pic_BOW = load_sparse_csr("data/pic_BOW.npz")
dft_wne_BOW = load_sparse_csr("data/wne_BOW.npz")

In [None]:
dft_pic_BOW_a = dft_pic_BOW.toarray()

#### 1.2 Word to vector

The bag of words representation is convenient for larger bodies of text like articles or books, but might not be best suited for comment data. Comment data is small and semantically packed, we need some way of accessing the intrinsic meaning of a comment from the few words provided. To accomplish this we will use the Word to Vector model, which incoporates deep learning into our existing model.

The first step is to build a list of the words which filters out numbers and punctuation, but has stopwords.

In [76]:
# clean all of the comments in the subreddit 'worldnews'
c = 0; start_time = time.time()
dft_wne_clean_stop = []
for row_index, row in dft_wne.iterrows():
    dft_wne_clean_stop.append(clean_comment(row['body'],False)) ### Important: FALSE
    c += 1
    if c==10 or c==100 or c==1000 or c==10000 or c%100000==0:
        print (str(c) + " "),
print time.time() - start_time

10  100  1000  10000  nan
nan
100000  nan
200000  nan
nan
nan
nan
nan
300000  nan
nan
nan
nan
400000  nan
nan
nan
500000  nan
nan
1885.07237506


We absolutely have to store this, because it takes forever every time.

In [77]:
se_wne_clean_stop = pd.Series(dft_wne_clean_nstop)

In [78]:
se_wne_clean_stop.to_csv('data/wne_clean_stop.csv')

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 19: ordinal not in range(128)

### 2. Prediction Techniques

In [35]:
print shape(dtf_pic_BOW_a[0:400000])
print shape(dft_pic['score'][0:400000])
print shape(dtf_pic_BOW_a[400000:])
print shape(dft_pic['score'][400000:])

NameError: name 'shape' is not defined

#### 2.1 Regression

In [85]:
#train
X_train = dtf_pic_features[0:400000]
Y_train = np.array(dft_pic['score'][0:400000])
lm = linear_model.LinearRegression()
lm_fitted = lm.fit(X_train, Y_train)
#test
X_test = dtf_pic_features[400000:]
Y_true = dft_pic['score'][400000:]
Y_pred = lm_fitted.predict(X_test)
#evaluate
relative_error = 0
for i in range(0,len(Y_pred)):
    scale = float(Y_true.iloc[i])
    if abs(float(Y_true.iloc[i])) < 1:
        scale = 1
    relative_error += (float(Y_pred[i])-float(Y_true.iloc[i]))/scale/len(Y_pred)
print relative_error
print metrics.median_absolute_error(Y_true, Y_pred)
print metrics.mean_absolute_error(Y_true, Y_pred)
print metrics.mean_squared_error(Y_true, Y_pred)

6.74660403188
11.30064
22.6266525432
11685.3415595


In [90]:
plt.hist(dftest['sender'].values,10;

2
29.4182398223


#### 2.2 Classification

Instead of fitting a regressor to our data, a classifier might be better. This is because the number of upvotes of a comment's parent post greatly effecsts its success, but we don't have access to this latent variable. It causes the very popular comments to be skewed 

#### 1.3 Skip-thought vector

#### 1.4 2grams

###  Utilities and Sketching

In [2]:
def save_sparse_csr(filename,array):
    np.savez(filename,data = array.data ,indices=array.indices,
             indptr =array.indptr, shape=array.shape )

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((  loader['data'], loader['indices'], loader['indptr']),
                         shape = loader['shape'])

In [None]:
# an experiment
df1 = DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
df1['e'] = Series(np.random.randn(sLength), index=df1.index)

In [40]:
#clean = []
#for i in xrange(0,100):
#    print i
#    print clean_comment(dftrain.iloc[i]['body'])
#    print ""

In [26]:
#for i in range(0,10):
#    print i
#    print df_pic.iloc[i]['score']
#    print df_pic.iloc[i]['body']
#    print str(clean_comment(df_pic.iloc[i]['body']))
#    print ""

# check if a string is only ascii characters
def is_ascii(s):
    try:
        s.decode('ascii')
        return True
    except:
        print s
        return False
    # main logic

In [None]:
for i in range(0,len(dft_wne_clean_nstop)):
    if (not is_ascii(dft_wne_clean_nstop[i])):
        print i
        break
print dft_wne_clean_nstop[89]
print clean_comment(dft_wne_clean_nstop[89],True)
print clean_comment(dft_wne_clean_nstop[89],False)
print type(clean_comment(dft_wne_clean_nstop[89],True))