# Random Act of Pizza

### Description

In machine learning, it is often said there are no free lunches. How wrong we were.

This competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome (successful/unsuccessful) and meta-data. Participants must create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness.

"I'll write a poem, sing a song, do a dance, play an instrument, whatever! I just want a pizza," says one hopeful poster. What about making an algorithm?


### Data Description
See, fork, and run a random forest benchmark model through Kaggle Scripts

This dataset includes 5671 requests collected from the Reddit community Random Acts of Pizza between December 8, 2010 and September 29, 2013 (retrieved on September 30, 2013). All requests ask for the same thing: a free pizza. The outcome of each request -- whether its author received a pizza or not -- is known. Meta-data includes information such as: time of the request, activity of the requester, community-age of the requester, etc.

Each JSON entry corresponds to one request (the first and only request by the requester on Random Acts of Pizza). We have removed fields from the test set which would not be available at the time of posting.

Data fields
"giver_username_if_known": Reddit username of giver if known, i.e. the person satisfying the request ("N/A" otherwise).

"number_of_downvotes_of_request_at_retrieval": Number of downvotes at the time the request was collected.

"number_of_upvotes_of_request_at_retrieval": Number of upvotes at the time the request was collected.

"post_was_edited": Boolean indicating whether this post was edited (from Reddit).

"request_id": Identifier of the post on Reddit, e.g. "t3_w5491".

"request_number_of_comments_at_retrieval": Number of comments for the request at time of retrieval.

"request_text": Full text of the request.

"request_text_edit_aware": Edit aware version of "request_text". We use a set of rules to strip edited comments indicating the success of the request such as "EDIT: Thanks /u/foo, the pizza was delicous".

"request_title": Title of the request.

"requester_account_age_in_days_at_request": Account age of requester in days at time of request.

"requester_account_age_in_days_at_retrieval": Account age of requester in days at time of retrieval.

"requester_days_since_first_post_on_raop_at_request": Number of days between requesters first post on RAOP and this request (zero if requester has never posted before on RAOP).

"requester_days_since_first_post_on_raop_at_retrieval": Number of days between requesters first post on RAOP and time of retrieval.

"requester_number_of_comments_at_request": Total number of comments on Reddit by requester at time of request.

"requester_number_of_comments_at_retrieval": Total number of comments on Reddit by requester at time of retrieval.

"requester_number_of_comments_in_raop_at_request": Total number of comments in RAOP by requester at time of request.

"requester_number_of_comments_in_raop_at_retrieval": Total number of comments in RAOP by requester at time of retrieval.

"requester_number_of_posts_at_request": Total number of posts on Reddit by requester at time of request.

"requester_number_of_posts_at_retrieval": Total number of posts on Reddit by requester at time of retrieval.

"requester_number_of_posts_on_raop_at_request": Total number of posts in RAOP by requester at time of request.

"requester_number_of_posts_on_raop_at_retrieval": Total number of posts in RAOP by requester at time of retrieval.

"requester_number_of_subreddits_at_request": The number of subreddits in which the author had already posted in at the time of request.

"requester_received_pizza": Boolean indicating the success of the request, i.e., whether the requester received pizza.

"requester_subreddits_at_request": The list of subreddits in which the author had already posted in at the time of request.

"requester_upvotes_minus_downvotes_at_request": Difference of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_minus_downvotes_at_retrieval": Difference of total upvotes and total downvotes of requester at time of retrieval.

"requester_upvotes_plus_downvotes_at_request": Sum of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_plus_downvotes_at_retrieval": Sum of total upvotes and total downvotes of requester at time of retrieval.

"requester_user_flair": Users on RAOP receive badges (Reddit calls them flairs) which is a small picture next to their username. In our data set the user flair is either None (neither given nor received pizza, N=4282), "shroom" (received pizza, but not given, N=1306), or "PIF" (pizza given after having received, N=83).

"requester_username": Reddit username of requester.

"unix_timestamp_of_request": Unix timestamp of request (supposedly in timezone of user, but in most cases it is equal to the UTC timestamp -- which is incorrect since most RAOP users are from the USA).

"unix_timestamp_of_request_utc": Unit timestamp of request in UTC.

In [1]:
# import basic libraries
import numpy as np
import pandas as pd
import nltk
import matplotlib
%matplotlib inline

# do not show warnings
import warnings
warnings.filterwarnings('ignore')

# import NLTK libraries
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk import RegexpParser
from nltk import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import FreqDist

from nltk.classify.scikitlearn import SklearnClassifier

# import Machine Learning Libaries
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, Perceptron, RidgeClassifier, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.cross_validation import KFold, cross_val_score
from sklearn import cross_validation


# IMPORT DATA
db = pd.read_json('../input/train.json')
db_test = pd.read_json('../input/test.json')

In [3]:
print('Summary:\n', '\ttrain db shape: {}\n\ttest db shape: {}'.format(db.shape, db_test.shape))

Summary:
 	train db shape: (4040, 32)
	test db shape: (1631, 17)


Let's print top 5 text in for those who received top upvote 

In [4]:
top5 = db.query('requester_received_pizza == True').sort_values(by='number_of_upvotes_of_request_at_retrieval', ascending=False)
top5 = top5.head(10).loc[: , ['request_title', 'request_text']]

for row in range(5):  
    print(top5.iloc[row, 0],'\n')
    print(top5.iloc[row, 1], '\n', '---'* 50)

[Request]Vancouver, BC, Canada Father of 5, wife just got out of surgery, we were suddenly cut off from employment insurance. 

The government screwed up and now we have to wait over a month for them to refile and reestablish my claim.  There is no way to expedite this at all, in spite of the fact that it is their mistake.  We have 2 girls (9 and 7) and 3 boys (5,3, 2months).

My wife had to be taken by ambulance to the hospital last week for emergency gall bladder removal surgery and we are feeling a bit beat on at the moment.  This would be a humungous pick-us-up.

I am happy to provide any verification you need.  Thanks in advance.

**EDIT: http://imgur.com/4FXXT Thank you so much Gama-Go!** 
 ------------------------------------------------------------------------------------------------------------------------------------------------------
[REQUEST] No sob story, it's just my birthday tomorrow and I really like pizza :D 

My husband has to work 7-5 and class from 530-930 on my bda

It looks like both request Title and request Text includes some valuable information. It could be good idea to join both those fields. 

## Classification approach
The Random Act of Pizza db includes many information that could be useful for this challange, however, for my learning purpose (learning natural text processing) I will use only request Text and request Title combined together. 

In [5]:
# join both fields
db['text'] = db['request_title'] + ' \n'+  db['request_text_edit_aware']
db_test['text'] = db_test['request_title'] + ' \n'+  db_test['request_text_edit_aware']
label = db['requester_received_pizza']
train = db['text']
test = db_test['text']
# convert True/False to 1/0
label = label * 1

In [11]:
ps = PorterStemmer()
train_words = pd.Series()

for row in train.head(2):
        words = word_tokenize(row)
        for w in words[:30]: 
            st = ps.stem(word=w)
            if w != st:
                print(w, ': ', st)


Chunking is like grouping similar words based on regular expressions we created. 

In [12]:
for row in train.head(3):
        words = word_tokenize(row)
        taged = pos_tag(words)
        
        chunkGram = r'''Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}'''
        chunkparser = RegexpParser(chunkGram)
        chunked = chunkparser.parse(taged)

Named Entity is looking for pre-set type of words like: organisation, people, money etc. 


In [86]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk import ne_chunk

for row in train.head(3):
        words = word_tokenize(row)
        taged = pos_tag(words)
        
        namedEnt = ne_chunk(taged, binary=True)

Lemitizing is similar to stem but it gives a real world not just cut version.

In [13]:
lemmatizer = WordNetLemmatizer()

for row in train.head(5):
        words = word_tokenize(row)
        for w in words:
            lem = lemmatizer.lemmatize(w)
            if w != lem:
                print(w, ": ", lem)

This is not useful yet as an idea but follwing code will look for synonims and antonims of a word

In [107]:
sync = wordnet.synsets('plan')
# how to find just this word
sync[0].lemmas()[0].name()
# create list of synonims and antonyms
synonims = []
antonims = []
for sync in wordnet.synsets('good'):
    for l in sync.lemmas():
        synonims.append(l.name())
        if l.antonyms():
            antonims.append(l.antonyms()[0].name())
            
print(set(synonims))
print(set(antonims))

## Data preparation
Before we can build any model we need to pre=process our data.
The train and test data is now saved in array of text. We need to to convert into format that will be readible by NTLK. 

To do so, we run following steps:
1. tokenized text into separate words.
* Find 3000 most common words 
* convert each text to see, whether it includes those 3000 texts
* Build a featuresets list including tuples: ({feature: True/False}, label)

### Build function to extract features

In [None]:
# Create list of all words used:
all_words = []
for row in train:
    all_words.extend(word_tokenize(row.lower()))

all_words = nltk.FreqDist(all_words)

# Create 1000 categories from all 15.000 words. 1000 is obj number and I might increase it in future. 
word_features = [w[0] for w in all_words.most_common(3000)]

# create function that would create a list of feeatures from word_features list
def find_features(list_of_words):
    unique_list_of_words = set(list_of_words)
    features = {}
    for w in word_features:
        features[w] = (w in unique_list_of_words)
    return features

### Extract features on train set. 

In [None]:
# Create a list of tupples for each Label and features:
request_tupples = []
for row in range(train.shape[0]):
    request_tupples.append((word_tokenize(train[row].lower()), label[row]))

# Extract Features
featuresets = [(find_features(text), label) for (text, label) in request_tupples]

## MachineLearning Algorithms
Our training set is prepared for ml algorithms testing. 
The procedure here is:
1. List all most popular ML algorithms in list
* Loop through this list and apply each of techniques to part of train set
* Divide Train sets into 10 parts and for each loop use 90% of set to train set and 10% to test it. NTLK doesn't have Cross_validate method like sklearn therefore we need to design
* Print the mean prediction for each method in order to choose which methods will be used for final model. 

In [8]:
# Run a loop through the most popular ML alg and print their results.
algorithms = [# NB
              'MultinomialNB',
              'BernoulliNB',
              
              # Linear models
              'LogisticRegression',
              'Perceptron',
              'RidgeClassifier',
              'SGDClassifier',
              
              # SVM
              'SVC',
              'LinearSVC',
              
              # Ensemble methods
              'RandomForestClassifier',
              'AdaBoostClassifier',
              'BaggingClassifier',
              'ExtraTreesClassifier',
              'GradientBoostingClassifier',
              
              # Tree models
              'DecisionTreeClassifier',
              'ExtraTreeClassifier',
              
              # KNN
              'KNeighborsClassifier']

for alg in algorithms:
    accu = []
    classifier = SklearnClassifier(eval(alg + '()'))
    cv = cross_validation.KFold(len(featuresets), n_folds=10, shuffle=False, random_state=None)
    
    for traincv, testcv in cv:
        classifier = classifier.train(featuresets[traincv[0]:traincv[len(traincv)-1]])
        accu.append(nltk.classify.util.accuracy(classifier, featuresets[testcv[0]:testcv[len(testcv)-1]]))
        
    accu = np.mean(accu)
    print('{} accuracy: {}%'.format(str(alg), round(accu*100, 2)))

print('-'*20,'\nFinished')

MultinomialNB accuracy: 79.78%
BernoulliNB accuracy: 70.15%
LogisticRegression accuracy: 88.34%
Perceptron accuracy: 74.74%
RidgeClassifier accuracy: 88.61%
SGDClassifier accuracy: 77.3%
SVC accuracy: 75.38%
LinearSVC accuracy: 91.51%
RandomForestClassifier accuracy: 92.66%
AdaBoostClassifier accuracy: 76.05%
BaggingClassifier accuracy: 91.89%
ExtraTreesClassifier accuracy: 94.52%
GradientBoostingClassifier accuracy: 78.14%
DecisionTreeClassifier accuracy: 92.85%
ExtraTreeClassifier accuracy: 93.05%
KNeighborsClassifier accuracy: 76.38%
-------------------- 
Finished


## Test set preparation
Now we need to prepare Test set in similar way to Train but simplier because the find_features function is already built. 

### Extract features on Test set

In [13]:
request_test = []
for row in range(test.shape[0]):
    request_test.append((word_tokenize(test[row].lower())))
    
testing = [(find_features(text)) for (text) in request_test]

## Build prediction
Based on algorithm search above we chose 8 methods below to build predictions. Each of them had high prediction average. And each of them will be upload on Kaggle site to see the average. 

In [64]:
# chosen algorithms:
algorithms = ['LogisticRegression',
              'RidgeClassifier',
              'LinearSVC',
              'RandomForestClassifier',
              'BaggingClassifier',
              'KNeighborsClassifier']

for alg in algorithms:
    classifier = SklearnClassifier(eval(alg + '()'))
    classifier = classifier.train(featuresets)
    prediction = classifier.classify_many(testing)

    prediction = pd.concat([pd.Series(db_test['request_id']), pd.Series(prediction)], axis=1, ignore_index=True)
    prediction.columns = ['request_id', 'requester_received_pizza']
    prediction.to_csv('{}.csv'.format(alg), index=False)

Unfortunately, the result (Kaggle Score) is still much lower than I would like it to be. The best method (KNN) received a a score of 56%, which indicates that probably there is a problem with in my algorithm search method as they indicated much higher results. Also, we need to different approach here to build a better prediction. 

In [9]:
print('Finished')

Finished


---------------------

# New approach ?? 
Maybe Keras?