# Random Act of Pizza

### Description

### Data Description
See, fork, and run a random forest benchmark model through Kaggle Scripts

This dataset includes 5671 requests collected from the Reddit community Random Acts of Pizza between December 8, 2010 and September 29, 2013 (retrieved on September 30, 2013). All requests ask for the same thing: a free pizza. The outcome of each request -- whether its author received a pizza or not -- is known. Meta-data includes information such as: time of the request, activity of the requester, community-age of the requester, etc.

Each JSON entry corresponds to one request (the first and only request by the requester on Random Acts of Pizza). We have removed fields from the test set which would not be available at the time of posting.

Data fields
"giver_username_if_known": Reddit username of giver if known, i.e. the person satisfying the request ("N/A" otherwise).

"number_of_downvotes_of_request_at_retrieval": Number of downvotes at the time the request was collected.

"number_of_upvotes_of_request_at_retrieval": Number of upvotes at the time the request was collected.

"post_was_edited": Boolean indicating whether this post was edited (from Reddit).

"request_id": Identifier of the post on Reddit, e.g. "t3_w5491".

"request_number_of_comments_at_retrieval": Number of comments for the request at time of retrieval.

"request_text": Full text of the request.

"request_text_edit_aware": Edit aware version of "request_text". We use a set of rules to strip edited comments indicating the success of the request such as "EDIT: Thanks /u/foo, the pizza was delicous".

"request_title": Title of the request.

"requester_account_age_in_days_at_request": Account age of requester in days at time of request.

"requester_account_age_in_days_at_retrieval": Account age of requester in days at time of retrieval.

"requester_days_since_first_post_on_raop_at_request": Number of days between requesters first post on RAOP and this request (zero if requester has never posted before on RAOP).

"requester_days_since_first_post_on_raop_at_retrieval": Number of days between requesters first post on RAOP and time of retrieval.

"requester_number_of_comments_at_request": Total number of comments on Reddit by requester at time of request.

"requester_number_of_comments_at_retrieval": Total number of comments on Reddit by requester at time of retrieval.

"requester_number_of_comments_in_raop_at_request": Total number of comments in RAOP by requester at time of request.

"requester_number_of_comments_in_raop_at_retrieval": Total number of comments in RAOP by requester at time of retrieval.

"requester_number_of_posts_at_request": Total number of posts on Reddit by requester at time of request.

"requester_number_of_posts_at_retrieval": Total number of posts on Reddit by requester at time of retrieval.

"requester_number_of_posts_on_raop_at_request": Total number of posts in RAOP by requester at time of request.

"requester_number_of_posts_on_raop_at_retrieval": Total number of posts in RAOP by requester at time of retrieval.

"requester_number_of_subreddits_at_request": The number of subreddits in which the author had already posted in at the time of request.

"requester_received_pizza": Boolean indicating the success of the request, i.e., whether the requester received pizza.

"requester_subreddits_at_request": The list of subreddits in which the author had already posted in at the time of request.

"requester_upvotes_minus_downvotes_at_request": Difference of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_minus_downvotes_at_retrieval": Difference of total upvotes and total downvotes of requester at time of retrieval.

"requester_upvotes_plus_downvotes_at_request": Sum of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_plus_downvotes_at_retrieval": Sum of total upvotes and total downvotes of requester at time of retrieval.

"requester_user_flair": Users on RAOP receive badges (Reddit calls them flairs) which is a small picture next to their username. In our data set the user flair is either None (neither given nor received pizza, N=4282), "shroom" (received pizza, but not given, N=1306), or "PIF" (pizza given after having received, N=83).

"requester_username": Reddit username of requester.

"unix_timestamp_of_request": Unix timestamp of request (supposedly in timezone of user, but in most cases it is equal to the UTC timestamp -- which is incorrect since most RAOP users are from the USA).

"unix_timestamp_of_request_utc": Unit timestamp of request in UTC.

In [81]:
# import libries
import pandas as pd
import nltk
import matplotlib
%matplotlib inline
# import data
db = pd.read_json('../input/train.json')

In [12]:
sh = db.shape

print('Summary:\n', '\tshape: {}'.format(sh))

Let's print top 5 text in for those who received top upvote 

In [31]:
top5 = db.query('requester_received_pizza == True').sort_values(by='number_of_upvotes_of_request_at_retrieval', ascending=False)
top5 = top5.head(10).loc[: , ['request_title', 'request_text']]

for row in range(5):  
    print(top5.iloc[row, 0],'\n')
    print(top5.iloc[row, 1], '\n', '---'* 50)

It looks like both request Title and request Text includes some valuable information. It could be good idea to join both those fields. 

## Classification approach
The Random Act of Pizza db includes many information that could be useful for this challange, however, for my learning purpose (learning natural text processing) I will use only request Text and request Title combined together. 

In [43]:
# join both fields
db['text'] = db['request_title'] + ' \n'+  db['request_text']
label = db['requester_received_pizza']
train = db['text']
# convert True/False to 1/0
label = label * 1

In [68]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [92]:
ps = PorterStemmer()
train_words = pd.Series()

for row in train.head(2):
        words = word_tokenize(row)
        for w in words[:30]: 
            st = ps.stem(word=w)
            if w != st:
                print(w, ': ', st)


Request :  request
Colorado :  colorado
Springs :  spring
Help :  help
Please :  pleas
military :  militari
family :  famili
has :  ha
really :  realli
times :  time
Request :  request
California :  california
gas :  ga
Thursday :  thursday


Chunking is like grouping similar words based on regular expressions we created. 

In [85]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk import RegexpParser

for row in train.head(3):
        words = word_tokenize(row)
        taged = pos_tag(words)
        
        chunkGram = r'''Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}'''
        chunkparser = RegexpParser(chunkGram)
        chunked = chunkparser.parse(taged)
        print(chunked)

Named Entity is looking for pre-set type of words like: organisation, people, money etc. 


In [86]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk import ne_chunk

for row in train.head(3):
        words = word_tokenize(row)
        taged = pos_tag(words)
        
        namedEnt = ne_chunk(taged, binary=True)
        print(namedEnt)

Lemitizing is similar to stem but it gives a real world not just cut version.

In [93]:
from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for row in train.head(5):
        words = word_tokenize(row)
        for w in words:
            lem = lemmatizer.lemmatize(w)
            if w != lem:
                print(w, ": ", lem)

children :  child
has :  ha
times :  time
means :  mean
was :  wa
has :  ha
us :  u
leftovers :  leftover
leftovers :  leftover
guys :  guy
stories :  story
was :  wa
lives :  life
schedules :  schedule
us :  u


This is not useful yet as an idea but follwing code will look for synonims and antonims of a word

In [107]:
from nltk.corpus import wordnet

sync = wordnet.synsets('plan')
# how to find just this word
sync[0].lemmas()[0].name()
# create list of synonims and antonyms
synonims = []
antonims = []
for sync in wordnet.synsets('good'):
    for l in sync.lemmas():
        synonims.append(l.name())
        if l.antonyms():
            antonims.append(l.antonyms()[0].name())
            
print(set(synonims))
print(set(antonims))

{'beneficial', 'secure', 'salutary', 'in_force', 'undecomposed', 'thoroughly', 'soundly', 'sound', 'honest', 'serious', 'skillful', 'proficient', 'honorable', 'adept', 'trade_good', 'estimable', 'dear', 'near', 'commodity', 'respectable', 'dependable', 'good', 'right', 'well', 'unspoiled', 'unspoilt', 'upright', 'practiced', 'safe', 'just', 'skilful', 'goodness', 'expert', 'in_effect', 'effective', 'full', 'ripe'}
{'evilness', 'badness', 'ill', 'evil', 'bad'}


'plan'