http://jmcauley.ucsd.edu/data/amazon/qa/
# Description

This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions.

This dataset can be combined with Amazon product review data, available here, by matching ASINs in the Q/A dataset with ASINs in the review data. The review data also includes product metadata (product titles etc.).
## Files
Sample question (and answer):

{
  "asin": "B000050B6Z",
  "questionType": "yes/no",
  "answerType": "Y",
  "answerTime": "Aug 8, 2014",
  "unixTime": 1407481200,
  "question": "Can you use this unit with GEL shaving cans?",
  "answer": "Yes. If the can fits in the machine it will despense hot gel lather. I've been using my machine for both , gel and traditional lather for over 10 years."
}

## where

    asin - ID of the product, e.g. B000050B6Z
    questionType - type of question. Could be 'yes/no' or 'open-ended'
    answerType - type of answer. Could be 'Y', 'N', or '?' (if the polarity of the answer could not be predicted). Only present for yes/no questions.
    answerTime - raw answer timestamp
    unixTime - answer timestamp converted to unix time
    question - question text
    answer - answer text


In [1]:
import pandas as pd
import json
import ngram
import sklearn
#import matplotlib.pyplot as plt
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import math
import os
import subprocess

# Loading the data

In [2]:
a = open('C:\Users\i\Downloads\qa_Software.json') .read()

In [3]:
b = a.replace("{","").split("}")

In [4]:
"""
    loading the json file as a dict so we can put it easily into a DataFrame
"""
c = []
for i in b:
    try:
        if i[-1] !='\'':
            i = i + "'"
        c.append(eval("{"+i +"}"))
        #print i[-1],'\n****'
    except:
        pass#print "{"+i +"}",'\n-----------------------'

In [5]:
d = pd.DataFrame(c)

In [6]:
d.head()

Unnamed: 0,answer,answerTime,answerType,asin,question,questionType,unixTime
0,Yes,"Aug 11, 2013",Y,439381673,"I have Windows 8, Will this work on my computer?",yes/no,1376204000.0
1,"I used it with a pc. So, I have no idea. I hop...","Aug 11, 2014",,439381673,"It says above platform Mac, but I see in the q...",open-ended,1407740000.0
2,No it has to have a Mac that runs on power pc ...,"Sep 24, 2014",N,439381673,Will this work for Mac OS X?,yes/no,1411542000.0
3,"Hi T.Lee, I have not had any trouble running t...","May 23, 2014",?,439381673,"I have Windows7, Will this work on my computer?",yes/no,1400828000.0
4,Yes! :P,"Apr 13, 2014",Y,439381673,Will this work on Windows XP?,yes/no,1397372000.0


# Cleaning the data

In [7]:
print d.dtypes,"\n"

answer           object
answerTime       object
answerType       object
asin             object
question         object
questionType     object
unixTime        float64
dtype: object 



In [8]:
"""
    Look for nans
"""
print set(d.answerType)
print set(d.questionType)

set(['Y', nan, '?', 'N'])
set(['open-ended', 'yes/no'])


In [9]:
d.answerType = d.answerType.apply(lambda x: "nan" if x == np.nan else x)

In [10]:
np.isnan(d.loc[1].answerType )

True

In [11]:
print set(d.isnull().question)
print set(d.isnull().answer)

set([False])
set([False, True])


In [12]:
"""
    See if we can afford to just drop them
"""
print len([ i for i in d.isnull().answer if i ])

5


In [13]:
"""
    drop line of nan in answer since they can not be exploit
"""
d = d[ [ type(i) != type(1.0) for i in d.answer ] ]

In [14]:
data = d[["answer","question","questionType","answerType"]]

In [15]:
data.head()

Unnamed: 0,answer,question,questionType,answerType
0,Yes,"I have Windows 8, Will this work on my computer?",yes/no,Y
1,"I used it with a pc. So, I have no idea. I hop...","It says above platform Mac, but I see in the q...",open-ended,
2,No it has to have a Mac that runs on power pc ...,Will this work for Mac OS X?,yes/no,N
3,"Hi T.Lee, I have not had any trouble running t...","I have Windows7, Will this work on my computer?",yes/no,?
4,Yes! :P,Will this work on Windows XP?,yes/no,Y


# Transform

In [16]:
"""
    get the questions into the right type of data
"""
e = ngram.NGram(map(str,data.question))

In [19]:
"""
    Split into the conventional train 6  / test 4  
"""
train, test = train_test_split(data, test_size = 0.4)
print "train",len(train),"\ntest",len(test), "\ndiff data/train+test",len(data)- (len(train)+len(test))

train 4483 
test 2990 
diff data/train+test 0


In [20]:
train.head()

Unnamed: 0,answer,question,questionType,answerType
3224,"It will work with windows 8.1, but I do not us...",will this work in 8.1,yes/no,?
4772,As long as your server(s) are configured in th...,Can I split the CALs over two servers? That is...,open-ended,
4776,shipped,hi is this a download after purchase or shipped?,open-ended,
5752,the PCEye go comes with tobii gaze interaction...,what software in it,open-ended,
307,I have never worked on Windows 8. I have not f...,I am looking for a CD-ROM for Windows 8 the on...,yes/no,N


In [21]:
test.head()

Unnamed: 0,answer,question,questionType,answerType
4422,The manual is on-line. You'll have to search f...,Is the manual online or paperback with CD?,open-ended,
466,Yes,Will it work with Windows 7?,yes/no,Y
6547,"it should be, yes",does printmaster platinum v6 compatible with w...,yes/no,Y
6841,"I cannot talk for Amazon, but if you purchase ...",Can I download the software if I am a customer...,yes/no,?
2036,Yes - You can use a traditional acoustic guita...,Can you use traditional acoustic guitar?,yes/no,Y


# Matching sentences

In [49]:
"""
    Verify the behavior of NGram
"""
ngram.NGram.compare("i","i")

1.0

In [23]:
"""
    Creating an NGram of questions as a reference
"""
G = ngram.NGram(map(str,train.question))

In [24]:
"""
    Find the closest questions
"""
err = 0.0
count = 0.0
for index_i,i in enumerate(test.question):
    closer_question = G.finditem(i)                                               
    vrai = test.answer.iloc[index_i]
    try:
        err += 1 - ngram.NGram.compare(vrai, closer_question) #max([ngram.NGram.compare(vrai, j) for j in closer_answers]) 
    except ValueError:
        count+=1
        pass
    if index_i%500 ==0:
        print index_i    

0
500
1000
1500
2000
2500


In [25]:
"""
    Visualize the last result
"""
print i 
print closer_question
print vrai

I have 4 license currently and need to add another. Will this work? Bob
I am not currently a student but need SPSS for a work related research project. Will the license work for me?
If you already have the software it might be cheaper to just call them and get another license. You can do this from within the program or call 1-866-379-6635. I am not sure, but it may be cheaper to get it from Intuit, otherwise buying another copy of the software should also work, if that is the cheaper route.


In [26]:
print "error :",err/len(test)

error : 0.933377599972


In [27]:
"""
    Count the number of failed computation
"""
count

0.0

## We can see that's it useless to continue into that way

# Decision Tree

In [28]:
from sklearn.feature_extraction import DictVectorizer
"""
    Create a function to create dummy features for all cathegories in answer
"""
def categorizeDF(df):
    old_columns = df.columns
    cat_cols = ["question"]
    temp_dict = df[cat_cols].to_dict(orient="records")
    vec = DictVectorizer()
    vec_arr = vec.fit_transform(temp_dict).toarray()
    
    new_df = pd.DataFrame(vec_arr).convert_objects(convert_numeric=True)
    new_df.index = df.index
    new_df.columns = vec.get_feature_names()
    columns_to_add = [col for col in old_columns if col not in cat_cols]
    new_df[columns_to_add] = df[columns_to_add]
    #new_df.drop(cat_cols, inplace=True, axis=1)
    return new_df

joined_cat = categorizeDF(data)



In [29]:
"""
    Split the dataset into an even train/tes split
"""
train = joined_cat._slice(slice(0, len(joined_cat)/2))
test = joined_cat._slice(slice(len(joined_cat)/2, len(joined_cat)))

In [30]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
"""
    Chose all relevent features for the training
"""
Xcols = train.columns
Xcols = list(joined_cat)[:-3]
y = train['answer']
print Xcols[:5]

['question="3- Piece/ 1- year " means it can be installed on 3 pc\'s per household for 1 year duration?', 'question="Nuance-certified handheld device" -- What are they? Anyone knows where I can find a list of such?', 'question="printing is NOT included in this version" what this means? I can only save as file format?', 'question=$30 rebate? Does anyone know where I can find the link to the $30 rebate that was offered when it was in the goldbox?', "question='09 QB Problems solved? Can anyone tell me if QB has solved their '09 Pro problems?"]


In [41]:
"""
    Create the classifier
"""
forest = RandomForestClassifier(n_estimators=70, max_features=0.1, min_samples_split=24, random_state=33, n_jobs=3)

In [42]:
"""
    Fit the Classifier
"""
forest.fit(train[Xcols], y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.1, max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=24,
            min_weight_fraction_leaf=0.0, n_estimators=70, n_jobs=3,
            oob_score=False, random_state=33, verbose=0, warm_start=False)

In [43]:
"""
    Make prediction
"""
probsRF = forest.predict_proba(test[Xcols])[:,1]

In [44]:
"""
    Verifying the results
"""
set(probsRF)

{0.0}

In [46]:
"""
    Function for visualing a tree
"""
def visualize_tree(tree, feature_names):
    """Create tree png using graphviz.

    Args
    ----
    tree -- scikit-learn DecsisionTree.
    feature_names -- list of feature names.
    """
    with open("dt.dot", 'w') as f:
        export_graphviz(tree, out_file=f,
                        feature_names=feature_names)

    command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"]
    try:
        subprocess.check_call(command)
    except:
        exit("Could not run dot, ie graphviz, to "
             "produce visualization")

In [48]:
"""
    Visualisation of the tree
"""
visualize_tree(forest, Xcols)

AttributeError: 'RandomForestClassifier' object has no attribute 'tree_'