<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#MongoDb" data-toc-modified-id="MongoDb-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>MongoDb</a></span></li><li><span><a href="#Data-load-from-a-file" data-toc-modified-id="Data-load-from-a-file-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data load from a file</a></span></li><li><span><a href="#Data-transformation" data-toc-modified-id="Data-transformation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data transformation</a></span></li><li><span><a href="#Data-discovery" data-toc-modified-id="Data-discovery-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Data discovery</a></span></li><li><span><a href="#Training-and-test-set" data-toc-modified-id="Training-and-test-set-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Training and test set</a></span></li><li><span><a href="#Featured-words" data-toc-modified-id="Featured-words-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Featured words</a></span></li><li><span><a href="#NLTK-library---The-Naive-bayes-model" data-toc-modified-id="NLTK-library---The-Naive-bayes-model-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>NLTK library - The Naive bayes model</a></span></li><li><span><a href="#Accuracy---Confusion-matrix" data-toc-modified-id="Accuracy---Confusion-matrix-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Accuracy - Confusion matrix</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Next-steps" data-toc-modified-id="Next-steps-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Next steps</a></span></li></ul></div>

# Amazon fine food - rating prediction model

We have Amazon data about part of their product. In this section we are going to predict STAR rating according to sentences that a user wrote into summary and text field. See example https://www.amazon.com/Apple-Watch-Gold-Aluminum-Sport/dp/B075TDXYCS/ref=sr_1_17?s=electronics&ie=UTF8&qid=1540892072&sr=1-17#customerReviews 


Model selection

We deceided to use Naive Bayes model as a good example of bayes classifiers. Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering. It perform well in case of categorical input variables compared to numerical variable(s). The Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data


##   Libraries

In [1]:
from pymongo import MongoClient #necessary to locally install mongodb 
from pymongo import DESCENDING
import numpy as np #library with mathematical tools 
import matplotlib.pyplot as plt #for ploting graphs 
import pandas as pd #library for manage and import datasets
import nltk #natural language processing library containing NaiveBayes
from sklearn.model_selection import train_test_split #very useful when splitting a sample
from nltk.metrics import ConfusionMatrix 

## MongoDb

Now it is time to upload data from a pandas dataframe to a local MongoDB.

In [2]:
#  Auxiliary function - upload to Mongodb

def upload_data_mongoDb(collection, data, delete_before_upload = True, silent_mode = False):
    try:
        if delete_before_upload == True: 
            # delete before insert
            collection.delete_many({})
            
        # insert the dataframe to mongodb
        collection.insert_many(data)

        # dataframe_load = []
        data = []
        if silent_mode == False:
            print('Dataframe uploaded to MongoDb')

    except:
        print('Error occured while uploading data to MongoDb')

In [3]:
# connect to localhost MongoDB database
client = MongoClient()
client = MongoClient('localhost', 27017)

# connect to an amazon_database
db = client.amazon_database

# Collections
# connect to an amazon collection in the amazon database
collection = db.amazon_collection

# connect to an transformed collection in the amazon database 
transformed_collection = db.transformed_collection

# connect to an word features collection in the amazon database 
wordfeatures_collection = db.wordfeatures_collection

# connect to an training set collection in the amazon database 
train_set_collection = db.train_set_collection

# connect to an training set collection in the amazon database - used in nltk model
train_set_collection_nltk = db.train_set_collection_nltk

# connect to an test set  collection in the amazon database 
test_set_collection = db.test_set_collection

# connect to an test set  collection in the amazon database - used in nltk model
test_set_collection_nltk = db.test_set_collection_nltk

# connect to an featuring_temporary_collection in the amazon database 
featuring_temporary_collection = db.featuring_temporary_collection

## Data load from a file

In [None]:
df=pd.DataFrame()
myColl = {}
linecount = 1

def makeDataFrame(dictionary):
    global df
    df=df.append(dictionary,ignore_index=True)      


def loadData(filename, loadlines = np.inf):  
    global myColl 
    global linecount
    
    # Open the file.
    f = open(filename, "r")

    
    while(linecount <= loadlines):
        try:
            # Read a line.
            line = f.readline()

            # When readline returns an empty string, the file is fully read.
            if line == "":
                makeDataFrame(myColl)
                myColl={}
                break

            # When a newline is returned, the line is empty.
            if line == "\n":
                linecount = linecount + 1
                makeDataFrame(myColl)
                myColl={}
                continue

            # Print other lines.
            stripped = line.strip().split(': ')
            myColl[stripped[0]]=stripped[1]
        except:
            print(line)
            print(linecount)
            continue
            
    return df

# number of loaded lines is limited 
dataframe = loadData(filename="foodscopy.txt",loadlines = 500000)

# number of loaded lines is unlimited - we load the whole sample
# dataframe = loadData(filename="foodscopy.txt")

In [5]:
dataframe.head()

Unnamed: 0,_id,product/productId,review/helpfulness,review/profileName,review/score,review/summary,review/text,review/time,review/userId
0,5bdfe521d95ae035681e56e3,B001E4KFG0,1/1,delmartian,5.0,Good Quality Dog Food,I have bought several of the Vitality canned d...,1303862400,A3SGXH7AUHU8GW
1,5bdfe521d95ae035681e56e4,B00813GRG4,0/0,dll pa,1.0,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,1346976000,A1D87F6ZCVE5NK
2,5bdfe521d95ae035681e56e5,B000LQOCH0,1/1,"Natalia Corres ""Natalia Corres""",4.0,"""Delight"" says it all",This is a confection that has been around a fe...,1219017600,ABXLMWJIXXAIN
3,5bdfe521d95ae035681e56e6,B000UA0QIQ,3/3,Karl,2.0,Cough Medicine,If you are looking for the secret ingredient i...,1307923200,A395BORC6FGVXV
4,5bdfe521d95ae035681e56e7,B006K2ZZ7K,0/0,"Michael D. Bigham ""M. Wassir""",5.0,Great taffy,Great taffy at a great price. There was a wid...,1350777600,A1UQRSCLF8GW1T


In [None]:
#  Upload data from file to mongoDb
upload_data_mongoDb(collection,dataframe.to_dict('records'))

## Data transformation

Let's create a auxiliary function for following data transformation. First, we create a function that will remove meaningless word. We use nltk function pos_tag that assign a part of speech to every word. Then, we choose only relevant ones. Last, we can drop off columns that we are not going to use in this run such as productid or userid. 

In [6]:
# Remove useless part of speech words such as 'the','a','this','that'. Leave only adjectives, nouns 
# See https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

def remove_wrong_words(wordlist):
    wordlist2 = []
    wordlist2 = nltk.pos_tag(nltk.word_tokenize(wordlist.replace('.',' ')))
#     wordlist4 = [word for (word, pos) in wordlist3 if pos not in ['DT','EX','FW','RP','SYM','TO','IN','CC']] --
    wordlist3 = [word for (word, pos) in wordlist2 if pos in ['JJ','JJR','JJS','RB','RBR','RBS','UH','NN','NNS','NNP','NNSP']]
    return wordlist3

We create new column 'good/bad' from 'review/score' column as well as summary_transformed column where we use remove_wrong_wrong function from previous section.

In [4]:
#  load data from MongoDb
data = pd.DataFrame(list(collection.find()))

# pick relevant columns
data = data[['review/summary', 'review/text', 'review/score']]
data.dropna()

Unnamed: 0,review/summary,review/text,review/score
0,Good Quality Dog Food,I have bought several of the Vitality canned d...,5.0
1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,1.0
2,"""Delight"" says it all",This is a confection that has been around a fe...,4.0
3,Cough Medicine,If you are looking for the secret ingredient i...,2.0
4,Great taffy,Great taffy at a great price. There was a wid...,5.0
5,Nice Taffy,I got a wild hair for taffy and ordered this f...,4.0
6,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...,5.0
7,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...,5.0
8,Yay Barley,Right now I'm mostly just sprouting this so my...,5.0
9,Healthy Dog Food,This is a very healthy dog food. Good for thei...,5.0


In [7]:
# transform
data['dependent_variable']  = data['review/score'].apply(lambda x: 1 if float(x) >= 4 else 0)

# remove non significant words
data['independent_variable'] = data['review/summary'].apply(lambda x: remove_wrong_words(x.lower()))

# Choose only independant variable and dependant variable
data = data[['independent_variable', 'dependent_variable']]

In [8]:
data['id'] = data.reset_index().index
data.set_index('id')
data.head()

Unnamed: 0,independent_variable,dependent_variable,id
0,"[good, quality, dog, food]",1,0
1,"[not, advertised]",0,1
2,[delight],1,2
3,"[cough, medicine]",0,3
4,"[great, taffy]",1,4


In [None]:
#  Upload transformed data to MongoDb
upload_data_mongoDb(transformed_collection,data.to_dict('records'))

## Data discovery 

Function for counting good and bad ratings in order to see a ration in our sample. 

In [22]:
def good_count(dataframe):
    dataframe['dependent_variable'] = dataframe['dependent_variable'].apply(lambda x: int(x))
    print ('Number of good ratings: ' + str(sum(dataframe['dependent_variable'])))
    print ('Number of total ratings: ' + str(len(dataframe['dependent_variable'])))

In [23]:
data = pd.DataFrame(list(transformed_collection.find({},{"dependent_variable"}))) 
good_count(data)

Number of good ratings: 389844
Number of total ratings: 500000


Apparently, Amazon is selling pretty quality products :) in Amazon fine food section due to high number of positive rating. This may cause a trouble in terms of an accuraccy. Let's keep it in mind and see how many bad rating we have in training sample as well as test sample. 

## Training and test set 

In [33]:
data = pd.DataFrame(list(transformed_collection.find()))
data = data.sample(n = 30000)
data.set_index('id')

# sklearn
train_set, test_set = train_test_split(data, test_size=0.7, random_state = 0)

In [34]:
print('training_set')
good_count(train_set)

training_set
Number of good ratings: 7012
Number of total ratings: 9000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [35]:
print('test_set')
good_count(test_set)

test_set
Number of good ratings: 16326
Number of total ratings: 21000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [None]:
#  Upload training data to MongoDb
upload_data_mongoDb(train_set_collection,train_set.to_dict('records'))
train_set_collection.create_index([("id", DESCENDING)])

In [None]:
#  Upload test data to MongoDb
upload_data_mongoDb(test_set_collection,test_set.to_dict('records'))
test_set_collection.create_index([("id", DESCENDING)])

## Featured words 

For Naive Bayes we need to create featured words that will indicate good/bad. We want to use only the most common word used in our train sample. We call it 'wordfeatures'.

In [None]:
# Pick up only TOP most common word in our dataset and use them as featured words 
def wordFeatures(wordList,top):
    forbidenwords = ['.','..','%',"n't",'amazon.com','dr.','mrs.','.but','mr.','tea..']
    wordList = nltk.FreqDist(wordList)
    wordList = wordList.most_common(top)
    wordFeatures = [{'features':words} for words,counts in wordList if words not in forbidenwords]
    return wordFeatures   

We create a list with all word occured in the dataset. We call it features and we upload it to MongoDB into collection wordfeatures_collection.

In [None]:
wordList = []

for row in train_set_collection.find({},{"independent_variable"}):
    wordList.extend(row['independent_variable'])

features = wordFeatures(wordList,10000) 

In [None]:
#  Upload featured word to MongoDb
upload_data_mongoDb(wordfeatures_collection,features)

## NLTK library - The Naive bayes model

Prepare new column independent_variable_naive_bayes into a format that is being used in nltk library. 

In [8]:
# does words in a sentence contains in featured words 
def getFeatures(doc,featuredwords):
    docWords = set(doc)
    feat ={}
    for word in featuredwords:
        feat['contains(%s)' % word] = (word in docWords)
    return feat

def feature_dataset(collection, featuredwords):
    data = []
    featuring_temporary_collection.delete_many({})
    
    for i in range(0,500):
        try:
            # classsic ETL 
            number_of_lines = 1000
            # load data
            data = pd.DataFrame(list(collection.find({"id": {"$gte": i * number_of_lines , "$lt": (i+1)*number_of_lines}})))
            # transform data
            data['independent_variable_naive_bayes'] = data['independent_variable'].apply(lambda x: getFeatures(x,featuredwords))
            # load data       
            upload_data_mongoDb(featuring_temporary_collection, 
                                data.to_dict('records'), 
                                delete_before_upload = False, 
                                silent_mode = True)
        except:
            break        

In [10]:
# load a list of featured words
featuredwords = pd.DataFrame(list(wordfeatures_collection.find()))
featuredwords = list(featuredwords.iloc[:,1].values)

Prepare training set for NLTK library.

In [None]:
# upload to Mongo
feature_dataset(train_set_collection,featuredwords)

In [8]:
training_set_prepared_NLTK = pd.DataFrame(list(featuring_temporary_collection.find()))

In [9]:
training_set_prepared_NLTK.head()

Unnamed: 0,_id,dependent_variable,id,independent_variable,independent_variable_naive_bayes
0,5be2d3d8d95ae0035cbb1644,5.0,977,"[brews, excellent, cup, coffee, quickly, easily]","{'contains(great)': False, 'contains(good)': F..."
1,5be2d3d8d95ae0035cbb15f5,5.0,898,"[i, n't, smell]","{'contains(great)': False, 'contains(good)': F..."
2,5be2d3d8d95ae0035cbb15f0,5.0,893,"[strictly, best]","{'contains(great)': False, 'contains(good)': F..."
3,5be2d3d8d95ae0035cbb15ef,5.0,892,"[fantastic, coffee, best, i, ever]","{'contains(great)': False, 'contains(good)': F..."
4,5be2d3d8d95ae0035cbb15de,1.0,875,[something],"{'contains(great)': False, 'contains(good)': F..."


In [None]:
upload_data_mongoDb(train_set_collection_nltk,training_set_prepared_NLTK.to_dict('records'))

Prepare training set for NLTK library.

In [None]:
feature_dataset(test_set_collection,featuredwords)

In [4]:
test_set_prepared_NLTK = pd.DataFrame(list(featuring_temporary_collection.find()))

In [5]:
# upload to Mongo
upload_data_mongoDb(test_set_collection_nltk,test_set_prepared_NLTK.to_dict('records'))

<pymongo.results.InsertManyResult at 0x32178800>

Let's predict with NLTK.

In [4]:
def transform_data_for_nltk(collection):
    data = []
    dataset = []
    
    data = pd.DataFrame(list(collection.find()))

    for index, row in data[['independent_variable_naive_bayes', 'dependent_variable']].iterrows():
          dataset.append((row[0], row[1]))
            
    return dataset

In [5]:
train_set = transform_data_for_nltk(train_set_collection_nltk)
test_set = transform_data_for_nltk(test_set_collection_nltk)

In [6]:
# create Naive Bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Test accuracy
print(nltk.classify.accuracy(classifier, test_set))

0.8371904761904762


In [11]:
# Try your own words
print(classifier.classify(getFeatures('great'.split(),featuredwords)))

1


In [12]:
# Try your own words
print(classifier.classify(getFeatures('bad'.split(),featuredwords)))

0


## Accuracy - Confusion matrix

In [21]:
test_set_pred = [classifier.classify(word) for (word, tag) in test_set]
test_set_tag = [tag for (word, tag) in test_set]


print(nltk.ConfusionMatrix(test_set_tag, test_set_pred))

  |     0     1 |
--+-------------+
0 | <2005> 2655 |
1 |   764<15576>|
--+-------------+
(row = reference; col = test)



The confusion matrix depicts that 2655 cases were predicted as false positives (I.order error). If we would predict all as positives, we would get 4670 cases estimated as false positives, therefore it would mean 77.7% model accuracy.  

## Conclusion 

Predicting power of our Naive bayes model is 83.7%. If we wouldn't use any model and take into account that we have a small number of negative cases in our sample, therefore let's predict all as positive. We would get accuracy of 77.7%. Our model is better about 6%. We cannot consider this result as a success. Let's update our model with following steps.

## Next steps

1. Use a different column such as 'review/text' and test whether productid or userid is relevant. 
2. In terms of featurewords, use only word that have biggest different between good and bad category.
3. Use function collocations that can capture pair of words (see appendix).
4. Use sklearn library and class NaiveBayes with bernouli distribution. 
5. Start with the sample and pick equal number of positive and negative cases.

# Appendix

Collocations on all words in our data.

In [None]:
#  Usage of Collocation in practise

from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wordList)
finder.nbest(bigram_measures.pmi, 1000)  # doctest: +NORMALIZE_WHITESPACE
