# Introduction

The purpose of this anlaysis is trying to find the products that most likely make people happy, based on the happy moments in the dataset. We use KOKO, an entity extraction system with a declarative query language, to extract products (e.g., car, dress) that we believe contribute to happiness. KOKO is especially suitable for entity extraction with limited evidence (e.g. cafe name extraction from one or a few blogs). To use KOKO, the user specifies a few high-level patterns along with weights and KOKO will find all instances matching against the patterns, and return the most proabable results based on the accumulated weights for each entity.

The analysis comprises the following steps:
1. Data preprocessing: the dataset is read and converted to a text file as input to KOKO
2. Information extraction: a KOKO query is written and evaluated. The query specify the desirable patterns entities match against.

# 1. Data preprocessing

We first load the data and take a quick look at the happy moments inside. Then we use Python to identify happy moments that involves shopping, using keywords like "buy" or "purchase".

## Load HappyDB

In [1]:
import pandas as pds

data = pds.read_csv('./data/cleaned_hm.csv')
data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1.1.1,hmid,wid,reflection_period,cleaned_hm,original_hm,modified
0,0,0,0,3833.0,31526.0,8962.0,24h,I found a silver coin from 1852 buried in the ...,I found a silver coin from 1852 buried in the ...,False
1,1,1,1,9336.0,37050.0,10252.0,24h,"This one is pretty minuscule, we had to go to ...","This one is pretty minuscule, we had to go to ...",False
2,2,2,2,18454.0,46196.0,586.0,24h,The word problem is never part of a happy pers...,The word aproblema is never part of a happy pe...,False
3,3,3,3,22355.0,50108.0,409.0,24h,I started studying bagavthgeetha .,I started studying bagavthgeetha .,True
4,4,4,4,27323.0,55093.0,731.0,24h,i go to old age home and service enjoy the mom...,i go to oldage home and service enjoy the moment.,False


## Identify shopping-related moments

We first try to find all the happy moments that contain 'buy', 'bought' or 'purchase'.
Due to space limitation, we focus on the first 100 moments. Try different values for yourself.

In [20]:
num_moments = 100
assert (num_moments < data['cleaned_hm'].size)
print('Happy moments involving purchasing:\n')
for i in range(0, num_moments):
    if 'buy' in data['cleaned_hm'].iloc[i] or \
       'bought' in data['cleaned_hm'].iloc[i] or \
       'purchase' in data['cleaned_hm'].iloc[i]:
       print("{}: {}".format(i, data['cleaned_hm'].iloc[i]))

Happy moments involving purchasing:

38: I waited patiantly for my income tax return and finally received it. And no i haven't went rogue with impulse purchases. I just did  the noble thing to do and take care of home, made a couple of investments and put the rest in my bank account.
45: I am happy when i purchase a new vehicle 
94: My husband purchased a new nan for me.


# 2. KOKO query

The query we are trying to run is to extract purchased product names that make people happy.
Let's first take a look at the contents of the query.

In [3]:
with open('./product_names/koko_queries/products_v3.koko', 'r') as query:
    print(query.read())

extract "Ngrams(1,2)" x from "/Users/chen/Research/Playground/Github_Playground/happydb/data/happyDB_clean.txt" if
		   ("bought a new" x {0.01}) or
		   ("bought a few" x {0.01}) or
		   ("bought some" x {0.01}) or
		   ("bought a" x {0.01}) or
		   ("bought" x {0.01}) or		   
		   (x "I bought" {0.01}) or		   
		   ("purchase a new" x {0.01}) or
		   ("purchase a few" x {0.01}) or
		   ("purchase some" x {0.01}) or
		   ("purchase a" x {0.01}) or
		   ("purchase" x {0.01}) or		   
		   (x "I purchase" {0.01})
with threshold 0.0
excluding (str(x) matches ".*[0-9]+.*")
excluding (str(x) matches ".*(new|NEW|few).*")
excluding (str(x) matches ".*(and|or|so).*")
excluding (str(x) matches ".*(,|\.|;|!|\$|\s|\(|\)|-).*")
excluding (str(x) matches ".*(month|week|year|day|night|today).*")



## Introduction to KOKO syntax

The query adopts the following format:

extract "<keyword>" x from "<file>" if
    <conditions>
with threshold <n>
[excluding <e-conditions>]
    
The query will extract all the entities of type <keyword> from the file <file>, if its accumulated weight exceeds a certain threshold <n>.
    
For example, in the above query. the <keyword> is "Ngrams(1,2)", which means all the n-grams where $1 \le n \le 2$. The user could also use "Ents" for named entities, and "Nps" for noun pharases.
<conditions> could be strict matching rule -- e.g., {"bought" x {0.01}} -- or non-Boolean conditions (not shown here for simplicity). The number accompanying each condition represents the weight of the condition. Each entity instance matching the condition will add the weight to the result, making the qualified entity rank higher.

# 3. Entity extraction with KOKO

## Install KOKO

To install KOKO locally, simply run the following command:

    pip install pykoko

## Run KOKO

We use the example query presented in Section 2 to extract products that make people happy.

To start, we need to import the KOKO package, along with spaCy for parsing documents.

In [4]:
import koko
import spacy

Then you can run the KOKO query. 
Considering the size of the corpus, the running time could be take a while.

In [5]:
koko.run('./product_names/koko_queries/products_v3.koko', doc_parser='spacy') 
# The output should be cleaned. The meta data are better removed. And a mark showing progress, e.g. "...", should be used.

INFO 2017-09-18 15:13:42,233 - Loading SpaCy English models
INFO 2017-09-18 15:13:44,956 - Done
Loading embedding models
Creating QueryExpander for: en
Embeddings file not found: /Users/chen/.virtualenv/python3/lib/python3.6/site-packages/koko/../embeddings/commoncrawl.840B.300d.txt
Ontology file not found: /Users/chen/.virtualenv/python3/lib/python3.6/site-packages/koko/../coffee_ontology.txt
Creating QueryExpander for: ja
Embeddings file not found: /Users/chen/.virtualenv/python3/lib/python3.6/site-packages/koko/../embeddings/japanese_noun_verb_embedding_vectors.txt
Ontology file not provided
Done loading embedding models
Parsed query: extract "/Users/chen/Research/Playground/Github_Playground/happydb/data/happyDB_clean.txt" Ngrams(1,2) from "x" if
	("bought a new" x { 0.01 }) or
	("bought a few" x { 0.01 }) or
	("bought some" x { 0.01 }) or
	("bought a" x { 0.01 }) or
	("bought" x { 0.01 }) or
	(x "I bought" { 0.01 }) or
	("purchase a new" x { 0.01 }) or
	("purchase a few" x { 0.01 

stuffed                                            0.010000
szechuan                                           0.010000
Legend                                             0.010000
about                                              0.010000
designer                                           0.010000
string                                             0.010000
turtle                                             0.010000
note                                               0.010000
DAY                                                0.010000
ready                                              0.010000
reliable                                           0.010000
running                                            0.010000
by                                                 0.010000
arrived                                            0.010000
manager                                            0.010000
aircon                                             0.010000
SURPRISE                                

The expected results are as follows:


    
Results:

<table align="left" style="width:50%;text-align:left">
  <tr>
    <td>Entity name</td>
    <td>Entity score</td> 
  </tr>
    <tr>
<th>car</th>
<th>1.000000</th>
</tr>
<tr>
<th>when</th>
<th>0.930000</th>
</tr>
<tr>
<th>pair</th>
<th>0.380000</th>
</tr>
<tr>
<th>house</th>
<th>0.380000</th>
</tr>
<tr>
<th>bike</th>
<th>0.350000</th>
</tr>
<tr>
<th>laptop</th>
<th>0.310000</th>
</tr>
<tr>
<th>that</th>
<th>0.300000</th>
</tr>
<tr>
<th>video</th>
<th>0.280000</th>
</tr>
<tr>
<th>phone</th>
<th>0.240000</th>
</tr>
<tr>
<th>home</th>
<th>0.230000</th>
</tr>
<tr>
<th>dress</th>
<th>0.230000</th>
</tr>
<tr>
<th>computer</th>
<th>0.210000</th>
</tr>
<tr>
<th>good</th>
<th>0.180000</th>
</tr>
<tr>
<th>TV</th>
<th>0.170000</th>
</tr>
<tr>
<th>NICE</th>
<th>0.170000</th>
</tr>
<tr>
<th>shirt</th>
<th>0.150000</th>
</tr>
<tr>
<th>game</th>
<th>0.150000</th>
</tr>
<tr>
<th>mobile</th>
<th>0.150000</th>
</tr>
<tr>
<th>bunch</th>
<th>0.150000</th>
</tr>
<tr>
<th>watch</th>
<th>0.150000</th>
</tr>
<tr>
<th>really</th>
<th>0.140000</th>
</tr>
<tr>
<th>Nintendo</th>
<th>0.120000</th>
</tr>
<tr>
<th>cellphone</th>
<th>0.120000</th>
</tr>
<tr>
<th>cell</th>
<th>0.110000</th>
</tr>
<tr>
<th>YESTERDAY</th>
<th>0.110000</th>
</tr>
<tr>
<th>puppy</th>
<th>0.100000</th>
</tr>
<tr>
<th>lot</th>
<th>0.100000</th>
</tr>
</table>

The results show that expensive purchases, such as cars, houses or laptops, are mentioned most in HappyDB.

# 4. Conclusion

KOKO is easy to use, and effectively extract qualified entities efficiently.

The author works for Recruit Institute of Technology, the research lab that develops KOKO.